# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
'''### Research Question:
**What is life without technology?**

This study explores how life changes when people give up modern technology, especially digital devices like smartphones and computers. The aim is to understand how this affects well-being, productivity, and social interactions. By examining how people adapt without technology, we can gain insights into its impact on daily life.

### Data Collection Strategy:

To answer this question, both **subjective** (self-reported) and **objective** (observational) data are needed. Participants will abstain from technology for a specific period (e.g., 1-4 weeks), and data will be collected on how this affects their routine, mental state, and social interactions.

#### 1. **Mental and Emotional Well-being Data**
   - **Variables**: Participants report their mental health, stress, and mood.
   - **Tool**: Daily or weekly surveys (via an app before and after the detox, or manually during the detox).
   - **Frequency**: Daily or weekly.
   - **Sample Size**: 100-200 participants for 1 month.
   - **Amount**: 4,000 entries (100 participants x 30 days).

#### 2. **Productivity and Daily Activities Data**
   - **Variables**: Time spent on activities like work, hobbies, or face-to-face interactions.
   - **Tool**: Activity logs (written during the detox, or digital before and after).
   - **Frequency**: Daily.
   - **Sample Size**: Same as above.
   - **Amount**: 3,000 entries.

#### 3. **Social Interaction Data**
   - **Variables**: Type and frequency of social interactions (e.g., phone calls, in-person meetings).
   - **Tool**: Weekly logs to track the quality and quantity of social interactions.
   - **Frequency**: Weekly.
   - **Sample Size**: Same as above.
   - **Amount**: 400 entries.

#### 4. **Physiological Data (optional)**
   - **Variables**: Sleep quality, heart rate variability, and physical activity.
   - **Tool**: Wearables like smartwatches (optional).
   - **Sample Size**: A subset of 50-100 participants.

---

### Steps for Collecting and Saving Data:

#### **Step 1: Recruit Participants**
- Recruit a diverse group of participants who are willing to give up technology for a set period. This group should include individuals from different age groups, work backgrounds (e.g., remote workers, students), and lifestyles.

#### **Step 2: Set Up the Data Collection Platform**
- Use Google Forms or paper-based surveys to collect data on well-being, activities, and social interactions before and after the detox.
- Provide participants with physical notebooks to manually record their daily activities and feelings during the detox.

#### **Step 3: Collect Pre-Detox Baseline Data**
- Before the detox begins, collect baseline data on participants’ current technology use (screen time, social media hours, etc.), mental health, and productivity.

#### **Step 4: Collect Data During the Detox**
- Participants abstain from technology for 1-4 weeks. During this period, they will record their daily activities, feelings, and social interactions manually.
- Collect daily logs, with participants noting changes in their routine, stress levels, and emotional state.

#### **Step 5: Post-Detox Data Collection**
- After the detox, participants will resume using the digital tools to submit final surveys reflecting on their experiences and changes in well-being and productivity.

#### **Step 6: Manage Data Storage and Backup**
- Store the data securely in a cloud-based storage solution (e.g., Google Cloud).
- Perform regular backups and check the quality of the collected data to ensure completeness.

#### **Step 7: Data Security and Anonymization**
- De-identify all data by assigning participants unique codes to protect their privacy.
- Ensure compliance with data protection regulations like GDPR.

---

### Data Analysis:
1. **Compare Well-being**: Analyze how participants’ well-being (stress, mood, mental health) changed during the technology-free period compared to their baseline period.
2. **Analyze Productivity**: Examine any shifts in daily productivity and work habits without technology.
3. **Social Interaction Trends**: Assess whether participants had more or higher-quality face-to-face social interactions during the detox.
4. **Qualitative Insights**: Review open-ended responses on how participants felt about their experience, the challenges, and benefits.

By analyzing both quantitative and qualitative data, the study will reveal how living without technology impacts people's lives.
'''

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
'''To collect a dataset of 1000 samples for the question "How does the daily exposure to green spaces influence mental well-being and productivity in a remote-working population?", you can simulate a data collection process in Python. This involves generating random data based on certain assumptions for mental well-being, productivity, and time spent in green spaces.

Here’s a simple Python code to generate a dataset of 1000 samples, where each sample includes:

Time spent in green spaces (in minutes).
Mental well-being score (1-10).
Productivity score (1-10).
We'll assume the well-being and productivity scores are influenced by the time spent in green spaces.'''

import pandas as pd
import numpy as np

# Function to generate mental well-being and productivity based on time spent in green spaces
def generate_wellbeing_productivity(time_in_green_space):
    # Assume mental well-being increases with more time spent in green spaces
    wellbeing = np.clip(np.random.normal(loc=time_in_green_space / 30 + 5, scale=1), 1, 10)

    # Assume productivity peaks at moderate exposure (e.g., 60-90 minutes) and decreases if too much/too little
    productivity = np.clip(np.random.normal(loc=-0.01 * (time_in_green_space - 75)**2 + 8, scale=1), 1, 10)

    return wellbeing, productivity

# Parameters for the dataset
num_samples = 1000

# Simulate data
data = {
    "Participant_ID": np.arange(1, num_samples + 1),
    "Time_in_Green_Space": np.random.randint(0, 180, num_samples),  # Random time between 0 and 180 minutes
}

# Initialize lists to store wellbeing and productivity scores
wellbeing_scores = []
productivity_scores = []

# Generate wellbeing and productivity scores based on time spent in green spaces
for time in data["Time_in_Green_Space"]:
    wellbeing, productivity = generate_wellbeing_productivity(time)
    wellbeing_scores.append(wellbeing)
    productivity_scores.append(productivity)

# Add the generated scores to the dataset
data["Mental_Wellbeing"] = wellbeing_scores
data["Productivity"] = productivity_scores

# Convert the data to a pandas DataFrame
df = pd.DataFrame(data)

# Show the first few rows of the dataset
print(df.head())

# Save to a CSV file
df.to_csv('green_space_wellbeing_productivity_dataset.csv', index=False)



## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_scholar_articles(keyword, num_articles):
    articles = []
    url = f"https://scholar.google.com/scholar?q={keyword}&hl=en&as_sdt=0,5&as_ylo=2014&as_yhi=2024"

    while len(articles) < num_articles:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        for entry in soup.find_all('div', class_='gs_ri'):
            title = entry.find('h3', class_='gs_rt').text
            venue = entry.find('div', class_='gs_a').text.split('-')[1].strip()
            year = entry.find('div', class_='gs_a').text.split('-')[-1].strip().split()[-1]
            authors = entry.find('div', class_='gs_a').text.split('-')[0].strip()
            abstract = entry.find('div', class_='gs_rs').text if entry.find('div', class_='gs_rs') else "No abstract available"

            articles.append({
                'Title': title,
                'Venue': venue,
                'Year': year,
                'Authors': authors,
                'Abstract': abstract
            })

            if len(articles) >= num_articles:
                break

        next_button = soup.find('td', class_='b d6cvqb')
        if next_button and len(articles) < num_articles:
            url = "https://scholar.google.com" + next_button.a['href']
        else:
            break

    return articles

# Collect 1000 articles with the keyword 'XYZ'
articles = get_scholar_articles('XYZ', 1000)

# Convert the list of articles to a DataFrame and save it as a CSV file
df = pd.DataFrame(articles)
df.to_csv('scholar_articles.csv', index=False)

print("Collected 1000 articles and saved to scholar_articles.csv")


Collected 1000 articles and saved to scholar_articles.csv


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
!pip install tweepy
import tweepy
import pandas as pd

# I dont have any developer API's from twitter or any oter social media i applied for it waiting for reply
API_KEY = 'api_key'
API_SECRET_KEY = 'api_secret_key'
ACCESS_TOKEN = 'access_token'
ACCESS_TOKEN_SECRET = 'access_token_secret'

# Authenticate to Twitter API
auth = tweepy.OAuthHandler(API_KEY, API_SECRET_KEY)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True)


hashtag = "#climatechange"
tweet_count = 100

# Collect tweets
tweets_data = []
for tweet in tweepy.Cursor(api.search_tweets, q=hashtag, lang="en", tweet_mode="extended").items(tweet_count):
    tweets_data.append({
        'Tweet ID': tweet.id_str,
        'Username': tweet.user.screen_name,
        'Tweet Text': tweet.full_text,
        'Created At': tweet.created_at,
        'Likes': tweet.favorite_count,
        'Retweets': tweet.retweet_count
    })

# Convert collected tweets data to a DataFrame
df = pd.DataFrame(tweets_data)

# Display the first few rows of the DataFrame
print(df.head())

# Save to CSV file
df.to_csv('twitter_hashtag_data.csv', index=False)



## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
Write your response here.
'''