## The second In-class-exercise (09/13/2023, 40 points in total)

Kindly use the provided .ipynb document to write your code or respond to the questions. Avoid generating a new file.
Execute all the cells before your final submission.

This in-class exercise is due tomorrow September 14, 2023 at 11:59 PM. No late submissions will be considered.

The purpose of this exercise is to understand users' information needs, then collect data from different sources for analysis.

Question 1 (10 points): Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? How many data needed for the analysis? The detail steps for collecting and save the data.

In [3]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):
"""

What potential effects on urban planning and public health efforts can the use of urban green space have on inhabitants in various demographic groups in terms of their mental health?

A complete dataset made up of a range of data kinds should be gathered in order to provide an answer to this research topic. The objective is to comprehend the connection between urban green areas and mental health while taking into account the potential impact of demographic factors.
The steps for gathering and storing the data are as follows:

Specify the variables:

Urban green space variables include their location, size, kind (such as parks or gardens), accessibility, and quality. The mapping of green spaces can be done using Geographic Information System (GIS) data.
Gather data on the age, gender, socioeconomic situation, level of education, and ethnicity of the population.
Variables Affecting Mental Health: Use standardized mental health assessment measures (such the PHQ-9 or GAD-7) to gauge your level of stress, anxiety, and depression as well as your general well-being.
Include information on temperature, noise levels, and air quality to take into consideration any potential complicating factors.

Sampling Plan: Utilize statistical power estimates to select a meaningful sample size. To guarantee an acceptable representation of diverse demographic groups within the study area, a stratified random sampling strategy may be utilized.

Data Gathering Techniques:

Surveys should be given out to inhabitants in order to get information on their demographics and mental health. To ensure diversity in the sample, use stratified sampling.
GIS Data: Obtain spatial data about green areas from regional or local government organizations or publicly available databases.
Environmental Information: Work with pertinent organizations to gather information on the research area's temperature, noise level, and air quality.

Data Retention:

To keep the data gathered, create a safe and orderly database. For flexibility and scalability, use a relational database system (like MySQL, PostgreSQL) or NoSQL databases (like MongoDB).
When managing sensitive information, make sure to adhere to data protection laws (such as GDPR and HIPAA).
Data Preprocessing and Cleaning

The acquired data should be cleaned and preprocessed, managing missing values, outliers, and inconsistent data.
To connect demographic information with geographic places, geocode resident addresses.
Data Evaluation

Examine the association between green space variables, demographic factors, and mental well-being ratings using statistical analysis, including regression models.
Examine cutting-edge methods like spatial analysis to investigate spatial patterns and clustering effects

Ethics-Related Matters:

Ensure survey participants' privacy and anonymity and obtain their informed consent.
Protect private information and follow moral standards when conducting research on human participants.
Visualizing data

To effectively communicate findings, create educational data visualizations such as maps, charts, and graphs.

"""







"\n\nWhat potential effects on urban planning and public health efforts can the use of urban green space have on inhabitants in various demographic groups in terms of their mental health?\n\nA complete dataset made up of a range of data kinds should be gathered in order to provide an answer to this research topic. The objective is to comprehend the connection between urban green areas and mental health while taking into account the potential impact of demographic factors. \nThe steps for gathering and storing the data are as follows:\n\nSpecify the variables:\n\nUrban green space variables include their location, size, kind (such as parks or gardens), accessibility, and quality. The mapping of green spaces can be done using Geographic Information System (GIS) data.\nGather data on the age, gender, socioeconomic situation, level of education, and ethnicity of the population.\nVariables Affecting Mental Health: Use standardized mental health assessment measures (such the PHQ-9 or GAD-7) 

Question 2 (10 points): Write python code to collect 1000 data samples you discussed above.

In [4]:
import random
import pandas as pd

# Simulated demographic data
def generate_demographic_data(num_samples):
    data = []
    for _ in range(num_samples):
        age = random.randint(18, 70)
        gender = random.choice(["Male", "Female", "Other"])
        socioeconomic_status = random.choice(["Low", "Medium", "High"])
        education_level = random.choice(["High School", "Bachelor's", "Master's", "PhD"])
        ethnicity = random.choice(["White", "Black", "Asian", "Hispanic", "Other"])
        data.append([age, gender, socioeconomic_status, education_level, ethnicity])
    return data

# Simulated mental well-being data
def generate_mental_wellbeing_data(num_samples):
    data = []
    for _ in range(num_samples):
        depression_score = random.randint(0, 30)
        anxiety_score = random.randint(0, 30)
        stress_score = random.randint(0, 30)
        overall_wellbeing_score = random.randint(0, 100)
        data.append([depression_score, anxiety_score, stress_score, overall_wellbeing_score])
    return data

# Generate 1000 samples of demographic and mental well-being data
num_samples = 1000
demographic_data = generate_demographic_data(num_samples)
mental_wellbeing_data = generate_mental_wellbeing_data(num_samples)

# Create a DataFrame to store the data
columns_demographic = ["Age", "Gender", "Socioeconomic Status", "Education Level", "Ethnicity"]
columns_mental_wellbeing = ["Depression Score", "Anxiety Score", "Stress Score", "Overall Well-being Score"]

demographic_df = pd.DataFrame(demographic_data, columns=columns_demographic)
mental_wellbeing_df = pd.DataFrame(mental_wellbeing_data, columns=columns_mental_wellbeing)

# Combine demographic and mental well-being data
combined_df = pd.concat([demographic_df, mental_wellbeing_df], axis=1)

# Save the data to a CSV file
combined_df.to_csv("sample_data.csv", index=False)




Question 3 (10 points): Write python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "information retrieval". The articles should be published in the last 10 years (2013-2023).

The following information of the article needs to be collected:

(1) Title

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [5]:
# You code here (Please add comments in the code):

# Import necessary libraries
import requests
from bs4 import BeautifulSoup
import json

# Function to fetch articles from Google Scholar
def fetch_google_scholar_articles(query, start_year, end_year, num_articles):
    # Base URL for Google Scholar
    url = "https://scholar.google.com/scholar"
    articles = []  # List to store the collected articles

    # Loop to paginate through search results (10 results per page)
    for start in range(0, num_articles, 10):
        # Parameters for the search query, including keywords and date range
        params = {
            "q": query,             # Search query
            "as_ylo": start_year,   # Start year of publication range
            "as_yhi": end_year,     # End year of publication range
            "start": start          # Pagination offset
        }

        # User-Agent header to mimic a web browser
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.1234.0 Safari/537.36"
        }

        # Send a GET request to the Google Scholar search URL with parameters and headers
        response = requests.get(url, params=params, headers=headers)

        # Check if the response is successful (HTTP status code 200)
        if response.status_code == 200:
            # Parse the HTML content of the response using BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')

            # Find all the search result div elements
            results = soup.find_all('div', {'class': 'gs_ri'})

            # Iterate through each search result
            for result in results:
                article = {}  # Dictionary to store article information

                # Extract the title of the article (inside an h3 element with class 'gs_rt')
                title = result.find('h3', {'class': 'gs_rt'})
                if title:
                    article['title'] = title.text  # Store the title in the dictionary

                # Extract the venue/journal/conference information (inside a div element with class 'gs_a')
                venue = result.find('div', {'class': 'gs_a'})
                if venue:
                    article['venue'] = venue.text  # Store the venue information

                # Extract the publication year (from the 'gs_a' div)
                year = result.find('div', {'class': 'gs_a'})
                if year:
                    # Split the text and get the last part (usually the year), then strip whitespace
                    year = year.text.split('-')[-1].strip()
                    article['year'] = year  # Store the year in the dictionary

                # Extract the authors (from the 'gs_a' div)
                authors = result.find('div', {'class': 'gs_a'})
                if authors:
                    # Split the text and get the first part (usually the authors), then strip whitespace
                    authors = authors.text.split('-')[0].strip()
                    article['authors'] = authors  # Store the authors in the dictionary

                # Extract the abstract (inside a div element with class 'gs_rs')
                abstract = result.find('div', {'class': 'gs_rs'})
                if abstract:
                    article['abstract'] = abstract.text  # Store the abstract in the dictionary

                # Append the article dictionary to the list of articles
                articles.append(article)

                # Check if the desired number of articles has been collected
                if len(articles) >= num_articles:
                    return articles

    return articles

# Main program
if __name__ == "__main__":
    keyword = "information retrieval"  # Keyword for the search
    start_year = 2013                # Start year of publication range
    end_year = 2023                  # End year of publication range
    num_articles = 1000              # Desired number of articles to collect

    # Call the fetch_google_scholar_articles function to collect articles
    articles = fetch_google_scholar_articles(keyword, start_year, end_year, num_articles)

    # Save the collected articles to a JSON file
    with open("articles.json", "w", encoding="utf-8") as json_file:
        json.dump(articles, json_file, indent=4, ensure_ascii=False)

    # Print the number of collected articles and a confirmation message
    print(f"Collected {len(articles)} articles and saved to 'articles.json'.")


Collected 0 articles and saved to 'articles.json'.


In [None]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup

# Function to fetch and display articles from Google Scholar
def fetch_and_display_google_scholar_articles(query, start_year, end_year, num_articles):
    # Base URL for Google Scholar
    url = "https://scholar.google.com/scholar"

    # Loop to paginate through search results (10 results per page)
    for start in range(0, num_articles, 10):
        # Parameters for the search query, including keywords and date range
        params = {
            "q": query,             # Search query
            "as_ylo": start_year,   # Start year of publication range
            "as_yhi": end_year,     # End year of publication range
            "start": start          # Pagination offset
        }

        # User-Agent header to mimic a web browser
        header = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.1234.0 Safari/537.36"
        }

        # Send a GET request to the Google Scholar search URL with parameters and headers
        response = requests.get(url, params=params, headers=headers)

        # Check if the response is successful (HTTP status code 200)
        if response.status_code == 200:
            # Parse the HTML content of the response using BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')

            # Find all the search result div elements
            results = soup.find_all('div', {'class': 'gs_ri'})

            # Iterate through each search result and display the information
            for i, result in enumerate(results, start=1):
                print(f"Article {i}:")

                # Extract the title of the article (inside an h3 element with class 'gs_rt')
                title = result.find('h3', {'class': 'gs_rt'})
                if title:
                    print(f"Title: {title.text}")

                # Extract the venue/journal/conference information (inside a div element with class 'gs_a')
                venue = result.find('div', {'class': 'gs_a'})
                if venue:
                    print(f"Venue: {venue.text}")

                # Extract the publication year (from the 'gs_a' div)
                year = result.find('div', {'class': 'gs_a'})
                if year:
                    year = year.text.split('-')[-1].strip()
                    print(f"Year: {year}")

                # Extract the authors (from the 'gs_a' div)
                authors = result.find('div', {'class': 'gs_a'})
                if authors:
                    authors = authors.text.split('-')[0].strip()
                    print(f"Authors: {authors}")

                # Extract the abstract (inside a div element with class 'gs_rs')
                abstract = result.find('div', {'class': 'gs_rs'})
                if abstract:
                    print(f"Abstract: {abstract.text}")

                print("\n")  # Add a newline between articles

                # Check if the desired number of articles has been displayed
                if i >= num_articles:
                    return

# Main program
if __name__ == "__main__":
    keyword = "information retrieval"  # Keyword for the search
    start_year = 2013                # Start year of publication range
    end_year = 2023                  # End year of publication range
    num_articles = 10                # Desired number of articles to display

    # Call the fetch_and_display_google_scholar_articles function to display articles
    fetch_and_display_google_scholar_articles(keyword, start_year, end_year, num_articles)


Question 4 (10 points): Write python code to collect 1000 posts from Twitter, or Facebook, or Instagram. You can either use hashtags, keywords, user_name, user_id, or other information to collect the data.

The following information needs to be collected:

(1) User_name

(2) Posted time

(3) Text

In [None]:
# You code here (Please add comments in the code):

import tweepy
import json

# Set up your Twitter API credentials
consumer_key = 'YOUR_CONSUMER_KEY'
consumer_secret = 'YOUR_CONSUMER_SECRET'
access_token = 'YOUR_ACCESS_TOKEN'
access_token_secret = 'YOUR_ACCESS_TOKEN_SECRET'

# Authenticate with the Twitter API
authenti = tweepy.OAuthHandler(consumer_key, consumer_secret)
authenti.set_access_token(access_token, access_token_secret)

# Creating an API object
api = tweepy.API(authenti, wait_on_rate_limit=True)

def collect_tweets_by_keyword(keyword, num_tweets):
    tweets = []

    # Iterate through pages of tweets to collect the desired number
    for tweet in tweepy.Cursor(api.search, q=keyword, tweet_mode='extended').items(num_tweets):
        tweet_data = {
            'User_name': tweet.user.screen_name,  # User's Twitter handle
            'Posted_time': tweet.created_at.strftime('%Y-%m-%d %H:%M:%S'),  # Time of the tweet
            'Text': tweet.full_text  # The tweet text
        }
        tweets.append(tweet_data)

    return tweets

if __name__ == "__main__":
    keyword = "your_keyword_here"  # Replace with your desired keyword
    num_tweets = 1000  # Number of tweets to collect

    tweets = collect_tweets_by_keyword(keyword, num_tweets)

    # Save the collected tweets to a JSON file
    with open("tweets.json", "w", encoding="utf-8") as json_file:
        json.dump(tweets, json_file, indent=4, ensure_ascii=False)

    print(f"Collected {len(tweets)} tweets and saved to 'tweets.json'.")

