<a href="https://colab.research.google.com/github/sainikhila11/SaiNikhila_INFO5731_Spring2024/blob/main/Yavanamanda_Sai_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

The topic i am intrested to do on for this assesssment is how do daily variations in social media engagement relate to fluctuations in individuals' reported happiness levels? To answer this question, additional data on the specific content or interactions within each social media platform,such as the nature of posts or the sentiment of interactions, would be crucial. Furthermore, incorporating external factors like daily events, news sentiment, or weather conditions could enhance the analysis. A substantial dataset, with at least a year's worth of daily observations for a diverse sample, would be needed to capture seasonal and long-term trends adequately. The steps for collecting and saving the data would involve expanding the social media usage simulation to include more detailed information and merging it with an enriched individual dataset. The resulting dataset could then be saved in a CSV file for further analysis.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
import pandas as pd
from datetime import datetime, timedelta
import random

# Simulate social media usage data
def generate_social_media_data(start_date, end_date):
    date_range = pd.date_range(start=start_date, end=end_date, freq='D')
    social_media_data = {'Date': date_range,
                         'Facebook_Hours': [random.uniform(0, 3) for _ in range(len(date_range))],
                         'Instagram_Hours': [random.uniform(0, 2) for _ in range(len(date_range))],
                         'Twitter_Hours': [random.uniform(0, 1) for _ in range(len(date_range))],
                         'Other_Hours': [random.uniform(0, 2) for _ in range(len(date_range))]}
    return pd.DataFrame(social_media_data)

# Simulate individual data
def generate_individual_data(start_date, end_date):
    date_range = pd.date_range(start=start_date, end=end_date, freq='D')
    individual_data = {'Date': date_range,
                       'Happiness_Score': [random.randint(1, 10) for _ in range(len(date_range))],
                       'Age': [random.randint(18, 60) for _ in range(len(date_range))],
                       'Gender': [random.choice(['Male', 'Female']) for _ in range(len(date_range))]}
    return pd.DataFrame(individual_data)

# Set the timeframe for data collection
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 3, 31)

# Generate social media usage data
social_media_data = generate_social_media_data(start_date, end_date)

# Generate individual data
individual_data = generate_individual_data(start_date, end_date)

# Merge dataframes on 'Date'
dataset = pd.merge(social_media_data, individual_data, on='Date')

# Save the dataset to a CSV file
dataset.to_csv('social_media_happiness_dataset.csv', index=False)

# Load the dataset
loaded_dataset = pd.read_csv('social_media_happiness_dataset.csv')

# Display the first few rows of the loaded dataset
print("First few rows of the loaded dataset:")
print(loaded_dataset.head())

# Display basic statistics of the loaded dataset
print("\nBasic statistics of the loaded dataset:")
print(loaded_dataset.describe())

# Display the unique values in the 'Gender' column
print("\nUnique values in the 'Gender' column:")
print(loaded_dataset['Gender'].unique())


First few rows of the loaded dataset:
         Date  Facebook_Hours  Instagram_Hours  Twitter_Hours  Other_Hours  \
0  2023-01-01        1.904099         0.443164       0.997229     0.084667   
1  2023-01-02        1.007663         0.361478       0.737084     0.597787   
2  2023-01-03        0.977409         1.109788       0.801282     0.557709   
3  2023-01-04        0.256258         1.304992       0.397953     1.665752   
4  2023-01-05        0.067369         0.139420       0.753975     1.832711   

   Happiness_Score  Age  Gender  
0                6   40    Male  
1                5   20  Female  
2                5   39  Female  
3                6   26    Male  
4                2   19  Female  

Basic statistics of the loaded dataset:
       Facebook_Hours  Instagram_Hours  Twitter_Hours  Other_Hours  \
count       90.000000        90.000000      90.000000    90.000000   
mean         1.427203         0.976420       0.513907     0.938930   
std          0.860750         0.566659

## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [21]:
import requests
from bs4 import BeautifulSoup

def fetch_google_scholar_articles(keyword, num_articles=10, years_range=(2014, 2024)):
    base_url = "https://scholar.google.com/"
    articles = []

    for year in range(years_range[0], years_range[1] + 1):
        params = {
            "q": f'{keyword} after:{year-1}-12-31 before:{year+1}-01-01',
            "hl": "en",
            "as_sdt": "0,5",
        }

        response = requests.get(base_url, params=params)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            results = soup.find_all('div', class_='gs_ri')

            for result in results:
                title_element = result.find('h3', class_='gs_rt')
                title = title_element.get_text(strip=True) if title_element else "Title not found"

                venue = result.find('div', class_='gs_a').get_text(strip=True) if result.find('div', class_='gs_a') else "Venue not found"
                year_match = [int(y) for y in re.findall(r'\b\d{4}\b', venue) if years_range[0] <= int(y) <= years_range[1]]
                year = year_match[0] if year_match else None
                authors = venue.split('-')[0].strip()

                abstract_element = result.find('div', class_='gs_rs')
                abstract = abstract_element.get_text(strip=True) if abstract_element else "Abstract not found"

                article_info = {
                    "title": title,
                    "venue": venue,
                    "year": year,
                    "authors": authors,
                    "abstract": abstract,
                }

                articles.append(article_info)

    return articles

# Example usage
keyword = "XYZ"
num_articles = 10
articles = fetch_google_scholar_articles(keyword, num_articles)

for i, article in enumerate(articles, 1):
    print(f"Article {i}:")
    print(f"Title: {article['title']}")
    print(f"Venue: {article['venue']}")
    print(f"Year: {article['year']}")
    print(f"Authors: {article['authors']}")
    print(f"Abstract: {article['abstract']}")
    print("=" * 50)


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

I have used the online tool 'web scraper' to do the web scraping. I had chosen the platform linkedIn to collect job details.I Have installed the webscraper extention to my local system and initiated it in the page i want to extract the data from. I have moved to web scraper tab and started creating a sitemap by giving a name to it and URL from where i need to scrape the data. I have then created a root node that consisted all the sub elements. For each of these i had given a name and then given their type such as link, text, url etc and then selected the data that has to go to each of the elements and clicked on srape sitemap. Then each link on the webpage was scraped and extracted in table format. Then i had selected to export the data as a csv file into the local system.

here is the link for drive with the csv file of scraped data
 https://docs.google.com/spreadsheets/d/1JVjLmapQ4NBfLnDKxLs6UxCbn-Wd2y1P/edit?usp=drive_link&ouid=106647022315744601568&rtpof=true&sd=true

# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

Web Scraping as a topic definitely stood out as intresting to me and doing this using an online tool hands on gave me practial knowledge on how exactly these tools work to do the scraping. However i found it difficult to code in python to do the scraping. I wasn't very aware of what module and funtions or methods would be suitable for installing the webscraping tools. I think gaining the ability to build more skill in this topic would be extremely helpful given that it makes data extraction and gathering way more easier than manual methods.