# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

**Data Collection Process**

**Identify Relevant Weather Sources**: Choose suitable weather APIs or websites known for providing historical weather data.


**Specify Locations**: Identify locations relevant to the online shopping transaction data to ensure alignment between weather conditions and consumer behavior.


**Define Date Range**: Specify the duration corresponding to the Twitter data collection period to extract daily weather information.


**Retrieve Hourly Weather Data**: Use the chosen weather API or website to programmatically retrieve daily temperature and weather conditions for the specified locations(longitude and latitude) and timeframe.


**Data Formatting**: Organize the collected weather data into a structured format, ensuring consistency and compatibility with subsequent analysis.


**Save Data**: Store the retrieved weather data in a structured format such as CSV, including fields for Humidity, temperature, and weather conditions.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
!pip install retry_requests

Collecting retry_requests
  Downloading retry_requests-2.0.0-py3-none-any.whl (15 kB)
Installing collected packages: retry_requests
Successfully installed retry_requests-2.0.0


In [None]:
import openmeteo_requests
import requests_cache
import pandas as pd
from retry_requests import retry

cache_session = requests_cache.CachedSession('.cache', expire_after=-1)
retry_session = retry(cache_session, retries=5, backoff_factor=0.2)
openmeteo = openmeteo_requests.Client(session=retry_session)

latitude = 33.214840
longitude = -97.133064
start_date = "2022-01-01"
end_date = "2024-02-10"
def fetch_weather_data_for_date(date):
    url = "https://archive-api.open-meteo.com/v1/archive"
    params = {
        "latitude": latitude,
        "longitude": longitude,
        "start_date": date,
        "end_date": date,
        "hourly": ["temperature_2m", "relative_humidity_2m", "precipitation", "rain", "snowfall", "weather_code"]
    }
    response = openmeteo.weather_api(url, params=params)
    return response

weather_dataframe = pd.DataFrame(columns=["date", "temperature_2m", "relative_humidity_2m", "precipitation", "rain", "snowfall", "weather_code"])

date = pd.to_datetime(start_date)
while len(weather_dataframe) < 1000:
    response = fetch_weather_data_for_date(date.strftime("%Y-%m-%d"))
    hourly = response[0].Hourly()
    hourly_data = {
        "date": pd.to_datetime(hourly.Time(), unit="s"),
        "temperature_2m": hourly.Variables(0).ValuesAsNumpy(),
        "relative_humidity_2m": hourly.Variables(1).ValuesAsNumpy(),
        "precipitation": hourly.Variables(2).ValuesAsNumpy(),
        "rain": hourly.Variables(3).ValuesAsNumpy(),
        "snowfall": hourly.Variables(4).ValuesAsNumpy(),
        "weather_code": hourly.Variables(5).ValuesAsNumpy()
    }
    daily_dataframe = pd.DataFrame(data=hourly_data)
    weather_dataframe = pd.concat([weather_dataframe, daily_dataframe], ignore_index=True)
    date += pd.Timedelta(days=1)

weather_dataframe = weather_dataframe.head(1000)
weather_dataframe.to_csv('weather_data.csv')
print(weather_dataframe)


          date  temperature_2m  relative_humidity_2m  precipitation  rain  \
0   2022-01-01       18.839500             93.922546            0.0   0.0   
1   2022-01-01       18.139500             95.085800            0.2   0.2   
2   2022-01-01       18.439499             91.849548            0.0   0.0   
3   2022-01-01       17.639500             93.868118            0.0   0.0   
4   2022-01-01       18.389500             90.116112            2.3   2.3   
..         ...             ...                   ...            ...   ...   
995 2022-02-11        8.339499             45.343323            0.0   0.0   
996 2022-02-11        9.439500             42.569962            0.0   0.0   
997 2022-02-11        9.339499             45.133007            0.0   0.0   
998 2022-02-11       10.639500             44.524261            0.0   0.0   
999 2022-02-11       13.639500             43.328072            0.0   0.0   

     snowfall  weather_code  
0         0.0           3.0  
1         0.0  

In [None]:
data = pd.read_csv('weather_data.csv')
data = data.drop('Unnamed: 0',axis=1)
data

Unnamed: 0,date,temperature_2m,relative_humidity_2m,precipitation,rain,snowfall,weather_code
0,2022-01-01,18.839500,93.922550,0.0,0.0,0.0,3.0
1,2022-01-01,18.139500,95.085800,0.2,0.2,0.0,51.0
2,2022-01-01,18.439499,91.849550,0.0,0.0,0.0,3.0
3,2022-01-01,17.639500,93.868120,0.0,0.0,0.0,3.0
4,2022-01-01,18.389500,90.116110,2.3,2.3,0.0,61.0
...,...,...,...,...,...,...,...
995,2022-02-11,8.339499,45.343323,0.0,0.0,0.0,0.0
996,2022-02-11,9.439500,42.569960,0.0,0.0,0.0,0.0
997,2022-02-11,9.339499,45.133007,0.0,0.0,0.0,0.0
998,2022-02-11,10.639500,44.524260,0.0,0.0,0.0,0.0


In [None]:
data['snowfall'].sum()

3.71000003

## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
import json
import requests
from bs4 import BeautifulSoup
import pandas as pd

keyword = "XYZ"
start_year = 2014
end_year = 2024
max_articles = 1000
result = []

base_url = "https://scholar.google.com/scholar?q="
url = f"{base_url}{keyword}&as_ylo={start_year}&as_yhi={end_year}&start=0"

num_articles = 0

while num_articles < max_articles:
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        article_blocks = soup.find_all("div", class_="gs_ri")

        for article_block in article_blocks:
            title = article_block.find("h3", class_="gs_rt").text
            venue = article_block.find("div", class_="gs_a").text
            authors = venue.split("-")[0].strip()
            venue = venue.split("-")[1].strip()
            year = venue.split(",")[-1].strip()
            venue = venue.split(",")[:-1]
            venue = ", ".join(venue).strip()
            abstract = article_block.find("div", class_="gs_rs").text

            result.append({
                'title': title,
                'venue': venue,
                'year': year,
                'authors': authors,
                'abstract': abstract
            })

            num_articles += 1

            if num_articles >= max_articles:
                break

        url = f"{base_url}{keyword}&as_ylo={start_year}&as_yhi={end_year}&start={num_articles}"
    else:
        break

df = pd.DataFrame(result)

df.to_csv('articles.csv', index=False)

print(json.dumps(result, indent=4))


[
    {
        "title": "The XYZ states revisited",
        "venue": "International Journal of Modern Physics A",
        "year": "2018",
        "authors": "CZ Yuan",
        "abstract": "The BESIII and the LHCb became the leading experiments in the study of the exotic states \nafter the Belle and BaBar experiments finished their data taking in the first decade of this \u2026"
    },
    {
        "title": "The XYZ states: experimental and theoretical status and perspectives",
        "venue": "Physics Reports",
        "year": "2020",
        "authors": "N Brambilla, S Eidelman, C Hanhart, A Nefediev\u2026",
        "abstract": "The quark model was formulated in 1964 to classify mesons as bound states made of a \nquark\u2013antiquark pair, and baryons as bound states made of three quarks. For a long time all \u2026"
    },
    {
        "title": "Perancangan Digitalisasi Ruang Baca Fakultas XYZ Pada Universitas XYZ",
        "venue": "Prosiding CORISINDO 2023",
        "year": "2023

In [None]:
df = pd.DataFrame(result)
df.to_csv('articles.csv', index=False)

In [None]:
data = pd.read_csv('articles.csv')
data

Unnamed: 0,title,venue,year,authors,abstract
0,The XYZ states revisited,International Journal of Modern Physics A,2018,CZ Yuan,The BESIII and the LHCb became the leading exp...
1,The XYZ states: experimental and theoretical s...,Physics Reports,2020,"N Brambilla, S Eidelman, C Hanhart, A Nefediev…",The quark model was formulated in 1964 to clas...
2,Perancangan Digitalisasi Ruang Baca Fakultas X...,Prosiding CORISINDO 2023,2023,"E Hartati, Y Aprizal",… memiliki 11 Fakultas yang salah satunya adal...
3,An overview of XYZ new particles,Chinese Science Bulletin,2014,X Liu,… (XYZ\) have been announced by experiments af...
4,The xyz algorithm for fast interaction search ...,Journal of Machine Learning …,2018,"GA Thanei, N Meinshausen, RD Shah","… In this section, we present a version of the..."
...,...,...,...,...,...
595,Analisis dan perancangan sistem penentuan prio...,Journal of …,2022,"W Witanti, T Harihayati…",… Perusahaan XYZ merupakan suatu perusahaan ya...
596,[PDF][PDF] Pengaruh Marketing Public Relation ...,Value: Journal of Management and Business,2017,H Sugesti,… XYZ adalah sebuah taman rekreasi air yang me...
597,Peramalan Trend Pendapatan di Toko Online XYZ ...,Jurnal JTIK (Jurnal Teknologi …,2023,"JN Gustin, MAI Pakereng",The Utilization of an online shopping platform...
598,[PDF][PDF] Usulan Perawatan Mesin Berdasarkan ...,Jurnal Teknik …,2014,"DC Siagian, H Napitupulu…",PT. XYZ adalah salah satu perusahaan yang berg...


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


<b> Reddit on topic Python

In [None]:
!pip install httpx

Collecting httpx
  Downloading httpx-0.26.0-py3-none-any.whl (75 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/75.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx)
  Downloading httpcore-1.0.3-py3-none-any.whl (77 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/77.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/58.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: 

In [None]:
import httpx
import pandas as pd
import time
base_url = "https://www.reddit.com"
endpoint = "/r/python"
category = "/hot"
url = f"{base_url}{endpoint}{category}.json"
after_post_id = None

dataset = []

for _ in range(5):
    params = {
        'limit': 100,
        't': 'year',  # time unit (hour, day, week, month, year, all)
        'after': after_post_id
    }
    response = httpx.get(url, params=params)
    print(f'Fetching "{response.url}"...')
    if response.status_code != 200:
        print(response)
        raise Exception('Failed to fetch data')
    json_data = response.json()
    dataset.extend([rec['data'] for rec in json_data['data']['children']])
    after_post_id = json_data['data']['after']
    time.sleep(0.5)

df = pd.DataFrame(dataset)
df.to_csv('reddit_python.csv', index=False)


Fetching "https://www.reddit.com/r/python/hot.json?limit=100&t=year&after="...
Fetching "https://www.reddit.com/r/python/hot.json?limit=100&t=year&after=t3_1ai5okp"...
Fetching "https://www.reddit.com/r/python/hot.json?limit=100&t=year&after=t3_19etjtd"...
Fetching "https://www.reddit.com/r/python/hot.json?limit=100&t=year&after=t3_196sd4c"...
Fetching "https://www.reddit.com/r/python/hot.json?limit=100&t=year&after=t3_18y7t6j"...


In [None]:
data =pd.read_csv('reddit_python.csv')

In [None]:
data

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,subreddit_subscribers,created_utc,num_crossposts,media,is_video,post_hint,preview,author_cakeday,url_overridden_by_dest,media_metadata
0,,Python,# Weekly Thread: What's Everyone Working On Th...,t2_6l4z3,False,,0,False,Sunday Daily Thread: What's everyone working o...,"[{'a': ':pythonLogo:', 'e': 'emoji', 'u': 'htt...",...,1206085,1.707610e+09,0,,False,,,,,
1,,Python,# Weekly Thread: Meta Discussions and Free Tal...,t2_6l4z3,False,,0,False,Friday Daily Thread: r/Python Meta and Free-Ta...,"[{'a': ':pythonLogo:', 'e': 'emoji', 'u': 'htt...",...,1206085,1.708042e+09,0,,False,self,{'images': [{'source': {'url': 'https://extern...,,,
2,,Python,From the makers of `ruff` comes [`uv`](https:/...,t2_jlklb3zi,False,,0,False,Announcing uv: Python packaging in Rust,[],...,1206085,1.708027e+09,0,,False,self,{'images': [{'source': {'url': 'https://extern...,,,
3,,Python,I'm excited to announce the beta release of [B...,t2_yezak,False,,0,False,BlackMarblePy: Python Package to Retrieve NASA...,"[{'e': 'text', 't': 'Showcase'}]",...,1206085,1.708043e+09,1,,False,self,{'images': [{'source': {'url': 'https://extern...,,,
4,,Python,A GitHub repository of Python Tutorials in mar...,t2_rgcg4,False,,0,False,Anaconda Python Distribution Tutorials,"[{'e': 'text', 't': 'Tutorial'}]",...,1206085,1.708074e+09,0,,False,self,{'images': [{'source': {'url': 'https://extern...,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
472,,Python,"Really, [oreiller](https://www.google.com/sear...",t2_13esq4qb,False,,0,False,Oreiller: An image library for easy Pillow man...,"[{'e': 'text', 't': 'Beginner Showcase'}]",...,1206085,1.703839e+09,0,,False,,,,,"{'r8ixcd8f579c1': {'status': 'valid', 'e': 'Im..."
473,,Python,I started taking pictures 'everyday' in 2019 a...,t2_cjzp0u3,False,,0,False,I created a program to align thousands of self...,"[{'e': 'text', 't': 'Intermediate Showcase'}]",...,1206085,1.703816e+09,0,,False,self,{'images': [{'source': {'url': 'https://extern...,,,
474,,Python,[pyjanitor](https://pyjanitor-devs.github.io/p...,t2_umww5x61,False,,0,False,Efficient Range Joins in Pandas,"[{'e': 'text', 't': 'Resource'}]",...,1206085,1.703852e+09,0,,False,,,,,
475,,Python,*Stop wasting time saving plots manually — aut...,t2_89cohrt0,False,,0,False,A Better Way to Wrangle Figures Out of Jupyter...,"[{'e': 'text', 't': 'Intermediate Showcase'}]",...,1206085,1.703835e+09,0,,False,self,{'images': [{'source': {'url': 'https://extern...,,,"{'sj6f6u69q69c1': {'status': 'valid', 'e': 'Im..."


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

Web scraping and data collection are both fascinating and challenging endeavors. Through my experience with these tasks, I have gained a wealth of knowledge. The process of scraping data is particularly demanding, often requiring hours of work and pushing the limits of GPU capabilities. This has led me to the realization that a highly capable computer is essential for efficient data scraping.


Moreover, I have encountered challenges with social media platforms, which have implemented pricing models that limit access for developers seeking to learn. For instance, while I successfully authenticated with the Twitter API, I found that calls for additional features were forbidden due to pricing constraints. Despite these obstacles, I managed to extract data from Reddit, which conveniently provides JSON data for Python discussion posts. By accessing the API for these JSON files, I was able to successfully scrape data related to Python discussions.


However, I have faced issues with websites either returning a 403 error, indicating that access is forbidden, or a 429 error, signaling too many requests. Despite these challenges, the learning experience has been immensely rewarding.


I am now confident in my ability to scrape data from virtually any source, provided I have the necessary resources. Although I encountered numerous difficulties throughout this process, the knowledge and satisfaction gained in the end have been incredibly fulfilling.