# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

#"Could wearable biometric data forecast the development of decision fatigue among people taking challenging decisions throughout the day?"

Synopsis:
People who make a lot of judgments during the day may become weary of them, which could affect the caliber and speed of their decisions. Through the examination of physiological information (heart rate, skin temperature, and stress levels), we can ascertain whether there exist quantifiable markers that forecast the onset of decision fatigue. This may result in customized solutions that support people in keeping their minds clear when making decisions.


##Data Collection Plan:

###Biometric Data:
1)Modulation of heart rate

2)temperature of the skin

3)sleep statistics

4) Exercise

###Mental Performance:
 1)Decision-making time

 2)Decision-making quality and accuracy

###Decision Perspective:
1)Total number of decisions made in a day

2)Decision difficulty

###Emotional Information:
1)Mental tiredness

2)Self-reported decision-making

##Amount of Data Needed:
50–100 people, covering various lifestyles and professions.

To examine patterns and variability in daily decision fatigue over time, data should be collected for a minimum of three to four months.

Wearable technology is used to continuously collect biometric data.

Decision-making and thinking-making information gathered following each session of decision-making

##Steps to collect and save the data:

###Enrollment of Participants:
Find participants who make judgments on a daily basis, such as students, entrepreneurs, or those in high-risk positions. Teach participants how to use a basic smartphone app or survey platform to gather subjective data.

###Biometric Data Collection:
Now we need to collect biometric data from variables such as Modulation of heart rate, skin temperature, sleep stats and excercise routine to assess ceretian factors like stress, mental stability,fluctuations that may indicate  fatigue, physical activity and sleep cycle.

###Decision-making skills:
Create a decision-logging application that allows users to score mental effort, confidence, and weariness as well as record and categorize daily decisions. Automatic reminders should be set up to remind users to enter the data on a regular basis.Provide controlled decision-making exercises with a range of difficulty levels for puzzles, strategic games, and logic challenges. To replicate decision fatigue patterns, repeat activities throughout the day, vary task difficulty, and measure time and accuracy.

###Data Management and Storage:
Utilize Participant ID, Date & Time, and biometric information to organize data (sleep, activity, GSR, HRV, and skin temperature). Factors such as work kind, difficulty, duration, accuracy, effort, and confidence should be considered while making decisions. Keep subjective measurements (such as energy, weariness, and mood) encrypted in a safe cloud database.

###Analysis:
Investigate relationships between biometric markers and decision fatigue using data analytic technologies. Use machine learning models to analyze heart rate variability, skin temperature, GSR, sleep, and activity to forecast the onset, accuracy, and decision time of weariness.

##Conclusion:
This study could contribute to our understanding of decision fatigue and inspire novel approaches, including wearable apps that recommend breaks or modifications to enhance cognitive function during times when decision-making demands are high.










## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
import random
import pandas as pd

# biometric data and decision fatigue simulation
def data_simulation(samples):
    dt = []
    for _ in range(samples):
        decisions = random.randint(5, 15)
        difficulty = random.choice([1, 2, 3, 4, 5])
        fatigue = min(15, random.uniform(1, 5) + decisions * 0.2 * difficulty)
        hrv = max(40, 80 - fatigue * 3)
        skin_temperature = 98 + fatigue * 0.2
        stress = random.uniform(1, 20) if fatigue > 5 else random.uniform(5, 10)

        dt.append([decisions, difficulty, fatigue, hrv, skin_temperature, stress])

    return dt

# dataset creation
samples = 1000
columns = ['decisions', 'difficulty', 'fatigue', 'hrv', 'skin_temperature', 'stress']
dataset = data_simulation(samples)

# Saving data to csv
df = pd.DataFrame(dataset, columns=columns)
df.to_csv('biometric data .csv', index=False)

print("Dataset saved as 'biometric data.csv'")


Dataset saved as 'biometric data.csv'


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [4]:
import requests
import pandas as pd
from datetime import datetime


#Article retrieval function from the Semantic Scholar API
def articles(keyphrases, start_year, end_year, article_count):
    url = "https://api.semanticscholar.org/graph/v1/paper/search"
    params = {
        "object": keyphrases,
        "limit": article_count,
        "year": f"{start_year}-{end_year}",
        "cols": "title,venue,year,authors,abstract"
    }

    decision = requests.get(url, params=params)
    if decision.status_code != 100:
        print("Unable to retrieve data")
        return []

    data = decision.json()
    articles = data.get("data", [])

    results = []
    for article in articles:
        title = article.get("title")
        venue = article.get("venue")
        year = article.get("year")
        authors = ", ".join([author.get("name") for author in article.get("authors", [])])
        abstract = article.get("abstract")

        results.append({
            "Title": title,
            "Venue/Journal/Conference": venue,
            "Year": year,
            "Authors": authors,
            "Abstract": abstract
        })

    return results

# Specifications
keyphrases = "XYZ"
start_year = 2014
end_year = 2024
article_count = 1000

# data fetching
articles = articles(keyphrases, start_year, end_year, article_count)

#Dataframe
df = pd.DataFrame(articles)

# Saving data to csv
df.to_csv('articles_data.csv', index=False)

print(f"Data collected and saved as 'articles_data.csv'.")


Unable to retrieve data
Data collected and saved as 'articles_data.csv'.


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [3]:
# write your answer here
https://myunt-my.sharepoint.com/:f:/g/personal/vijayaramareddymallidi_my_unt_edu/EnFhduZ1d5RFqjdF5tacthABPYeqykk0xHJWWBpxGNzMiw?e=EY3flT

# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [2]:
'''

Working on web scraping tasks was a valuable experience, teaching key concepts like HTML structure, pagination, and dynamic content extraction. Challenges included anti-scraping mechanisms like CAPTCHA, which were handled using tools like Octoparse. Web scraping enhances research by automating data collection for analysis, benefiting decision-making and large-scale studies,I also tried my level best to complete 4A but i couldn't make it will more on it in future
'''

"\n\nWorking on web scraping tasks was a valuable experience, teaching key concepts like HTML structure, pagination, and dynamic content extraction. Challenges included anti-scraping mechanisms like CAPTCHA, which were handled using tools like Octoparse. Web scraping enhances research by automating data collection for analysis, benefiting decision-making and large-scale studies,I also tried my level best to complete 4A but i couldn't make it will more on it in future\n"