<a href="https://colab.research.google.com/github/snampally97/assignment-reviews/blob/main/nampally_srikanth_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
What effects does telemedicine adoption have on patient outcomes and accessibility to healthcare in remote areas?

Reasons:

By getting beyond geographic obstacles, telemedicine presents a viable way to improve healthcare accessibility in rural areas. The gap in healthcare access between rural and urban populations is closed through remote consultations and services. It's critical to comprehend how telemedicine affects healthcare accessibility in order to alleviate access inequities.

Examining how telemedicine adoption impacts patient outcomes such as health, adherence to treatment, and satisfaction offers valuable insights on how effective the technology is in delivering healthcare in remote areas. Assessing patient outcomes facilitates determining areas for improvement and the quality of care.

In order to allocate resources and make policy decisions about rural healthcare, it is imperative to evaluate the cost-effectiveness of telemedicine in comparison to traditional approaches. Comprehending its financial ramifications aids in maximizing results while optimizing resource allocation.
It provides insightful information to compile healthcare providers' opinions on the advantages and difficulties of telemedicine deployment as well as their experiences with it. Adoption hurdles can be identified and supportive actions can be informed by provider viewpoints.
The long-term impacts of telemedicine adoption, including sustainability and scalability, must be considered in future healthcare planning in rural areas. Understanding its transformative potential guides the development of sustainable initiatives that cater to the evolving demands of rural populations.

Data Collection:

Telemedicine Adoption Data:
Get information about the telemedicine platforms that are being used, the services that are being provided, and the degree of patient acceptance in rural healthcare facilities.

Healthcare Accessibility Metrics:
Acquire information on variables related to rural towns' healthcare accessibility, including travel times to the closest medical facility, the presence of medical personnel, and the percentage of services used.

Patient Outcomes Data:
Prior to and after using telemedicine interventions, gather data on patient outcomes, including health status, adherence to treatment, patient satisfaction, and health-related quality of life.

Cost Data:
Gather information on the costs of implementing telemedicine and providing healthcare services, such as the upfront investment costs, ongoing operating costs, and cost reductions over more conventional approaches.

Provider Surveys or Interviews:
Respondents' experiences, opinions, and difficulties with telemedicine acceptance and use should be gathered through surveys or interviews conducted with healthcare providers in remote areas.

Data Storage and Analysis:

Data Management:
When storing acquired data, make sure it complies with privacy laws like HIPAA (Health Insurance Portability and Accountability Act) by following the proper data management practices.

Quantitative Analysis:
Quantitative data can be studied using statistical analytic techniques like regression analysis to assess the relationship between the utilization of telemedicine and patient outcomes or healthcare accessibility.

Qualitative Analysis:
Utilize qualitative research techniques, such as thematic analysis, to examine qualitative information obtained from provider interviews or surveys in order to pinpoint important themes and takeaways about the acceptance and application of telemedicine.

Integration and Interpretation:
Combine results from qualitative and quantitative research to give a thorough grasp of how telemedicine affects patient outcomes and healthcare accessible in remote areas.

Report and Dissemination:
To assist in decision-making and to advance evidence-based healthcare practices, disseminate study findings via scholarly publications, policy briefs, or presentations to pertinent parties, such as legislators, representatives of rural communities, and healthcare practitioners..


## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
import pandas as pd
import random

def generate_data(num_samples):
    data = []
    for _ in range(num_samples):
        telemedicine_adoption = random.choice(['High', 'Medium', 'Low'])
        patient_health_status = random.choice(['Improved', 'No Change', 'Worsened'])
        treatment_adherence = random.choice(['High', 'Medium', 'Low'])
        patient_satisfaction = random.choice(['Satisfied', 'Neutral', 'Dissatisfied'])
        cost_effectiveness = random.choice(['Cost-effective', 'Neutral', 'Cost-ineffective'])
        provider_perspectives = random.choice(['Positive', 'Neutral', 'Negative'])
        long_term_implications = random.choice(['Positive', 'Neutral', 'Negative'])

        data.append([telemedicine_adoption, patient_health_status, treatment_adherence,
                     patient_satisfaction, cost_effectiveness, provider_perspectives,
                     long_term_implications])

    return data

def save_to_csv(data, filename):
    df = pd.DataFrame(data, columns=['Telemedicine Adoption', 'Patient Health Status', 'Treatment Adherence',
                                     'Patient Satisfaction', 'Cost Effectiveness', 'Provider Perspectives',
                                     'Long-term Implications'])
    df.to_csv(filename, index=False)

dataset = generate_data(1000)
save_to_csv(dataset, 'telemedicine_study_dataset.csv')

## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
import requests
from bs4 import BeautifulSoup
import csv

def fetch_articles_microsoft(keyword, num_articles):
    articles = []
    base_url = "https://academic.microsoft.com/search?q="
    response = requests.get(base_url + keyword)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        results = soup.find_all('div', class_='paper')
        for result in results[:num_articles]:
            try:
                title = result.find('div', class_='name').text.strip()
                venue = result.find('div', class_='conference-name').text.strip()
                year = result.find('span', class_='year').text.strip()
                authors = result.find('div', class_='authors').text.strip()
                abstract = result.find('div', class_='abstract').text.strip()
                articles.append([title, venue, year, authors, abstract])
            except Exception as e:
                print(f"Error: {e}")
    else:
        print("Failed to fetch articles")

    return articles

# Assuming 'XYZ' as keyword and 1000 articles to be collected
keyword = "XYZ"
num_articles = 1000
articles = fetch_articles_microsoft(keyword, num_articles)
save_to_csv(articles, f'{keyword}_articles_microsoft.csv')

## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
import tweepy
import pandas as pd

# Twitter API credentials
consumer_key = 'Xklki9yTH2k4kSVxM0bOQFQpC'
consumer_secret = '1UV2X19RoheZNnAMBGR1HuCEHCUJ0soA3BfaUI3ZTrLr8exYOn'
access_token = '1621586346161610752-cs2QugyC3UfCpgfbBlQqM1RbGss4A0'
access_token_secret = 'NO4BTdlkSig5ny0WYHaesSg6aYwSavJKeCHfgG0PVQ6Xs'


# Authenticate Twitter
auth = tweepy.OAuth1UserHandler(consumer_key, consumer_secret, access_token, access_token_secret)
twitter_api = tweepy.API(auth, wait_on_rate_limit=True)


def collect_twitter_data(api, query, count=10):
    tweets_data = []
    for tweet in api.search_tweets(q=query, lang='en',tweet_mode='extended',count=count):
        tweets_data.append([tweet.id, tweet.created_at, tweet.user.screen_name, tweet.full_text])
    return tweets_data

# Example usage
query = '#python'
twitter_data = collect_twitter_data(twitter_api, query, count=5)

# Combine data into a DataFrame
twitter_df = pd.DataFrame(twitter_data, columns=['Tweet ID', 'Created At', 'Username', 'Text'])

# Display the DataFrames
print("Twitter Data:")
print(twitter_df)


Forbidden: 403 Forbidden
453 - You currently have access to a subset of Twitter API v2 endpoints and limited v1.1 endpoints (e.g. media post, oauth) only. If you need access to this endpoint, you may need a different access level. You can learn more here: https://developer.twitter.com/en/portal/product

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
Taking on web scraping tasks has allowed me to gain significant knowledge with CSS selectors and HTML structural comprehension for precise data extraction. My research into other scraping libraries, such as BeautifulSoup and Selenium, has given me more insight into the techniques used to gather web data. However, because of the challenges in handling dynamically generated material, browser automation uses Selenium. Barriers such as IP filtering and CAPTCHA issues further hinder data collection efforts. To improve the breadth and depth of their study, AI researchers need web scraping to get a range of datasets from online platforms for sentiment analysis, model training, and insight extraction.