<a href="https://colab.research.google.com/github/sivarohith99/SivaRohith_INFO5731_Fall2024/blob/main/Jampana_SivaRohith_Exercise_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# write your answer here
'''How does the public sentiment towards artificial intelligence (AI) evolve over time in major news outlets?

Data Needed:
To answer this question, you would need the following types of data:

News Articles:
Title, content, publication date, author, and source (news outlet).
Articles specifically focused on artificial intelligence (AI), its technologies, ethics, impact on society, and industries.
Metadata:
Date of publication.
Article source - Times of India.
Sentiment (positive, negative, or neutral) derived from text analysis.
Keyword frequency (mentions of terms like "AI," "ethics," "machine learning," "automation").
Amount of Data Needed:
Time Span: To track sentiment over time, collecting data over at least one year would be ideal. You could begin with a span of 6 months if shorter time frames are needed.
Volume: Aim for a dataset of 500 to 1000 articles initially to cover diverse opinions and events. You can scale up later if needed.
Sources: Collect data from major news outlet to ensure variety in perspectives (e.g., The Guardian, BBC, New York Times, Forbes, and Wired).
Detailed Steps for Collecting and Saving the Data:
Identify Data Sources:

Choose reliable news websites that cover a wide range of articles on AI.
Each site should have accessible news sections or search pages with relevant keywords like "Artificial Intelligence," "AI," or related terms.
Set Up a Web Scraping Framework:

Use Python libraries such as BeautifulSoup (for parsing HTML), Selenium (for dynamic content), or Scrapy (for advanced crawling).
You may also use requests for simple HTTP requests to access static content.
Search News Articles by Keywords:

For each website, create a list of URLs for news articles. This can be done by querying search forms for terms like "AI" and setting a date filter to cover the desired time span.
Example: The Guardian's search results for "Artificial Intelligence."
Extract Data:

Use BeautifulSoup to extract relevant HTML tags, such as <title>, <date>, and <article>.
Scrape data points like the headline, author, publication date, and article text.'''

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [30]:
# write your answer here
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from collections import Counter
from textblob import TextBlob

# Define the base URL and search query parameters
base_url = 'https://timesofindia.indiatimes.com/'
search_query = 'artificial intelligence'
total_articles = 1000
articles_collected = 0
article_list = []

# Keywords to track
keywords = ['AI', 'ethics', 'machine learning', 'automation']

def fetch_article_links(search_url, num_pages):
    article_links = []
    for page in range(1, num_pages + 1):
        params = {'q': search_query, 'page': page}
        response = requests.get(search_url, params=params)

        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')

            # Find article links (modify selector based on actual site)
            links = soup.find_all('a', href=True)
            for link in links:
                href = link['href']
                if 'article' in href:
                    full_url = href if href.startswith('http') else f"https://timesofindia.indiatimes.com{href}"
                    article_links.append(full_url)
        else:
            print(f"Failed to retrieve page {page}. Status code: {response.status_code}")

    return article_links

def analyze_sentiment(text):
    """Returns sentiment polarity: positive, negative, or neutral"""
    blob = TextBlob(text)
    polarity = blob.sentiment.polarity
    if polarity > 0:
        return 'positive'
    elif polarity < 0:
        return 'negative'
    else:
        return 'neutral'

def count_keywords(text, keywords):
    """Returns a dictionary of keyword frequencies in the text"""
    word_list = text.lower().split()
    word_counts = Counter(word_list)
    keyword_counts = {keyword: word_counts.get(keyword.lower(), 0) for keyword in keywords}
    return keyword_counts

def scrape_article(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')

            # Extract article details (modify selectors based on actual site)
            title = soup.find('h1').get_text() if soup.find('h1') else 'No Title'
            date = soup.find('time')['datetime'] if soup.find('time') else 'No Date'
            content = soup.find('div', class_='article-body').get_text() if soup.find('div', class_='article-body') else 'No Content'

            # Perform sentiment analysis and keyword frequency analysis
            sentiment = analyze_sentiment(content)
            keyword_freq = count_keywords(content, keywords)

            return {'title': title, 'date': date, 'content': content, 'url': url, 'sentiment': sentiment, **keyword_freq}
        else:
            print(f"Failed to retrieve article at {url}. Status code: {response.status_code}")
            return None
    except Exception as e:
        print(f"Error fetching article {url}: {e}")
        return None

def main():
    global articles_collected
    num_pages = 10  # Number of search result pages to scrape

    print("Fetching article links...")
    article_links = fetch_article_links(base_url, num_pages)
    print(f"Found {len(article_links)} article links.")

    print("Scraping articles...")
    for link in article_links:
        if articles_collected >= total_articles:
            break

        article = scrape_article(link)
        if article:
            article_list.append(article)
            articles_collected += 1
            print(f"Collected {articles_collected}/{total_articles} articles")

        time.sleep(1)  # Be respectful and avoid overwhelming the server

    print("Saving data...")
    df = pd.DataFrame(article_list)
    df.to_csv('ai_news_articles.csv', index=False)
    print(f"Data saved to ai_news_articles.csv")

if __name__ == "__main__":
    main()



Fetching article links...
Found 2330 article links.
Scraping articles...
Collected 1/1000 articles
Collected 2/1000 articles
Collected 3/1000 articles
Collected 4/1000 articles
Collected 5/1000 articles
Collected 6/1000 articles
Collected 7/1000 articles
Collected 8/1000 articles
Collected 9/1000 articles
Collected 10/1000 articles
Collected 11/1000 articles
Collected 12/1000 articles
Collected 13/1000 articles
Collected 14/1000 articles
Collected 15/1000 articles
Collected 16/1000 articles
Collected 17/1000 articles
Collected 18/1000 articles
Collected 19/1000 articles
Collected 20/1000 articles
Collected 21/1000 articles
Collected 22/1000 articles
Collected 23/1000 articles
Collected 24/1000 articles
Collected 25/1000 articles
Collected 26/1000 articles
Collected 27/1000 articles
Collected 28/1000 articles
Collected 29/1000 articles
Collected 30/1000 articles
Collected 31/1000 articles
Collected 32/1000 articles
Collected 33/1000 articles
Collected 34/1000 articles
Collected 35/1000 

## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [29]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def fetch_acm_articles(keyword, start_year, end_year, num_articles=1000):
    base_url = "https://dl.acm.org/action/doSearch"

    # Initialize an empty list to store article information
    articles = []
    start = 0  # For pagination

    while len(articles) < num_articles:
        search_url = f"{base_url}?AllField={keyword}&startYear={start_year}&endYear={end_year}&pageSize=50&startPage={start}"

        # Fetch search results
        response = requests.get(search_url)
        soup = BeautifulSoup(response.content, "html.parser")

        # Extract article details
        results = soup.find_all("div", class_="issue-item__content")
        for result in results:
            # Safely extract title
            title_element = result.find("h5", class_="issue-item__title")
            title = title_element.text.strip() if title_element else "N/A"

            # Safely extract authors
            authors_element = result.find("span", class_="issue-item__authors")
            authors = authors_element.text.strip() if authors_element else "N/A"

            # Safely extract venue
            venue_element = result.find("span", class_="issue-item__venue")
            venue = venue_element.text.strip() if venue_element else "N/A"

            # Safely extract year
            year_element = result.find("span", class_="issue-item__year")
            year = year_element.text.strip() if year_element else "N/A"

            # Safely extract abstract
            abstract_element = result.find("div", class_="issue-item__abstract")
            abstract = abstract_element.text.strip() if abstract_element else "N/A"

            articles.append({
                "Title": title,
                "Authors": authors,
                "Venue": venue,
                "Year": year,
                "Abstract": abstract
            })

            # Stop if we've collected enough articles
            if len(articles) >= num_articles:
                break

        # Increment the page for pagination
        start += 1

        # If no more results are found, break the loop
        if not results:
            break

    return articles

# Usage example
keyword = "XYZ"
start_year = 2014
end_year = 2024
num_articles_to_collect = 1000

xyz_articles = fetch_acm_articles(keyword, start_year, end_year, num_articles_to_collect)

# Create a DataFrame from the list of articles
df = pd.DataFrame(xyz_articles)

# Save the DataFrame to a CSV file
csv_filename = "xyz_articles.csv"
df.to_csv(csv_filename, index=False)

print(f"Saved {len(df)} articles to {csv_filename}")


Saved 1000 articles to xyz_articles.csv


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [2]:
!pip install praw



In [3]:
# write your answer here

import praw
import pandas as pd

# Initialize the Reddit API client
reddit = praw.Reddit(
    client_id='H2pzWfXFrWmjPSy63_8z3A',
    client_secret='SBHhAkSF7xn8G2miJrwlNsy3PrQWvg',
    user_agent='Siva'
)
# Function to collect data from Reddit
def collect_reddit_data(subreddit_name, keyword, limit=100):
    subreddit = reddit.subreddit(subreddit_name)
    posts = subreddit.search(keyword, limit=limit)

    data = []
    for post in posts:
        data.append({
            'title': post.title,
            'score': post.score,
            'id': post.id,
            'url': post.url,
            'created': post.created_utc,
            'num_comments': post.num_comments
        })

    return pd.DataFrame(data)

# Collect data
df_reddit = collect_reddit_data('python', 'data science', 10)
print(df_reddit)

# Save to a CSV file
df_reddit.to_csv('reddit_data.csv', index=False)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



                                               title  score       id  \
0  I teach Python courses - here's my collection ...   2975   jii8ex   
1  Six months into Python and Data science, my fi...   1723   g5ymoy   
2  Modern alternatives to Data Science Libraries ...    208  196jbms   
3  If you're a beginner interested in data scienc...    827   xyyj9t   
4  If you're a beginner interested in data scienc...    696  12j68f7   
5  1 year ago I started building Practice Probs -...    788   zzv4zt   
6  Python Data Science December [Completed] - 24 ...    518   zu7vqp   
7  78 Python data science practice problems in a ...    776   u77fce   
8  Build a Data Science SaaS App with Just Python...    103  173qcwe   
9  What features of the Python language predestin...     73   zu8azk   

                                                 url       created  \
0      https://marko-knoebl.github.io/slides/#python  1.603731e+09   
1                    https://v.redd.it/mlqov8dbicu41  1.587551e+09 

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
Write your response here.
Learning Experience: It's been overvelmig for me as webscraping excercise required me to run through many packages and libraries

Challenges Encountered: I encountered many errors and reached limit while working with api's.

Relevance to Your Field of Study: In my view web scraping is important skill while learning with AI and ML and even for jobs many employers want to scrape data from websites.
Overall I want to understand more about webscraping and do some intresting projects
'''