<a href="https://colab.research.google.com/github/shreyamadarapu/INFO_5731/blob/main/Madarapu_Shreya_Exercise_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

**Research Question:** What are some ways to incorporate sentiment analysis into a cybersecurity system to improve the accuracy of identifying false information on social media platforms?

1. Gathering Data:
Content on Social Media: Gather a wide variety of postings on social media from different sites in order to examine how users feel about news items.
News Reports: In order to train the model, compile a set of data of news stories that have been classified as authentic or fraudulent.
Get information on likes, shares, comments, and retweets in order to gauge how social media users are affected by news items.
Assign either positive or negative sentiment labels to the gathered news items and social media messages .
2. Amount of data:
Collect arount 10000 social media posts from diverse platforms and 5000 news reports(both real and fake) to train the model.

Steps for collecting and saving data:
a) Gather Data: Connect posts using relevant hashtags and keywords by using social media APIs.
For news items, scrape news websites or use pre-existing datasets like Kaggle.
b) Data Preprocessing: Filter through the gathered information to eliminate spam, redundant posts, and unnecessary content.
For sentiment analysis, tokenize the text data and eliminate any stop words.
Determine the veracity of the news reports by utilizing pre-existing datasets and fact-checking resources.
c) Data Storage: To make analysis and access easier, store the gathered data in an organized format like CSV or JSON.
To guarantee data integrity and confidentiality, save the cleaned and preprocessed data in a secure database.
To avoid data loss and preserve consistency for upcoming research and examination, put data backup mechanisms into place.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('fake_or_real_news.csv')

# Filter out the rows with label 'FAKE'
fake_news = df[df['label'] == 'FAKE']

# Filter out the rows with label 'REAL'
real_news = df[df['label'] == 'REAL']

# Sample 500 fake news and 500 real news
fake_sample = fake_news.sample(n=500, random_state=42)
real_sample = real_news.sample(n=500, random_state=42)

# Concatenate the samples
dataset = pd.concat([fake_sample, real_sample], ignore_index=True)

# Save the dataset to a new CSV file
dataset.to_csv('sentiment_analysis_cybersecurity_dataset.csv', index=False)

print("Dataset collected and saved successfully.")

Dataset collected and saved successfully.


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
import requests
from bs4 import BeautifulSoup
import time

def scrape_google_scholar(keyword, num_articles=1000):
    base_url = "https://scholar.google.com/scholar"
    params = {
        "q": keyword,
        "hl": "en",
        "as_sdt": "0,5",
        "as_vis": "1",
        "as_ylo": "2013",
        "as_yhi": "2023",
    }
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
    }

    articles = []
    while len(articles) < num_articles:
        try:
            response = requests.get(base_url, params=params, headers=headers)
            if response.status_code == 200:
                soup = BeautifulSoup(response.text, "html.parser")
                results = soup.find_all("div", class_="gs_r gs_or gs_scl")

                for result in results:
                    article = {}
                    title = result.find("h3", class_="gs_rt")
                    if title:
                        article["title"] = title.text

                    venue_year = result.find("div", class_="gs_a")
                    if venue_year:
                        venue_year = venue_year.text.split("-")
                        if len(venue_year) >= 2:
                            article["venue"] = venue_year[0].strip()
                            article["year"] = venue_year[-1].strip()

                    authors = result.find("div", class_="gs_a").find("a")
                    if authors:
                        article["authors"] = authors.text

                    abstract = result.find("div", class_="gs_rs")
                    if abstract:
                        article["abstract"] = abstract.text.strip()

                    articles.append(article)
                    if len(articles) >= num_articles:
                        break

                next_page = soup.find("button", class_="gs_btnPR gs_in_ib gs_btn_lrge gs_btn_half gs_btn_lsu")
                if not next_page:
                    break

                next_page_url = base_url + next_page.get("onclick").split("=")[-1].replace("'", "")
                response = requests.get(next_page_url, headers=headers)
                params = {}

            else:
                print("Failed to retrieve page:", response.status_code)
        except Exception as e:
            print("Error occurred:", e)

        time.sleep(5)

    return articles

keyword = "information retrieval"
num_articles = 1000
articles = scrape_google_scholar(keyword, num_articles)
print(f"Number of articles collected: {len(articles)}")

for i, article in enumerate(articles, start=1):
    print(f"\nArticle {i}:")
    print("Title:", article.get("title", "N/A"))
    print("Venue:", article.get("venue", "N/A"))
    print("Year:", article.get("year", "N/A"))
    print("Authors:", article.get("authors", "N/A"))
    print("Abstract:", article.get("abstract", "N/A"))


Number of articles collected: 10

Article 1:
Title: [BOOK][B] Information retrieval: Implementing and evaluating search engines
Venue: S Buttcher, CLA Clarke, GV Cormack
Year: books.google.com
Authors: CLA Clarke
Abstract: … Information retrieval forms the foundation for modern search engines. In this textbook we 
provide an introduction to information retrieval targeted at graduate students and working …

Article 2:
Title: Information retrieval as statistical translation
Venue: A Berger, J Lafferty
Year: dl.acm.org
Authors: A Berger
Abstract: … There is a large literature on probabilistic approaches to information retrieval, and we will 
not attempt to survey it here. Instead, we focus on the language modeling approach introduced …

Article 3:
Title: A language modeling approach to information retrieval
Venue: JM Ponte, WB Croft
Year: dl.acm.org
Authors: WB Croft
Abstract: … models, we have developed an approach to retrieval based on probabilistic language … in 
information retrieval 

## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
import requests
from bs4 import BeautifulSoup

# Define the URL of the Wikipedia page you want to scrape
url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'  # Replace with the Wikipedia URL of your choice

# Fetch the HTML content
response = requests.get(url)
html = response.text

# Parse the HTML
soup = BeautifulSoup(html, 'html.parser')

# Find and extract headings and content
headings = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
content = soup.find_all('p')

# Initialize lists to store headings and content text
heading_texts = []
content_texts = []

# Extract text from headings and content
for heading in headings:
    heading_texts.append(heading.get_text())

for paragraph in content:
    content_texts.append(paragraph.get_text())

# Print or save the collected data
for heading_text in heading_texts:
    print("Heading:", heading_text)

for paragraph_text in content_texts:
    print("Content:", paragraph_text)

Heading: Contents
Heading: Python (programming language)
Heading: History[edit]
Heading: Design philosophy and features[edit]
Heading: Syntax and semantics[edit]
Heading: Indentation[edit]
Heading: Statements and control flow[edit]
Heading: Expressions[edit]
Heading: Methods[edit]
Heading: Typing[edit]
Heading: Arithmetic operations[edit]
Heading: Programming examples[edit]
Heading: Libraries[edit]
Heading: Development environments[edit]
Heading: Implementations[edit]
Heading: Reference implementation[edit]
Heading: Other implementations[edit]
Heading: Unsupported implementations[edit]
Heading: Cross-compilers to other languages[edit]
Heading: Performance[edit]
Heading: Development[edit]
Heading: API documentation generators[edit]
Heading: Naming[edit]
Heading: Popularity[edit]
Heading: Uses[edit]
Heading: Languages influenced by Python[edit]
Heading: See also[edit]
Heading: References[edit]
Heading: Sources[edit]
Heading: Further reading[edit]
Heading: External links[edit]
Content: 



## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
I felt it wasnt easy to work on web scraping exercise. The comprehension of HTML structure and its navigation through the use of tools such as BeautifulSoup in Python is one of the fundamental ideas that I found most helpful. Furthermore, being familiar with HTTP requests and response codes made it easier to handle the numerous situations that came up when web scraping.
There were a few significant problems including a faulty gateway that I ran into when gathering data from specific websites. It was also necessary to devise ways to get around or lessen the impact of anti-scraping methods that certain websites were using, such as rate limiting or IP banning for ques-3.
In a lot of areas, including mine, having the ability to collect and evaluate data from internet sources is quite important. This data makes it easier to construct and test machine learning models since it can incorporate text from a variety of sources, such as academic papers, news stories, and social media. Additionally, online scraping makes it possible to gather data in real-time, which is useful for sentiment analysis, trend tracking, and keeping tabs on public opinion on particular subjects related to my research.
'''