<a href="https://colab.research.google.com/github/sthapa5496-ops/Samraggi/blob/main/Thapa_Samraggi_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [None]:
import requests
import csv
import time

# (Step 1) Define query and output file
search_term = "machine learning"   # you can change to "data science", "AI", etc.
records_needed = 1000            # small batch just for testing
batch_limit = 100                   # fetch 25 per request
file_name = "batch9.csv"

# (Step 2) API endpoint and fields to retrieve
api_url = "https://api.semanticscholar.org/graph/v1/paper/search"
fields_required = "title,abstract,year,authors"

# (Step 3) Function to fetch papers
def fetch_papers(query, total, per_request):
    results = []
    offset = 0

    while len(results) < total:
        params = {
            "query": query,
            "offset": offset,
            "limit": per_request,
            "fields": fields_required
        }
        response = requests.get(api_url, params=params)

        if response.status_code != 200:
            print(f" Error {response.status_code}, retrying...")
            time.sleep(2)
            continue

        data = response.json().get("data", [])
        if not data:
            break

        results.extend(data)
        offset += per_request
        time.sleep(0.2)  # respect API limits

    return results[:total]

# (Step 4) Save to CSV
def save_csv(records, filename):
    with open(filename, "w", encoding="utf-8", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["Paper_Title", "Abstract", "Year", "Authors"])
        for paper in records:
            title = paper.get("title", "")
            abstract = paper.get("abstract", "")
            year = paper.get("year", "")
            authors = ", ".join([a.get("name", "") for a in paper.get("authors", [])])
            writer.writerow([title, abstract, year, authors])

# (Step 5) Run test
if __name__ == "__main__":
    papers = fetch_papers(search_term, records_needed, batch_limit)
    save_csv(papers, file_name)
    print(f"✅ Test complete: {len(papers)} papers saved to {file_name}")

import pandas as pd

# list all your batch files
batch_files = [f"batch{i}.csv" for i in range(1, 11)]

# read and merge
merged = pd.concat([pd.read_csv(f) for f in batch_files], ignore_index=True)

# save as one big CSV
merged.to_csv("papers_10000.csv", index=False)

print("✅ Combined into papers_10000.csv with", len(merged), "rows")
from google.colab import files
files.download("papers_10000.csv")



 Error 429, retrying...
 Error 429, retrying...
 Error 429, retrying...
 Error 429, retrying...
 Error 429, retrying...
✅ Test complete: 1000 papers saved to batch9.csv
✅ Combined into papers_10000.csv with 10000 rows


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required NLTK data (run once)
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

#  Load dataset
df = pd.read_csv("papers_10000.csv")

# Let's assume the column to clean is "Abstract"
text_col = "Abstract"

# 1. Remove noise (special characters, punctuation)
df["clean_noise"] = df[text_col].astype(str).apply(lambda x: re.sub(r"[^a-zA-Z\s]", "", x))
print("✅ Removed noise")
print(df[["Abstract", "clean_noise"]].head())

# 2. Remove numbers
df["clean_no_numbers"] = df["clean_noise"].apply(lambda x: re.sub(r"\d+", "", x))
print("✅ Removed numbers")
print(df[["clean_noise", "clean_no_numbers"]].head())

# 3. Remove stopwords
stop_words = set(stopwords.words("english"))
df["clean_no_stopwords"] = df["clean_no_numbers"].apply(
    lambda x: " ".join([word for word in x.split() if word.lower() not in stop_words])
)
print("✅ Removed stopwords")
print(df[["clean_no_numbers", "clean_no_stopwords"]].head())

# 4. Lowercase all text
df["clean_lower"] = df["clean_no_stopwords"].str.lower()
print("✅ Lowercased text")
print(df[["clean_no_stopwords", "clean_lower"]].head())

# 5. Stemming
stemmer = PorterStemmer()
df["clean_stemmed"] = df["clean_lower"].apply(
    lambda x: " ".join([stemmer.stem(word) for word in x.split()])
)
print("✅ Applied stemming")
print(df[["clean_lower", "clean_stemmed"]].head())

# 6. Lemmatization
lemmatizer = WordNetLemmatizer()
df["clean_lemmatized"] = df["clean_lower"].apply(
    lambda x: " ".join([lemmatizer.lemmatize(word) for word in x.split()])
)
print("✅ Applied lemmatization")
print(df[["clean_lower", "clean_lemmatized"]].head())

# (Final Step) Save new CSV with cleaned columns
df.to_csv("papers_10000_cleaned.csv", index=False)
print("\n🎉 Cleaning complete! File saved as papers_10000_cleaned.csv")
from google.colab import files
files.download("papers_10000_cleaned.csv")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


✅ Removed noise
                                            Abstract  \
0  We present Fashion-MNIST, a new dataset compri...   
1                                                NaN   
2  TensorFlow is an interface for expressing mach...   
3  With the widespread use of artificial intellig...   
4                                                NaN   

                                         clean_noise  
0  We present FashionMNIST a new dataset comprisi...  
1                                                nan  
2  TensorFlow is an interface for expressing mach...  
3  With the widespread use of artificial intellig...  
4                                                nan  
✅ Removed numbers
                                         clean_noise  \
0  We present FashionMNIST a new dataset comprisi...   
1                                                nan   
2  TensorFlow is an interface for expressing mach...   
3  With the widespread use of artificial intellig...   
4                  

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:

!pip install spacy
!python -m spacy download en_core_web_sm
import pandas as pd
import spacy
from collections import Counter


nlp = spacy.load("en_core_web_sm")
df = pd.read_csv("papers_10000_cleaned.csv")

text_col = "clean_lemmatized"
texts = df[text_col].dropna().astype(str).tolist()

sample_texts = texts[:5]

# (1) Parts of Speech (POS) Tagging

pos_counts = Counter()

for doc in nlp.pipe(sample_texts):
    for token in doc:
        if token.pos_ in ["NOUN", "VERB", "ADJ", "ADV"]:
            pos_counts[token.pos_] += 1

print("✅ POS Tagging Counts:")
print(f"Nouns: {pos_counts['NOUN']}")
print(f"Verbs: {pos_counts['VERB']}")
print(f"Adjectives: {pos_counts['ADJ']}")
print(f"Adverbs: {pos_counts['ADV']}\n")


# (2) Constituency & Dependency Parsing


example_sentence = sample_texts[0]  # pick the first cleaned sentence
doc = nlp(example_sentence)

print("✅ Dependency Parsing (word → head relation):")
for token in doc:
    print(f"{token.text} --> {token.dep_} --> {token.head.text}")


print("\nExample Sentence:", example_sentence)
print("Explanation:")
print("- Dependency tree shows how each word relates grammatically to others.")
print("- Constituency tree (not shown here) breaks the sentence into nested phrases like NP, VP, etc.")


# (3) Named Entity Recognition (NER)

entity_counts = Counter()

for doc in nlp.pipe(sample_texts):
    for ent in doc.ents:
        entity_counts[ent.label_] += 1

print("\n✅ Named Entity Recognition Counts:")
for ent, count in entity_counts.items():
    print(f"{ent}: {count}")

entities_table = []
for doc in nlp.pipe(sample_texts):
    for ent in doc.ents:
        entities_table.append([ent.text, ent.label_])

entities_df = pd.DataFrame(entities_table, columns=["Entity", "Label"])
entities_df.to_csv("entities_sample.csv", index=False)
print("\n🎉 Entities saved to entities_sample.csv for review.")



Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m86.6 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
✅ POS Tagging Counts:
Nouns: 235
Verbs: 88
Adjectives: 59
Adverbs: 11

✅ Dependency Parsing (word → head relation):
present --> amod --> comprising
fashionmnist --> amod --> comprising
new --> amod --> comprising
dataset --> npadvmod --> comprising
comprising --> nsubj --> set
x --> punct --> comprising
grayscale --> compound --> ima

# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Parameters
base_url = "https://github.com/marketplace?type=actions&page="
total_pages = 45   # 🔹 change to 50 (for ~1000 products, 20 per page)
delay = 2         # seconds delay to avoid server overload

data = []

for page in range(1, total_pages + 1):
    print(f"Scraping page {page}...")
    url = base_url + str(page)

    try:
        response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
        if response.status_code != 200:
            print(f"⚠️ Page {page} failed with status {response.status_code}")
            continue

        soup = BeautifulSoup(response.text, "html.parser")


        products = soup.find_all("div", {"class": "d-flex flex-auto"})

        for product in products:

            name_tag = product.find("h3")
            name = name_tag.get_text(strip=True) if name_tag else "N/A"


            desc_tag = product.find("p")
            description = desc_tag.get_text(strip=True) if desc_tag else "N/A"


            link_tag = product.find("a", href=True)
            url = "https://github.com" + link_tag["href"] if link_tag else "N/A"

            data.append([name, description, url, page])

        time.sleep(delay)

    except Exception as e:
        print(f" Error on page {page}: {e}")
        continue

# Save to CSV
df = pd.DataFrame(data, columns=["Product_Name", "Description", "URL", "Page_Number"])
df.to_csv("github_marketplace_actions.csv", index=False)
print("✅ Scraping is complete. Data saved to github_marketplace_actions.csv")




Scraping page 1...
Scraping page 2...
Scraping page 3...
Scraping page 4...
Scraping page 5...
Scraping page 6...
Scraping page 7...
Scraping page 8...
Scraping page 9...
Scraping page 10...
Scraping page 11...
Scraping page 12...
Scraping page 13...
Scraping page 14...
Scraping page 15...
Scraping page 16...
Scraping page 17...
Scraping page 18...
Scraping page 19...
Scraping page 20...
Scraping page 21...
Scraping page 22...
Scraping page 23...
Scraping page 24...
Scraping page 25...
Scraping page 26...
Scraping page 27...
Scraping page 28...
Scraping page 29...
Scraping page 30...
Scraping page 31...
Scraping page 32...
Scraping page 33...
Scraping page 34...
Scraping page 35...
Scraping page 36...
Scraping page 37...
Scraping page 38...
Scraping page 39...
Scraping page 40...
Scraping page 41...
Scraping page 42...
Scraping page 43...
Scraping page 44...
Scraping page 45...
✅ Scraping is complete. Data saved to github_marketplace_actions.csv


In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download NLTK resources (first run only)
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")


df = pd.read_csv("github_marketplace_actions.csv")

df["Description"] = df["Description"].fillna("")

stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def clean_text(text):

    text = text.lower()
    text = re.sub(r"<.*?>", " ", text)
    text = re.sub(r"[^a-z\s]", " ", text)
    tokens = nltk.word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return " ".join(tokens)

df["Clean_Description"] = df["Description"].apply(clean_text)


df = df.drop_duplicates()

df = df[df["Product_Name"].notna() & df["URL"].notna()]

df = df.reset_index(drop=True)

df.to_csv("github_marketplace_actions_cleaned.csv", index=False)

print("✅ Preprocessing & Data Quality checks completed and Saved as github_marketplace_actions_cleaned.csv")


✅ Preprocessing & Data Quality checks completed and Saved as github_marketplace_actions_cleaned.csv


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [2]:
#Part 1
import tweepy
import pandas as pd


api_key = "ANN78bI7tNdFOOXgY6j0pKlPe"
api_key_secret = "XIBb70nEZaXsJAEQL88xSoPoYfBRwD4kMj5x8OkWgPReiIm6VG"
bearer_token = "AAAAAAAAAAAAAAAAAAAAAC4u4wEAAAAAEbXsSj%2BcPQ54yR%2BQyGReRuHqFT4%3DrAdfKwM4OzucUQf318v5epc7l434b0H7x0qmyJAxtaQGVqRKLT"
access_token = "1370273475038834691-4tPW2A9BuCCXSfMTc8BpZCbZksCiFE"
access_token_secret = "Cd8aAXl5kST8o4iqOEhj0J0Y050Msq4EdbZS5GTbz93P2"

client = tweepy.Client(bearer_token=bearer_token,
                       consumer_key=api_key,
                       consumer_secret=api_key_secret,
                       access_token=access_token,
                       access_token_secret=access_token_secret)


query = "(#AI OR #machinelearning OR #ML OR #artificialintelligence) -is:retweet lang:en"


print("🔎 Searching tweets...")

response = client.search_recent_tweets(
    query=query,
    tweet_fields=["id", "text", "author_id"],
    user_fields=["username"],
    expansions=["author_id"],
    max_results=100
)

tweets_data = []

if response.data:
    users = {u.id: u.username for u in response.includes["users"]}
    for tweet in response.data:
        tweets_data.append({
            "tweet_id": tweet.id,
            "username": users.get(tweet.author_id, "N/A"),
            "text": tweet.text
        })
    print(f"✅ Found {len(tweets_data)} tweets")
else:
    print(" No tweets found.")


df = pd.DataFrame(tweets_data)
print(df.head())

🔎 Searching tweets...
✅ Found 100 tweets
              tweet_id      username  \
0  1977551287547027555   KanzaKhan09   
1  1977551268869464297  hemettante14   
2  1977551265212305579    raonsecure   
3  1977551183754432683   tomas_corza   
4  1977551121875963930       ARBSOAI   

                                                text  
0  @Tesla Wild move by FSD 14.1 backing out like ...  
1  Talus Labs is designing intelligent systems th...  
2  📢RaonSecure’s 2025 Blockchain &amp; AI Hackath...  
3  Click the link to learn how to make your conte...  
4  📣 "How to Lose at Poker While Still Claiming I...  


In [3]:

# Part 2

df.drop_duplicates(subset=["tweet_id"], inplace=True)

df.dropna(inplace=True)

df.reset_index(drop=True, inplace=True)

print("✅ Final Dataset Shape:", df.shape)
print("\nColumn Info:")
print(df.info())
print("\nSample Data:")
print(df.head())


df.to_csv("cleaned_tweets.csv", index=False, encoding="utf-8")
print("📁 Data saved to 'cleaned_tweets.csv'")

from google.colab import files

files.download("cleaned_tweets.csv")



✅ Final Dataset Shape: (100, 3)

Column Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   tweet_id  100 non-null    int64 
 1   username  100 non-null    object
 2   text      100 non-null    object
dtypes: int64(1), object(2)
memory usage: 2.5+ KB
None

Sample Data:
              tweet_id      username  \
0  1977551287547027555   KanzaKhan09   
1  1977551268869464297  hemettante14   
2  1977551265212305579    raonsecure   
3  1977551183754432683   tomas_corza   
4  1977551121875963930       ARBSOAI   

                                                text  
0  @Tesla Wild move by FSD 14.1 backing out like ...  
1  Talus Labs is designing intelligent systems th...  
2  📢RaonSecure’s 2025 Blockchain &amp; AI Hackath...  
3  Click the link to learn how to make your conte...  
4  📣 "How to Lose at Poker While Still Claiming I...  
📁 Data saved to 'clea

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

 Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

ANS- The assignment was really interesting and useful, Since I was able to assemble real data, prepare it, and then analyze it. Although large dataset and API boundaries were the most difficult aspect to handle, I loved watching clean lessons and organized data designed for analysis.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog