<a href="https://colab.research.google.com/github/raguram-3398/Raguram_INFO5731_Fall2025/blob/Assignments/Poobalan_Raguram_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [None]:
import requests
import csv
import time
from datetime import datetime

def get_papers(query, total_papers_per_query, year):

    search_url = "https://api.semanticscholar.org/graph/v1/paper/search"
    papers_data = []
    offset = 0
    limit = 100  # Max limit per request
    api_result_limit = 1000 # The API does not allow offsets >= 1000

    print(f"Starting data collection for query: '{query}' for year {year}")

    while len(papers_data) < total_papers_per_query:
        # Stop if the next request would exceed the API's hard limit for offsets.
        if offset >= api_result_limit:
            print(f"Reached the API's result limit for '{query}' for year {year}.")
            break

        # Parameters for the API request
        params = {
            'query': query,
            'year': year,
            'offset': offset,
            'limit': min(limit, total_papers_per_query - len(papers_data)),
            'fields': 'title,abstract'
        }

        try:
            response = requests.get(search_url, params=params)

            # Handle rate limiting specifically
            if response.status_code == 429:
                print("Rate limit reached. Waiting for 60 seconds before retrying...")
                time.sleep(60)
                continue # Retry the same request

            response.raise_for_status()  # Raise an exception for other bad status codes (4xx or 5xx)

            data = response.json()

            # Check if 'data' key exists and is not empty
            if 'data' not in data or not data['data']:
                print(f"No more results found for '{query}' for year {year}. Stopping.")
                break

            for paper in data['data']:
                # Ensure the abstract is not None before appending
                if paper.get('abstract'):
                    papers_data.append({
                        'title': paper.get('title'),
                        'abstract': paper.get('abstract')
                    })

            print(f"Collected {len(papers_data)}/{total_papers_per_query} papers for '{query}' for year {year}...")

            # Update offset for the next page, ensuring it doesn't exceed the API limit
            if 'next' in data and data['next'] < api_result_limit:
                offset = data['next']
            else:
                # No more pages to fetch or offset limit reached
                break

        except requests.exceptions.RequestException as e:
            print(f"An error occurred: {e}")
            break

        # Respectful delay to avoid overwhelming the API.
        # The API limit is 100 requests per 5 minutes (300 seconds), so >3 seconds per request.
        time.sleep(4)

    return papers_data

def save_to_csv(data, filename="research_papers.csv"):

    if not data:
        print("No data to save.")
        return

    # Define the headers for the CSV file
    headers = ['title', 'abstract']

    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=headers)
        writer.writeheader()
        writer.writerows(data)

    print(f"\nSuccessfully saved {len(data)} papers to {filename}")


if __name__ == "__main__":
    # Define the core search queries
    QUERIES = [
        "machine learning",
        "data science",
        "artificial intelligence",
        "information extraction"
    ]

    # Define the total number of papers to collect
    TOTAL_PAPERS_TARGET = 10000

    all_papers = []
    current_year = datetime.now().year
    start_year = 2010 # A reasonable year to stop searching to prevent an infinite loop

    print(f"Attempting to collect {TOTAL_PAPERS_TARGET} papers using {len(QUERIES)} core queries across different years.")

    # Loop until the target is reached or we run out of recent years
    while len(all_papers) < TOTAL_PAPERS_TARGET and current_year >= start_year:
        for query in QUERIES:
            # If we've already hit the target, stop fetching more
            if len(all_papers) >= TOTAL_PAPERS_TARGET:
                break

            # Calculate how many papers we still need to reach the goal
            papers_needed = TOTAL_PAPERS_TARGET - len(all_papers)

            # We can't fetch more than 1000 per query/year combo, and we don't want to overshoot our target
            papers_to_fetch_this_run = min(papers_needed, 1000)

            new_papers = get_papers(query, papers_to_fetch_this_run, current_year)
            all_papers.extend(new_papers)
            print("-" * 50)

        # Move to the previous year for the next set of queries
        current_year -= 1

    # Check if the target was met and inform the user
    if len(all_papers) < TOTAL_PAPERS_TARGET:
        print(f"\nWarning: Could only collect {len(all_papers)} out of {TOTAL_PAPERS_TARGET} targeted papers.")
        print(f"The search stopped after checking back to year {start_year}.")

    # Save all collected data to a single CSV file
    save_to_csv(all_papers)



Attempting to collect 10000 papers using 4 core queries across different years.
Starting data collection for query: 'machine learning' for year 2025
Rate limit reached. Waiting for 60 seconds before retrying...
Collected 95/1000 papers for 'machine learning' for year 2025...
Rate limit reached. Waiting for 60 seconds before retrying...
Collected 190/1000 papers for 'machine learning' for year 2025...
Rate limit reached. Waiting for 60 seconds before retrying...
Collected 287/1000 papers for 'machine learning' for year 2025...
Rate limit reached. Waiting for 60 seconds before retrying...
Rate limit reached. Waiting for 60 seconds before retrying...
Rate limit reached. Waiting for 60 seconds before retrying...
Rate limit reached. Waiting for 60 seconds before retrying...
Rate limit reached. Waiting for 60 seconds before retrying...
Collected 385/1000 papers for 'machine learning' for year 2025...
Rate limit reached. Waiting for 60 seconds before retrying...
Collected 447/1000 papers for 

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
"""(1) Remove noise, such as special characters and punctuations."""

import pandas as pd
import re

# Input & output paths
input_file = "/content/research_papers.csv"
output_file = "/content/research_step1.csv"

df = pd.read_csv(input_file)

def strip_symbols(text):
    if pd.isna(text):
        return ""
    # Replace non-alphabetic characters (except spaces) with blank
    text = re.sub(r"[^A-Za-z\s]", " ", text)
    # Normalize spaces
    return re.sub(r"\s+", " ", text).strip()

df["title_step1"] = df["title"].apply(strip_symbols)
df["abstract_step1"] = df["abstract"].apply(strip_symbols)

df.to_csv(output_file, index=False)
df.head()


Unnamed: 0,title,abstract,title_step1,abstract_step1
0,iMLGAM: Integrated Machine Learning and Geneti...,Abstract To address the substantial variabilit...,iMLGAM Integrated Machine Learning and Genetic...,Abstract To address the substantial variabilit...
1,Convergence of evolving artificial intelligenc...,The confluence of new technologies with artifi...,Convergence of evolving artificial intelligenc...,The confluence of new technologies with artifi...
2,The triglyceride–glucose index and its obesity...,Background Hypertension (HTN) is a global publ...,The triglyceride glucose index and its obesity...,Background Hypertension HTN is a global public...
3,AI and Machine Learning for Sustainable Energy...,This research explores how Machine Learning an...,AI and Machine Learning for Sustainable Energy...,This research explores how Machine Learning an...
4,Speech emotion recognition using machine learning,– Speech Emotion Recognition (SER) system usin...,Speech emotion recognition using machine learning,Speech Emotion Recognition SER system using ma...


In [None]:
"""(2) Remove numbers."""

import pandas as pd
import re

input_file = "/content/research_step1.csv"
output_file = "/content/research_step2.csv"

df = pd.read_csv(input_file)

def drop_numbers(text):
    return re.sub(r"\d+", "", text) if isinstance(text, str) else ""

df["title_step2"] = df["title_step1"].apply(drop_numbers)
df["abstract_step2"] = df["abstract_step1"].apply(drop_numbers)

df.to_csv(output_file, index=False)
df.head()


Unnamed: 0,title,abstract,title_step1,abstract_step1,title_step2,abstract_step2
0,iMLGAM: Integrated Machine Learning and Geneti...,Abstract To address the substantial variabilit...,iMLGAM Integrated Machine Learning and Genetic...,Abstract To address the substantial variabilit...,iMLGAM Integrated Machine Learning and Genetic...,Abstract To address the substantial variabilit...
1,Convergence of evolving artificial intelligenc...,The confluence of new technologies with artifi...,Convergence of evolving artificial intelligenc...,The confluence of new technologies with artifi...,Convergence of evolving artificial intelligenc...,The confluence of new technologies with artifi...
2,The triglyceride–glucose index and its obesity...,Background Hypertension (HTN) is a global publ...,The triglyceride glucose index and its obesity...,Background Hypertension HTN is a global public...,The triglyceride glucose index and its obesity...,Background Hypertension HTN is a global public...
3,AI and Machine Learning for Sustainable Energy...,This research explores how Machine Learning an...,AI and Machine Learning for Sustainable Energy...,This research explores how Machine Learning an...,AI and Machine Learning for Sustainable Energy...,This research explores how Machine Learning an...
4,Speech emotion recognition using machine learning,– Speech Emotion Recognition (SER) system usin...,Speech emotion recognition using machine learning,Speech Emotion Recognition SER system using ma...,Speech emotion recognition using machine learning,Speech Emotion Recognition SER system using ma...


In [None]:
"""(3) Remove stopwords by using the stopwords list."""

import pandas as pd
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

input_file = "/content/research_step2.csv"
output_file = "/content/research_step3.csv"

df = pd.read_csv(input_file)

def filter_stopwords(text):
    if not isinstance(text, str):
        return ""
    words = text.split()
    return " ".join([w for w in words if w.lower() not in stop_words])

df["title_step3"] = df["title_step2"].apply(filter_stopwords)
df["abstract_step3"] = df["abstract_step2"].apply(filter_stopwords)

df.to_csv(output_file, index=False)
df.head()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,title,abstract,title_step1,abstract_step1,title_step2,abstract_step2,title_step3,abstract_step3
0,iMLGAM: Integrated Machine Learning and Geneti...,Abstract To address the substantial variabilit...,iMLGAM Integrated Machine Learning and Genetic...,Abstract To address the substantial variabilit...,iMLGAM Integrated Machine Learning and Genetic...,Abstract To address the substantial variabilit...,iMLGAM Integrated Machine Learning Genetic Alg...,Abstract address substantial variability immun...
1,Convergence of evolving artificial intelligenc...,The confluence of new technologies with artifi...,Convergence of evolving artificial intelligenc...,The confluence of new technologies with artifi...,Convergence of evolving artificial intelligenc...,The confluence of new technologies with artifi...,Convergence evolving artificial intelligence m...,confluence new technologies artificial intelli...
2,The triglyceride–glucose index and its obesity...,Background Hypertension (HTN) is a global publ...,The triglyceride glucose index and its obesity...,Background Hypertension HTN is a global public...,The triglyceride glucose index and its obesity...,Background Hypertension HTN is a global public...,triglyceride glucose index obesity related der...,Background Hypertension HTN global public heal...
3,AI and Machine Learning for Sustainable Energy...,This research explores how Machine Learning an...,AI and Machine Learning for Sustainable Energy...,This research explores how Machine Learning an...,AI and Machine Learning for Sustainable Energy...,This research explores how Machine Learning an...,AI Machine Learning Sustainable Energy Predict...,research explores Machine Learning AI used enh...
4,Speech emotion recognition using machine learning,– Speech Emotion Recognition (SER) system usin...,Speech emotion recognition using machine learning,Speech Emotion Recognition SER system using ma...,Speech emotion recognition using machine learning,Speech Emotion Recognition SER system using ma...,Speech emotion recognition using machine learning,Speech Emotion Recognition SER system using ma...


In [None]:
"""(4) Lowercase all texts"""

import pandas as pd

input_file = "/content/research_step3.csv"
output_file = "/content/research_step4.csv"

df = pd.read_csv(input_file)

df["title_step4"] = df["title_step3"].str.lower()
df["abstract_step4"] = df["abstract_step3"].str.lower()

df.to_csv(output_file, index=False)
df.head()


Unnamed: 0,title,abstract,title_step1,abstract_step1,title_step2,abstract_step2,title_step3,abstract_step3,title_step4,abstract_step4
0,iMLGAM: Integrated Machine Learning and Geneti...,Abstract To address the substantial variabilit...,iMLGAM Integrated Machine Learning and Genetic...,Abstract To address the substantial variabilit...,iMLGAM Integrated Machine Learning and Genetic...,Abstract To address the substantial variabilit...,iMLGAM Integrated Machine Learning Genetic Alg...,Abstract address substantial variability immun...,imlgam integrated machine learning genetic alg...,abstract address substantial variability immun...
1,Convergence of evolving artificial intelligenc...,The confluence of new technologies with artifi...,Convergence of evolving artificial intelligenc...,The confluence of new technologies with artifi...,Convergence of evolving artificial intelligenc...,The confluence of new technologies with artifi...,Convergence evolving artificial intelligence m...,confluence new technologies artificial intelli...,convergence evolving artificial intelligence m...,confluence new technologies artificial intelli...
2,The triglyceride–glucose index and its obesity...,Background Hypertension (HTN) is a global publ...,The triglyceride glucose index and its obesity...,Background Hypertension HTN is a global public...,The triglyceride glucose index and its obesity...,Background Hypertension HTN is a global public...,triglyceride glucose index obesity related der...,Background Hypertension HTN global public heal...,triglyceride glucose index obesity related der...,background hypertension htn global public heal...
3,AI and Machine Learning for Sustainable Energy...,This research explores how Machine Learning an...,AI and Machine Learning for Sustainable Energy...,This research explores how Machine Learning an...,AI and Machine Learning for Sustainable Energy...,This research explores how Machine Learning an...,AI Machine Learning Sustainable Energy Predict...,research explores Machine Learning AI used enh...,ai machine learning sustainable energy predict...,research explores machine learning ai used enh...
4,Speech emotion recognition using machine learning,– Speech Emotion Recognition (SER) system usin...,Speech emotion recognition using machine learning,Speech Emotion Recognition SER system using ma...,Speech emotion recognition using machine learning,Speech Emotion Recognition SER system using ma...,Speech emotion recognition using machine learning,Speech Emotion Recognition SER system using ma...,speech emotion recognition using machine learning,speech emotion recognition ser system using ma...


In [None]:
"""(5) Stemming."""

import pandas as pd
import nltk
from nltk.stem import PorterStemmer

nltk.download("punkt")

input_file = "/content/research_step4.csv"
output_file = "/content/research_step5.csv"

df = pd.read_csv(input_file)

stemmer = PorterStemmer()

def stem_text(text):
    return " ".join([stemmer.stem(w) for w in str(text).split()])

df["title_step5"] = df["title_step4"].apply(stem_text)
df["abstract_step5"] = df["abstract_step4"].apply(stem_text)

df.to_csv(output_file, index=False)
df.head()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,title,abstract,title_step1,abstract_step1,title_step2,abstract_step2,title_step3,abstract_step3,title_step4,abstract_step4,title_step5,abstract_step5
0,iMLGAM: Integrated Machine Learning and Geneti...,Abstract To address the substantial variabilit...,iMLGAM Integrated Machine Learning and Genetic...,Abstract To address the substantial variabilit...,iMLGAM Integrated Machine Learning and Genetic...,Abstract To address the substantial variabilit...,iMLGAM Integrated Machine Learning Genetic Alg...,Abstract address substantial variability immun...,imlgam integrated machine learning genetic alg...,abstract address substantial variability immun...,imlgam integr machin learn genet algorithm dri...,abstract address substanti variabl immun check...
1,Convergence of evolving artificial intelligenc...,The confluence of new technologies with artifi...,Convergence of evolving artificial intelligenc...,The confluence of new technologies with artifi...,Convergence of evolving artificial intelligenc...,The confluence of new technologies with artifi...,Convergence evolving artificial intelligence m...,confluence new technologies artificial intelli...,convergence evolving artificial intelligence m...,confluence new technologies artificial intelli...,converg evolv artifici intellig machin learn t...,confluenc new technolog artifici intellig ai m...
2,The triglyceride–glucose index and its obesity...,Background Hypertension (HTN) is a global publ...,The triglyceride glucose index and its obesity...,Background Hypertension HTN is a global public...,The triglyceride glucose index and its obesity...,Background Hypertension HTN is a global public...,triglyceride glucose index obesity related der...,Background Hypertension HTN global public heal...,triglyceride glucose index obesity related der...,background hypertension htn global public heal...,triglycerid glucos index obes relat deriv pred...,background hypertens htn global public health ...
3,AI and Machine Learning for Sustainable Energy...,This research explores how Machine Learning an...,AI and Machine Learning for Sustainable Energy...,This research explores how Machine Learning an...,AI and Machine Learning for Sustainable Energy...,This research explores how Machine Learning an...,AI Machine Learning Sustainable Energy Predict...,research explores Machine Learning AI used enh...,ai machine learning sustainable energy predict...,research explores machine learning ai used enh...,ai machin learn sustain energi predict model o...,research explor machin learn ai use enhanc ene...
4,Speech emotion recognition using machine learning,– Speech Emotion Recognition (SER) system usin...,Speech emotion recognition using machine learning,Speech Emotion Recognition SER system using ma...,Speech emotion recognition using machine learning,Speech Emotion Recognition SER system using ma...,Speech emotion recognition using machine learning,Speech Emotion Recognition SER system using ma...,speech emotion recognition using machine learning,speech emotion recognition ser system using ma...,speech emot recognit use machin learn,speech emot recognit ser system use machin lea...


In [None]:
"""(6) Lemmatization."""

import pandas as pd
import spacy

input_file = "/content/research_step5.csv"
output_file = "/content/research_step6.csv"

nlp = spacy.load("en_core_web_sm", disable=["parser", "ner", "textcat"])

df = pd.read_csv(input_file)

def lemmatize_text(text):
    if not isinstance(text, str) or not text.strip():
        return ""
    doc = nlp(text)
    return " ".join([token.lemma_ for token in doc])

df["cleaned_title"] = df["title_step5"].apply(lemmatize_text)
df["cleaned_abstract"] = df["abstract_step5"].apply(lemmatize_text)

final_file = "/content/research_papers_cleaned.csv"
df.to_csv(final_file, index=False)
print(f"✅ Final file saved: {final_file}")

df[["title", "cleaned_title", "abstract", "cleaned_abstract"]].head()



✅ Final file saved: /content/research_papers_cleaned.csv


Unnamed: 0,title,cleaned_title,abstract,cleaned_abstract
0,iMLGAM: Integrated Machine Learning and Geneti...,imlgam integr machin learn genet algorithm dri...,Abstract To address the substantial variabilit...,abstract address substanti variabl immun check...
1,Convergence of evolving artificial intelligenc...,converg evolv artifici intellig machin learn t...,The confluence of new technologies with artifi...,confluenc new technolog artifici intellig ai m...
2,The triglyceride–glucose index and its obesity...,triglycerid glucos index obe relat deriv predi...,Background Hypertension (HTN) is a global publ...,background hypertens htn global public health ...
3,AI and Machine Learning for Sustainable Energy...,ai machin learn sustain energi predict model o...,This research explores how Machine Learning an...,research explor machin learn ai use enhanc ene...
4,Speech emotion recognition using machine learning,speech emot recognit use machin learn,– Speech Emotion Recognition (SER) system usin...,speech emot recognit ser system use machin lea...


# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
import spacy
import pandas as pd

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Load cleaned file
file_path = "/content/research_papers_cleaned.csv"
df = pd.read_csv(file_path)

# Combine title and abstract for analysis
df['cleaned_text'] = df['cleaned_title'].fillna('') + ". " + df['cleaned_abstract'].fillna('')

# POS tagging for one document
text = df['cleaned_text'][0]   # analyze first row as example
doc = nlp(text)

# Count POS categories
pos_counts = {"NOUN":0, "VERB":0, "ADJ":0, "ADV":0}
for token in doc:
    if token.pos_ in pos_counts:
        pos_counts[token.pos_] += 1

print("POS tagging for first text:")
for token in doc:
    print(f"{token.text:<15} {token.pos_}")

print("\nPOS counts:", pos_counts)


POS tagging for first text:
imlgam          PROPN
integr          PROPN
machin          PROPN
learn           VERB
genet           PROPN
algorithm       PROPN
drive           VERB
multiom         PROPN
analysi         NOUN
pan             PROPN
cancer          NOUN
immunotherapi   PROPN
respon          PROPN
predict         PROPN
.               PUNCT
abstract        ADJ
address         NOUN
substanti       PROPN
variabl         PROPN
immun           PROPN
checkpoint      PROPN
blockad         PROPN
icb             PROPN
therapi         ADJ
effect          NOUN
develop         VERB
innov           PROPN
r               PROPN
packag          NOUN
call            PROPN
integr          PROPN
machin          PROPN
learn           VERB
genet           PROPN
algorithm       PROPN
drive           VERB
multiom         ADJ
analysi         NOUN
imlgam          NOUN
establish       VERB
comprehen       ADJ
score           NOUN
system          NOUN
predict         VERB
treatment       NOUN
outcom 

In [None]:
import spacy
import benepar

# Load spacy model
nlp = spacy.load("en_core_web_sm")

# Add benepar constituency parser
if spacy.__version__.startswith("3"):
    nlp.add_pipe("benepar", config={"model": "benepar_en3"})

# Example: parse first abstract
doc = nlp(df['cleaned_abstract'][0])
sent = list(doc.sents)[0]

print("Sentence:", sent.text)

# Dependency parse
print("\nDependency Parse:")
for token in sent:
    print(f"{token.text:<15} {token.dep_:<10} {token.head.text:<10} {token.pos_}")

# Constituency parse
print("\nConstituency Parse Tree:")
print(sent._.parse_string)


You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Sentence: abstract address substanti variabl immun checkpoint blockad icb therapi effect develop innov r packag call integr machin learn genet algorithm drive multiom analysi imlgam establish comprehen score system predict treatment outcom advanc multi omic data integr research demonstr imlgam score exhibit superior predict perform across independ cohort low score correl significantli enhanc therapeut respon outperform exist clinic biomark detail analysi reveal tumor low imlgam score display distinct immun microenviron characterist includ increas immun cell infiltr amplifi antitumor immun respons critic cluster regularli interspac short palindrom repeat screen identifi centrosom protein cep key molecul modul tumor immun evas mechanist confirm role regul cell mediat antitumor immun respon find valid imlgam power prognost tool also propos cep promis therapeut target offer novel strategi enhanc icb treatment efficaci imlgam packag freeli avail github http github com yelab imlgam provid re



In [42]:
import pandas as pd
import spacy
from collections import defaultdict

nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger", "lemmatizer", "textcat"])

df = pd.read_csv("/content/research_papers_cleaned.csv", dtype=str).fillna("")

texts = df["cleaned_title"].tolist() + df["cleaned_abstract"].tolist()

label_counts = defaultdict(int)
entities_set = set()

for doc in nlp.pipe(texts, batch_size=100):
    for ent in doc.ents:
        entities_set.add((ent.text, ent.label_))
        label_counts[ent.label_] += 1

print("🔎 Named Entity Counts:")
for label, count in label_counts.items():
    print(f"{label}: {count}")


🔎 Named Entity Counts:
PERSON: 30016
ORG: 18472
GPE: 5524
NORP: 4012
LOC: 469
CARDINAL: 7034
FAC: 647
ORDINAL: 2100
LAW: 29
PRODUCT: 1064
DATE: 1668
TIME: 63
WORK_OF_ART: 44
QUANTITY: 100
LANGUAGE: 110
MONEY: 11
EVENT: 12
PERCENT: 5


# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [41]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time, random
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

# Global session with headers
session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
})

# Retry wrapper
def fetch_url(url, retries=2, timeout=10):
    for attempt in range(retries):
        try:
            resp = session.get(url, timeout=timeout)
            if resp.status_code == 200:
                return resp.text
        except requests.RequestException:
            pass
        # backoff
        time.sleep(1 + random.random())
    return None

# Step 1: Scrape listing pages
def scrape_marketplace_page(page_num):
    url = f"https://github.com/marketplace?type=actions&page={page_num}"
    html = fetch_url(url)
    results = []
    if not html:
        return results

    soup = BeautifulSoup(html, "html.parser")
    cards = soup.find_all("div", attrs={"data-testid": "non-featured-item"})

    for card in cards:
        link_tag = card.find("a", href=True)
        if link_tag:
            results.append({
                "Page": page_num,
                "Name": link_tag.get_text(strip=True),
                "URL": f"https://github.com{link_tag['href']}",
                "Description": ""
            })
    print(f"✅ Page {page_num}: {len(results)} actions found")
    return results

# Step 2: Fetch descriptions concurrently
def fetch_description(item):
    html = fetch_url(item["URL"])
    if not html:
        return item
    soup = BeautifulSoup(html, "html.parser")
    about_div = soup.find("div", attrs={"data-testid": "about"})
    if about_div:
        span = about_div.find("span")
        if span:
            item["Description"] = span.get_text(strip=True)
    return item

def enrich_with_descriptions_parallel(records, workers=10):
    results = []
    with ThreadPoolExecutor(max_workers=workers) as executor:
        futures = [executor.submit(fetch_description, rec) for rec in records]
        for f in tqdm(as_completed(futures), total=len(futures), desc="Fetching Descriptions"):
            results.append(f.result())
    return results

# Step 3: Orchestrator
def main(total_pages=5, outfile="github_actions_data.csv"):
    all_actions = []
    for p in range(1, total_pages + 1):
        all_actions.extend(scrape_marketplace_page(p))
        time.sleep(random.uniform(1, 3))  # politeness delay

    print(f"🔎 Collected {len(all_actions)} actions. Now fetching descriptions...")

    all_actions = enrich_with_descriptions_parallel(all_actions, workers=10)

    df = pd.DataFrame(all_actions)
    df.to_csv(outfile, index=False, encoding="utf-8")
    print(f"✅ Finished! Saved {len(all_actions)} actions to {outfile}")

# Run
if __name__ == "__main__":
    main(total_pages=60, outfile="github_actions_final.csv")


✅ Page 1: 20 actions found
✅ Page 2: 20 actions found
✅ Page 3: 20 actions found
✅ Page 4: 0 actions found
✅ Page 5: 20 actions found
✅ Page 6: 20 actions found
✅ Page 7: 0 actions found
✅ Page 8: 20 actions found
✅ Page 9: 20 actions found
✅ Page 10: 20 actions found
✅ Page 11: 0 actions found
✅ Page 12: 20 actions found
✅ Page 13: 20 actions found
✅ Page 14: 0 actions found
✅ Page 15: 20 actions found
✅ Page 16: 20 actions found
✅ Page 17: 20 actions found
✅ Page 18: 0 actions found
✅ Page 19: 20 actions found
✅ Page 20: 20 actions found
✅ Page 21: 20 actions found
✅ Page 22: 20 actions found
✅ Page 23: 20 actions found
✅ Page 24: 20 actions found
✅ Page 25: 20 actions found
✅ Page 26: 20 actions found
✅ Page 27: 20 actions found
✅ Page 28: 20 actions found
✅ Page 29: 20 actions found
✅ Page 30: 20 actions found
✅ Page 31: 20 actions found
✅ Page 32: 20 actions found
✅ Page 33: 20 actions found
✅ Page 34: 20 actions found
✅ Page 35: 20 actions found
✅ Page 36: 20 actions found
✅ Page

Fetching Descriptions: 100%|██████████| 1080/1080 [02:26<00:00,  7.36it/s]

✅ Finished! Saved 1080 actions to github_actions_final.csv





In [43]:
import pandas as pd
import re
import nltk
import spacy
from nltk.corpus import stopwords

nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

nlp = spacy.load("en_core_web_sm", disable=["parser", "ner", "textcat"])

# Load scraped CSV
input_file = "/content/github_actions_final.csv"
output_file = "/content/github_actions_cleaned.csv"

df = pd.read_csv(input_file)

# ---------------------------
# 🔎 Step 1: Data Quality Checks
# ---------------------------
print("Initial shape:", df.shape)

# Drop duplicates
df = df.drop_duplicates()

# Fill missing values with empty string
df = df.fillna("")

# Ensure required columns exist
required_cols = ["Name", "Description", "URL", "Page"]
for col in required_cols:
    if col not in df.columns:
        df[col] = ""

print("After cleaning shape:", df.shape)

# ---------------------------
# 🔎 Step 2: Text Preprocessing
# ---------------------------
def preprocess_text(text):
    if not isinstance(text, str) or not text.strip():
        return ""

    # Remove HTML tags
    text = re.sub(r"<.*?>", " ", text)

    # Remove special characters (keep alphanumeric and spaces)
    text = re.sub(r"[^A-Za-z0-9\s]", " ", text)

    # Normalize whitespace
    text = re.sub(r"\s+", " ", text).strip()

    # Tokenize & lowercase
    tokens = text.lower().split()

    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization
    doc = nlp(" ".join(tokens))
    tokens = [token.lemma_ for token in doc]

    return " ".join(tokens)

# Apply preprocessing to Name & Description
df["Clean_Name"] = df["Name"].apply(preprocess_text)
df["Clean_Description"] = df["Description"].apply(preprocess_text)

# ---------------------------
# 🔎 Step 3: Save Cleaned Data
# ---------------------------
df.to_csv(output_file, index=False, encoding="utf-8")
print(f"✅ Cleaned data saved to {output_file}")
df.head()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Initial shape: (1080, 4)
After cleaning shape: (1080, 4)
✅ Cleaned data saved to /content/github_actions_cleaned.csv


Unnamed: 0,Page,Name,URL,Description,Clean_Name,Clean_Description
0,1,Metrics embed,https://github.com/marketplace/actions/metrics...,An infographics generator with 40+ plugins and...,metric embe,infographic generator 40 plugin 300 option dis...
1,1,OpenCommit — improve commits with AI 🧙,https://github.com/marketplace/actions/opencom...,Replaces lame commit messages with meaningful ...,opencommit improve commit ai,replace lame commit message meaningful ai gene...
2,1,yq - portable yaml processor,https://github.com/marketplace/actions/yq-port...,"create, read, update, delete, merge, validate ...",yq portable yaml processor,create read update delete merge validate yaml
3,1,generate-snake-game-from-github-contribution-grid,https://github.com/marketplace/actions/generat...,Generates a snake game from a github user cont...,generate snake game github contribution grid,generate snake game github user contributions ...
4,1,TruffleHog OSS,https://github.com/marketplace/actions/truffle...,Find and verify leaked credentials in your sou...,trufflehog oss,find verify leak credential source code


#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [44]:

!pip install tweepy pandas --quiet

import tweepy
import pandas as pd

# -------------------------------
# Step 1: Authentication
# -------------------------------

BEARER_TOKEN = "AAAAAAAAAAAAAAAAAAAAABD64QEAAAAAwYwajxy8v2Bvdayk%2BtsSaiUSh%2B8%3DrILnZlk5SeA73GpUdyjuWFI90og1cP1MrhLcNh4WdyUmTgPbtI"

# Initialize Tweepy client
client = tweepy.Client(bearer_token=BEARER_TOKEN, wait_on_rate_limit=True)

# -------------------------------
# Step 2: Define Search Query
# -------------------------------
# Target hashtags and keywords
search_terms = "(#MachineLearning OR #ArtificialIntelligence OR #AI OR #ML) -is:retweet"

# -------------------------------
# Step 3: Collect Tweets
# -------------------------------
tweet_records = []

# Fetch recent tweets (max 100 per request)
response = client.search_recent_tweets(
    query=search_terms,
    tweet_fields=["id", "created_at", "text", "author_id", "lang"],
    max_results=100
)

# Extract information
if response.data:
    for t in response.data:
        tweet_records.append({
            "tweet_id": t.id,
            "user_id": t.author_id,
            "created_time": t.created_at,
            "tweet_text": t.text
        })

# -------------------------------
# Step 4: Save to CSV
# -------------------------------
df = pd.DataFrame(tweet_records)
output_file = "tweets_ml_ai.csv"
df.to_csv(output_file, index=False, encoding="utf-8")

print(f"✅ {len(df)} tweets saved to {output_file}")
print(df.head())


✅ 100 tweets saved to tweets_ml_ai.csv
              tweet_id              user_id              created_time  \
0  1972874264790360517  1952824190643699712 2025-09-30 04:00:53+00:00   
1  1972874253575090546   746293897551183872 2025-09-30 04:00:50+00:00   
2  1972874247530811789  1861827740590284801 2025-09-30 04:00:49+00:00   
3  1972874196087980230  1915737951402971136 2025-09-30 04:00:37+00:00   
4  1972874189540303040  1508940277217579010 2025-09-30 04:00:35+00:00   

                                          tweet_text  
0  🌴 👙 Holy Coconutz! Beach Vibes and Confidence ...  
1  Life\nAfraid of AI replacing you? Here are job...  
2  #DecentralizedStorm #WeatherApp #AITheories #T...  
3  📊🥤\n💵 $128K this weekend ✅ Stock expert @Mulho...  
4  #KOTHARIPET has shown a downtrend recently, wi...  


In [45]:
import pandas as pd
import re

# Load your dataset
df = pd.read_csv("tweets_ml_ai.csv")

print("🔎 Columns in dataset:", df.columns.tolist())
print("🔎 Original dataset shape:", df.shape)

# Choose correct column name for text
text_col = "text" if "text" in df.columns else "tweet_text"

# Remove duplicates and missing values
df.drop_duplicates(subset=["tweet_id"], inplace=True)
df.dropna(subset=[text_col], inplace=True)

# Cleaning function
def clean_tweet(text):
    text = str(text).lower()
    text = re.sub(r"http\S+|www\S+", "", text)   # remove URLs
    text = re.sub(r"@\w+", "", text)             # remove mentions
    text = re.sub(r"#", "", text)                # remove hashtag symbol
    text = re.sub(r"[^a-z0-9\s]", " ", text)     # keep only alphanumeric + space
    text = re.sub(r"\s+", " ", text).strip()     # remove extra spaces
    return text

# Apply cleaning
df["clean_text"] = df[text_col].apply(clean_tweet)

print("✅ After cleaning shape:", df.shape)
print(df[[text_col, "clean_text"]].head())

# Save cleaned CSV
df.to_csv("tweets_ml_ai_cleaned.csv", index=False, encoding="utf-8")
print("💾 Cleaned tweets saved to tweets_ml_ai_cleaned.csv")


🔎 Columns in dataset: ['tweet_id', 'user_id', 'created_time', 'tweet_text']
🔎 Original dataset shape: (100, 4)
✅ After cleaning shape: (100, 5)
                                          tweet_text  \
0  🌴 👙 Holy Coconutz! Beach Vibes and Confidence ...   
1  Life\nAfraid of AI replacing you? Here are job...   
2  #DecentralizedStorm #WeatherApp #AITheories #T...   
3  📊🥤\n💵 $128K this weekend ✅ Stock expert @Mulho...   
4  #KOTHARIPET has shown a downtrend recently, wi...   

                                          clean_text  
0  holy coconutz beach vibes and confidence for a...  
1  life afraid of ai replacing you here are jobs ...  
2  decentralizedstorm weatherapp aitheories techp...  
3  128k this weekend stock expert knows what he s...  
4  kotharipet has shown a downtrend recently with...  
💾 Cleaned tweets saved to tweets_ml_ai_cleaned.csv


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog