<a href="https://colab.research.google.com/github/varunasnv7-cpu/Varun_Info_5731_Spring2026/blob/main/INFO5731_Assignment_1_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 1**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100


**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2024 or 2025 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [174]:
import time
import requests
import pandas as pd

# -----------------------------
# USER SETTINGS
# -----------------------------
QUERY = "machine learning"  # allowed: "machine learning", "data science", "artificial intelligence", "information extraction"
TARGET_N = 10_000
OUT_CSV = "semantic_scholar_10000_abstracts_clean.csv"

# Optional: add your key for better rate limits (leave "" if you don't have one)
API_KEY = ""

FIELDS = [
    "paperId",
    "title",
    "abstract",
    "year",
    "url",
    "citationCount",
    "publicationDate",
    "authors"
]

BASE_URL = "https://api.semanticscholar.org/graph/v1/paper/search/bulk"


def make_headers(api_key: str) -> dict:
    headers = {"User-Agent": "INFO5731_Assignment1/1.0"}
    if api_key.strip():
        headers["x-api-key"] = api_key.strip()
    return headers


def authors_to_str(authors) -> str:
    if not isinstance(authors, list):
        return ""
    names = []
    for a in authors:
        if isinstance(a, dict) and a.get("name"):
            names.append(a["name"].strip())
    return "; ".join(names)


def fetch_bulk_page(query: str, fields: list, token: str | None, api_key: str, retries: int = 6) -> dict:
    """
    Bulk search endpoint paginates using a continuation `token`. :contentReference[oaicite:1]{index=1}
    """
    params = {
        "query": query,
        "fields": ",".join(fields),
        "sort": "citationCount"
    }
    if token:
        params["token"] = token

    headers = make_headers(api_key)

    for attempt in range(1, retries + 1):
        r = requests.get(BASE_URL, params=params, headers=headers, timeout=60)

        # Handle rate limiting gracefully
        if r.status_code == 429:
            sleep_s = min(60, 2 ** attempt)
            print(f"[429] Rate limited. Sleeping {sleep_s}s (attempt {attempt}/{retries})...")
            time.sleep(sleep_s)
            continue

        # Handle transient server issues
        if 500 <= r.status_code < 600:
            sleep_s = min(60, 2 ** attempt)
            print(f"[{r.status_code}] Server error. Sleeping {sleep_s}s (attempt {attempt}/{retries})...")
            time.sleep(sleep_s)
            continue

        r.raise_for_status()
        return r.json()

    raise RuntimeError("Failed to fetch after multiple retries (rate limits / server errors).")


def clean_text(s: str) -> str:
    if s is None:
        return ""
    return " ".join(str(s).split()).strip()


def main():
    rows = []
    token = None

    # Slow down if no key (shared/stricter limits are more likely without a key)
    sleep_between = 1.2 if API_KEY.strip() else 3.5

    while len(rows) < TARGET_N:
        data = fetch_bulk_page(QUERY, FIELDS, token, API_KEY)

        batch = data.get("data", [])
        token = data.get("token")  # next-page token returned by bulk search :contentReference[oaicite:2]{index=2}

        if not batch:
            print("No more results returned by API. Stopping.")
            break

        for p in batch:
            rows.append({
                "paperId": p.get("paperId", ""),
                "title": clean_text(p.get("title", "")),
                "abstract": clean_text(p.get("abstract", "")),
                "year": p.get("year"),
                "publicationDate": p.get("publicationDate", ""),
                "url": p.get("url", ""),
                "citationCount": p.get("citationCount"),
                "authors": authors_to_str(p.get("authors", [])),
                "query_used": QUERY
            })

            if len(rows) >= TARGET_N:
                break

        print(f"Collected {len(rows):,} papers | token={'yes' if token else 'no'}")

        if not token:
            print("No continuation token returned; cannot paginate further. Stopping.")
            break

        time.sleep(sleep_between)

    df = pd.DataFrame(rows)

    # Deduplicate by paperId
    if "paperId" in df.columns:
        df = df.drop_duplicates(subset=["paperId"]).reset_index(drop=True)

    # OPTIONAL: keep only rows with a non-empty abstract
    # (uncomment if your instructor expects every row to have text)
    # df = df[df["abstract"].str.len() > 0].reset_index(drop=True)

    df.to_csv(OUT_CSV, index=False, encoding="utf-8")
    print(f"\nSaved cleaned CSV: {OUT_CSV}")
    print(df.head(5))

    return df


if __name__ == "__main__":
    main()

Collected 1,000 papers | token=yes
Collected 2,000 papers | token=yes
Collected 3,000 papers | token=yes
Collected 4,000 papers | token=yes
Collected 5,000 papers | token=yes
Collected 6,000 papers | token=yes
Collected 7,000 papers | token=yes
[500] Server error. Sleeping 2s (attempt 1/6)...
Collected 7,671 papers | token=no
No continuation token returned; cannot paginate further. Stopping.

Saved cleaned CSV: semantic_scholar_10000_abstracts_clean.csv
                                    paperId  \
0  0000817f14d29ab7febdd976b9a971c81b7de4f6   
1  000099a8f4c7604b553963915427b46ad4b6e491   
2  0000bbee1421f05fd61f172b2788f60106282ca0   
3  00011b6a4ce0a7de7e4cf5b18b1b131ef4a4c103   
4  00011f3b71f704bf34ab3e3d8ea88cf07037f9b5   

                                               title  \
0  Transcriptomics-Guided In Silico Drug Repurpos...   
1  AI in Multidisciplinary Engineering: A Holisti...   
2  02011 Development of a machine learning algori...   
3  Why AI Projects in Banks Fail Wi

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [69]:
# Install if needed (Colab users)
!pip install nltk --quiet

import pandas as pd
import re
import nltk

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)

True

In [70]:
# Load the CSV from Question 1
df = pd.read_csv("semantic_scholar_10000_abstracts_clean.csv")

# Change this if your column name is different
TEXT_COLUMN = "abstract"

print("Dataset shape:", df.shape)
df[[TEXT_COLUMN]].head(5)

Dataset shape: (8978, 9)


Unnamed: 0,abstract
0,"In tropical and subtropical areas, malaria sta..."
1,Artificial Intelligence (AI) has evolved from ...
2,
3,"Everywhere across the globe, banks are increas..."
4,


In [71]:
def remove_noise(text):
    if pd.isna(text):
        return ""
    # Keep only alphabets and spaces
    text = re.sub(r'[^A-Za-z\s]', '', text)
    return text

df["clean_no_noise"] = df[TEXT_COLUMN].apply(remove_noise)

print("After removing noise:")
df[["abstract", "clean_no_noise"]].head(3)

After removing noise:


Unnamed: 0,abstract,clean_no_noise
0,"In tropical and subtropical areas, malaria sta...",In tropical and subtropical areas malaria stan...
1,Artificial Intelligence (AI) has evolved from ...,Artificial Intelligence AI has evolved from a ...
2,,


In [72]:
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

df["clean_no_numbers"] = df["clean_no_noise"].apply(remove_numbers)

print("After removing numbers:")
df[["clean_no_noise", "clean_no_numbers"]].head(3)

After removing numbers:


Unnamed: 0,clean_no_noise,clean_no_numbers
0,In tropical and subtropical areas malaria stan...,In tropical and subtropical areas malaria stan...
1,Artificial Intelligence AI has evolved from a ...,Artificial Intelligence AI has evolved from a ...
2,,


In [73]:
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

df["clean_no_stopwords"] = df["clean_no_numbers"].apply(remove_stopwords)

print("After removing stopwords:")
df[["clean_no_numbers", "clean_no_stopwords"]].head(3)

After removing stopwords:


Unnamed: 0,clean_no_numbers,clean_no_stopwords
0,In tropical and subtropical areas malaria stan...,tropical subtropical areas malaria stands prof...
1,Artificial Intelligence AI has evolved from a ...,Artificial Intelligence AI evolved computation...
2,,


In [74]:
df["clean_lowercase"] = df["clean_no_stopwords"].str.lower()

print("After converting to lowercase:")
df[["clean_no_stopwords", "clean_lowercase"]].head(3)

After converting to lowercase:


Unnamed: 0,clean_no_stopwords,clean_lowercase
0,tropical subtropical areas malaria stands prof...,tropical subtropical areas malaria stands prof...
1,Artificial Intelligence AI evolved computation...,artificial intelligence ai evolved computation...
2,,


In [75]:
stemmer = PorterStemmer()

def apply_stemming(text):
    words = text.split()
    stemmed_words = [stemmer.stem(word) for word in words]
    return " ".join(stemmed_words)

df["clean_stemmed"] = df["clean_lowercase"].apply(apply_stemming)

print("After stemming:")
df[["clean_lowercase", "clean_stemmed"]].head(3)

After stemming:


Unnamed: 0,clean_lowercase,clean_stemmed
0,tropical subtropical areas malaria stands prof...,tropic subtrop area malaria stand profound pub...
1,artificial intelligence ai evolved computation...,artifici intellig ai evolv comput tool transfo...
2,,


In [76]:
lemmatizer = WordNetLemmatizer()

def apply_lemmatization(text):
    words = text.split()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return " ".join(lemmatized_words)

df["clean_lemmatized"] = df["clean_lowercase"].apply(apply_lemmatization)

print("After lemmatization:")
df[["clean_lowercase", "clean_lemmatized"]].head(3)

After lemmatization:


Unnamed: 0,clean_lowercase,clean_lemmatized
0,tropical subtropical areas malaria stands prof...,tropical subtropical area malaria stand profou...
1,artificial intelligence ai evolved computation...,artificial intelligence ai evolved computation...
2,,


# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [175]:
import os, glob
import pandas as pd

# 1) Find CSVs in the current folder
csv_files = sorted(glob.glob("*.csv"))
print("CSV files found:", csv_files)

# 2) Pick the most likely cleaned file (fallback to first csv)
preferred = [
    "semantic_scholar_abstracts_fully_cleaned.csv",
    "semantic_scholar_10000_abstracts_clean.csv",
    "semanticscholar_papers_10000_clean.csv",
]
csv_path = None
for p in preferred:
    if p in csv_files:
        csv_path = p
        break
if csv_path is None:
    if len(csv_files) == 0:
        raise FileNotFoundError("No CSV found. Upload your cleaned CSV into this notebook runtime.")
    csv_path = csv_files[0]

print("‚úÖ Using CSV:", csv_path)

df = pd.read_csv(csv_path)
print("Shape:", df.shape)
print("Columns:", list(df.columns))
df.head(2)

CSV files found: ['cleaned_ml_ai_tweets.csv', 'github_marketplace_actions_clean.csv', 'github_marketplace_actions_raw.csv', 'semantic_scholar_10000_abstracts_clean.csv', 'tweets_clean.csv', 'tweets_raw.csv']
‚úÖ Using CSV: semantic_scholar_10000_abstracts_clean.csv
Shape: (7671, 9)
Columns: ['paperId', 'title', 'abstract', 'year', 'publicationDate', 'url', 'citationCount', 'authors', 'query_used']


Unnamed: 0,paperId,title,abstract,year,publicationDate,url,citationCount,authors,query_used
0,0000817f14d29ab7febdd976b9a971c81b7de4f6,Transcriptomics-Guided In Silico Drug Repurpos...,"In tropical and subtropical areas, malaria sta...",2023.0,2023-09-05,https://www.semanticscholar.org/paper/0000817f...,0,Joyce V. B. Borba; Beatriz Rosa de Azevedo; La...,machine learning
1,000099a8f4c7604b553963915427b46ad4b6e491,AI in Multidisciplinary Engineering: A Holisti...,Artificial Intelligence (AI) has evolved from ...,2025.0,2025-12-26,https://www.semanticscholar.org/paper/000099a8...,0,Dr. G. Gayatri Tanuja; Chethana T. V,machine learning


In [176]:
# These are common Q2 output column names
candidate_cols = [
    "clean_lemmatized",
    "clean_stemmed",
    "clean_lowercase",
    "clean_no_stopwords",
    "clean_no_numbers",
    "clean_no_noise",
    "abstract",  # in case you didn't create new columns
    "text",
    "review",
]

TEXT_COL = None
for c in candidate_cols:
    if c in df.columns:
        TEXT_COL = c
        break

if TEXT_COL is None:
    # fallback: pick the first object/string column
    obj_cols = [c for c in df.columns if df[c].dtype == "object"]
    if not obj_cols:
        raise ValueError("No text-like column found. Please check your CSV columns.")
    TEXT_COL = obj_cols[0]

df[TEXT_COL] = df[TEXT_COL].fillna("").astype(str)
print("‚úÖ Using text column:", TEXT_COL)
df[[TEXT_COL]].head(5)

‚úÖ Using text column: abstract


Unnamed: 0,abstract
0,"In tropical and subtropical areas, malaria sta..."
1,Artificial Intelligence (AI) has evolved from ...
2,
3,"Everywhere across the globe, banks are increas..."
4,


In [177]:
!pip install -q nltk
import nltk
from collections import Counter

nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

noun_tags = {"NN","NNS","NNP","NNPS"}
verb_tags = {"VB","VBD","VBG","VBN","VBP","VBZ"}
adj_tags  = {"JJ","JJR","JJS"}
adv_tags  = {"RB","RBR","RBS"}

pos_counter = Counter()

# Use all rows; if it runs slow, change to df[TEXT_COL].head(2000)
texts = df[TEXT_COL].tolist()

for txt in texts:
    tokens = nltk.word_tokenize(txt)
    tagged = nltk.pos_tag(tokens)
    for _, tag in tagged:
        if tag in noun_tags:
            pos_counter["Noun"] += 1
        elif tag in verb_tags:
            pos_counter["Verb"] += 1
        elif tag in adj_tags:
            pos_counter["Adjective"] += 1
        elif tag in adv_tags:
            pos_counter["Adverb"] += 1

print("‚úÖ Total POS Counts")
print(pos_counter)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


‚úÖ Total POS Counts
Counter({'Noun': 372708, 'Verb': 135773, 'Adjective': 121559, 'Adverb': 23558})


In [182]:
!pip uninstall -y transformers tokenizers # Uninstall potentially incompatible versions
!pip install -q tokenizers==0.12.1 # Install a compatible tokenizers version
!pip install -q transformers==4.30.0 # Install a compatible transformers version
!pip install -q spacy benepar
!python -m spacy download en_core_web_sm -q

import spacy
import benepar

nlp = spacy.load("en_core_web_sm")

# Add constituency parser (benepar)
try:
    benepar.download("benepar_en3")
except:
    pass

if "benepar" not in nlp.pipe_names:
    nlp.add_pipe("benepar", config={"model": "benepar_en3"})

print("‚úÖ spaCy and benepar ready. Pipelines:", nlp.pipe_names)

Found existing installation: transformers 5.2.0
Uninstalling transformers-5.2.0:
  Successfully uninstalled transformers-5.2.0
Found existing installation: tokenizers 0.22.2
Uninstalling tokenizers-0.22.2:
  Successfully uninstalled tokenizers-0.22.2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m√ó[0m [32mBuilding wheel for tokenizers [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m‚îÇ[0m exit code: [1;36m1[0m
  [31m‚ï∞‚îÄ>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for tokenizers (pyproject.toml) ... [?25l[?25herror
[31m  ERROR: Failed building wheel for tokenizers[0m[31m
[0m[31mERROR: ERROR: Failed to build installable wheels for some pyproject.to

[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!


AttributeError: T5Tokenizer has no attribute build_inputs_with_special_tokens

In [179]:
# Find a good example sentence from your cleaned text
example_text = ""
for t in df[TEXT_COL]:
    if isinstance(t, str) and len(t.split()) > 8:
        example_text = t
        break

doc = nlp(example_text)

# pick first sentence (spaCy sentence segmentation)
sent = list(doc.sents)[0] if list(doc.sents) else doc
example_sentence = sent.text.strip()

print("‚úÖ Example sentence:")
print(example_sentence)

‚úÖ Example sentence:
In tropical and subtropical areas, malaria stands as a profound public health challenge, causing an estimated 247 million cases worldwide annually.


In [180]:
sent_doc = nlp(example_sentence)
sent = list(sent_doc.sents)[0] if list(sent_doc.sents) else sent_doc

print("‚úÖ Constituency Parse Tree:\n")
print(sent._.parse_string)

‚úÖ Constituency Parse Tree:



Exception: No constituency parse is available for this document. Consider adding a BeneparComponent to the pipeline.

In [181]:
print("‚úÖ Dependency Parse (Token | POS | Head | Dependency)\n")
for token in sent:
    print(f"{token.text:<15} | {token.pos_:<6} | {token.head.text:<15} | {token.dep_}")

‚úÖ Dependency Parse (Token | POS | Head | Dependency)

In              | ADP    | stands          | prep
tropical        | ADJ    | areas           | amod
and             | CCONJ  | tropical        | cc
subtropical     | ADJ    | tropical        | conj
areas           | NOUN   | In              | pobj
,               | PUNCT  | stands          | punct
malaria         | NOUN   | stands          | nsubj
stands          | VERB   | stands          | ROOT
as              | ADP    | stands          | prep
a               | DET    | challenge       | det
profound        | ADJ    | challenge       | amod
public          | ADJ    | health          | amod
health          | NOUN   | challenge       | compound
challenge       | NOUN   | as              | pobj
,               | PUNCT  | stands          | punct
causing         | VERB   | stands          | advcl
an              | DET    | cases           | det
estimated       | VERB   | cases           | amod
247             | NUM    | million      

# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub‚Äôs usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [84]:
!pip install -q beautifulsoup4 requests pandas lxml

import time
import random
import re
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup
import pandas as pd

In [85]:
BASE = "https://github.com"
MARKETPLACE_URL = "https://github.com/marketplace"
TYPE = "actions"

TARGET_PRODUCTS = 1000
START_PAGE = 1
MAX_PAGES_SAFETY = 300   # safety cap so it won't run forever if layout changes
SLEEP_MIN = 1.0          # be polite; increase if you get throttled
SLEEP_MAX = 2.0

HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; INFO5731-Scraper/1.0; +https://github.com)",
    "Accept-Language": "en-US,en;q=0.9",
}

session = requests.Session()
session.headers.update(HEADERS)

def fetch_page(page_num: int, max_retries: int = 5) -> str:
    params = {"type": TYPE, "page": page_num}
    url = MARKETPLACE_URL

    for attempt in range(1, max_retries + 1):
        r = session.get(url, params=params, timeout=30)

        # Basic throttling / transient handling
        if r.status_code in (429, 502, 503, 504):
            wait = min(60, 2 ** attempt) + random.random()
            print(f"[{r.status_code}] Throttled/server issue. Sleeping {wait:.1f}s and retrying...")
            time.sleep(wait)
            continue

        r.raise_for_status()
        return r.text

    raise RuntimeError(f"Failed to fetch page {page_num} after retries")

def parse_marketplace_actions(html: str, page_num: int):
    soup = BeautifulSoup(html, "lxml")
    items = []

    # Marketplace cards are typically anchors linking to /marketplace/actions/<slug>
    # We'll gather all such anchors and extract name/desc.
    anchors = soup.select('a[href^="/marketplace/actions/"]')

    # De-duplicate anchors by href (page can contain repeated nested anchors)
    seen = set()
    unique_anchors = []
    for a in anchors:
        href = a.get("href", "")
        if href and href not in seen:
            seen.add(href)
            unique_anchors.append(a)

    for a in unique_anchors:
        href = a.get("href", "")
        full_url = urljoin(BASE, href)

        # Try to find a "title/name" text
        # Often the name is in a strong/h3/span within the anchor
        name = ""
        name_candidate = a.get_text(" ", strip=True)

        # Heuristic: action name is usually the first part; but cards may include description too.
        # We'll attempt to locate a heading-like element first.
        heading = a.select_one("h3, h2, strong, .h3, .h4")
        if heading:
            name = heading.get_text(" ", strip=True)
        else:
            # Fallback: take first line-ish chunk
            name = name_candidate

        # Description often appears in a sibling <p> within same card container
        desc = ""
        card = a.find_parent(["div", "li", "article"])
        if card:
            p = card.select_one("p")
            if p:
                desc = p.get_text(" ", strip=True)

        # Clean name if it's too long (sometimes includes extra text)
        name = re.sub(r"\s+", " ", name).strip()
        desc = re.sub(r"\s+", " ", desc).strip()

        # Only keep plausible rows
        if "/marketplace/actions/" in full_url and name:
            items.append({
                "product_name": name,
                "description": desc,
                "url": full_url,
                "page_number": page_num
            })

    return items

def scrape_actions(target_n: int = 1000):
    results = []
    seen_urls = set()

    for page in range(START_PAGE, START_PAGE + MAX_PAGES_SAFETY):
        html = fetch_page(page)
        items = parse_marketplace_actions(html, page)

        # Add new unique products
        new_count = 0
        for it in items:
            if it["url"] not in seen_urls:
                seen_urls.add(it["url"])
                results.append(it)
                new_count += 1
                if len(results) >= target_n:
                    break

        print(f"Page {page}: found {len(items)} candidates, added {new_count}, total {len(results)}")

        # Stop if page returns nothing new (likely end or selector mismatch)
        if new_count == 0 and page > START_PAGE + 2:
            print("No new items found on this page. Stopping early.")
            break

        # polite delay
        time.sleep(random.uniform(SLEEP_MIN, SLEEP_MAX))

        if len(results) >= target_n:
            break

    return pd.DataFrame(results)

df_raw = scrape_actions(TARGET_PRODUCTS)
df_raw.head(10)

Page 1: found 20 candidates, added 20, total 20
Page 2: found 20 candidates, added 20, total 40
Page 3: found 20 candidates, added 20, total 60
Page 4: found 20 candidates, added 20, total 80
Page 5: found 20 candidates, added 20, total 100
Page 6: found 20 candidates, added 20, total 120
Page 7: found 20 candidates, added 20, total 140
Page 8: found 20 candidates, added 20, total 160
Page 9: found 20 candidates, added 20, total 180
Page 10: found 20 candidates, added 20, total 200
Page 11: found 20 candidates, added 20, total 220
Page 12: found 20 candidates, added 20, total 240
Page 13: found 20 candidates, added 20, total 260
Page 14: found 20 candidates, added 20, total 280
Page 15: found 20 candidates, added 20, total 300
Page 16: found 20 candidates, added 19, total 319
Page 17: found 20 candidates, added 20, total 339
Page 18: found 20 candidates, added 20, total 359
Page 19: found 20 candidates, added 20, total 379
Page 20: found 20 candidates, added 20, total 399
Page 21: foun

Unnamed: 0,product_name,description,url,page_number
0,TruffleHog OSS,,https://github.com/marketplace/actions/truffle...,1
1,Metrics embed,,https://github.com/marketplace/actions/metrics...,1
2,yq - portable yaml processor,,https://github.com/marketplace/actions/yq-port...,1
3,Super-Linter,,https://github.com/marketplace/actions/super-l...,1
4,Rebuild Armbian and Kernel,,https://github.com/marketplace/actions/rebuild...,1
5,Gosec Security Checker,,https://github.com/marketplace/actions/gosec-s...,1
6,Checkout,,https://github.com/marketplace/actions/checkout,1
7,OpenCommit ‚Äî improve commits with AI üßô,,https://github.com/marketplace/actions/opencom...,1
8,SSH Remote Commands,,https://github.com/marketplace/actions/ssh-rem...,1
9,Claude Code Action Official,,https://github.com/marketplace/actions/claude-...,1


In [58]:
RAW_CSV = "github_marketplace_actions_raw.csv"
df_raw.to_csv(RAW_CSV, index=False, encoding="utf-8")
print("Saved:", RAW_CSV, "| rows:", len(df_raw))

Saved: github_marketplace_actions_raw.csv | rows: 1000


In [59]:
!pip install -q nltk

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")

STOPWORDS = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [60]:
def strip_html(text: str) -> str:
    # Descriptions should be plain already, but this is safe
    return BeautifulSoup(text, "lxml").get_text(" ", strip=True)

def normalize_whitespace(text: str) -> str:
    return re.sub(r"\s+", " ", text).strip()

def remove_noise(text: str) -> str:
    # keep letters, numbers, and spaces; remove special punctuation
    text = re.sub(r"[^A-Za-z0-9\s]", " ", text)
    return normalize_whitespace(text)

def preprocess_text(text: str) -> str:
    if text is None or (isinstance(text, float) and pd.isna(text)):
        return ""

    text = str(text)
    text = strip_html(text)
    text = text.lower()
    text = remove_noise(text)

    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in STOPWORDS and len(t) > 1]

    # lemmatize (default noun; still acceptable for assignment)
    tokens = [lemmatizer.lemmatize(t) for t in tokens]

    return " ".join(tokens)

# Create cleaned columns
df = df_raw.copy()
df["product_name_clean"] = df["product_name"].apply(preprocess_text)
df["description_clean"] = df["description"].apply(preprocess_text)

df[["product_name", "product_name_clean", "description", "description_clean"]].head(5)

Unnamed: 0,product_name,product_name_clean,description,description_clean
0,TruffleHog OSS,trufflehog os,,
1,Metrics embed,metric embed,,
2,yq - portable yaml processor,yq portable yaml processor,,
3,Super-Linter,super linter,,
4,Rebuild Armbian and Kernel,rebuild armbian kernel,,


In [61]:
dq_report = {}

# 1) Completeness: missing values
dq_report["missing_product_name"] = int(df["product_name"].isna().sum() + (df["product_name"].astype(str).str.strip() == "").sum())
dq_report["missing_url"] = int(df["url"].isna().sum() + (df["url"].astype(str).str.strip() == "").sum())
dq_report["missing_description"] = int(df["description"].isna().sum() + (df["description"].astype(str).str.strip() == "").sum())

# 2) URL format validity (should start with https://github.com/marketplace/actions/)
valid_prefix = "https://github.com/marketplace/actions/"
df["url_valid"] = df["url"].astype(str).str.startswith(valid_prefix)
dq_report["invalid_url_count"] = int((~df["url_valid"]).sum())

# 3) Duplicates
dq_report["duplicate_url_rows"] = int(df.duplicated(subset=["url"]).sum())

# 4) Consistency: page_number should be positive integer
df["page_number_valid"] = df["page_number"].apply(lambda x: isinstance(x, (int,)) and x > 0)
dq_report["invalid_page_number"] = int((~df["page_number_valid"]).sum())

# 5) Basic length checks
df["name_len"] = df["product_name"].astype(str).str.len()
df["desc_len"] = df["description"].astype(str).str.len()
dq_report["very_short_names(<3)"] = int((df["name_len"] < 3).sum())

dq_report

{'missing_product_name': 0,
 'missing_url': 0,
 'missing_description': 1000,
 'invalid_url_count': 0,
 'duplicate_url_rows': 0,
 'invalid_page_number': 0,
 'very_short_names(<3)': 0}

In [62]:
# Drop duplicate URLs (keep first)
df_clean = df.drop_duplicates(subset=["url"]).copy()

# Fill missing descriptions with empty string (you can also drop them if required)
df_clean["description"] = df_clean["description"].fillna("")
df_clean["description_clean"] = df_clean["description_clean"].fillna("")

# Optionally: remove rows with invalid URL
df_clean = df_clean[df_clean["url"].astype(str).str.startswith(valid_prefix)].copy()

print("Before:", len(df), "After DQ fixes:", len(df_clean))
df_clean.head(5)

Before: 1000 After DQ fixes: 1000


Unnamed: 0,product_name,description,url,page_number,product_name_clean,description_clean,url_valid,page_number_valid,name_len,desc_len
0,TruffleHog OSS,,https://github.com/marketplace/actions/truffle...,1,trufflehog os,,True,True,14,0
1,Metrics embed,,https://github.com/marketplace/actions/metrics...,1,metric embed,,True,True,13,0
2,yq - portable yaml processor,,https://github.com/marketplace/actions/yq-port...,1,yq portable yaml processor,,True,True,28,0
3,Super-Linter,,https://github.com/marketplace/actions/super-l...,1,super linter,,True,True,12,0
4,Rebuild Armbian and Kernel,,https://github.com/marketplace/actions/rebuild...,1,rebuild armbian kernel,,True,True,26,0


In [63]:
FINAL_CSV = "github_marketplace_actions_clean.csv"
df_clean.drop(columns=["url_valid","page_number_valid","name_len","desc_len"], errors="ignore") \
        .to_csv(FINAL_CSV, index=False, encoding="utf-8")

print("Saved:", FINAL_CSV, "| rows:", len(df_clean))

Saved: github_marketplace_actions_clean.csv | rows: 1000


#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [162]:
!pip install tweepy pandas



In [163]:
# ============================================
# PART 1: SCRAPE TWEETS (Colab Version)
# ============================================

import tweepy
import pandas as pd
import re

# üî¥ IMPORTANT: Paste your Bearer Token below
bearer_token = "PASTE_YOUR_BEARER_TOKEN_HERE"

# Remove accidental spaces
bearer_token = bearer_token.strip()

# Authenticate
client = tweepy.Client(
    bearer_token=bearer_token,
    wait_on_rate_limit=True
)

# Test connection first
try:
    test = client.get_user(username="TwitterDev")
    print("‚úÖ Authentication Successful!")
except Exception as e:
    print("‚ùå Authentication Failed:", e)

# Search query
query = "(#MachineLearning OR #ArtificialIntelligence) lang:en -is:retweet"

# Fetch tweets
response = client.search_recent_tweets(
    query=query,
    max_results=100,
    tweet_fields=["id", "text", "author_id"],
    expansions=["author_id"],
    user_fields=["username"]
)

# Extract data
tweets_data = []

if response.data:
    users = {u["id"]: u for u in response.includes["users"]}

    for tweet in response.data:
        user = users[tweet.author_id]
        tweets_data.append({
            "tweet_id": tweet.id,
            "username": user.username,
            "text": tweet.text
        })

# Create DataFrame
df = pd.DataFrame(tweets_data)

print("\nInitial Data:")
print(df.head())

# ============================================
# PART 2: DATA CLEANING
# ============================================

# Remove duplicates
df.drop_duplicates(subset="tweet_id", inplace=True)

# Remove URLs
df["text"] = df["text"].apply(lambda x: re.sub(r"http\S+", "", x))

# Remove mentions
df["text"] = df["text"].apply(lambda x: re.sub(r"@\w+", "", x))

# Remove hashtag symbol
df["text"] = df["text"].apply(lambda x: re.sub(r"#", "", x))

# Remove special characters
df["text"] = df["text"].apply(lambda x: re.sub(r"[^A-Za-z0-9\s]", "", x))

# Remove extra spaces
df["text"] = df["text"].apply(lambda x: re.sub(r"\s+", " ", x).strip())

# Remove empty rows
df.dropna(inplace=True)
df = df[df["text"] != ""]

# ============================================
# FINAL CHECK
# ============================================

print("\nFinal Dataset Info:")
print(df.info())

print("\nFinal Cleaned Data:")
print(df.head())

# ============================================
# SAVE FILE
# ============================================

file_name = "cleaned_ml_ai_tweets.csv"
df.to_csv(file_name, index=False)

print(f"\n‚úÖ File saved as {file_name}")

‚ùå Authentication Failed: 401 Unauthorized
Unauthorized


Unauthorized: 401 Unauthorized
Unauthorized

# Mandatory Question (5 points)

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

Overall, this assignment was both educational and practical. It helped me understand how real-world data is collected using APIs and how important data cleaning is before performing any analysis. Working with the Twitter (X) API gave me insight into how social media data can be extracted for machine learning or artificial intelligence research.

One of the most challenging parts of the assignment was setting up API authentication. Managing developer credentials, handling authentication errors (such as 401 Unauthorized), and understanding access limitations required patience and troubleshooting. It highlighted how external system permissions can affect technical implementation, even when the code is correct.

Another challenge was ensuring proper data cleaning. Removing duplicates, URLs, special characters, and handling missing values required careful attention to maintain data quality without losing meaningful information.

What I enjoyed most was seeing the full workflow‚Äîfrom data collection to cleaning and exporting the final dataset. It made the process feel complete and practical rather than purely theoretical. I also appreciated learning how to structure code clearly and perform a final data quality check before saving the dataset.

Regarding the time provided, it was generally reasonable. However, additional time can be helpful due to potential API access issues and debugging authentication errors, which are sometimes outside the student‚Äôs control.

Overall, this assignment strengthened my understanding of data acquisition, preprocessing, and the importance of clean datasets in machine learning projects.