<a href="https://colab.research.google.com/github/rozankc/Roza_INFO5731_Spring2025/blob/main/KhanChowdhury_Roza_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [None]:
import requests
import json
import csv

base_url = 'https://ddr.densho.org/api/0.2/narrator/?format=json'
narrators = []
url_next = base_url

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
}

while url_next:
    try:
        response = requests.get(url_next, headers=headers, timeout=10)

        response.raise_for_status()
        data = response.json()
        narrators.extend(data.get('objects', []))

        url_next = data.get('next_api', None)

    except requests.exceptions.RequestException as e:
        print(f"Error {e}")
        break

with open('narrators.json', 'w', encoding='utf-8') as f:
    json.dump(narrators, f, indent=2, ensure_ascii=False)

columns = ["ID", "Full Name", "Generation", "Birth Location", "Birth Date", "Death Date", "Biography"]
with open('narrators.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=columns)
    writer.writeheader()
    for narrator in narrators:
        row = {
            "ID": narrator.get("id", ""),
            "Full Name": narrator.get("display_name", ""),
            "Generation": narrator.get("generation", ""),
            "Birth Location": narrator.get("birth_location", ""),
            "Birth Date": narrator.get("b_date", ""),
            "Death Date": narrator.get("d_date", ""),
            "Biography": narrator.get("bio", "")
        }
        writer.writerow(row)
        print(row)  # Print each row as it is written

print(f"Done!")



{'ID': '361', 'Full Name': 'Kay Aiko Abe', 'Generation': 'Nisei', 'Birth Location': 'Selleck, Washington', 'Birth Date': '1927-05-09T00:00:00', 'Death Date': '2018-06-26T00:00:00', 'Biography': "Nisei female. Born May 9, 1927, in Selleck, Washington. Spent much of childhood in Beaverton, Oregon, where father owned a farm. Influenced at an early age by parents' conversion to Christianity. During World War II, removed to the Portland Assembly Center, Oregon, and the Minidoka concentration camp, Idaho. After the war, worked to establish a successful volunteer program to feed the homeless in Seattle, Washington."}
{'ID': '291', 'Full Name': 'Art Abe', 'Generation': 'Nisei', 'Birth Location': 'Seattle, Washington', 'Birth Date': '1921-06-12T00:00:00', 'Death Date': '2015-08-13T00:00:00', 'Biography': 'Nisei male. Born June 12, 1921, in Seattle, Washington. Grew up in an area of Seattle with few other Japanese Americans, and was attending the University of Washington when Japan bombed Pearl 

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

# Load the CSV file
file_path = '/content/narrators.csv'  # Adjust path if needed
df = pd.read_csv(file_path)

# Ensure 'Biography' column exists
if 'Biography' not in df.columns:
    raise KeyError("The 'Biography' column is missing from the dataset.")

# (1) Remove noise, such as special characters and punctuations.
def remove_noise(text):
    return re.sub(r'[^a-zA-Z\s]', '', str(text))

df['Clean_Biography'] = df['Biography'].apply(remove_noise)
print("Step 1 Output:\n", df[['Biography', 'Clean_Biography']].head())

# (2) Remove numbers.
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

df['Clean_Biography'] = df['Clean_Biography'].apply(remove_numbers)
print("\nStep 2 Output:\n", df[['Biography', 'Clean_Biography']].head())

# (3) Remove stopwords by using the stopwords list.
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = text.split()
    return ' '.join([word for word in words if word.lower() not in stop_words])

df['Clean_Biography'] = df['Clean_Biography'].apply(remove_stopwords)
print("\nStep 3 Output:\n", df[['Biography', 'Clean_Biography']].head())

# (4) Lowercase all texts
df['Clean_Biography'] = df['Clean_Biography'].str.lower()
print("\nStep 4 Output:\n", df[['Biography', 'Clean_Biography']].head())

# (5) Stemming.
stemmer = PorterStemmer()

def stem_text(text):
    words = text.split()
    return ' '.join([stemmer.stem(word) for word in words])

df['Clean_Biography'] = df['Clean_Biography'].apply(stem_text)
print("\nStep 5 Output:\n", df[['Biography', 'Clean_Biography']].head())

# (6) Lemmatization.
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    words = text.split()
    return ' '.join([lemmatizer.lemmatize(word) for word in words])

df['Clean_Biography'] = df['Clean_Biography'].apply(lemmatize_text)
print("\nStep 6 Output:\n", df[['Biography', 'Clean_Biography']].head())

# Save cleaned data to a new CSV file
df.to_csv('/content/clean_narrator.csv', index=False)
print("\nCleaned data saved to 'clean_narrator.csv'")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Step 1 Output:
                                            Biography  \
0  Nisei female. Born May 9, 1927, in Selleck, Wa...   
1  Nisei male. Born June 12, 1921, in Seattle, Wa...   
2  Nisei female. Born October 31, 1925, in Seattl...   
3  Nisei female. Born July 8, 1928, in Boyle Heig...   
4  Sansei male. Born March 15, 1950, in Torrance,...   

                                     Clean_Biography  
0  Nisei female Born May   in Selleck Washington ...  
1  Nisei male Born June   in Seattle Washington G...  
2  Nisei female Born October   in Seattle Washing...  
3  Nisei female Born July   in Boyle Heights Cali...  
4  Sansei male Born March   in Torrance Californi...  

Step 2 Output:
                                            Biography  \
0  Nisei female. Born May 9, 1927, in Selleck, Wa...   
1  Nisei male. Born June 12, 1921, in Seattle, Wa...   
2  Nisei female. Born October 31, 1925, in Seattl...   
3  Nisei female. Born July 8, 1928, in Boyle Heig...   
4  Sansei male. Born

# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
!pip install nltk spacy



In [None]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
import pandas as pd
import nltk
import spacy
from collections import Counter
from nltk import pos_tag, word_tokenize
from spacy import displacy  # Import displacy for visualization

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Load the cleaned CSV file
file_path = '/content/clean_narrator.csv'  # Update this path as needed
df = pd.read_csv(file_path)

# Ensure 'Clean_Biography' column exists
if 'Clean_Biography' not in df.columns:
    raise KeyError("The 'Clean_Biography' column is missing from the dataset.")

# (1) Parts of Speech (POS) Tagging:
def pos_analysis(text):
    if pd.isna(text) or not isinstance(text, str):
        return [], Counter()
    words = word_tokenize(text)
    pos_tags = pos_tag(words)
    pos_counts = Counter(tag for word, tag in pos_tags)
    return pos_tags, pos_counts

# Apply POS analysis
df['POS_Tags'], df['POS_Counts'] = zip(*df['Clean_Biography'].apply(pos_analysis))

# Calculate total counts of Nouns, Verbs, Adjectives, and Adverbs
total_counts = Counter()
for count in df['POS_Counts']:
    total_counts.update(count)

print("Total Counts of Nouns, Verbs, Adjectives, and Adverbs:")
print(f"Nouns: {total_counts['NN'] + total_counts['NNS'] + total_counts['NNP'] + total_counts['NNPS']}")
print(f"Verbs: {total_counts['VB'] + total_counts['VBD'] + total_counts['VBG'] + total_counts['VBN'] + total_counts['VBP'] + total_counts['VBZ']}")
print(f"Adjectives: {total_counts['JJ'] + total_counts['JJR'] + total_counts['JJS']}")
print(f"Adverbs: {total_counts['RB'] + total_counts['RBR'] + total_counts['RBS']}")

# (2) Constituency and Dependency Parsing
def parse_sentences(text):
    if pd.isna(text) or not isinstance(text, str):
        return
    doc = nlp(text)
    for sent in doc.sents:
        print(f"\nSentence: {sent.text}")

        # Dependency Parsing
        print("Dependency Parsing:")
        for token in sent:
            print(f"{token.text} --> {token.dep_} --> {token.head.text}")

        # Constituency Parsing (using spaCy's built-in method)
        print("Constituency Parsing Tree:")
        displacy.render(sent, style='dep', jupyter=True)  # Visualize dependency parsing

# Apply parsing to each biography
df['Clean_Biography'].apply(parse_sentences)

# (3) Named Entity Recognition (NER)
def named_entity_recognition(text):
    if pd.isna(text) or not isinstance(text, str):
        return [], Counter()
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    entity_counts = Counter(ent.label_ for ent in doc.ents)
    return entities, entity_counts

# Apply NER to each biography
df['Named_Entities'], df['Entity_Counts'] = zip(*df['Clean_Biography'].apply(named_entity_recognition))

# Print Named Entities and their counts
print("\nNamed Entities and their Counts:")
for count in df['Entity_Counts']:
    print(count)

# Save analyzed data to a new CSV file
df.to_csv('/content/analyzed_narrator.csv', index=False)
print("\nAnalyzed data saved to 'analyzed_narrator.csv'")

RuntimeError: Only a single TORCH_LIBRARY can be used to register the namespace prims; please put all of your definitions in a single TORCH_LIBRARY block.  If you were trying to specify implementations, consider using TORCH_LIBRARY_IMPL (which can be duplicated).  If you really intended to define operators for a single namespace in a distributed way, you can use TORCH_LIBRARY_FRAGMENT to explicitly indicate this.  Previous registration of TORCH_LIBRARY was registered at /dev/null:241; latest registration was registered at /dev/null:241

In [None]:
import pandas as pd
import nltk
import spacy
from collections import Counter
from nltk import pos_tag, word_tokenize
from spacy import displacy  # Import displacy for visualization

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Load the cleaned CSV file
file_path = '/content/clean_narrator.csv'  # Update this path as needed
df = pd.read_csv(file_path)

# Ensure 'Clean_Biography' column exists
if 'Clean_Biography' not in df.columns:
    raise KeyError("The 'Clean_Biography' column is missing from the dataset.")

# (1) Parts of Speech (POS) Tagging:
def pos_analysis(text):
    if pd.isna(text) or not isinstance(text, str):
        return [], Counter()
    words = word_tokenize(text)
    pos_tags = pos_tag(words)
    pos_counts = Counter(tag for word, tag in pos_tags)
    return pos_tags, pos_counts

# Apply POS analysis
df['POS_Tags'], df['POS_Counts'] = zip(*df['Clean_Biography'].apply(pos_analysis))

# Calculate total counts of Nouns, Verbs, Adjectives, and Adverbs
total_counts = Counter()
for count in df['POS_Counts']:
    total_counts.update(count)

print("Total Counts of Nouns, Verbs, Adjectives, and Adverbs:")
print(f"Nouns: {total_counts['NN'] + total_counts['NNS'] + total_counts['NNP'] + total_counts['NNPS']}")
print(f"Verbs: {total_counts['VB'] + total_counts['VBD'] + total_counts['VBG'] + total_counts['VBN'] + total_counts['VBP'] + total_counts['VBZ']}")
print(f"Adjectives: {total_counts['JJ'] + total_counts['JJR'] + total_counts['JJS']}")
print(f"Adverbs: {total_counts['RB'] + total_counts['RBR'] + total_counts['RBS']}")

# (2) Constituency and Dependency Parsing
def parse_sentences(text):
    if pd.isna(text) or not isinstance(text, str):
        return
    doc = nlp(text)

    # Process only the first sentence
    first_sentence = next(doc.sents)  # Get the first sentence
    print(f"\nSentence: {first_sentence.text}")

    # Dependency Parsing
    print("Dependency Parsing:")
    for token in first_sentence:
        print(f"{token.text} --> {token.dep_} --> {token.head.text}")

    # Constituency Parsing (using spaCy's built-in method)
    print("Constituency Parsing Tree:")
    displacy.render(first_sentence, style='dep', jupyter=True)  # Visualize dependency parsing

# Apply parsing to the first biography only
parse_sentences(df['Clean_Biography'].iloc[0])  # Change index to select a different biography if needed

# (3) Named Entity Recognition (NER)
def named_entity_recognition(text):
    if pd.isna(text) or not isinstance(text, str):
        return [], Counter()
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    entity_counts = Counter(ent.label_ for ent in doc.ents)
    return entities, entity_counts

# Apply NER to the first biography only
df['Named_Entities'], df['Entity_Counts'] = zip(*df['Clean_Biography'].apply(named_entity_recognition))

# Print Named Entities and their counts for the first biography
print("\nNamed Entities and their Counts for the first biography:")
print(df['Entity_Counts'].iloc[0])

# Save analyzed data to a new CSV file
df.to_csv('/content/analyzed_narrator.csv', index=False)
print("\nAnalyzed data saved to 'analyzed_narrator.csv'")

RuntimeError: Only a single TORCH_LIBRARY can be used to register the namespace prims; please put all of your definitions in a single TORCH_LIBRARY block.  If you were trying to specify implementations, consider using TORCH_LIBRARY_IMPL (which can be duplicated).  If you really intended to define operators for a single namespace in a distributed way, you can use TORCH_LIBRARY_FRAGMENT to explicitly indicate this.  Previous registration of TORCH_LIBRARY was registered at /dev/null:241; latest registration was registered at /dev/null:241

In [None]:
pip install torch spacy

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [None]:

pip install torch torchvision torchaudio spacy nltk pandas




In [None]:
pip install torch torchvision torchaudio spacy nltk pandas



In [None]:
pip install spacy



In [None]:
import pandas as pd
import nltk
import spacy
from collections import Counter
from nltk import pos_tag, word_tokenize
from spacy import displacy  # Import displacy for visualization

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Load the cleaned CSV file
file_path = '/content/clean_narrator.csv'  # Update this path as needed
df = pd.read_csv(file_path)

# Ensure 'Clean_Biography' column exists
if 'Clean_Biography' not in df.columns:
    raise KeyError("The 'Clean_Biography' column is missing from the dataset.")

# Step 1: Parts of Speech (POS) Tagging
def pos_analysis(text):
    if pd.isna(text) or not isinstance(text, str):
        return [], Counter()
    words = word_tokenize(text)
    pos_tags = pos_tag(words)
    pos_counts = Counter(tag for word, tag in pos_tags)
    return pos_tags, pos_counts

# Apply POS analysis
df['POS_Tags'], df['POS_Counts'] = zip(*df['Clean_Biography'].apply(pos_analysis))

# Calculate total counts of Nouns, Verbs, Adjectives, and Adverbs
total_counts = Counter()
for count in df['POS_Counts']:
    total_counts.update(count)

print("Total Counts of Nouns, Verbs, Adjectives, and Adverbs:")
print(f"Nouns: {total_counts['NN'] + total_counts['NNS'] + total_counts['NNP'] + total_counts['NNPS']}")
print(f"Verbs: {total_counts['VB'] + total_counts['VBD'] + total_counts['VBG'] + total_counts['VBN'] + total_counts['VBP'] + total_counts['VBZ']}")
print(f"Adjectives: {total_counts['JJ'] + total_counts['JJR'] + total_counts['JJS']}")
print(f"Adverbs: {total_counts['RB'] + total_counts['RBR'] + total_counts['RBS']}")

# Step 2: Constituency and Dependency Parsing
def parse_sentences(text):
    if pd.isna(text) or not isinstance(text, str):
        return
    doc = nlp(text)
    for sent in doc.sents:
        print(f"\nSentence: {sent.text}")

        # Dependency Parsing
        print("Dependency Parsing:")
        for token in sent:
            print(f"{token.text} --> {token.dep_} --> {token.head.text}")

        # Constituency Parsing (using spaCy's built-in method)
        print("Constituency Parsing Tree:")
        displacy.render(sent, style='dep', jupyter=True)  # Visualize dependency parsing

# Apply parsing to each biography
df['Clean_Biography'].apply(parse_sentences)

# Step 3: Named Entity Recognition (NER)
def named_entity_recognition(text):
    if pd.isna(text) or not isinstance(text, str):
        return [], Counter()
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    entity_counts = Counter(ent.label_ for ent in doc.ents)
    return entities, entity_counts

# Apply NER to each biography
df['Named_Entities'], df['Entity_Counts'] = zip(*df['Clean_Biography'].apply(named_entity_recognition))

# Print Named Entities and their counts
print("\nNamed Entities and their Counts:")
for count in df['Entity_Counts']:
    print(count)

# Save analyzed data to a new CSV file
df.to_csv('/content/analyzed_narrator.csv', index=False)
print("\nAnalyzed data saved to 'analyzed_narrator.csv'")

RuntimeError: Only a single TORCH_LIBRARY can be used to register the namespace prims; please put all of your definitions in a single TORCH_LIBRARY block.  If you were trying to specify implementations, consider using TORCH_LIBRARY_IMPL (which can be duplicated).  If you really intended to define operators for a single namespace in a distributed way, you can use TORCH_LIBRARY_FRAGMENT to explicitly indicate this.  Previous registration of TORCH_LIBRARY was registered at /dev/null:241; latest registration was registered at /dev/null:241

In [None]:
!pip uninstall torch spacy
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
!pip install spacy
!python -m spacy download en_core_web_sm

Found existing installation: torch 2.5.1+cu124
Uninstalling torch-2.5.1+cu124:
  Would remove:
    /usr/local/bin/convert-caffe2-to-onnx
    /usr/local/bin/convert-onnx-to-caffe2
    /usr/local/bin/torchfrtrace
    /usr/local/bin/torchrun
    /usr/local/lib/python3.11/dist-packages/functorch/*
    /usr/local/lib/python3.11/dist-packages/torch-2.5.1+cu124.dist-info/*
    /usr/local/lib/python3.11/dist-packages/torch/*
    /usr/local/lib/python3.11/dist-packages/torchgen/*
Proceed (Y/n)? [31mERROR: Operation cancelled by user[0m[31m
[0mTraceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/commands/uninstall.py", line 106, in run
    uninstall_pathset = req.uninstall(
                        ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/req

In [None]:
import pandas as pd
import nltk
import spacy
from collections import Counter
from nltk import pos_tag, word_tokenize
from spacy import displacy  # Import displacy for visualization

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Load the cleaned CSV file
file_path = '/content/clean_narrator.csv'
df = pd.read_csv(file_path)

# Ensure 'Clean_Biography' column exists
if 'Clean_Biography' not in df.columns:
    raise KeyError("The 'Clean_Biography' column is missing from the dataset.")

# (1) Parts of Speech (POS) Tagging:
def pos_analysis(text):
    if pd.isna(text) or not isinstance(text, str):
        return [], Counter()
    words = word_tokenize(text)
    pos_tags = pos_tag(words)
    pos_counts = Counter(tag for word, tag in pos_tags)
    return pos_tags, pos_counts

# Apply POS analysis
df['POS_Tags'], df['POS_Counts'] = zip(*df['Clean_Biography'].apply(pos_analysis))

# Calculate total counts of Nouns, Verbs, Adjectives, and Adverbs
total_counts = Counter()
for count in df['POS_Counts']:
    total_counts.update(count)

print("\nTotal Counts of Nouns, Verbs, Adjectives, and Adverbs:")
print(f"Nouns: {total_counts['NN'] + total_counts['NNS'] + total_counts['NNP'] + total_counts['NNPS']}")
print(f"Verbs: {total_counts['VB'] + total_counts['VBD'] + total_counts['VBG'] + total_counts['VBN'] + total_counts['VBP'] + total_counts['VBZ']}")
print(f"Adjectives: {total_counts['JJ'] + total_counts['JJR'] + total_counts['JJS']}")
print(f"Adverbs: {total_counts['RB'] + total_counts['RBR'] + total_counts['RBS']}")

# (2) Constituency and Dependency Parsing
def parse_sentences(text):
    if pd.isna(text) or not isinstance(text, str):
        return
    doc = nlp(text)

    # Process only the first sentence
    first_sentence = next(doc.sents)  # Get the first sentence
    print(f"\nSentence: {first_sentence.text}")

    # Dependency Parsing
    print("Dependency Parsing:")
    for token in first_sentence:
        print(f"{token.text} --> {token.dep_} --> {token.head.text}")

    # Constituency Parsing (using spaCy's built-in method)
    print("Constituency Parsing Tree:")
    displacy.render(first_sentence, style='dep', jupyter=True)

# Apply parsing to the first biography only
parse_sentences(df['Clean_Biography'].iloc[0])

# Step 3: Named Entity Recognition (NER)
def named_entity_recognition(text):
    if pd.isna(text) or not isinstance(text, str):
        return [], Counter()
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    entity_counts = Counter(ent.label_ for ent in doc.ents)
    return entities, entity_counts

# Apply NER to each biography
df['Named_Entities'], df['Entity_Counts'] = zip(*df['Clean_Biography'].apply(named_entity_recognition))

# Print Named Entities and their counts
print("\nNamed Entities and their Counts:")
for count in df['Entity_Counts']:
    print(count)

df.to_csv('/content/analyzed_narrator.csv', index=False)
print("\nAnalyzed data saved to 'analyzed_narrator.csv'")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!



Total Counts of Nouns, Verbs, Adjectives, and Adverbs:
Nouns: 30693
Verbs: 4856
Adjectives: 8657
Adverbs: 1119

Sentence: nisei femal born may selleck washington spent much childhood beaverton oregon father own farm influenc earli age parent convers christian world war ii remov portland assembl center oregon minidoka concentr camp idaho war work establish success volunt program feed homeless seattl washington
Dependency Parsing:
nisei --> compound --> femal
femal --> nsubj --> selleck
born --> acl --> femal
may --> aux --> selleck
selleck --> compound --> washington
washington --> nsubj --> spent
spent --> ROOT --> spent
much --> amod --> beaverton
childhood --> compound --> beaverton
beaverton --> dobj --> spent
oregon --> compound --> father
father --> appos --> beaverton
own --> ccomp --> spent
farm --> compound --> influenc
influenc --> nmod --> convers
earli --> compound --> parent
age --> compound --> parent
parent --> compound --> convers
convers --> dobj --> own
christian --> 


Named Entities and their Counts:
Counter({'GPE': 4, 'PERSON': 1, 'NORP': 1, 'EVENT': 1})
Counter({'GPE': 5, 'PERSON': 3, 'ORG': 2, 'DATE': 1, 'FAC': 1, 'CARDINAL': 1})
Counter({'GPE': 4, 'PERSON': 1})
Counter({'GPE': 4, 'PERSON': 2, 'EVENT': 1})
Counter({'PERSON': 4, 'GPE': 1, 'NORP': 1})
Counter({'GPE': 7, 'ORG': 3, 'PERSON': 2, 'LOC': 2, 'DATE': 1})
Counter({'PERSON': 9, 'GPE': 2, 'ORG': 2, 'CARDINAL': 1})
Counter({'GPE': 3, 'PERSON': 1, 'ORG': 1})
Counter({'GPE': 8, 'PERSON': 4, 'ORG': 3, 'DATE': 1, 'NORP': 1})
Counter({'GPE': 4, 'ORG': 1, 'DATE': 1, 'PERSON': 1})
Counter({'GPE': 5, 'PERSON': 2, 'DATE': 1, 'EVENT': 1, 'ORG': 1})
Counter({'PERSON': 2, 'GPE': 2, 'ORG': 1, 'EVENT': 1})
Counter({'EVENT': 1})
Counter({'EVENT': 1})
Counter({'GPE': 3, 'PERSON': 2, 'ORG': 2})
Counter({'ORG': 4, 'GPE': 4, 'PERSON': 2, 'PRODUCT': 1})
Counter({'GPE': 2, 'NORP': 1, 'ORG': 1})
Counter({'GPE': 3, 'PERSON': 1, 'NORP': 1, 'EVENT': 1})
Counter({'GPE': 3, 'PERSON': 3, 'ORG': 2, 'DATE': 1, 'NORP': 1,

# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [None]:
!pip install requests beautifulsoup4 pandas



**Prompt 1:** Write Python code using requests and BeautifulSoup4 to scrape data from the GitHub Marketplace actions page: https://github.com/marketplace?type=actions. Extract the product name, short description, and product URL. The product name is in an a tag with the class marketplace-common-module__marketplace-item-link--jrIHf inside an h3 tag. The short description is in a p tag with the class line-clamp-2. The product URL is in the href attribute of the same a tag as the product name. Store the results in a list of dictionaries, where each dictionary contains 'product_name', 'description', and 'url' keys. Convert the list of dictionaries into a Pandas DataFrame and save it to a CSV file named 'github_marketplace_actions.csv'. Include columns for 'product_name', 'description', 'url', and 'page_number'.

**Result:** To help you scrape data from GitHub Marketplace, here's a step-by-step plan for your task:

Code Outline:
Setup and Libraries: Install the required libraries.
Define the Base URL and Pagination: Loop through all 500 pages.
Scrape Data: Extract product name, description, and URL.
Store Data: Save the data in a CSV file.
Handle Errors: Add error handling and delays.

In [None]:
import requests
from bs4 import BeautifulSoup
import csv
import time

# Base URL
base_url = "https://github.com/marketplace?type=actions"

# List to store the product data
product_data = []

# Loop through 500 pages (for a total of 1000 products)
for page in range(1, 501):  # Page 1 to 500
    print(f"Scraping page {page}: {base_url}&page={page}")
    url = f"{base_url}&page={page}"

    try:
        # Send a request to the page
        response = requests.get(url)
        response.raise_for_status()  # Check for request errors

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all product items (within <h3> and <p> tags)
        products = soup.find_all('div', class_='Box-sc-g0xbh4-0')

        # Loop through each product on the page
        for product in products:
            title_tag = product.find('a', class_='marketplace-common-module__marketplace-item-link--jrIHf')
            description_tag = product.find('p', class_='line-clamp-2')

            if title_tag and description_tag:
                product_name = title_tag.text.strip()
                short_description = description_tag.text.strip()
                product_url = "https://github.com" + title_tag['href']

                # Append the product data along with the current page number
                product_data.append([product_name, short_description, product_url, page])

        # Add a delay to avoid overloading the server
        time.sleep(2)

    except Exception as e:
        print(f"Error on page {page}: {e}")

# Write the data to a CSV file
with open("github_marketplace_products.csv", mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(["Product Name", "Short Description", "Product URL", "Page Number"])  # CSV header
    writer.writerows(product_data)

print("Data scraping completed. Saved to 'github_marketplace_products.csv'")


Scraping page 1: https://github.com/marketplace?type=actions&page=1
Error on page 1: 400 Client Error: Bad Request for url: https://github.com/marketplace?type=actions&page=1
Scraping page 2: https://github.com/marketplace?type=actions&page=2
Error on page 2: 400 Client Error: Bad Request for url: https://github.com/marketplace?type=actions&page=2
Scraping page 3: https://github.com/marketplace?type=actions&page=3
Error on page 3: 400 Client Error: Bad Request for url: https://github.com/marketplace?type=actions&page=3
Scraping page 4: https://github.com/marketplace?type=actions&page=4
Error on page 4: 400 Client Error: Bad Request for url: https://github.com/marketplace?type=actions&page=4
Scraping page 5: https://github.com/marketplace?type=actions&page=5
Error on page 5: 400 Client Error: Bad Request for url: https://github.com/marketplace?type=actions&page=5
Scraping page 6: https://github.com/marketplace?type=actions&page=6
Error on page 6: 400 Client Error: Bad Request for url: h

KeyboardInterrupt: 

**Prompt 2:** I am getting this error. Help me fix this. Error on page 29: 400 Client Error: Bad Request for url: https://github.com/marketplace?type=actions&page=29
Scraping page 30: https://github.com/marketplace?type=actions&page=30

**Result:** The 400 Client Error: Bad Request suggests that the request sent to GitHub's server might be incorrectly formatted or that the server is rejecting it. This can happen for several reasons, such as:

Incorrect URL Structure: Ensure that the URL is formed correctly.
Headers: Sometimes, websites might block requests if they don't include common headers (like a User-Agent) that mimic a browser.
Pagination Format: Double-check the pagination part of the URL.

In [None]:
import requests
from bs4 import BeautifulSoup
import csv
import time

# Base URL
base_url = "https://github.com/marketplace?type=actions"

# List to store the product data
product_data = []

# Headers to mimic a browser request (avoid getting blocked by GitHub)
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# Loop through 500 pages (for a total of 1000 products)
for page in range(1, 501):  # Page 1 to 500
    print(f"Scraping page {page}: {base_url}&page={page}")
    url = f"{base_url}&page={page}"

    try:
        # Send a request to the page with the headers
        response = requests.get(url, headers=headers)
        response.raise_for_status()  # Check for request errors

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all product items (within <h3> and <p> tags)
        products = soup.find_all('div', class_='Box-sc-g0xbh4-0')

        # Loop through each product on the page
        for product in products:
            title_tag = product.find('a', class_='marketplace-common-module__marketplace-item-link--jrIHf')
            description_tag = product.find('p', class_='line-clamp-2')

            if title_tag and description_tag:
                product_name = title_tag.text.strip()
                short_description = description_tag.text.strip()
                product_url = "https://github.com" + title_tag['href']

                # Append the product data along with the current page number
                product_data.append([product_name, short_description, product_url, page])

        # Add a delay to avoid overloading the server
        time.sleep(2)

    except Exception as e:
        print(f"Error on page {page}: {e}")

# Write the data to a CSV file
with open("github_marketplace_products.csv", mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(["Product Name", "Short Description", "Product URL", "Page Number"])  # CSV header
    writer.writerows(product_data)

print("Data scraping completed. Saved to 'github_marketplace_products.csv'")


Scraping page 1: https://github.com/marketplace?type=actions&page=1
Error on page 1: 400 Client Error: Bad Request for url: https://github.com/marketplace?type=actions&page=1
Scraping page 2: https://github.com/marketplace?type=actions&page=2
Error on page 2: 400 Client Error: Bad Request for url: https://github.com/marketplace?type=actions&page=2
Scraping page 3: https://github.com/marketplace?type=actions&page=3
Error on page 3: 400 Client Error: Bad Request for url: https://github.com/marketplace?type=actions&page=3
Scraping page 4: https://github.com/marketplace?type=actions&page=4
Error on page 4: 400 Client Error: Bad Request for url: https://github.com/marketplace?type=actions&page=4
Scraping page 5: https://github.com/marketplace?type=actions&page=5
Error on page 5: 400 Client Error: Bad Request for url: https://github.com/marketplace?type=actions&page=5
Scraping page 6: https://github.com/marketplace?type=actions&page=6
Error on page 6: 400 Client Error: Bad Request for url: h

KeyboardInterrupt: 

**Prompt 3:**:I am still getting this error. Check the header.


**Result:** Sure! Below is the full code that handles the scraping from the GitHub Marketplace. Updated Code with Debugging

In [None]:
import requests
from bs4 import BeautifulSoup
import csv
import time
import warnings
from urllib3.exceptions import InsecureRequestWarning

# Suppress SSL warnings (only if necessary)
# warnings.simplefilter('ignore', InsecureRequestWarning)

# Set headers to avoid being blocked by GitHub
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"
}

# Base URL and setup CSV file
base_url = "https://github.com/marketplace?type=actions"
csv_file = "github_marketplace_data.csv"

# Open CSV file for writing data
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(["Product Name", "Short Description", "Product URL", "Page Number"])

    # Loop through 500 pages
    for page in range(1, 501):
        url = f"{base_url}&page={page}"
        print(f"Scraping page {page}: {url}")

        try:
            # Send GET request to GitHub Marketplace page
            response = requests.get(url, headers=headers)

            # If the response status code is not 200, log the error and skip the page
            if response.status_code != 200:
                print(f"Error on page {page}: {response.status_code} - Skipping...")
                continue

            # Parse the page content using BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')

            # Find all products on the page
            products = soup.find_all('div', class_='Box-sc-g0xbh4-0')

            for product in products:
                try:
                    # Extract the product name (linked in <a> inside <h3>)
                    product_name = product.find('h3', class_='marketplace-common-module__marketplace-item-link--jrIHf')
                    if product_name:
                        product_name = product_name.get_text(strip=True)
                    else:
                        product_name = "N/A"

                    # Extract the short description (inside <p> with line-clamp-2 class)
                    short_description = product.find('p', class_='line-clamp-2')
                    if short_description:
                        short_description = short_description.get_text(strip=True)
                    else:
                        short_description = "N/A"

                    # Extract the product URL (href inside <a> tag)
                    product_url = product.find('a', class_='marketplace-common-module__marketplace-item-link--jrIHf')
                    if product_url:
                        product_url = f"https://github.com{product_url['href']}"
                    else:
                        product_url = "N/A"

                    # Write product data to CSV
                    writer.writerow([product_name, short_description, product_url, page])

                except Exception as e:
                    print(f"Error extracting data for a product on page {page}: {e}")

            # Wait between requests to avoid overloading the server
            time.sleep(2)

        except Exception as e:
            print(f"Error on page {page}: {e}")

print("Scraping completed. Data saved to 'github_marketplace_data.csv'.")


Scraping page 1: https://github.com/marketplace?type=actions&page=1
Scraping page 2: https://github.com/marketplace?type=actions&page=2
Scraping page 3: https://github.com/marketplace?type=actions&page=3
Scraping page 4: https://github.com/marketplace?type=actions&page=4
Scraping page 5: https://github.com/marketplace?type=actions&page=5
Scraping page 6: https://github.com/marketplace?type=actions&page=6
Scraping page 7: https://github.com/marketplace?type=actions&page=7
Scraping page 8: https://github.com/marketplace?type=actions&page=8
Scraping page 9: https://github.com/marketplace?type=actions&page=9
Scraping page 10: https://github.com/marketplace?type=actions&page=10
Scraping page 11: https://github.com/marketplace?type=actions&page=11
Scraping page 12: https://github.com/marketplace?type=actions&page=12
Scraping page 13: https://github.com/marketplace?type=actions&page=13
Scraping page 14: https://github.com/marketplace?type=actions&page=14
Scraping page 15: https://github.com/m

The csv file was empty so I changed the classes and then it worked.

In [None]:
import requests
from bs4 import BeautifulSoup
import csv
import time
import random

# Set headers to avoid being blocked
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7"
}

# Base URL
base_url = "https://github.com/marketplace?type=actions"

# CSV file to save data
csv_file = "github_marketplace_data.csv"
all_products = []

# Open CSV file for writing
with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(["Product Name", "Short Description", "Product URL", "Page Number"])

    # Loop through the first 500 pages )
    for page in range(1, 501):
        url = f"{base_url}&page={page}"
        print(f"Scraping page {page}: {url}")

        try:
            # Send GET request
            response = requests.get(url, headers=headers)

            # Check if request was successful
            if response.status_code != 200:
                print(f"Error on page {page}: {response.status_code} - Skipping...")
                continue

            # Parse the HTML
            soup = BeautifulSoup(response.text, 'html.parser')

            # Find all product items
            products = soup.find_all('a', class_='marketplace-common-module__marketplace-item-link--jrIHf')

            if not products:
                print(f"No products found on page {page}. Check selectors!")
                continue

            for product in products:
                try:
                    # Extract Product Name
                    product_name = product.get_text(strip=True) if product else "N/A"

                    # Extract Product URL
                    product_url = f"https://github.com{product['href']}" if product else "N/A"

                    # Short description
                    short_description = "N/A"
                    description_tag = product.find_next('p', class_='mt-1 mb-0 text-small fgColor-muted line-clamp-2')
                    if description_tag:
                        short_description = description_tag.get_text(strip=True)

                    # Save to list and CSV file
                    product_data = [product_name, short_description, product_url, page]
                    all_products.append(product_data)
                    writer.writerow(product_data)

                except Exception as e:
                    print(f"Error extracting product on page {page}: {e}")

            # Sleep with a random delay to avoid bot detection
            time.sleep(random.uniform(3, 6))

        except Exception as e:
            print(f"Error on page {page}: {e}")

# Print 10 sample results
print("\n✅ Scraping completed! Here are 10 sample results:\n")
for sample in all_products[:10]:
    print(sample)

print(f"\nData saved to '{csv_file}'.")



Scraping page 1: https://github.com/marketplace?type=actions&page=1
Scraping page 2: https://github.com/marketplace?type=actions&page=2
Scraping page 3: https://github.com/marketplace?type=actions&page=3
Scraping page 4: https://github.com/marketplace?type=actions&page=4
Scraping page 5: https://github.com/marketplace?type=actions&page=5
Scraping page 6: https://github.com/marketplace?type=actions&page=6
Scraping page 7: https://github.com/marketplace?type=actions&page=7
Scraping page 8: https://github.com/marketplace?type=actions&page=8
No products found on page 8. Check selectors!
Scraping page 9: https://github.com/marketplace?type=actions&page=9
Scraping page 10: https://github.com/marketplace?type=actions&page=10
No products found on page 10. Check selectors!
Scraping page 11: https://github.com/marketplace?type=actions&page=11
Scraping page 12: https://github.com/marketplace?type=actions&page=12
Scraping page 13: https://github.com/marketplace?type=actions&page=13
Scraping page 1

**Prompt 4:** Generate Python code that loads a CSV file named github_marketplace_data.csv into a Pandas DataFrame, cleans the 'description' column by removing special characters, tokenizing, converting to lowercase, and removing stopwords using NLTK, stores the cleaned text in a new 'cleaned_description' column, performs data quality checks for missing values and duplicate rows, prints a summary of these checks, and saves the resulting DataFrame to a new CSV file named clean_github_marketplace_data.csv, ensuring the code includes the necessary NLTK downloads.

**Result:** Here’s the code for cleaning and preprocessing the text data, along with data quality operations like handling missing values, removing duplicates, and renaming the CSV file.

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the CSV file
df = pd.read_csv('github_marketplace_data.csv')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Preprocessing function
def preprocess_text(text):
    if isinstance(text, str):  # Make sure it's a string
        # Convert to lowercase
        text = text.lower()

        # Remove HTML tags
        text = re.sub(r'<.*?>', '', text)

        # Remove special characters (keep only words and spaces)
        text = re.sub(r'[^a-zA-Z\s]', '', text)

        # Tokenization
        words = word_tokenize(text)

        # Remove stopwords
        stop_words = set(stopwords.words('english'))
        words = [word for word in words if word not in stop_words]

        # Lemmatization
        words = [lemmatizer.lemmatize(word) for word in words]

        # Join words back into a string
        text = ' '.join(words)

    return text

# Apply preprocessing to product name and description
df['Product Name'] = df['Product Name'].apply(preprocess_text)
df['Short Description'] = df['Short Description'].apply(preprocess_text)

# Data Quality Operations

# 1. Remove duplicates
df.drop_duplicates(subset=['Product Name', 'Short Description', 'Product URL'], keep='first', inplace=True)

# 2. Handle missing values: Fill missing values in 'Short Description' with 'N/A'
df['Short Description'].fillna('N/A', inplace=True)

# 3. Ensure completeness: Make sure all rows have values for required columns
df.dropna(subset=['Product Name', 'Product URL'], inplace=True)

# 4. Reset index after dropping rows
df.reset_index(drop=True, inplace=True)

# Save the cleaned data to a new CSV file
df.to_csv('clean_github_marketplace_data.csv', index=False)

print("Data preprocessing and quality operations completed. Cleaned data saved to 'clean_github_marketplace_data.csv'.")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


After Loading import nltk nltk.download('punkt_tab')
  >>> **The code worked.**

In [None]:
import pandas as pd
import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the cleaned CSV file
file_path = '/content/github_marketplace_data.csv'
df = pd.read_csv(file_path)

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Ensure 'Product Name' and 'Short Description' columns exist
if 'Product Name' not in df.columns or 'Short Description' not in df.columns:
    raise KeyError("The 'Product Name' or 'Short Description' column is missing from the dataset.")

# Preprocessing function to clean text
def preprocess_text(text):
    if isinstance(text, str):  # Make sure it's a string
        # Convert to lowercase
        text = text.lower()

        # Remove HTML tags
        text = re.sub(r'<.*?>', '', text)

        # Remove special characters (keep only words and spaces)
        text = re.sub(r'[^a-zA-Z\s]', '', text)

        # Tokenization
        words = word_tokenize(text)

        # Remove stopwords
        stop_words = set(stopwords.words('english'))
        words = [word for word in words if word not in stop_words]

        # Lemmatization
        words = [lemmatizer.lemmatize(word) for word in words]

        # Join words back into a string
        text = ' '.join(words)

    return text

# Apply preprocessing to 'Product Name' and 'Short Description'
df['Product Name'] = df['Product Name'].apply(preprocess_text)
df['Short Description'] = df['Short Description'].apply(preprocess_text)

# Data Quality Operations

# 1. Remove duplicates based on 'Product Name' and 'Short Description'
df.drop_duplicates(subset=['Product Name', 'Short Description', 'Product URL'], keep='first', inplace=True)

# 2. Handle missing values: Fill missing values in 'Short Description' with 'N/A'
df['Short Description'] = df['Short Description'].fillna('N/A')

# 3. Ensure completeness: Make sure all rows have values for required columns
df.dropna(subset=['Product Name', 'Product URL'], inplace=True)

# 4. Reset index after dropping rows
df.reset_index(drop=True, inplace=True)

# Save the cleaned data to a new CSV file
df.to_csv('clean_github_marketplace_data.csv', index=False)

print("\nData preprocessing and quality operations completed. Cleaned data saved to 'clean_github_marketplace_data.csv'.")

# Save the cleaned data to a new CSV file
df.to_csv('clean_github_marketplace_data.csv', index=False)

print("\nData preprocessing and quality operations completed. Cleaned data saved to 'clean_github_marketplace_data.csv'.")
print(f"Number of rows after preprocessing: {len(df)}")

if not df.empty:
    print("\nRandom 10 Samples of Cleaned Data:")
    print(df.sample(min(10, len(df))))  # Avoids ValueError if df has fewer than 10 rows
else:
    print("The DataFrame is empty after preprocessing.")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



Data preprocessing and quality operations completed. Cleaned data saved to 'clean_github_marketplace_data.csv'.

Data preprocessing and quality operations completed. Cleaned data saved to 'clean_github_marketplace_data.csv'.
Number of rows after preprocessing: 9361

Random 10 Samples of Cleaned Data:
                        Product Name  \
1148              fast forward merge   
8715  check vulnerable nuget package   
9103                  mdformataction   
4373             android translation   
6739        set environment variable   
3649                          tomchv   
4720              black duck scanner   
1821       vercel preview deployment   
3602              run kubernetes job   
2197                      create pdf   

                                      Short Description  \
1148   merge pull request fast forwarding target branch   
8715       github action check vulnerable nuget package   
9103  github action linting markdown file using mdfo...   
4373  github action 

#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [None]:
!pip install tweepy



In [None]:
!pip install tweepy[async]

Collecting async-lru<3,>=1.0.3 (from tweepy[async])
  Downloading async_lru-2.0.4-py3-none-any.whl.metadata (4.5 kB)
Downloading async_lru-2.0.4-py3-none-any.whl (6.1 kB)
Installing collected packages: async-lru
Successfully installed async-lru-2.0.4


In [None]:
!pip install git+https://github.com/tweepy/tweepy.git

Collecting git+https://github.com/tweepy/tweepy.git
  Cloning https://github.com/tweepy/tweepy.git to /tmp/pip-req-build-bd15f83e
  Running command git clone --filter=blob:none --quiet https://github.com/tweepy/tweepy.git /tmp/pip-req-build-bd15f83e
  Resolved https://github.com/tweepy/tweepy.git to commit db28c0e84826485755eb7fcef0c30f75395dff5f
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


Code starts from here

**Prompt 1:** How to Authentication for Twitter API?

**Result: ** Correct Authentication for Twitter API (Tweepy)
1️⃣ Using OAuth 1.0a for Read/Write Access (Posting, Retweeting, Following, etc.)
If you need full API access (to post tweets, like, follow, etc.), use OAuth 1.0a with OAuth1UserHandler:

In [None]:
import tweepy

# Twitter API credentials
API_KEY = ("l33X0Gu0ol4v6dCv3Hmh0zumI")
API_SECRET_KEY = ("EpvOvU6vH2Aq8D19OawLYxXDrt34cMmVqLBr0LVK7gKZ2N1QiU")
ACCESS_TOKEN = ("1891596879294779392-YDctW2mKYIdjfrAWWMsrXgFXW5hmSX")
ACCESS_TOKEN_SECRET = ("yFU0qjfqILHEWUAfTCLm2jPVzREWp9HXZqliqZqyePWdg")
BEARER_TOKEN = ("AAAAAAAAAAAAAAAAAAAAAFZfzQEAAAAA4nW6QDFZDhCV62vs1bNuggQy33o%3DmSDOZeUZCanrYXoQS6wnMLBhUirQgkzyLBOnhMigNhouUbULMb")


# Authenticate using OAuth 1.0a (for read/write access)
auth = tweepy.OAuth1UserHandler(API_KEY, API_SECRET_KEY)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

# Create an API object
api = tweepy.API(auth)

# Verify authentication
try:
    user = api.verify_credentials()
    print(f"Authenticated as: {user.screen_name}")
except Exception as e:
    print("Authentication failed:", e)


Authenticated as: NaserRoza69005


**Prompt 2:** Write Python code using Tweepy to authenticate with the Twitter API using these keys: [Your API Key], [Your API Secret Key], [Your Access Token], [Your Access Token Secret]. Include error handling. Alos Using the authenticated Tweepy API, write Python code to search for tweets with the hashtags #bigdata. Extract the tweet ID, username, and tweet text. Save the results into a list of dictionaries. Add comments to explain each step. Lastly Convert the list of dictionaries into a Pandas DataFrame and save it as a CSV file named 'tweets_machine_learning_ai.csv'.

**Result:** Code to Scrape Tweets Using Tweepy

In [None]:
import tweepy
import pandas as pd

# Twitter API credentials
API_KEY = ("l33X0Gu0ol4v6dCv3Hmh0zumI")
API_SECRET_KEY = ("EpvOvU6vH2Aq8D19OawLYxXDrt34cMmVqLBr0LVK7gKZ2N1QiU")
ACCESS_TOKEN = ("1891596879294779392-YDctW2mKYIdjfrAWWMsrXgFXW5hmSX")
ACCESS_TOKEN_SECRET = ("yFU0qjfqILHEWUAfTCLm2jPVzREWp9HXZqliqZqyePWdg")
BEARER_TOKEN = ("AAAAAAAAAAAAAAAAAAAAAFZfzQEAAAAA4nW6QDFZDhCV62vs1bNuggQy33o%3DmSDOZeUZCanrYXoQS6wnMLBhUirQgkzyLBOnhMigNhouUbULMb")


# Authenticate using OAuth 1.0a
auth = tweepy.OAuth1UserHandler(API_KEY, API_SECRET_KEY)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

# Create an API object
api = tweepy.API(auth, wait_on_rate_limit=True)  # Avoids hitting rate limits

# Define search parameters
query = "#bidata-filter:retweets"  # Avoid retweets
max_tweets = 100  # Number of tweets to fetch

# Fetch tweets
tweets_data = []
for tweet in tweepy.Cursor(api.search_tweets, q=query, lang="en", tweet_mode="extended").items(max_tweets):
    tweets_data.append({
        "Tweet_ID": tweet.id,
        "Username": tweet.user.screen_name,
        "Text": tweet.full_text
    })

# Convert to DataFrame
df = pd.DataFrame(tweets_data)

# Display first few rows
print(df.head())

# Save to CSV
df.to_csv("tweets_machine_learning_ai.csv", index=False)
print("Tweets saved to CSV file successfully!")


Forbidden: 403 Forbidden
453 - You currently have access to a subset of X API V2 endpoints and limited v1.1 endpoints (e.g. media post, oauth) only. If you need access to this endpoint, you may need a different access level. You can learn more here: https://developer.x.com/en/portal/product

**Prompt 3:** I am getting this error. Forbidden: 403 Forbidden
453 - You currently have access to a subset of X API V2 endpoints and limited v1.1 endpoints (e.g. media post, oauth) only. If you need access to this endpoint, you may need a different access level. You can learn more here: https://developer.x.com/en/portal/product. Why? Fix my Code.

**Result:**❌ Error: "403 Forbidden - You currently have access to a subset of X API V2 endpoints"
This error means your Twitter Developer Account does not have access to the Twitter v1.1 search API (search_tweets).✅ Solution: Use Twitter API v2 with Tweepy’s Client.search_recent_tweets()
Since Twitter API v1.1 is restricted, we need to use Twitter API v2, which requires a Bearer Token.

🔹 Updated Code to Scrape Tweets Using Twitter API v2

In [None]:
import tweepy
import pandas as pd

# Twitter API credentials
BEARER_TOKEN = "AAAAAAAAAAAAAAAAAAAAAFZfzQEAAAAA4nW6QDFZDhCV62vs1bNuggQy33o%3DmSDOZeUZCanrYXoQS6wnMLBhUirQgkzyLBOnhMigNhouUbULMb" # API v2 requires Bearer Token

# Authenticate using OAuth 2.0
client = tweepy.Client(bearer_token=BEARER_TOKEN)

# Define search parameters
query = "#bigdata -is:retweet"
max_tweets = 100  # Number of tweets to fetch

# Fetch tweets
tweets_data = []
response = client.search_recent_tweets(query=query, max_results=min(max_tweets, 100), tweet_fields=["id", "text", "author_id"])

# Process tweets
if response.data:
    for tweet in response.data:
        tweets_data.append({
            "Tweet_ID": tweet.id,
            "Username": tweet.author_id,  # Usernames require extra API calls in v2
            "Text": tweet.text
        })

# Convert to DataFrame
df = pd.DataFrame(tweets_data)

# Display first few rows
print(df.head())

# Save to CSV
df.to_csv("tweets_machine_learning_ai.csv", index=False)
print("Tweets saved to CSV file successfully!")


              Tweet_ID             Username  \
0  1892399658145595522  1888728911288385536   
1  1892396798573990338  1887445725988200452   
2  1892396774075031840  1887445725988200452   
3  1892395073423151479   855649514937434112   
4  1892393870601691583   737142202481016832   

                                                Text  
0  #BigData #AIInBusiness #AdvancedAnalytics #Dat...  
1  #Kadena #Bullrun $KDA $FLUX $BTC #KDA #Crypto ...  
2  #Kadena #Bullrun $KDA $FLUX $BTC #KDA #Crypto ...  
3  SAFe® 6.0 and SAFe Studio - iPMOGuide https://...  
4  https://t.co/1SltPcUaAm\nmy @enilev🪔mom \n#MWC...  
Tweets saved to CSV file successfully!


**Prompt 4:** Write Python code using Pandas to load the 'tweets_machine_learning_ai.csv' file. Remove duplicate tweets and tweets with missing text. Remove URLs from the text, and convert the text to lowercase. Save the cleaned data to a new CSV file named 'cleaned_tweets.csv'. Print a summary of the data quality checks.

**Result:** Code to Clean Tweets Data

In [None]:
import pandas as pd
import re

# Load the CSV file
df = pd.read_csv("tweets_machine_learning_ai.csv")

# Initial Data Quality Check
print(f"Initial dataset shape: {df.shape}")
print(f"Missing values before cleaning:\n{df.isnull().sum()}")

# Drop duplicates based on tweet text
df.drop_duplicates(subset="Text", keep="first", inplace=True)

# Remove rows where text is missing
df.dropna(subset=["Text"], inplace=True)

# Function to clean tweet text
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)  # Remove URLs
    return text.strip()

# Apply text cleaning
df["Text"] = df["Text"].apply(clean_text)

print("\nFirst few rows of the cleaned data:\n")
print(df.head())

# Final Data Quality Check
print(f"\nDataset shape after cleaning: {df.shape}")
print(f"Missing values after cleaning:\n{df.isnull().sum()}")
print(f"Number of unique tweets: {df['Text'].nunique()}")

# Save cleaned data to a new CSV file
df.to_csv("cleaned_tweets.csv", index=False)
print("\n✅ Cleaned data saved to 'cleaned_tweets.csv'.")


Initial dataset shape: (100, 3)
Missing values before cleaning:
Tweet_ID    0
Username    0
Text        0
dtype: int64

First few rows of the cleaned data:

              Tweet_ID             Username  \
0  1892399658145595522  1888728911288385536   
1  1892396798573990338  1887445725988200452   
2  1892396774075031840  1887445725988200452   
3  1892395073423151479   855649514937434112   
4  1892393870601691583   737142202481016832   

                                                Text  
0  #bigdata #aiinbusiness #advancedanalytics #dat...  
1  #kadena #bullrun $kda $flux $btc #kda #crypto ...  
2  #kadena #bullrun $kda $flux $btc #kda #crypto ...  
3  safe® 6.0 and safe studio - ipmoguide   #ipmog...  
4  my @enilev🪔mom \n#mwc25 @mwchub\n@gsma 🌐#bosch...  

Dataset shape after cleaning: (99, 3)
Missing values after cleaning:
Tweet_ID    0
Username    0
Text        0
dtype: int64
Number of unique tweets: 48

✅ Cleaned data saved to 'cleaned_tweets.csv'.


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog