# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [2]:
# Install necessary libraries
!apt-get update # update apt repository
!apt install -y wget curl unzip
!wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
!dpkg -i google-chrome-stable_current_amd64.deb
!apt --fix-broken install -y

# Install the required Python libraries
!pip install selenium webdriver-manager pandas

# Import libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, ElementClickInterceptedException
from webdriver_manager.chrome import ChromeDriverManager
import time
import pandas as pd

# Function to scrape IMDb reviews
def scrape_imdb_reviews(url, max_reviews=1000):
    # Set up Chrome options
    chrome_options = Options()
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--headless")  # Run headless for Colab
    chrome_options.add_argument(
        "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    )

    # Set up the driver and WebDriver Manager
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=chrome_options)

    all_reviews = []

    try:
        driver.get(url)
        # Wait for the initial review articles to load
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "article")))

        # Click the "All" dropdown button if available to load full review details.
        try:
            all_button = WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'ipc-see-more__button')]"))
            )
            # Scroll element into view before clicking
            driver.execute_script("arguments[0].scrollIntoView(true);", all_button)
            time.sleep(0.5)  # Small delay after scrolling
            all_button.click()
            time.sleep(2)
        except TimeoutException:
            print("No 'All' button found or clickable. Continuing without clicking it.")
        except ElementClickInterceptedException:
            print("ElementClickInterceptedException on 'All' button. Trying JavaScript click.")
            driver.execute_script("arguments[0].click();", all_button)
            time.sleep(2)

        while len(all_reviews) < max_reviews:
            # Scroll to bottom to prompt lazy-loading of reviews
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(2)  # brief pause after scrolling

            # Get current review articles and remember the count before loading more
            review_elements = driver.find_elements(By.TAG_NAME, "article")
            current_count = len(review_elements)

            for review in review_elements:
                if len(all_reviews) >= max_reviews:
                    break
                try:
                    # Attempt to get the rating; use None if not found
                    try:
                        rating = review.find_element(By.CLASS_NAME, "ipc-rating-star--rating").text
                    except NoSuchElementException:
                        rating = None

                    # Get the title and remove any extra text such as "Expand"
                    try:
                        title = review.find_element(By.CLASS_NAME, "ipc-title__text").text.replace(" \nExpand", "")
                    except Exception:
                        title = ""

                    # Expand review content if a spoiler button is present
                    try:
                        spoiler_button = review.find_element(By.CLASS_NAME, "review-spoiler-button")
                         # Scroll element into view before clicking
                        driver.execute_script("arguments[0].scrollIntoView(true);", spoiler_button)
                        time.sleep(0.2)
                        spoiler_button.click()
                        time.sleep(0.5)
                    except Exception:
                        pass

                    # Extract author and date if available
                    try:
                        author_info = review.find_element(By.CLASS_NAME, "iHZNcU")
                        author = author_info.find_element(By.CLASS_NAME, "ipc-link").text
                        date = author_info.find_element(By.CLASS_NAME, "review-date").text
                    except Exception:
                        author = ""
                        date = ""

                    # Extract the helpful votes (if available)
                    try:
                        helpful_votes = review.find_element(By.CLASS_NAME, "ipc-voting_label_count--up").text
                    except NoSuchElementException:
                        helpful_votes = ""

                    review_data = {
                        'rating': rating,
                        'title': title,
                        'author': author,
                        'date': date,
                        'helpful_votes': helpful_votes
                    }

                    # Avoid duplicates by checking if the review data is already added
                    if review_data not in all_reviews:
                        all_reviews.append(review_data)
                        print(f"Scraped review {len(all_reviews)}: {title[:50]}...")

                except Exception as e:
                    print(f"Error scraping a review: {str(e)}")
                    continue

            # If we already reached enough reviews, break out of the loop
            if len(all_reviews) >= max_reviews:
                break

            # Attempt to find and click the "Load More" button to load additional reviews
            try:
                load_more = WebDriverWait(driver, 10).until(
                    EC.element_to_be_clickable((By.XPATH, "//button[contains(@class, 'ipc-see-more__button')]"))
                )

                # Scroll element into view before clicking
                driver.execute_script("arguments[0].scrollIntoView(true);", load_more)
                time.sleep(0.5)  # Small delay after scrolling
                load_more.click()


                # Wait until new reviews have been loaded by checking that the count has increased.
                WebDriverWait(driver, 10).until(lambda d: len(d.find_elements(By.TAG_NAME, "article")) > current_count)
                time.sleep(1)
            except TimeoutException:
                print("No more reviews to load or timeout reached while waiting for more reviews.")
                break
            except ElementClickInterceptedException:
                print("ElementClickInterceptedException on 'Load More' button.  Trying JavaScript click.")
                load_more = WebDriverWait(driver, 10).until(
                    EC.presence_of_element_located((By.XPATH, "//button[contains(@class, 'ipc-see-more__button')]"))
                )
                driver.execute_script("arguments[0].click();", load_more)
                time.sleep(2)

        # Save the data to a CSV file
        if all_reviews:
            df = pd.DataFrame(all_reviews)
            df.to_csv('imdb_reviews.csv', index=False)
            print(f"\nSuccessfully scraped {len(all_reviews)} reviews.")
            print("Data saved to 'imdb_reviews.csv'.")

    except Exception as e:
        print(f"Error during scraping: {str(e)}")
        # Save partial results if an error occurred
        if all_reviews:
            df = pd.DataFrame(all_reviews)
            df.to_csv('imdb_reviews_partial.csv', index=False)
            print("Saved partial results to 'imdb_reviews_partial.csv'.")

    finally:
        driver.quit()


# Main script
url = "https://www.imdb.com/title/tt15398776/reviews/?ref_=tt_ov_ql_2"  # Example URL of reviews
scrape_imdb_reviews(url, max_reviews=1000)


Get:1 https://dl.google.com/linux/chrome/deb stable InRelease [1,825 B]
Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Get:3 https://dl.google.com/linux/chrome/deb stable/main amd64 Packages [1,217 B]
Hit:4 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:10 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Fetched 3,042 B in 2s (1,467 B/s)
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repos

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [3]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('wordnet')

# Load the dataset
df = pd.read_csv('/content/imdb_reviews.csv')

# Check if the dataset has a column named 'title' or 'review', and choose one to clean
review_column = 'title' if 'title' in df.columns else 'review'

# Initialize NLP tools
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    if pd.isna(text):  # Check for NaN values
        return ""

    # 1. Remove special characters and punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # 2. Remove numbers
    text = re.sub(r'\d+', '', text)

    # 3. Convert to lowercase
    text = text.lower()

    # 4. Remove stopwords
    words = text.split()
    words = [word for word in words if word not in stop_words]

    # 5. Apply stemming
    stemmed_words = [stemmer.stem(word) for word in words]

    # 6. Apply lemmatization
    lemmatized_words = [lemmatizer.lemmatize(word) for word in stemmed_words]

    return ' '.join(lemmatized_words)

# Apply cleaning function to the dataset
df['clean_review'] = df[review_column].astype(str).apply(clean_text)

# Save cleaned data to a new CSV file
df.to_csv('imdb_reviews_cleaned.csv', index=False)

# Display the first few rows to verify
print(df[['title', 'clean_review']].head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


                                               title  \
0                              Murphy is exceptional   
1  A challenging watch to be sure, but a worthwhi...   
2                             Quality but exhausting   
3                           And the Oscar goes to...   
4  A brilliantly layered examination of a man thr...   

                                        clean_review  
0                                      murphi except  
1                  challeng watch sure worthwhil one  
2                                    qualiti exhaust  
3                                          oscar goe  
4  brilliantli layer examin man throughout incred...  


# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [4]:
import pandas as pd
import nltk
from nltk import pos_tag, word_tokenize
from nltk.tree import Tree
import spacy
import subprocess  # Import the subprocess module

# Load the spacy model
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Downloading en_core_web_sm model...")
    subprocess.run(["python", "-m", "spacy", "download", "en_core_web_sm"])
    nlp = spacy.load("en_core_web_sm")


def syntax_analysis(df, text_column='clean_review'):
    """
    Performs syntax and structure analysis on the text data.

    Args:
      df (pd.DataFrame): DataFrame containing the text data.
      text_column (str): The name of the column containing the text data.

    Returns:
      pd.DataFrame: The input DataFrame with added POS tags, constituency parsing, dependency parsing and named entities.
    """

    # 1. Parts of Speech (POS) Tagging
    def get_pos_counts(text):
        """ Tags parts of speech and calculates counts for Noun, Verb, Adjective, Adverb. """
        try:
            tokens = word_tokenize(text)
            tagged = pos_tag(tokens)
            counts = {'Noun': 0, 'Verb': 0, 'Adjective': 0, 'Adverb': 0}
            for word, tag in tagged:
                if tag.startswith('N'):
                    counts['Noun'] += 1
                elif tag.startswith('V'):
                    counts['Verb'] += 1
                elif tag.startswith('J'):
                    counts['Adjective'] += 1
                elif tag.startswith('R'):
                    counts['Adverb'] += 1
            return counts
        except Exception as e:
            print(f"POS Tagging Error: {e}")
            return {'Noun': 0, 'Verb': 0, 'Adjective': 0, 'Adverb': 0}  # Return zeros on error


    df['pos_counts'] = df[text_column].apply(get_pos_counts)

    # 2. Constituency Parsing and Dependency Parsing using SpaCy
    def get_parse_trees(text):
        """ Generates constituency and dependency parse trees using SpaCy. """
        try:
            doc = nlp(text)

            # Dependency Parsing Tree
            dep_tree = [(token.text, token.dep_, token.head.text) for token in doc]

            # Constituency Parsing (using SpaCy's sentence structure)
            constituency_tree = []
            for chunk in doc.noun_chunks:
                constituency_tree.append((chunk.text, chunk.root.dep_, chunk.root.head.text))

            return constituency_tree, dep_tree
        except Exception as e:
            print(f"Parsing Error: {e}")
            return [], []


    df[['constituency_tree', 'dependency_tree']] = df[text_column].apply(lambda x: pd.Series(get_parse_trees(x)))

    # 3. Named Entity Recognition
    def get_named_entities(text):
        """ Extracts named entities and counts occurrences. """
        try:
            doc = nlp(text)
            entities = {}
            for ent in doc.ents:
                if ent.label_ in entities:
                    entities[ent.label_] += 1
                else:
                    entities[ent.label_] = 1
            return entities
        except Exception as e:
            print(f"NER Error: {e}")
            return {}

    df['named_entities'] = df[text_column].apply(get_named_entities)
    return df


# Load the cleaned data
cleaned_df = pd.read_csv('/content/imdb_reviews_cleaned.csv')

# Perform syntax analysis
analyzed_df = syntax_analysis(cleaned_df)

# Display the results for the first row

print("Example Row Analysis:")
print("----------------------")
print("Original Title:", analyzed_df['title'][0])
print("Cleaned Title:", analyzed_df['clean_review'][0])
print("POS Counts:", analyzed_df['pos_counts'][0])
print("Constituency Tree:", analyzed_df['constituency_tree'][0])
print("Dependency Tree:", analyzed_df['dependency_tree'][0])
print("Named Entities:", analyzed_df['named_entities'][0])

# Example Explanation (using the first sentence)
print("\nExample Explanation:")
print("--------------------")
example_sentence = analyzed_df['clean_review'][0]
print("Example Sentence:", example_sentence)

print("\nDependency Parsing Example:")
print("The dependency parsing tree represents the relationships between words in the sentence.  Each word is connected to another word (its head) by a directed edge, representing the type of dependency.")
example_dep_tree = analyzed_df['dependency_tree'][0]
print(example_dep_tree)
print("""
    For example, ('Heavy', 'amod', 'Handed') means that the word 'Heavy' is an adjectival modifier (amod) of the word 'Handed'.
""")

print("\nConstituency Parsing Example:")
print("The constituency parsing tree divides the sentence into constituents (phrases).")
example_const_tree = analyzed_df['constituency_tree'][0]
print(example_const_tree)
print("""
    Here, each tuple shows the phrases and dependencies of the sentences.
""")

# Save the analyzed DataFrame to a new CSV file
analyzed_df.to_csv('oppenheimer_imdb_reviews_analyzed.csv', index=False)
print("\nAnalysis saved to 'oppenheimer_imdb_reviews_analyzed.csv'")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

POS Tagging Error: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_dat

# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [12]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from nltk.corpus import stopwords

import random
# Base URL of GitHub Marketplace (Actions section)
BASE_URL = "https://github.com/marketplace?type=actions&page="

# Headers to mimic a browser request.  Added Accept-Encoding
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8",
    "Referer": "https://www.google.com/",
    "Accept-Encoding": "gzip, deflate, br",  # Critical:  Tell the server we accept compressed responses
}

# Initialize list to store data
data = []
page = 0
max_retries = 3
total_scraped = 0
max_products = 1000

# Scrape multiple pages (Assuming 40 products per page, we need ~25 pages for 1000 products)
while total_scraped < max_products:
    page += 1
    url = f"{BASE_URL}{page}"
    for attempt in range(max_retries):
        try:
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()  # Raise HTTPError for bad responses (4xx or 5xx)
            break  # If the request was successful, break out of the retry loop
        except requests.exceptions.RequestException as e:
            print(f"Failed to fetch page {page} (attempt {attempt + 1}/{max_retries}): {e}")
            if attempt < max_retries - 1:
                time.sleep(5 * (attempt + 1))  # Exponential backoff
            else:
                print(f"Failed to retrieve page {page} after {max_retries} attempts.")
                continue # Skip to the next page if all retries fail

    soup = BeautifulSoup(response.text, "html.parser")

    # Find all product containers
    items = soup.find_all('div', {'data-testid': 'non-featured-item'})

    num_items_scraped = 0
    for item in items:
        try:
            name = item.find('a', class_='marketplace-common-module__marketplace-item-link--jrIHf').text.strip()
            product_url = "https://github.com" + item.find('a')['href']
            description = item.find('p', class_='text-small').text.strip()
            data.append([name, description, product_url, page])
            num_items_scraped+=1
        except AttributeError as e:
            print(f"AttributeError parsing item on page {page}: {e}")
        except Exception as e:
            print(f"Unexpected error parsing item on page {page}: {e}")

    total_scraped += num_items_scraped

    if num_items_scraped == 0:
        print(f'Page {page} contain no more actions. Stopping...')
        break # There is not more actions in the github market place

    print(f"Scraped {num_items_scraped} products from page {page}, Total {total_scraped} actions")
    time.sleep(random.uniform(3, 7))  # Be very polite

# Save to CSV
df = pd.DataFrame(data, columns=["Product Name", "Description", "URL", "Page Number"])

df.to_csv("github_marketplace_actions.csv", index=False)
print("Scraping completed! Data saved to github_marketplace_actions.csv")

Scraped 20 products from page 1, Total 20 actions
Scraped 20 products from page 2, Total 40 actions
Scraped 20 products from page 3, Total 60 actions
Scraped 20 products from page 4, Total 80 actions
Scraped 20 products from page 5, Total 100 actions
Scraped 20 products from page 6, Total 120 actions
Scraped 20 products from page 7, Total 140 actions
Page 8 contain no more actions. Stopping...
Scraping completed! Data saved to github_marketplace_actions.csv


In [6]:
# Add these imports at the top
from nltk.stem import WordNetLemmatizer
import re
import pandas as pd
from tqdm import tqdm

# Initialize NLP resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    """
    Comprehensive text preprocessing pipeline with error handling
    """
    try:
        if pd.isna(text):
            return ""

        # Clean text
        text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # Remove URLs
        text = re.sub(r'\W+', ' ', text)  # Remove special characters
        text = re.sub(r'\d+', '', text)  # Remove numbers
        text = text.lower().strip()

        # Tokenization and lemmatization
        tokens = word_tokenize(text)
        tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

        return ' '.join(tokens)

    except Exception as e:
        print(f"Error processing text: {e}")
        return ""

def ensure_data_quality(df):
    """
    Comprehensive data quality checks and cleaning
    """
    # 1. Handle missing values
    print("\nData Quality Report:")
    print("Initial shape:", df.shape)

    # Critical columns check
    critical_cols = ['Product Name', 'URL']
    df = df.dropna(subset=critical_cols, how='any')

    # 2. Remove duplicates
    dup_count = df.duplicated(subset=['URL']).sum()
    print(f"Removing {dup_count} duplicate entries")
    df = df.drop_duplicates(subset=['URL'], keep='first')

    # 3. Validate URLs
    url_pattern = r'^https?://github\.com/.*'
    valid_urls = df['URL'].str.contains(url_pattern, na=False)
    print(f"Found {len(df) - valid_urls.sum()} invalid URLs")
    df = df[valid_urls]

    # 4. Clean text columns
    print("Processing text columns...")
    tqdm.pandas(desc="Cleaning Descriptions")
    df['Cleaned Description'] = df['Description'].progress_apply(preprocess_text)

    # 5. Final check
    print("\nFinal Data Quality Check:")
    print("Missing values per column:")
    print(df.isna().sum())
    print("\nData types:")
    print(df.dtypes)
    print("\nFinal shape:", df.shape)

    return df

# Add these lines AFTER creating the initial DataFrame but BEFORE saving to CSV
print("\nStarting data quality checks and preprocessing...")

# Run data quality pipeline
df_clean = ensure_data_quality(df)

# Save cleaned data
df_clean.to_csv("github_marketplace_actions_cleaned.csv", index=False)
print("\nCleaned data saved to github_marketplace_actions_cleaned.csv")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



Starting data quality checks and preprocessing...

Data Quality Report:
Initial shape: (140, 4)
Removing 0 duplicate entries
Found 0 invalid URLs
Processing text columns...


Cleaning Descriptions: 100%|██████████| 140/140 [00:00<00:00, 2984.18it/s]

Error processing text: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

Error processing text: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/dat




#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [21]:
import tweepy
import time

# Set your Bearer Token here (for API v2)
bearer_token = 'AAAAAAAAAAAAAAAAAAAAAEGlzQEAAAAAk%2FzdseHVsevoyskzo1yOBfLZWMU%3DyKwaEIHkCbia4iqp3dHSLaDjJh6n2MXENv2I1Pka8qKws66K3B'  # Replace with your actual Bearer Token

# Create a Tweepy client with Bearer Token (API v2)
client = tweepy.Client(bearer_token=bearer_token)

# Define the search query and the number of tweets you want to fetch
hashtag = '#generativeAI -is:retweet'

# Function to fetch recent tweets with smaller requests and retry logic
def fetch_tweets_with_retry(query, max_results=10, retries=3):
    attempt = 0
    while attempt < retries:
        try:
            tweets = client.search_recent_tweets(query=query, tweet_fields=["created_at", "text", "author_id"], max_results=max_results)
            return tweets
        except tweepy.errors.TooManyRequests as e:
            # If rate-limited, wait and retry with a shorter backoff
            wait_time = 30  # Retry after 30 seconds instead of longer
            print(f"Rate limit exceeded. Waiting for {wait_time} seconds.")
            time.sleep(wait_time)
            attempt += 1
        except Exception as e:
            print(f"An error occurred: {e}")
            break
    return None

# Fetch tweets with the query using retry logic (smaller max_results per request)
tweets = fetch_tweets_with_retry(query=hashtag, max_results=10)

# Display fetched tweets
if tweets and tweets.data:
    for tweet in tweets.data:
        print(f"Tweet ID: {tweet.id}")
        print(f"Author ID: {tweet.author_id}")
        print(f"Tweet Text: {tweet.text}")
        print(f"Created At: {tweet.created_at}")
        print("-----")
else:
    print("Failed to fetch tweets after retries.")

Rate limit exceeded. Waiting for 30 seconds.
Rate limit exceeded. Waiting for 30 seconds.
Rate limit exceeded. Waiting for 30 seconds.
Failed to fetch tweets after retries.


In [22]:
import re
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# Make sure to download necessary NLTK data if not already done
nltk.download('stopwords')
nltk.download('punkt')

# Function to clean tweet text
def clean_tweet_text(text):
    # Remove URLs, mentions (@username), hashtags (#hashtag)
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'#\w+', '', text)

    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Convert to lowercase
    text = text.lower()

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text)
    cleaned_text = ' '.join([word for word in tokens if word not in stop_words])

    return cleaned_text

# Create an empty list to store the cleaned tweet data
cleaned_data = []

# If the tweets data is not empty, clean the text and append to the cleaned_data list
if tweets and tweets.data:
    for tweet in tweets.data:
        tweet_id = tweet.id
        author_id = tweet.author_id
        original_text = tweet.text
        cleaned_text = clean_tweet_text(original_text)

        # Append cleaned tweet data as a list
        cleaned_data.append([tweet_id, author_id, original_text, cleaned_text])
else:
    print("No tweets found")

# Create a DataFrame from the cleaned data
df = pd.DataFrame(cleaned_data, columns=['Tweet ID', 'Author ID', 'Original Text', 'Cleaned Text'])

# Check if there are any missing values
print(f"Missing values: {df.isnull().sum()}")

# Perform a final data quality check for completeness and consistency
# For simplicity, let's just drop rows with missing values (if any)
df.dropna(inplace=True)

# Save the cleaned data to a CSV file
df.to_csv('cleaned_tweets_data.csv', index=False)

print("Data cleaning completed! Cleaned data saved to 'cleaned_tweets_data.csv'")

No tweets found
Missing values: Tweet ID         0
Author ID        0
Original Text    0
Cleaned Text     0
dtype: int64
Data cleaning completed! Cleaned data saved to 'cleaned_tweets_data.csv'


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog