<a href="https://colab.research.google.com/github/sivarohith99/SivaRohith_INFO5731_Fall2024/blob/main/Jampana_SivaRohith_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [None]:
# Your code here
import requests
import pandas as pd
from bs4 import BeautifulSoup
import time
import random
import re

def extract_product_id(url):
    # Try to find a product ID (ASIN) in the URL
    patterns = [
        r'/dp/([A-Z0-9]{10})',
        r'/product/([A-Z0-9]{10})',
        r'/([A-Z0-9]{10})(?:/|$)',
    ]

    for pattern in patterns:
        match = re.search(pattern, url)
        if match:
            return match.group(1)
    return None

def construct_review_url(product_id):
    return f"https://www.amazon.com/product-reviews/{product_id}/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"

def get_reviews(url, max_reviews=1000):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.5938.132 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
    }

    product_id = extract_product_id(url)
    if not product_id:
        print("Could not extract product ID from URL")
        return []

    review_url = construct_review_url(product_id)
    reviews_list = []
    page_number = 1

    while len(reviews_list) < max_reviews:
        try:
            current_url = f"{review_url}&pageNumber={page_number}"
            print(f"Fetching page {page_number}...")

            response = requests.get(current_url, headers=headers)
            if response.status_code != 200:
                print(f"Failed to fetch page {page_number}. Status code: {response.status_code}")
                break

            soup = BeautifulSoup(response.content, 'html.parser')

            # Find all review elements
            reviews = soup.find_all('div', {'data-hook': 'review'})

            if not reviews:
                print(f"No reviews found on page {page_number}, stopping...")
                break

            for review in reviews:
                try:
                    # Extract review data with multiple possible selectors
                    title_elem = (review.find('a', {'data-hook': 'review-title'}) or
                                 review.find('span', {'data-hook': 'review-title'}))
                    rating_elem = (review.find('i', {'data-hook': 'review-star-rating'}) or
                                  review.find('span', {'data-hook': 'review-star-rating'}))
                    body_elem = review.find('span', {'data-hook': 'review-body'})
                    date_elem = review.find('span', {'data-hook': 'review-date'})

                    if all([title_elem, rating_elem, body_elem, date_elem]):
                        title = title_elem.text.strip()
                        rating = float(rating_elem.text.split()[0])
                        body = body_elem.text.strip()
                        date = date_elem.text.strip()

                        reviews_list.append({
                            'Title': title,
                            'Rating': rating,
                            'Review': body,
                            'Date': date
                        })
                except Exception as e:
                    print(f"Error processing a review: {str(e)}")
                    continue

            print(f"Collected {len(reviews_list)} reviews so far...")

            # Random delay between requests
            time.sleep(random.uniform(2, 4))
            page_number += 1

        except Exception as e:
            print(f"Error on page {page_number}: {str(e)}")
            break

    return reviews_list

def save_to_csv(reviews, filename='amazon_reviews.csv'):
    if not reviews:
        print("No reviews to save!")
        return

    df = pd.DataFrame(reviews)
    df.to_csv(filename, index=False, encoding='utf-8')
    print(f"Saved {len(reviews)} reviews to {filename}")

def main():
    try:
        url = input("Enter the Amazon product URL: ").strip()

        if not url:
            print("URL cannot be empty!")
            return

        print("Starting to collect reviews...")
        reviews = get_reviews(url, max_reviews=1000)

        if reviews:
            filename = input("Enter the output CSV filename (default 'amazon_reviews.csv'): ").strip() or 'amazon_reviews.csv'
            save_to_csv(reviews, filename)

            print("\nSample Reviews:")
            for i, review in enumerate(reviews[:5], start=1):
                print(f"\nReview {i}:")
                print(f"Title: {review['Title']}")
                print(f"Rating: {review['Rating']}")
                print(f"Date: {review['Date']}")
                print(f"Review: {review['Review'][:100]}...")
        else:
            print("No reviews were collected! This might be due to:")
            print("1. Anti-scraping measures from Amazon")
            print("2. The product might not have any reviews")
            print("3. The URL might be incorrect")
            print("\nConsider:")
            print("- Using a VPN or proxy")
            print("- Checking if the product URL is correct")
            print("- Verifying that the product has reviews")

    except Exception as e:
        print(f"An error occurred: {str(e)}")

if __name__ == "__main__":
    main()


Enter the Amazon product URL: https://www.amazon.com/Apple-MX532LL-A-AirTag/dp/B0CWXNS552/ref=zg_bs_c_electronics_d_sccl_2/137-3609910-8111266?pd_rd_w=5aP2l&content-id=amzn1.sym.7379aab7-0dd8-4729-b0b5-2074f1cb413d&pf_rd_p=7379aab7-0dd8-4729-b0b5-2074f1cb413d&pf_rd_r=B0F8Q9BRWXXG6BCT7NKY&pd_rd_wg=v8XeA&pd_rd_r=61c47469-0ea5-4dc2-af9b-426cddf2c61f&pd_rd_i=B0CWXNS552&psc=1
Starting to collect reviews...
Fetching page 1...
Collected 10 reviews so far...
Fetching page 2...
Collected 20 reviews so far...
Fetching page 3...
Collected 30 reviews so far...
Fetching page 4...
Collected 40 reviews so far...
Fetching page 5...
Collected 50 reviews so far...
Fetching page 6...
Collected 60 reviews so far...
Fetching page 7...
Collected 70 reviews so far...
Fetching page 8...
Collected 80 reviews so far...
Fetching page 9...
Collected 90 reviews so far...
Fetching page 10...
Collected 100 reviews so far...
Fetching page 11...
No reviews found on page 11, stopping...
Enter the output CSV filename (d

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
# Write code for each of the sub parts with proper comments.

import pandas as pd  # For handling data in tabular format
import re  # For regular expressions to process text
import nltk  # For various natural language processing tasks
from nltk.corpus import stopwords  # To filter out common words (stopwords)
from nltk.stem import PorterStemmer, WordNetLemmatizer  # For stemming and lemmatization

# Download NLTK resources if they haven't been downloaded yet
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

def load_reviews(file_path='amazon_reviews.csv'):
    """Load reviews from a CSV file and return as a DataFrame."""
    return pd.read_csv(file_path)

def clean_text(review_text):
    """Clean the provided text through multiple steps."""

    # Step 1: Remove special characters and punctuation
    review_text = re.sub(r'[^\w\s]', '', review_text)
    print(f"After removing noise: {review_text}")

    # Step 2: Remove numbers
    review_text = re.sub(r'\d+', '', review_text)
    print(f"After removing numbers: {review_text}")

    # Step 3: Convert text to lowercase
    review_text = review_text.lower()
    print(f"After lowercasing: {review_text}")

    # Step 4: Remove stopwords
    stop_words = set(stopwords.words('english'))
    review_text = ' '.join(word for word in review_text.split() if word not in stop_words)
    print(f"After removing stopwords: {review_text}")

    # Step 5: Stemming the words
    stemmer = PorterStemmer()
    stemmed_text = ' '.join(stemmer.stem(word) for word in review_text.split())
    print(f"After stemming: {stemmed_text}")

    # Step 6: Lemmatization of the words
    lemmatizer = WordNetLemmatizer()
    lemmatized_text = ' '.join(lemmatizer.lemmatize(word) for word in review_text.split())
    print(f"After lemmatization: {lemmatized_text}")

    return stemmed_text, lemmatized_text

def process_reviews(df):
    """Apply the cleaning process to each review in the DataFrame."""
    df[['Stemmed_Review', 'Lemmatized_Review']] = df['Review'].apply(lambda x: clean_text(x)).apply(pd.Series)
    return df

def save_cleaned_reviews(df, output_file='cleaned_amazon_reviews.csv'):
    """Save the cleaned reviews to a new CSV file."""
    df.to_csv(output_file, index=False, encoding='utf-8')
    print(f"Cleaned reviews saved to {output_file}")

if __name__ == "__main__":
    # Load reviews from the specified CSV file
    reviews_df = load_reviews()

    # Clean the reviews using the defined process
    cleaned_reviews_df = process_reviews(reviews_df)

    # Save the cleaned reviews to a new CSV file
    save_cleaned_reviews(cleaned_reviews_df)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


After removing noise: Bought this for my boyfriend to have in is wallet and he loves it Easy to use tracking is accurate pairs easily with his iPhone Overall great
After removing numbers: Bought this for my boyfriend to have in is wallet and he loves it Easy to use tracking is accurate pairs easily with his iPhone Overall great
After lowercasing: bought this for my boyfriend to have in is wallet and he loves it easy to use tracking is accurate pairs easily with his iphone overall great
After removing stopwords: bought boyfriend wallet loves easy use tracking accurate pairs easily iphone overall great
After stemming: bought boyfriend wallet love easi use track accur pair easili iphon overal great
After lemmatization: bought boyfriend wallet love easy use tracking accurate pair easily iphone overall great
After removing noise: I have bought SO MANY OF THESE I have one in my backpack car keys wallet motorcycle keys motorcycle and in my vehicle The AirTag gives me a sense of ease as I can 

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
# Your code here
!pip install spacy nltk
!python -m spacy download en_core_web_sm
import pandas as pd
import spacy
import nltk
from collections import Counter
from nltk import pos_tag, word_tokenize

# Load spaCy model for advanced NLP tasks
nlp = spacy.load("en_core_web_sm")

# Load the cleaned reviews CSV
def load_cleaned_reviews(file_path='cleaned_amazon_reviews.csv'):
    return pd.read_csv(file_path)

# Part 1: Parts of Speech (POS) Tagging
def pos_tagging(text):
    tokens = word_tokenize(text)
    tagged_words = pos_tag(tokens)

    # Count Nouns, Verbs, Adjectives, and Adverbs
    pos_count = Counter(tag for word, tag in tagged_words)
    noun_count = pos_count.get('NN', 0) + pos_count.get('NNS', 0)  # Singular and plural nouns
    verb_count = pos_count.get('VB', 0) + pos_count.get('VBD', 0) + pos_count.get('VBG', 0) + pos_count.get('VBN', 0) + pos_count.get('VBP', 0) + pos_count.get('VBZ', 0)  # Verb types
    adj_count = pos_count.get('JJ', 0) + pos_count.get('JJR', 0) + pos_count.get('JJS', 0)  # Adjectives
    adv_count = pos_count.get('RB', 0) + pos_count.get('RBR', 0) + pos_count.get('RBS', 0)  # Adverbs

    return tagged_words, noun_count, verb_count, adj_count, adv_count

# Part 2: Parsing (Constituency Parsing and Dependency Parsing)
def parse_sentence(text):
    doc = nlp(text)

    # Dependency parsing
    dependency_parsing = [(token.text, token.dep_, token.head.text) for token in doc]

    # Constituency parsing (for demo purposes, we print the token structure)
    constituency_parsing = " ".join([f"{token.text}({token.tag_})" for token in doc])

    return dependency_parsing, constituency_parsing

# Part 3: Named Entity Recognition (NER)
def named_entity_recognition(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    entity_counts = Counter([ent.label_ for ent in doc.ents])

    return entities, entity_counts

# Main Function to Process Reviews
def process_reviews(df):
    # Create lists to store analysis results
    pos_tags = []
    noun_counts = []
    verb_counts = []
    adj_counts = []
    adv_counts = []
    dependency_parses = []
    constituency_parses = []
    named_entities = []
    entity_counts_list = []

    # Loop through each review for analysis
    for review in df['Lemmatized_Review']:
        # Part 1: POS tagging
        tagged_words, noun_count, verb_count, adj_count, adv_count = pos_tagging(review)
        pos_tags.append(tagged_words)
        noun_counts.append(noun_count)
        verb_counts.append(verb_count)
        adj_counts.append(adj_count)
        adv_counts.append(adv_count)

        # Part 2: Parsing
        dependency_parsing, constituency_parsing = parse_sentence(review)
        dependency_parses.append(dependency_parsing)
        constituency_parses.append(constituency_parsing)

        # Part 3: Named Entity Recognition
        entities, entity_counts = named_entity_recognition(review)
        named_entities.append(entities)
        entity_counts_list.append(entity_counts)

    # Add analysis columns to the dataframe
    df['POS_Tags'] = pos_tags
    df['Noun_Count'] = noun_counts
    df['Verb_Count'] = verb_counts
    df['Adj_Count'] = adj_counts
    df['Adv_Count'] = adv_counts
    df['Dependency_Parsing'] = dependency_parses
    df['Constituency_Parsing'] = constituency_parses
    df['Named_Entities'] = named_entities
    df['Entity_Counts'] = entity_counts_list

    return df

# Save the analysis results to a CSV file
def save_analysis(df, output_file='review_analysis.csv'):
    df.to_csv(output_file, index=False, encoding='utf-8')
    print(f"Analysis saved to {output_file}")

# Example explanation function for a parsing tree
def explain_parsing_tree(sentence):
    doc = nlp(sentence)
    print("\nConstituency Parsing Explanation:")
    print("The sentence is tokenized and tagged as follows:")
    for token in doc:
        print(f"Token: {token.text}, POS Tag: {token.tag_}, Dependency: {token.dep_} -> Head: {token.head.text}")

if __name__ == "__main__":
    # Load the cleaned reviews from CSV
    reviews_df = load_cleaned_reviews()

    # Process reviews for analysis
    analyzed_reviews_df = process_reviews(reviews_df)

    # Save the analyzed reviews to a new CSV file
    save_analysis(analyzed_reviews_df)

    # Example sentence to explain parsing trees
    sample_sentence = "I bought this product last month and it works great."
    explain_parsing_tree(sample_sentence)



Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m73.7 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Analysis saved to review_analysis.csv

Constituency Parsing Explanation:
The sentence is tokenized and tagged as follows:
Token: I, POS Tag: PRP, Dependency: nsubj -> Head: bought
Token: bought, POS Tag: VBD, Dependency: ROOT -> Head: bought
Token: this, POS Tag: DT, Dependency: det -> Head: product
Token: product, POS Tag: NN, Dependency:

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

link below

[link text](https://myunt-my.sharepoint.com/:x:/g/personal/sivarohithjampana_my_unt_edu/EaphPEboOe5Ojnka8IPPqtsBeCoTaRakmtiAqbYTap9etw?e=yPz4eb)

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

The assignment on analyzing Amazon reviews was challenging due to scraping limitations and complex text cleaning. I enjoyed applying NLP techniques like POS tagging and Named Entity Recognition, finding insights from the data. The allocated time was reasonable, allowing thorough exploration while feeling slightly rushed on parsing nuances.