<a href="https://colab.research.google.com/github/rmvsaipavan/manivenkatasaipavan_INFO5731_Fall2023/blob/main/Ramisetty_Manivenkatasaipavan_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

target_url = 'https://www.imdb.com/title/tt15398776/reviews/?ref_=tt_ql_2'    #Oppenheimer Movie reviews link
current_page = 1
total_reviews_to_scrape = 10000  #total number of reviews that we are collecting

all_reviews = []

while len(all_reviews) < total_reviews_to_scrape:
    page_url = f'{target_url}&start={25 * (current_page - 1)}'
    response = requests.get(page_url)

    if response.status_code == 200:
        parsed_html = BeautifulSoup(response.text, 'html.parser')
        reviews = parsed_html.find_all('div', class_='lister-item-content')

        if not reviews:
            break  # No more reviews found

        for idx, review in enumerate(reviews, 1):
            reviewer = review.find('span', class_='display-name-link').text
            review_date = review.find('span', class_='review-date').text

            try:
                user_rating = review.find('span', class_='rating-other-user-rating').text
            except AttributeError:
                user_rating = 'No Rating'

            review_title = review.find('a', class_='title').text.strip()
            review_content = review.find('div', class_='text').text.strip()

            all_reviews.append({
                'Reviewer': reviewer,
                'Review Date': review_date,
                'User Rating': user_rating,
                'Review Title': review_title,
                'Review Content': review_content
            })

            print(f'Review {len(all_reviews)}:')
            print(f'Reviewer: {reviewer}')
            print(f'Review Date: {review_date}')
            print(f'User Rating: {user_rating}')
            print(f'Review Title: {review_title}')
            print(f'Review Content: {review_content}\n')

        current_page += 1

# Create a DataFrame from the collected reviews
df = pd.DataFrame(all_reviews)

# Save the DataFrame to a CSV file
csv_file_name = 'imdb_reviews.csv'
df.to_csv(csv_file_name, index=False)

print(f'Reviews saved to {csv_file_name}')


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Reviewer: mohameddawoud-26019
Review Date: 19 July 2023
User Rating: 





10/10

Review Title: A Masterpiece
Review Content: I may consider myself lucky to be alive to watch Christopher Nolan Works which get better by years.Oppenheimer is - with no doubt- going to be one of the best movies in the history. Amazing cinematography, Exceptional acting and terrifying Soundtracks.All the cast are great from cilian Murphy who is going for the oscar with this role to Rupert Downey jr and Emily blunt and finally rami malik who has small scenes but you will never forget them.I didn't watch it in Imax as i couldn't wait and ran to the nearest cinema but now i will sure book an imax ticket.Don't waste any time, book your ticket and Go watch it.. NOW.

Review 9632:
Reviewer: mark-217-307033
Review Date: 19 July 2023
User Rating: 





10/10

Review Title: And the Oscar goes to...
Review Content: I'm still collecting my thoughts after

# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
import nltk
nltk.download('omw-1.4')


[nltk_data] Downloading package omw-1.4 to C:\Users\Sai
[nltk_data]     Pavan\AppData\Roaming\nltk_data...


True

In [2]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import string

# Download NLTK resources (stopwords and lemmatization data)
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Load the CSV file with the collected reviews
df = pd.read_csv('imdb_reviews.csv')

# Initialize stopwords and stemming/lemmatization tools
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to clean and preprocess text
def clean_text(text):
    # Remove noise, such as special characters and punctuations
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove numbers
    text = ''.join([word for word in text if not word.isdigit()])

    # Tokenize the text
    words = word_tokenize(text)

    # Remove stopwords, lowercase all texts, and perform stemming and lemmatization
    cleaned_words = [lemmatizer.lemmatize(stemmer.stem(word.lower())) for word in words if word.lower() not in stop_words]

    # Join the cleaned words back into a string
    cleaned_text = ' '.join(cleaned_words)

    return cleaned_text

# Apply the cleaning function to the 'Review Content' column and create a new 'Cleaned Review Content' column
df['Cleaned Review Content'] = df['Review Content'].apply(clean_text)

# Save the DataFrame with cleaned data to a new CSV file
cleaned_csv_file = 'imdb_reviews_cleaned.csv'
df.to_csv(cleaned_csv_file, index=False)

print(f'Cleaned data saved to {cleaned_csv_file}')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Cleaned data saved to imdb_reviews_cleaned.csv


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [3]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import string
import spacy

# Download additional NLTK data for named entity recognition
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Load the cleaned data
df = pd.read_csv('imdb_reviews_cleaned.csv')

# Load the English language model for spaCy
nlp = spacy.load('en_core_web_sm')

# Define variables to count parts of speech
noun_count = 0
verb_count = 0
adj_count = 0
adv_count = 0

# Define a function to analyze and print POS, constituency parsing, and dependency parsing
def analyze_text(text):
    global noun_count, verb_count, adj_count, adv_count  # Define as global

    # Tokenize the text and process it with spaCy
    doc = nlp(text)

    # Initialize variables for constituency parsing and dependency parsing
    constituency_tree = ''
    dependency_tree = ''

    # Analyze each token in the document
    for token in doc:
        if token.pos_ == 'NOUN':
            noun_count += 1
        elif token.pos_ == 'VERB':
            verb_count += 1
        elif token.pos_ == 'ADJ':
            adj_count += 1
        elif token.pos_ == 'ADV':
            adv_count += 1

        # Append token information to constituency and dependency trees
        constituency_tree += f'({token.text} {token.dep_} '
        dependency_tree += f'{token.text} ({token.dep_}) -> '

    # Print constituency parsing tree for one sentence
    print(f'Constituency Parsing Tree: {constituency_tree}')

    # Print dependency parsing tree for one sentence
    print(f'Dependency Parsing Tree: {dependency_tree}')

# Function to extract named entities and count them
def extract_named_entities(text):
    doc = nlp(text)

    named_entities = {
        'Persons': 0,
        'Organizations': 0,
        'Locations': 0,
        'Products': 0,
        'Dates': 0
    }

    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            named_entities['Persons'] += 1
        elif ent.label_ == 'ORG':
            named_entities['Organizations'] += 1
        elif ent.label_ == 'GPE':
            named_entities['Locations'] += 1
        elif ent.label_ == 'PRODUCT':
            named_entities['Products'] += 1
        elif ent.label_ == 'DATE':
            named_entities['Dates'] += 1

    return named_entities

# Analyze each review's cleaned content
for index, row in df.iterrows():
    print(f'Review {index + 1}:')
    cleaned_review = row['Cleaned Review Content']

    # (1) Parts of Speech Tagging
    analyze_text(cleaned_review)

    # (2) Constituency Parsing and Dependency Parsing
    # Using the logic from your previous provided code

    # (3) Named Entity Recognition
    named_entities = extract_named_entities(cleaned_review)
    print("Named Entity Counts:")
    print(named_entities)

    print("\n" + "="*50 + "\n")  # Separator between reviews

# Print the total counts of different parts of speech
print(f'Total Nouns: {noun_count}')
print(f'Total Verbs: {verb_count}')
print(f'Total Adjectives: {adj_count}')
print(f'Total Adverbs: {adv_count}')

# Save the updated DataFrame with counts to a new CSV file
analyzed_csv_file = 'imdb_reviews_analyzed.csv'
df.to_csv(analyzed_csv_file, index=False)

print(f'Analyzed data saved to {analyzed_csv_file}')


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


[1;30;43mStreaming output truncated to the last 5000 lines.[0m


Review 9377:
Constituency Parsing Tree: (one nummod (anticip compound (film compound (year compound (mani compound (peopl compound (includ compound (oppenheim compound (larg nsubj (deliv ROOT (much advmod (great amod (feel dobj (like prep (love pobj (two nummod (three nummod (hour npadvmod (like prep (hour pobj (fact npadvmod (stop conj (ador npadvmod (entir amod (thing nsubj (know parataxis (christoph compound (nolan compound (dunkirk nsubj (click conj (second compound (watch dobj (mayb compound (oppenheim nsubj (need ccomp (one nsubj (said ccomp (do aux (nt neg (feel xcomp (need aux (rush nsubj (see xcomp (soon advmod (long amod (exhaust compound (filmbut nmod (mani compound (way nsubj (ca aux (nt neg (deni amod (except prep (well advmod (made ccomp (one nummod (look nsubj (sound ccomp (amaz prep (you pobj (d nsubj (expect parataxis (feel dobj (though mark (accur compound (captur compound (time compound (period nsubj 

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

In [None]:
"""Constituency Parsing Tree;
A constituency parsing tree also referred to as a phrase structure tree shows the structure of a sentence by breaking
it down into its constituent phrases. Each node, in the tree represents an unit, such as a word or a group of words
(like noun phrases or verb phrases). The branches of the tree demonstrate the relationships between these constituents.
For instance if we take the sentence "The cat chased the mouse" the constituency tree would reveal how words combine
to form units like noun phrases and verb phrases. It offers an overview of the sentences structure.

Dependency Parsing Tree;
A dependency parsing tree illustrates the relationships between words, in a sentence based on dependencies. In this
type of tree each word serves as a node and the arcs connecting nodes represent relationships where one word is
considered as the head of that relationship while another serves as its dependent. For example in "The cat chased the mouse"
the dependency tree would show that "chased" depends on "cat" as its subject and also depends on "mouse" as its object.
This kind of parsing offers an, in depth perspective on the syntax and connections, in a sentence."""