# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [2]:
#Your code here
#Importing necssary libraries to collect text data
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Defining function to collect top 1000 reviews of a recent movie from IMDB
def user_review_imdb(movie_id):
    movie_url = "https://www.imdb.com/title/{}/reviews?ref_=tt_ov_rt".format(movie_id)
    user_reviews = []

    # Addding headers inorder to mimic a browser request
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
    }

    for page_number in range(1, 101):
        url = movie_url + "&sort=top&start={}&count=10".format((page_number - 1) * 10 + 1)
        response = requests.get(url, headers=headers)

        beautiful_soup = BeautifulSoup(response.content, 'html.parser')

        for movie_review in beautiful_soup.find_all('div', class_='review-container'):
            user_review_text = movie_review.find('div', class_='text show-more__control').get_text(strip=True)
            user_reviews.append(user_review_text)
            if len(user_reviews) >= 1000:  # Stopping if we already have top 1000 reviews
                break

        if len(user_reviews) >= 1000:
            break

    return user_reviews[:1000]

# Movie id of a recent movie Deadpool & Wolverine (2024)
movie_id = "tt6263850"
#Calling the user review function
user_reviews = user_review_imdb(movie_id)
movie_reviews_data = {'User review': user_reviews}
movie_reviews_df = pd.DataFrame(movie_reviews_data)
movie_reviews_df.to_csv('movie_reviews.csv', index=False)
print(movie_reviews_df)


                                           User review
0    Hugh Jackman is the perfect Wolverine. What a ...
1    What a crazy blast ! Bonkers !!Sooo !...\nWhat...
2    We've waited so long for this moment, and it w...
3    So many Easter Eggs, so true to the comic char...
4    I read an IGN review where the guy gave it a 7...
..                                                 ...
995  My dad fell asleep during this. Twice. If he'd...
996  Remember back in 2022 when Dr Strange: Multive...
997  And to understand some of the Easter eggs, jok...
998  "Deadpool 3" is a thrilling and hilarious cont...
999  Awesome movie! Lot of funny humor. Multiple oc...

[1000 rows x 1 columns]


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [3]:
# Write code for each of the sub parts with proper comments.
#Importing necessary libraries to clean the data cllected in previous part
import re
import nltk
import ssl
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
ssl._create_default_https_context = ssl._create_unverified_context

# Now, you can try downloading the NLTK stopwords
nltk.download('stopwords')

nltk.download('stopwords')
nltk.download('wordnet')

#Defining a function to clean the data collected
def clean_data(review):
    #Removing noise, such as special characters and punctuations.
    review = re.sub(r'[^\w\s]', '', review)

    # Removing numbers
    review = re.sub(r'\d+', '', review)

    # Removing stopwords by using the stopwords list.
    stop_words = set(stopwords.words('english'))
    review = ' '.join([item for item in review.split() if item.lower() not in stop_words])

    # Lowercasing all texts
    review = review.lower()

    # Stemming
    porter_stemmer = PorterStemmer()
    review = ' '.join([porter_stemmer.stem(item) for item in review.split()])

    # Lemmatization
    word_net_lemmatizer = WordNetLemmatizer()
    review = ' '.join([word_net_lemmatizer.lemmatize(item) for item in review.split()])

    return review

# Reading the csv file saved in the previous step
movie_reviews_dataset = pd.read_csv('movie_reviews.csv')

# Apply cleaning to the user review column in the saved dataset
movie_reviews_dataset['Cleaned User review'] = movie_reviews_dataset['User review'].apply(clean_data)

print(movie_reviews_dataset.head())


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


                                         User review  \
0  Hugh Jackman is the perfect Wolverine. What a ...   
1  What a crazy blast ! Bonkers !!Sooo !...\nWhat...   
2  We've waited so long for this moment, and it w...   
3  So many Easter Eggs, so true to the comic char...   
4  I read an IGN review where the guy gave it a 7...   

                                 Cleaned User review  
0  hugh jackman perfect wolverin fun movi like di...  
1  crazi blast bonker sooo say movi whole team be...  
2  weve wait long moment beyond fun wholesom full...  
3  mani easter egg true comic charact may possibl...  
4  read ign review guy gave stori poorth guy real...  


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [4]:
# Your code here
# Importing necessay libraries to implement Parts of Speech Tagging
import spacy
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from collections import Counter

# Downloading NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Loading the spaCy English language model
nlp = spacy.load('en_core_web_sm')

def pos_tagging(review_text):
    # Tokenizing the text of the movie review
    tokenize_text = word_tokenize(review_text)

    # Performing POS tagging
    pos_tags = pos_tag(tokenize_text)

    # Calculating the total number of pos tags
    pos_tag_number = Counter(tag[1] for tag in pos_tags)

    return pos_tag_number

def parse_reviews(review):
    # Processing the text of the review using spaCy
    parsing = nlp(review)

    # Printing out constituency parsing trees and dependency parsing trees
    for text in parsing.sents:
        # print out the constituency parsing trees
        constituency_parsing_tree = [item.text_with_ws for item in text]
        print("The Constituency Parsing Tree is")
        print(' '.join(constituency_parsing_tree))

        # Dependency parsing tree
        dependency_parsing_tree = [(item.text, item.dep_, item.head.text) for item in text]
        # print out the dependency parsing trees
        print("\nThe Dependency Parsing Tree is")
        for item, dependecy, head in dependency_parsing_tree:
            print(f"{item}, {dependecy} ---- {head}")
        print("\n")

def entities_extraction(review):
    # Processing the text using spaCy
    extracing_entities = nlp(review)

    # Extracting all the entities
    entity_count = Counter(item.label_ for item in extracing_entities.ents)

    # Returning the counts for each entity type
    return {
        'Person': entity_count.get('PERSON', 0),
        'Organization': entity_count.get('ORG', 0),
        'Location': entity_count.get('GPE', 0),
        'Product': entity_count.get('PRODUCT', 0),
        'Date': entity_count.get('DATE', 0)
    }


# Apply Parse reviews for each row in the 'cleaned user review' column
movie_reviews_dataset['Parse Reviews'] = movie_reviews_dataset['Cleaned User review'].apply(parse_reviews)
# Apply POS tagging for each row in the 'cleaned user review' column
movie_reviews_dataset['Pos tag of Review'] = movie_reviews_dataset['Cleaned User review'].apply(pos_tagging)

# Defining function to calculate total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.
def total_no_of_nouns(pos_count_review):
    return pos_count_review.get('NN', 0) + pos_count_review.get('NNS', 0) + pos_count_review.get('NNP', 0) + pos_count_review.get('NNPS', 0)

def total_no_of_verbs(pos_count_review):
    return pos_count_review.get('VB', 0) + pos_count_review.get('VBD', 0) + pos_count_review.get('VBG', 0) + pos_count_review.get('VBN', 0) + pos_count_review.get('VBP', 0) + pos_count_review.get('VBZ', 0)

def total_no_of_adjectives(pos_count_review):
    return pos_count_review.get('JJ', 0) + pos_count_review.get('JJR', 0) + pos_count_review.get('JJS', 0)

def total_no_of_adverbs(pos_count_review):
    return pos_count_review.get('RB', 0) + pos_count_review.get('RBR', 0) + pos_count_review.get('RBS', 0)

#Calculating the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.
nouns_count = movie_reviews_dataset['Pos tag of Review'].apply(total_no_of_nouns).sum()
verbs_count = movie_reviews_dataset['Pos tag of Review'].apply(total_no_of_verbs).sum()
adjectives_count = movie_reviews_dataset['Pos tag of Review'].apply(total_no_of_adjectives).sum()
adverbs_count = movie_reviews_dataset['Pos tag of Review'].apply(total_no_of_adverbs).sum()

# Printing the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.
print(f"Total number of N(oun) : {nouns_count}")
print(f"Total number of V(erb) : {verbs_count}")
print(f"Total number of Adj(ective) : {adjectives_count}")
print(f"Total number of Adv(erb) : {adverbs_count}")
print("\n")

# Applying entities extraction on each review
movie_reviews_dataset['Named Entity Recognition'] = movie_reviews_dataset['Cleaned User review'].apply(entities_extraction)

# Calculating the entities such as person names, organizations, locations, product names, and date from the clean texts
person_count = movie_reviews_dataset['Named Entity Recognition'].apply(lambda x: x['Person']).sum()
organization_count = movie_reviews_dataset['Named Entity Recognition'].apply(lambda x: x['Organization']).sum()
location_count = movie_reviews_dataset['Named Entity Recognition'].apply(lambda x: x['Location']).sum()
product_count = movie_reviews_dataset['Named Entity Recognition'].apply(lambda x: x['Product']).sum()
date_count = movie_reviews_dataset['Named Entity Recognition'].apply(lambda x: x['Date']).sum()

# Printing the total number of entity counts
print(f"Total number of Persons : {person_count}")
print(f"Total number of Organizations : {organization_count}")
print(f"Total number of Locations : {location_count}")
print(f"Total number of Products : {product_count}")
print(f"Total number of Dates : {date_count}")
print("\n")

print(movie_reviews_dataset)
movie_reviews_dataset.to_csv('Movie_reviews_dataset.csv', index=False)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
overal, nsubj ---- think
think, ccomp ---- saw
enjoy, nsubj ---- ride
ride, ccomp ---- think


The Constituency Parsing Tree is
deadpool  wolverin  everyth  ever  want  blew  mind  away 

The Dependency Parsing Tree is
deadpool, compound ---- everyth
wolverin, compound ---- everyth
everyth, nsubj ---- want
ever, advmod ---- want
want, ROOT ---- want
blew, ccomp ---- want
mind, dobj ---- blew
away, advmod ---- blew


The Constituency Parsing Tree is
surpris  cameo  gon na  tell  ya  watch  big  screen  believ  take  whole  new  level  liter  blow  away  everi  excit  element  whole  new  adventur 

The Dependency Parsing Tree is
surpris, compound ---- cameo
cameo, nsubj ---- gon
gon, ROOT ---- gon
na, aux ---- tell
tell, xcomp ---- gon
ya, dobj ---- tell
watch, xcomp ---- tell
big, amod ---- screen
screen, compound ---- believ
believ, npadvmod ---- tell
take, conj ---- gon
whole, amod ---- liter
new, amod ---- level
level,

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [5]:
#                                           User review  \
# 0    Hugh Jackman is the perfect Wolverine. What a ...
# 1    What a crazy blast ! Bonkers !!Sooo !...\nWhat...
# 2    We've waited so long for this moment, and it w...
# 3    So many Easter Eggs, so true to the comic char...
# 4    I read an IGN review where the guy gave it a 7...
# ..                                                 ...
# 995  My dad fell asleep during this. Twice. If he'd...
# 996  Remember back in 2022 when Dr Strange: Multive...
# 997  And to understand some of the Easter eggs, jok...
# 998  "Deadpool 3" is a thrilling and hilarious cont...
# 999  Awesome movie! Lot of funny humor. Multiple oc...

#                                    Cleaned User review Parse Reviews  \
# 0    hugh jackman perfect wolverin fun movi like di...          None
# 1    crazi blast bonker sooo say movi whole team be...          None
# 2    weve wait long moment beyond fun wholesom full...          None
# 3    mani easter egg true comic charact may possibl...          None
# 4    read ign review guy gave stori poorth guy real...          None
# ..                                                 ...           ...
# 995  dad fell asleep twice hed awak could leftthi m...          None
# 996  rememb back dr strang multivers mad came every...          None
# 997  understand easter egg joke scene even need kno...          None
# 998  deadpool thrill hilari continu franchis bring ...          None
# 999  awesom movi lot funni humor multipl occas whol...          None

#                                      Pos tag of Review  \
# 0    {'JJ': 12, 'NN': 37, 'VBP': 5, 'IN': 1, 'VBD':...
# 1    {'NN': 64, 'NNS': 2, 'VBP': 9, 'JJ': 22, 'IN':...
# 2    {'NNS': 5, 'VBP': 17, 'JJ': 50, 'NN': 94, 'IN'...
# 3    {'NN': 33, 'JJ': 12, 'MD': 2, 'VB': 4, 'VBP': ...
# 4    {'JJ': 14, 'NN': 33, 'VBD': 2, 'VBP': 8, 'VB':...
# ..                                                 ...
# 995  {'NN': 118, 'VBD': 12, 'JJ': 41, 'RB': 9, 'VBN...
# 996  {'NN': 219, 'RB': 22, 'JJ': 92, 'NNS': 9, 'VBP...
# 997  {'NN': 32, 'VBD': 4, 'RB': 2, 'VB': 2, 'VBP': ...
# 998  {'NN': 38, 'VBZ': 1, 'VBP': 2, 'RB': 1, 'JJ': ...
# 999  {'JJ': 18, 'NN': 31, 'IN': 2, 'VBN': 1, 'NNS':...

#                               Named Entity Recognition
# 0    {'Person': 0, 'Organization': 4, 'Location': 0...
# 1    {'Person': 1, 'Organization': 4, 'Location': 0...
# 2    {'Person': 2, 'Organization': 2, 'Location': 0...
# 3    {'Person': 1, 'Organization': 1, 'Location': 0...
# 4    {'Person': 1, 'Organization': 0, 'Location': 0...
# ..                                                 ...
# 995  {'Person': 3, 'Organization': 0, 'Location': 0...
# 996  {'Person': 11, 'Organization': 1, 'Location': ...
# 997  {'Person': 1, 'Organization': 1, 'Location': 0...
# 998  {'Person': 1, 'Organization': 0, 'Location': 0...
# 999  {'Person': 0, 'Organization': 1, 'Location': 0...

# [1000 rows x 5 columns]

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [6]:
# Write your response below
#The task engaged me on many levels, forcing me to solve complex issues. There was so much complexity that I usually spent extra hours after each session going through the work. I admit that this complexity made it somewhat intimidating at times, especially when I had to view all this alongside my other subjects. Further, some technical problems were encountered, for instance, software problems, loading libraries name errors, key errors and this added to the pressure of the schedule because more problems were caused.
#And all of this despite the problems that I have just mentioned, I liked many things about the assignment. I remember content engagement and also the level of curiosity and critical thinking I have obtained while working on this assignment. Putting theory into practice helped me internalize what I had learned and made the course more interesting. I had a chance to think creatively in addressing problems like coming up with new ideas. Another enjoyable aspect was the group work; it was nourishing to hear different perspectives, which enhanced my understanding.
#As for the time required to finish the assignment, I found it appropriate and satisfactory because it gave me a chance to analyze the topic deeply. However, with respect to the nature of the task, since it was quite a bit comprehensive, I think that a little more time would have also helped to do deeper analysis and research.

