# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [2]:
# we are writing a python program for scraping reviews into csv file
#importing required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.imdb.com/title/tt15398776/reviews/?ref_=ttrt_ql_2"
reviews = [] #creating an empty list to store the reviews

while len(reviews) < 1000: # we are using a while loop to scape 1000 reviews
    resp = requests.get(url)# we are fetching the reviews
    soup = BeautifulSoup(resp.content, 'html.parser')
    review_divs = soup.find_all('div', class_='lister-item-content')

    for review_div in review_divs:
        r = review_div.find('div', class_='text').get_text().strip()
        reviews.append(r)

        if len(reviews) == 1000:
            break

    next_button = soup.find('div', class_='load-more-data')
    if not next_button:
        break

    next_data_key = next_button['data-key']
    url = f'https://www.imdb.com/title/tt1877830/reviews/_ajax?paginationKey={next_data_key}'

df = pd.DataFrame(reviews, columns=['Review'])# creating a dataframe and saving the data in csv file
df.to_csv('movie_reviews_dataset.csv', index=False)
print(df.head(10))



                                              Review
0  One of the most anticipated films of the year ...
1  You'll have to have your wits about you and yo...
2  I'm a big fan of Nolan's work so was really lo...
3  "Oppenheimer" is a biographical thriller film ...
4  This movie is just... wow! I don't think I hav...
5  I was familiar with the Manhattan project and ...
6  Is it just me or did anyone else find this mov...
7  I'm still collecting my thoughts after experie...
8  0 out of 10 starsChristopher Nolan's Oppenheim...
9  I may consider myself lucky to be alive to wat...


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [3]:
# Write your code here
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download NLTK data (stopwords and lemmatization data)
nltk.download('stopwords')
nltk.download('wordnet')

# Load the CSV file with user reviews
csv_file_name = "movie_reviews_dataset.csv"
df = pd.read_csv(csv_file_name)

# Define functions for text cleaning
def clean_text(text):
    # Remove special characters and punctuations
    text = ''.join([char for char in text if char.isalnum() or char.isspace()])

    # Remove numbers
    text = ''.join([char for char in text if not char.isdigit()])

    # Lowercase the text
    text = text.lower()

    return text
def remove_stopwords(text):
    stop_words = set(stopwords.words("english"))
    words = text.split()
    words = [word for word in words if word not in stop_words]
    return ' '.join(words)

def stem_text(text):
    stemmer = PorterStemmer()
    words = text.split()
    words = [stemmer.stem(word) for word in words]
    return ' '.join(words)

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)

# Apply the cleaning functions to the "User Reviews" column
df['Cleaned Reviews'] = df['Review'].apply(clean_text)
df['Cleaned Reviews'] = df['Cleaned Reviews'].apply(remove_stopwords)
df['Cleaned Reviews'] = df['Cleaned Reviews'].apply(stem_text)
df['Cleaned Reviews'] = df['Cleaned Reviews'].apply(lemmatize_text)

# Save the cleaned data to a new CSV file
cleaned_csv_file_name = "movie_cleaned_reviews.csv"
df.to_csv(cleaned_csv_file_name, index=False)
print(f"Cleaned data saved to {cleaned_csv_file_name}")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Cleaned data saved to movie_cleaned_reviews.csv


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [5]:
import nltk
nltk.download('punkt')




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [8]:

nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
pos = []
for sentence in df['Cleaned Reviews']:
  text = word_tokenize(sentence)
  pos.append(nltk.pos_tag(text))
pos[0:50]



[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[[('one', 'CD'),
  ('anticip', 'NN'),
  ('film', 'NN'),
  ('year', 'NN'),
  ('mani', 'NN'),
  ('peopl', 'NN'),
  ('includ', 'NN'),
  ('oppenheim', 'NN'),
  ('larg', 'VBZ'),
  ('deliv', 'RB'),
  ('much', 'JJ'),
  ('great', 'JJ'),
  ('feel', 'NN'),
  ('like', 'IN'),
  ('love', 'NN'),
  ('two', 'CD'),
  ('three', 'CD'),
  ('hour', 'NN'),
  ('like', 'IN'),
  ('hour', 'NN'),
  ('fact', 'NN'),
  ('stop', 'VB'),
  ('ador', 'NN'),
  ('entir', 'JJ'),
  ('thing', 'NN'),
  ('know', 'VBP'),
  ('christoph', 'NN'),
  ('nolan', 'NN'),
  ('dunkirk', 'VBZ'),
  ('click', 'JJ'),
  ('second', 'JJ'),
  ('watch', 'NN'),
  ('mayb', 'NN'),
  ('oppenheim', 'VBD'),
  ('need', 'MD'),
  ('one', 'CD'),
  ('said', 'VBD'),
  ('dont', 'NN'),
  ('feel', 'NN'),
  ('need', 'VBP'),
  ('rush', 'NN'),
  ('see', 'VB'),
  ('soon', 'RB'),
  ('long', 'RB'),
  ('exhaust', 'JJ'),
  ('filmbut', 'NN'),
  ('mani', 'JJ'),
  ('way', 'NN'),
  ('cant', 'JJ'),
  ('deni', 'NN'),
  ('except', 'IN'),
  ('well', 'RB'),
  ('made', 'VBN'),
  

In [15]:
from collections import Counter
counts = []
for tags in pos:
  count = Counter( tag for word,  tag in tags)
  counts.append(count)
counts[0:20]

[Counter({'CD': 8,
          'NN': 84,
          'VBZ': 5,
          'RB': 15,
          'JJ': 39,
          'IN': 14,
          'VB': 4,
          'VBP': 8,
          'VBD': 4,
          'MD': 1,
          'VBN': 5,
          'RBS': 1,
          'PRP': 1,
          'JJS': 1,
          'NNS': 2,
          'FW': 1,
          'DT': 1,
          'WDT': 1}),
 Counter({'NN': 83,
          'JJ': 27,
          'MD': 2,
          'VB': 5,
          'RB': 4,
          'VBZ': 2,
          'JJS': 2,
          'VBP': 3,
          'NNS': 2,
          'IN': 1,
          'VBD': 1}),
 Counter({'NN': 39,
          'JJ': 15,
          'VB': 6,
          'RB': 6,
          'MD': 2,
          'NNS': 3,
          'VBP': 6,
          'VBN': 2,
          'JJS': 1,
          'IN': 2,
          'VBD': 2}),
 Counter({'JJ': 185,
          'NN': 477,
          'VBN': 7,
          'IN': 26,
          'VBD': 21,
          'FW': 6,
          'RB': 29,
          'JJS': 7,
          'VBP': 37,
          'NNS': 13,
   

In [None]:
import csv
import nltk
import spacy
from nltk import pos_tag, ne_chunk
from nltk.tokenize import sent_tokenize
from nltk.corpus import wordnet
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

# Loading the cleaned text from the CSV file
load_csv = "movie_cleaned_reviews.csv"

# Initializing counters for POS tagging
countnoun = 0
countverb = 0
countadj = 0
countadv = 0

# Initializing counters for named entities
countperson = 0
countorganization = 0
countlocation = 0
countproduct = 0
countdate = 0


# creating a Function to perform POS tagging and entity recognition
def analyze(text):
    global countnoun, countverb, countadj, countadv
    global countperson, countorganization, countlocation, countproduct, countdate


    # Tokenize text into sentences
    sentences = sent_tokenize(text)

    for sentence in sentences:
        # POS tagging using NLTK
        words = nltk.word_tokenize(sentence)
        pos_tags = nltk.pos_tag(words)
        for word, pos in pos_tags:
            if pos.startswith('N'):  # Noun
                countnoun += 1
            elif pos.startswith('V'):  # Verb
                countverb += 1
            elif pos.startswith('J'):  # Adjective
                countadj += 1
            elif pos.startswith('R'):  # Adverb
                countadv += 1

        # Named Entity Recognition with spaCy
        doc = nlp(sentence)
        for ent in doc.ents:
            if ent.label_ == 'PERSON':
                countperson += 1
            elif ent.label_ == 'ORG':
                countorganization += 1
            elif ent.label_ == 'GPE':
                countlocation += 1
            elif ent.label_ == 'PRODUCT':
                countproduct += 1
            elif ent.label_ == 'DATE':
                countdate += 1

# Read the CSV file and analyze the text
with open(load_csv, 'r', newline='', encoding='utf-8') as input_file:
    reader = csv.reader(input_file)
    header = next(reader)  # Skip the header row

    for row in reader:
        cleaned_text = row[-1]
        analyze(cleaned_text)
# Display the analysis results
print("POS Tagging Results:")
print(f"Total Nouns: {countnoun}")
print(f"Total Verbs: {countverb}")
print(f"Total Adjectives: {countadj}")
print(f"Total Adverbs: {countadv}")

print("\nNamed Entity Recognition Results:")
print(f"Persons: {countperson}")
print(f"Organizations: {countorganization}")
print(f"Locations: {countlocation}")
print(f"Products: {countproduct}")
print(f"Dates: {countdate}")

POS Tagging Results:
Total Nouns: 52198
Total Verbs: 12886
Total Adjectives: 19917
Total Adverbs: 5066

Named Entity Recognition Results:
Persons: 3110
Organizations: 772
Locations: 282
Products: 18
Dates: 215


In [14]:
import spacy
import pandas as pd
# Loading the English language model from spaCy
nlp = spacy.load("en_core_web_sm")

# Loading the cleaned movie reviews CSV file using pandas
df = pd.read_csv('movie_cleaned_reviews.csv')
def analyze_text(text):
    # Procesings the text with spaCy to analyze it
    doc = nlp(text)

    # Displaying the dependency parsing visualization for the entire text
    spacy.displacy.render(doc, style="dep", options={'compact': True, 'distance': 90})

    spacy.displacy.render(doc, style = "dep", jupyter=True, options={"distance":140})

    # Displaying the named entity recognition visualization for the entire text
    spacy.displacy.render(doc, style="ent")

for index, row in df.head(2).iterrows():
    cleaned_text = row['Cleaned Reviews']
    analyze_text(cleaned_text)




# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
#It took a longtime to work on this assignment.Working on parsing is a bit challenging.