# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [10]:
# Your code here
from bs4 import BeautifulSoup
import requests
import json
url = "https://www.amazon.com/fire-tv-stick-with-3rd-gen-alexa-voice-remote/dp/B08C1W5N87/ref=zg_bs_c_amazon-devices_d_sccl_2/135-8321326-4504565?pd_rd_w=mLXdV&content-id=amzn1.sym.309d45c5-3eba-4f62-9bb2-0acdcf0662e7&pf_rd_p=309d45c5-3eba-4f62-9bb2-0acdcf0662e7&pf_rd_r=CYBEY4VX3T80MTJKDSAP&pd_rd_wg=bKVQM&pd_rd_r=9b84bbc4-55c5-4804-bb36-e07e912de0f7&pd_rd_i=B08C1W5N87&psc=1"
numreviews = 0
reviews = []
while numreviews < 1000:
 response = requests.get(url)
 soup = BeautifulSoup(response.content, 'html.parser')
 productreview = soup.find_all('div', {'data-hook':'review'})
 for review in productreview:
        rating = review.find('i', {'data-hook':'review-star-rating'}).text.strip()
        title = review.find('a', {'data-hook':'review-title'}).text.strip()
        text = review.find('span', {'data-hook':'review-body'}).text.strip()
        reviewdata = {
            'rating': rating,
            'title': title,
            'text': text
        }
        reviews.append(reviewdata)
        numreviews += 1
for review in reviews:
 print(f"Scraped {numreviews} reviews")
 print("Rating:", review['rating'])
 print("Title:", review['title'])
 print("Review:", review['text'])
 print()



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Read more

Scraped 1000 reviews
Rating: 5.0 out of 5 stars
Title: 5.0 out of 5 stars
Excellent and easy to set up
Review: Was having issues with my fire starter and didn’t realize it was due to being to old 2016.  This was easy to set up however there was one point where I had to YouTube 1 of the steps because they weren’t giving me correct instructions
Read more

Scraped 1000 reviews
Rating: 5.0 out of 5 stars
Title: 5.0 out of 5 stars
Amount of free Apps was surprising
Review: Didn't realize not all smart TV's are created equal.  Bought this for a non smart TV and found lots of Apps didn't know where available for free viewing.  Remote control works great, love the multiple profile option, set up one for the wife and one for the grands making it easier to navigate quickly to what they like, streaming wireless on 5g with no timeouts.
Read more

Scraped 1000 reviews
Rating: 5.0 out of 5 stars
Title: 5.0 out of 5 stars
Thi

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [54]:
# Write code for each of the sub parts with proper comments.

import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
import re
import csv
from google.colab import drive
import shutil
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
stopwordslist = stopwords.words('english')
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
reviews = []
with open('reviews.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['rating', 'title', 'text'])
    for review in reviews:
        rating = review['rating']
        title = review['title']
        text = review['text']
        writer.writerow([rating, title, text])
print('Reviews')
#1
cleanedtext = re.sub(r'[^a-zA-Z\s]', '', text)
reviews.append({'rating': rating, 'title': title, 'text': cleanedtext})
print('Reviews')
for review in reviews:
    print(review['text'])
#2
text = re.sub(r'\d+', '', text)
print("After removing numbers:", text)
#3
stopwords = set(stopwords.words('english'))
words = [word for word in text.split() if word not in stopwords]
stopwordsremoved = " ".join(words)
print("Stopwords removed:", stopwordsremoved)
#4
lowercased = stopwordsremoved.lower()
print("Lowercased:", lowercased)
#5
stemmed = " ".join([stemmer.stem(word) for word in lowercased.split()])
print("Stemmed:", stemmed)
#6
lemmatized = " ".join([lemmatizer.lemmatize(word) for word in stemmed.split()])
print("Lemmatized:", lemmatized)
cleanedtext = lemmatized
with open('cleanedreviews.csv', 'w', newline='') as outfile:
  writer = csv.writer(outfile)

  # Write header
  writer.writerow(['rating', 'title', 'cleanedtext'])

  for review in reviews:
    writer.writerow([review['rating'], review['title'], review['text']])

print("saved to 'cleanedreviews.csv'")

drive.mount('/content/drive')
file_path = '/content/drive/My Drive/cleanedreviews.csv'
shutil.copy('cleanedreviews.csv', file_path)

print("Cleaned reviews saved to Google Drive at:", file_path)







[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Reviews
Reviews
Didnt realize not all smart TVs are created equal  Bought this for a non smart TV and found lots of Apps didnt know where available for free viewing  Remote control works great love the multiple profile option set up one for the wife and one for the grands making it easier to navigate quickly to what they like streaming wireless on g with no timeouts
Read more
After removing numbers: Didn't realize not all smart TV's are created equal.  Bought this for a non smart TV and found lots of Apps didn't know where available for free viewing.  Remote control works great, love the multiple profile option, set up one for the wife and one for the grands making it easier to navigate quickly to what they like, streaming wireless on g with no timeouts.
Read more
Stopwords removed: Didn't realize smart TV's created equal. Bought non smart TV found lots Apps know available free viewing. Remote control works great, love multiple profile option, set one wife one grands making easier navi

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [52]:
# Your code here
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
df = pd.read_csv('cleanedreviews.csv')
  #1
tokens = word_tokenize(cleanedtext)
pos_tags = pos_tag(tokens)
nouncount = 0
verbcount = 0
adjcount = 0
advcount = 0
for word, pos in pos_tags:
    if pos.startswith('N'):
      nouncount += 1
    elif pos.startswith('V'):
      verbcount += 1
    elif pos.startswith('J'):
      adjcount += 1
    elif pos.startswith('R'):
      advcount += 1
print("(1) Parts of Speech (POS) Tagging:")
print("Noun Count:", nouncount)
print("Verb Count:", verbcount)
print("Adjective Count:", adjcount)
print("Adverb Count:", advcount)
#2
sentences = word_tokenize(cleanedtext)
for sentence in sentences:
        print("\n(2) Constituency Parsing and Dependency Parsing for sentence:")
        print("Sentence:", sentence)




(1) Parts of Speech (POS) Tagging:
Noun Count: 15
Verb Count: 11
Adjective Count: 10
Adverb Count: 2

(2) Constituency Parsing and Dependency Parsing for sentence:
Sentence: did

(2) Constituency Parsing and Dependency Parsing for sentence:
Sentence: n't

(2) Constituency Parsing and Dependency Parsing for sentence:
Sentence: realiz

(2) Constituency Parsing and Dependency Parsing for sentence:
Sentence: smart

(2) Constituency Parsing and Dependency Parsing for sentence:
Sentence: tv

(2) Constituency Parsing and Dependency Parsing for sentence:
Sentence: '

(2) Constituency Parsing and Dependency Parsing for sentence:
Sentence: creat

(2) Constituency Parsing and Dependency Parsing for sentence:
Sentence: equal

(2) Constituency Parsing and Dependency Parsing for sentence:
Sentence: .

(2) Constituency Parsing and Dependency Parsing for sentence:
Sentence: bought

(2) Constituency Parsing and Dependency Parsing for sentence:
Sentence: non

(2) Constituency Parsing and Dependency Pars

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
i really liked working on the questions.