<a href="https://colab.research.google.com/github/rajidisindhuja/sindhuja_INFO5731_Fall2023/blob/main/Rajidi_Sindhuja_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [None]:
import requests
from bs4 import BeautifulSoup
import csv

# IMDb URL for the film's reviews
url = "https://www.imdb.com/title/tt10640346/reviews?ref_=tt_urv"

def scrape_imdb_reviews(url, max_reviews=10000):
    reviews = []
    page = 1

    while len(reviews) < max_reviews:
        response = requests.get(url, params={"start": (page - 1) * 25})
        if response.status_code != 200:
            print(f"Failed to retrieve page {page}. Check your internet connection or the URL.")
            break

        soup = BeautifulSoup(response.content, "html.parser")
        review_elements = soup.find_all("div", class_="lister-item")

        if not review_elements:
            print("No more reviews found.")
            break

        for review_element in review_elements:
            review_text = review_element.find("div", class_="text").get_text()
            reviews.append(review_text)

            if len(reviews) >= max_reviews:
                break

        page += 1

    return reviews

if __name__ == "__main__":
    max_reviews_to_collect = 10000  # Adjust this number to collect 10,000 reviews
    reviews = scrape_imdb_reviews(url, max_reviews_to_collect)

    if reviews:
        with open("imdb_reviews_babylon.csv", "w", newline="", encoding="utf-8") as csvfile:
            writer = csv.writer(csvfile)
            writer.writerow(["Review"])
            for review in reviews:
                writer.writerow([review.strip()])

        print(f"Successfully collected {len(reviews)} reviews and saved them to imdb_reviews_babylon.csv.")
    else:
        print("No reviews were found.")


Successfully collected 10000 reviews and saved them to imdb_reviews_babylon.csv.


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [17]:
# Write your code here

# importing text into a df
import pandas as pd
data_url="/content/imdb_reviews_babylon.csv"
df = pd.read_table(data_url,names=['text'])
df



Unnamed: 0,text
0,Review
1,This film felt like it was written and directe...
2,"Whether it be orgies, showcasing various bodil..."
3,In the opening scene of BABYLON an elephant on...
4,So many reviews praise this film for the level...
...,...
9996,This movie works on the premise that anything ...
9997,This movie should never have been made. From t...
9998,"Far too long, for the reward that it withholds..."
9999,This is definitely one of the worst movies I'v...


In [11]:
#Number of characters
df['Number of Characters'] = df['text'].str.len()
df

Unnamed: 0,text,Number of Characters
0,reviews,7
1,I needed to replace my teens Iphone and absolu...,776
2,I needed to replace my teens Iphone and absolu...,776
3,Was skeptical to get one a renewed iPhone . De...,878
4,I needed to replace my teens Iphone and absolu...,776
...,...,...
1316,I bought one of the used models and I have no ...,486
1317,"I got the white iphone 11, i haven't even had ...",375
1318,I wasn’t sure what to expect from a refurbishe...,618
1319,I bought the purple iPhone 11 from Stone Digit...,754


In [16]:
#Number of numerics
df['Numbers of numerics'] = df['text'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
df

Unnamed: 0,text,Number of Characters,Numbers of numerics
0,reviews,7,0
1,I needed to replace my teens Iphone and absolu...,776,0
2,I needed to replace my teens Iphone and absolu...,776,0
3,Was skeptical to get one a renewed iPhone . De...,878,5
4,I needed to replace my teens Iphone and absolu...,776,0
...,...,...,...
1316,I bought one of the used models and I have no ...,486,1
1317,"I got the white iphone 11, i haven't even had ...",375,0
1318,I wasn’t sure what to expect from a refurbishe...,618,1
1319,I bought the purple iPhone 11 from Stone Digit...,754,2


In [18]:
#digit removal
df['After digits removal'] = df['text'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))
df

Unnamed: 0,text,After digits removal
0,Review,Review
1,This film felt like it was written and directe...,This film felt like it was written and directe...
2,"Whether it be orgies, showcasing various bodil...","Whether it be orgies, showcasing various bodil..."
3,In the opening scene of BABYLON an elephant on...,In the opening scene of BABYLON an elephant on...
4,So many reviews praise this film for the level...,So many reviews praise this film for the level...
...,...,...
9996,This movie works on the premise that anything ...,This movie works on the premise that anything ...
9997,This movie should never have been made. From t...,This movie should never have been made. From t...
9998,"Far too long, for the reward that it withholds...","Far too long, for the reward that it withholds..."
9999,This is definitely one of the worst movies I'v...,This is definitely one of the worst movies I'v...


In [19]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [20]:
#Stopwords removal
from nltk.corpus import stopwords
stop = stopwords.words('english')
df['Stopwords Removal'] = df['text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df

Unnamed: 0,text,After digits removal,Stopwords Removal
0,Review,Review,Review
1,This film felt like it was written and directe...,This film felt like it was written and directe...,This film felt like written directed high scho...
2,"Whether it be orgies, showcasing various bodil...","Whether it be orgies, showcasing various bodil...","Whether orgies, showcasing various bodily flui..."
3,In the opening scene of BABYLON an elephant on...,In the opening scene of BABYLON an elephant on...,In opening scene BABYLON elephant pick-up empt...
4,So many reviews praise this film for the level...,So many reviews praise this film for the level...,"So many reviews praise film level ""debauchery""..."
...,...,...,...
9996,This movie works on the premise that anything ...,This movie works on the premise that anything ...,This movie works premise anything said loudly ...
9997,This movie should never have been made. From t...,This movie should never have been made. From t...,This movie never made. From perspective histor...
9998,"Far too long, for the reward that it withholds...","Far too long, for the reward that it withholds...","Far long, reward withholds, although fearless ..."
9999,This is definitely one of the worst movies I'v...,This is definitely one of the worst movies I'v...,This definitely one worst movies I've ever see...


In [21]:
#Lower casing
df['Lower Case'] = df['text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df

Unnamed: 0,text,After digits removal,Stopwords Removal,Lower Case
0,Review,Review,Review,review
1,This film felt like it was written and directe...,This film felt like it was written and directe...,This film felt like written directed high scho...,this film felt like it was written and directe...
2,"Whether it be orgies, showcasing various bodil...","Whether it be orgies, showcasing various bodil...","Whether orgies, showcasing various bodily flui...","whether it be orgies, showcasing various bodil..."
3,In the opening scene of BABYLON an elephant on...,In the opening scene of BABYLON an elephant on...,In opening scene BABYLON elephant pick-up empt...,in the opening scene of babylon an elephant on...
4,So many reviews praise this film for the level...,So many reviews praise this film for the level...,"So many reviews praise film level ""debauchery""...",so many reviews praise this film for the level...
...,...,...,...,...
9996,This movie works on the premise that anything ...,This movie works on the premise that anything ...,This movie works premise anything said loudly ...,this movie works on the premise that anything ...
9997,This movie should never have been made. From t...,This movie should never have been made. From t...,This movie never made. From perspective histor...,this movie should never have been made. from t...
9998,"Far too long, for the reward that it withholds...","Far too long, for the reward that it withholds...","Far long, reward withholds, although fearless ...","far too long, for the reward that it withholds..."
9999,This is definitely one of the worst movies I'v...,This is definitely one of the worst movies I'v...,This definitely one worst movies I've ever see...,this is definitely one of the worst movies i'v...


In [22]:
#Stemming
from nltk.stem import PorterStemmer
st = PorterStemmer()
df['After Stemming'] = df['Lower Case'].apply(lambda x: " ".join([st.stem(word) for word in x]))
df

Unnamed: 0,text,After digits removal,Stopwords Removal,Lower Case,After Stemming
0,Review,Review,Review,review,r e v i e w
1,This film felt like it was written and directe...,This film felt like it was written and directe...,This film felt like written directed high scho...,this film felt like it was written and directe...,t h i s f i l m f e l t l i k e i t ...
2,"Whether it be orgies, showcasing various bodil...","Whether it be orgies, showcasing various bodil...","Whether orgies, showcasing various bodily flui...","whether it be orgies, showcasing various bodil...","w h e t h e r i t b e o r g i e s , s ..."
3,In the opening scene of BABYLON an elephant on...,In the opening scene of BABYLON an elephant on...,In opening scene BABYLON elephant pick-up empt...,in the opening scene of babylon an elephant on...,i n t h e o p e n i n g s c e n e o f ...
4,So many reviews praise this film for the level...,So many reviews praise this film for the level...,"So many reviews praise film level ""debauchery""...",so many reviews praise this film for the level...,s o m a n y r e v i e w s p r a i s e ...
...,...,...,...,...,...
9996,This movie works on the premise that anything ...,This movie works on the premise that anything ...,This movie works premise anything said loudly ...,this movie works on the premise that anything ...,t h i s m o v i e w o r k s o n t h e ...
9997,This movie should never have been made. From t...,This movie should never have been made. From t...,This movie never made. From perspective histor...,this movie should never have been made. from t...,t h i s m o v i e s h o u l d n e v e r ...
9998,"Far too long, for the reward that it withholds...","Far too long, for the reward that it withholds...","Far long, reward withholds, although fearless ...","far too long, for the reward that it withholds...","f a r t o o l o n g , f o r t h e r ..."
9999,This is definitely one of the worst movies I'v...,This is definitely one of the worst movies I'v...,This definitely one worst movies I've ever see...,this is definitely one of the worst movies i'v...,t h i s i s d e f i n i t e l y o n e ...


In [23]:
#Punctuation removal
df['Removal of Punctuation'] = df['Lower Case'].str.replace('[^\w\s]','')
df

  df['Removal of Punctuation'] = df['Lower Case'].str.replace('[^\w\s]','')


Unnamed: 0,text,After digits removal,Stopwords Removal,Lower Case,After Stemming,Removal of Punctuation
0,Review,Review,Review,review,r e v i e w,review
1,This film felt like it was written and directe...,This film felt like it was written and directe...,This film felt like written directed high scho...,this film felt like it was written and directe...,t h i s f i l m f e l t l i k e i t ...,this film felt like it was written and directe...
2,"Whether it be orgies, showcasing various bodil...","Whether it be orgies, showcasing various bodil...","Whether orgies, showcasing various bodily flui...","whether it be orgies, showcasing various bodil...","w h e t h e r i t b e o r g i e s , s ...",whether it be orgies showcasing various bodily...
3,In the opening scene of BABYLON an elephant on...,In the opening scene of BABYLON an elephant on...,In opening scene BABYLON elephant pick-up empt...,in the opening scene of babylon an elephant on...,i n t h e o p e n i n g s c e n e o f ...,in the opening scene of babylon an elephant on...
4,So many reviews praise this film for the level...,So many reviews praise this film for the level...,"So many reviews praise film level ""debauchery""...",so many reviews praise this film for the level...,s o m a n y r e v i e w s p r a i s e ...,so many reviews praise this film for the level...
...,...,...,...,...,...,...
9996,This movie works on the premise that anything ...,This movie works on the premise that anything ...,This movie works premise anything said loudly ...,this movie works on the premise that anything ...,t h i s m o v i e w o r k s o n t h e ...,this movie works on the premise that anything ...
9997,This movie should never have been made. From t...,This movie should never have been made. From t...,This movie never made. From perspective histor...,this movie should never have been made. from t...,t h i s m o v i e s h o u l d n e v e r ...,this movie should never have been made from th...
9998,"Far too long, for the reward that it withholds...","Far too long, for the reward that it withholds...","Far long, reward withholds, although fearless ...","far too long, for the reward that it withholds...","f a r t o o l o n g , f o r t h e r ...",far too long for the reward that it withholds ...
9999,This is definitely one of the worst movies I'v...,This is definitely one of the worst movies I'v...,This definitely one worst movies I've ever see...,this is definitely one of the worst movies i'v...,t h i s i s d e f i n i t e l y o n e ...,this is definitely one of the worst movies ive...


In [24]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [25]:
#Lemmatization
from textblob import Word
import nltk
nltk.download('wordnet')
df['After Lemmatization'] = df['Removal of Punctuation'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df

[nltk_data] Downloading package wordnet to /root/nltk_data...


Unnamed: 0,text,After digits removal,Stopwords Removal,Lower Case,After Stemming,Removal of Punctuation,After Lemmatization
0,Review,Review,Review,review,r e v i e w,review,review
1,This film felt like it was written and directe...,This film felt like it was written and directe...,This film felt like written directed high scho...,this film felt like it was written and directe...,t h i s f i l m f e l t l i k e i t ...,this film felt like it was written and directe...,this film felt like it wa written and directed...
2,"Whether it be orgies, showcasing various bodil...","Whether it be orgies, showcasing various bodil...","Whether orgies, showcasing various bodily flui...","whether it be orgies, showcasing various bodil...","w h e t h e r i t b e o r g i e s , s ...",whether it be orgies showcasing various bodily...,whether it be orgy showcasing various bodily f...
3,In the opening scene of BABYLON an elephant on...,In the opening scene of BABYLON an elephant on...,In opening scene BABYLON elephant pick-up empt...,in the opening scene of babylon an elephant on...,i n t h e o p e n i n g s c e n e o f ...,in the opening scene of babylon an elephant on...,in the opening scene of babylon an elephant on...
4,So many reviews praise this film for the level...,So many reviews praise this film for the level...,"So many reviews praise film level ""debauchery""...",so many reviews praise this film for the level...,s o m a n y r e v i e w s p r a i s e ...,so many reviews praise this film for the level...,so many review praise this film for the level ...
...,...,...,...,...,...,...,...
9996,This movie works on the premise that anything ...,This movie works on the premise that anything ...,This movie works premise anything said loudly ...,this movie works on the premise that anything ...,t h i s m o v i e w o r k s o n t h e ...,this movie works on the premise that anything ...,this movie work on the premise that anything s...
9997,This movie should never have been made. From t...,This movie should never have been made. From t...,This movie never made. From perspective histor...,this movie should never have been made. from t...,t h i s m o v i e s h o u l d n e v e r ...,this movie should never have been made from th...,this movie should never have been made from th...
9998,"Far too long, for the reward that it withholds...","Far too long, for the reward that it withholds...","Far long, reward withholds, although fearless ...","far too long, for the reward that it withholds...","f a r t o o l o n g , f o r t h e r ...",far too long for the reward that it withholds ...,far too long for the reward that it withholds ...
9999,This is definitely one of the worst movies I'v...,This is definitely one of the worst movies I'v...,This definitely one worst movies I've ever see...,this is definitely one of the worst movies i'v...,t h i s i s d e f i n i t e l y o n e ...,this is definitely one of the worst movies ive...,this is definitely one of the worst movie ive ...


In [None]:
df.to_csv('cleaned_data.csv',index=False)

# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [27]:
import spacy
from collections import Counter

# Load the English NLP model from SpaCy
nlp = spacy.load("en_core_web_sm")

# Increase the max_length limit (adjust as needed)
nlp.max_length = 2000000

# Read the clean text from the CSV file
text = ""
with open("/content/cleaned_data.csv", "r", encoding="utf-8") as csvfile:
    for line in csvfile:
        text += line

# Split the text into sentences or paragraphs (adjust as needed)
sections = text.split("\n")  # Split by newline, assuming one review per line

# Initialize counters for POS tagging
noun_count = 0
verb_count = 0
adj_count = 0
adv_count = 0

# Initialize counters for named entities
named_entities = Counter()

# Iterate over sections to perform analysis
for section in sections:
    doc = nlp(section)

    # (1) Parts of Speech (POS) Tagging
    for token in doc:
        if token.pos_ == "NOUN":
            noun_count += 1
        elif token.pos_ == "VERB":
            verb_count += 1
        elif token.pos_ == "ADJ":
            adj_count += 1
        elif token.pos_ == "ADV":
            adv_count += 1

    # (3) Named Entity Recognition
    for token in doc.ents:
        named_entities[token.label_] += 1

# Print POS tagging counts
print("Parts of Speech (POS) Tagging:")
print(f"Nouns: {noun_count}")
print(f"Verbs: {verb_count}")
print(f"Adjectives: {adj_count}")
print(f"Adverbs: {adv_count}")

# Print Named Entity Recognition results
print("\nNamed Entity Recognition (NER):")
for entity, count in named_entities.items():
    print(f"{entity}: {count}")


Parts of Speech (POS) Tagging:
Nouns: 9646809
Verbs: 2992800
Adjectives: 1705200
Adverbs: 1310800

Named Entity Recognition (NER):
PERSON: 342001
WORK_OF_ART: 34401
ORG: 155601
CARDINAL: 83200
GPE: 74000
TIME: 59200
FAC: 4800
DATE: 56800
NORP: 23200
PRODUCT: 1200
MONEY: 1600
ORDINAL: 26800
QUANTITY: 2000
PERCENT: 1600
LOC: 400
EVENT: 800


**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

A constituency parsing tree is a tree-shaped representation of the grammatical structure of a sentence. It is based on the idea that sentences can be broken down into smaller and smaller units, called constituents. Constituents are groups of words that function together as a single unit in the sentence.

The root node of a constituency parsing tree is the entire sentence. The other nodes in the tree represent the constituents of the sentence. Constituents are typically labeled with their grammatical category, such as noun phrase (NP), verb phrase (VP), or prepositional phrase (PP).

Dependency parsing tree

A dependency parsing tree is another type of tree-shaped representation of the grammatical structure of a sentence. It is based on the idea that each word in a sentence depends on one or more other words in the sentence. The root node of a dependency parsing tree is the head word of the sentence. The head word is the most important word in the sentence, and the other words in the sentence depend on it in some way.


Constituency parsing trees and dependency parsing trees are two different ways of representing the grammatical structure of a sentence. They have different strengths and weaknesses, and they are used for different tasks.

Constituency parsing trees are good at showing the hierarchical structure of a sentence. They are also good at showing how different constituents in a sentence relate to each other. However, constituency parsing trees can be difficult to build for sentences with complex grammatical structures.

Dependency parsing trees are good at showing the grammatical relationships between words in a sentence. They are also easier to build than constituency parsing trees for sentences with complex grammatical structures. However, dependency parsing trees do not show the hierarchical structure of a sentence as well as constituency parsing trees.