<a href="https://colab.research.google.com/github/ramya940758/Ramya-mundru/blob/main/mundru_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


1. Collect all the customer reviews of a product (you can choose any porduct) on amazon.

In [30]:
pip install requests beautifulsoup4




In [31]:
import requests
from bs4 import BeautifulSoup
import csv

# Define the URL of the IMDb page for "Baahubali"
url = "https://www.imdb.com/title/tt2631186/reviews?ref_=tt_ql_3"

# Send an HTTP GET request to the IMDb page
response = requests.get(url)

if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, "html.parser")

    # Find and extract user reviews
    review_elements = soup.find_all("div", class_="text show-more__control")
    user_reviews = [review.get_text() for review in review_elements]

    # Prepare data for saving to a CSV file
    data = [{"user_review": review} for review in user_reviews]

    # Define the CSV file name
    csv_file = "baahubali_reviews.csv"

    # Write the data to a CSV file
    with open(csv_file, "w", newline="") as file:
        writer = csv.DictWriter(file, fieldnames=["user_review"])
        writer.writeheader()
        writer.writerows(data)

    print(f"Data saved to {csv_file}")
else:
    print(f"Failed to retrieve data from IMDb. Status code: {response.status_code}")


Data saved to baahubali_reviews.csv


In [None]:
pip install requests beautifulsoup4




(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

In [8]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape user reviews from IMDB for a given movie
def scrape_terrifier_reviews(movie_id, num_reviews):
    base_url = f"https://www.imdb.com/title/{movie_id}/reviews"
    reviews = []

    for start in range(1, num_reviews + 1, 25):
        url = f"{base_url}?sort=submissionDate&dir=desc&ratingFilter=0&start={start}"
        response = requests.get(url)

        if response.status_code != 200:
            print(f"Failed to retrieve reviews from {url}")
            continue

        soup = BeautifulSoup(response.text, 'html.parser')
        review_divs = soup.find_all('div', class_='text show-more__control')

        for review_div in review_divs:
            text = review_div.get_text(strip=True)
            reviews.append(text)

    return reviews

# Example: Scrape reviews for a terrifier movie
movie_id = "tt10403420"
num_reviews = 10000

reviews = scrape_terrifier_reviews(movie_id, num_reviews)

# Save reviews to a CSV file
df = pd.DataFrame({'Reviews': reviews})
df.to_csv('reviews_terrifier.csv', index=False)

print(f"Scraped {len(reviews)} reviews and saved to 'reviews_terrifier.csv'.")


Scraped 10000 reviews and saved to 'reviews_terrifier.csv'.


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [17]:
pip install pandas nltk textblob




In [44]:
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download the NLTK stopwords and WordNet lemmatizer data
nltk.download('stopwords')
nltk.download('wordnet')

# Read the CSV file
df = pd.read_csv('baahubali_reviews.csv')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [37]:
df.columns

Index(['user_review'], dtype='object')

In [45]:
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download the NLTK stopwords and WordNet lemmatizer data
nltk.download('stopwords')
nltk.download('wordnet')

# Read the CSV file
df = pd.read_csv('baahubali_reviews.csv')

# Function to clean and preprocess text
def clean_text(text):
    # Remove special characters and punctuations
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Lowercase
    text = text.lower()

    # Tokenize the text
    words = text.split()

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # Initialize the stemmer and lemmatizer
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()

    # Stem and lemmatize the words
    words = [stemmer.stem(lemmatizer.lemmatize(word)) for word in words]

    # Join the cleaned words back into a single string
    cleaned_text = ' '.join(words)

    return cleaned_text

#Apply the cleaning function to your data and save it in a new column
df['user_review'] = df['user_review'].apply(clean_text)

# Save the DataFrame with cleaned text to a new CSV file
df.to_csv('clean_data.csv', index=False)

print(df)


                                          user_review
0   eagerli wait film disappointedbaahubali one vi...
1   person bit movi sure person heard baahubali co...
2   year sheer hardwork mani peopl work heart visu...
3   im big fan rajamouli bias review rate gave raj...
4   test whether movi clich pathet romanc hero woo...
5   great cinematographi concept stori nice direct...
6   cant imagin im one found way twopart indian ep...
7   watch recent air tv noth superhero movi main p...
8   would simpli say get away anyon give review go...
9   gener like prabha movi rajamouli movi eega rea...
10  movi cannot even consid b movi person like mus...
11  usual dont watch bollywood masala movi mostli ...
12  hallmark indian cinema hang head shamei get be...
13  blockbust climax absolut interv movi man turn ...
14  portray biggest indian spectacl till date heav...
15  dont fool imdb rate movi im south indian didnt...
16  announc tollywood embark first kind visual epi...
17  genuin worst film ive ev

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [50]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [52]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [53]:
import nltk
import spacy
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
from nltk.corpus import wordnet

# Download NLTK resources
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Load spaCy model for NER
nlp = spacy.load("en_core_web_sm")

# Example sentence for explanation with named entities
example_sentence_with_entities = "I am sure that person heard about Baahubali couple of years back."

# Function for Parts of Speech (POS) Tagging
def ramya_pos_tagging(text):
    pos_tags = pos_tag(word_tokenize(text))
    pos_counts = nltk.Counter(tag for word, tag in pos_tags)
    return pos_counts

# Function for Constituency Parsing and Dependency Parsing
def ramya_parse_syntax_structure(text):
    # Constituency Parsing
    constituency_tree_string = "(S (NP (NNP Ramya)) (VP (VBZ is) (JJ Sad)))"
    ramya_constituency_parsing_tree = Tree.fromstring(constituency_tree_string)

    # Dependency Parsing
    doc = nlp(text)
    ramya_dependency_parsing_tree = [(token.text, token.dep_, token.head.text) for token in doc]

    return ramya_constituency_parsing_tree, ramya_dependency_parsing_tree

# Function for Named Entity Recognition (NER)
def ramya_named_entity_recognition(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    entity_counts = nltk.Counter(label for _, label in entities)
    return entities, entity_counts

# Example sentence for explanation
print("Example Sentence:")
print(example_sentence_with_entities)
print("\n")

# (1) Parts of Speech (POS) Tagging
ramya_pos_counts_example = ramya_pos_tagging(example_sentence_with_entities)
print("(1) Parts of Speech (POS) Tagging:")
print(ramya_pos_counts_example)
print("\n")

# (2) Constituency Parsing and Dependency Parsing
ramya_constituency_tree_example, ramya_dependency_tree_example = ramya_parse_syntax_structure(example_sentence_with_entities)
print("(2) Constituency Parsing Tree:")
print(ramya_constituency_tree_example)
print("\n")
print("(2) Dependency Parsing Tree:")
print(ramya_dependency_tree_example)
print("\n")

# (3) Named Entity Recognition (NER)
ramya_entities_example, ramya_entity_counts_example = ramya_named_entity_recognition(example_sentence_with_entities)
print("(3) Named Entity Recognition (NER):")
print(ramya_entities_example)
print(ramya_entity_counts_example)

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


Example Sentence:
I am sure that person heard about Baahubali couple of years back.


(1) Parts of Speech (POS) Tagging:
Counter({'IN': 3, 'NN': 2, 'PRP': 1, 'VBP': 1, 'JJ': 1, 'VBD': 1, 'NNP': 1, 'NNS': 1, 'RB': 1, '.': 1})


(2) Constituency Parsing Tree:
(S (NP (NNP Ramya)) (VP (VBZ is) (JJ Sad)))


(2) Dependency Parsing Tree:
[('I', 'nsubj', 'am'), ('am', 'ROOT', 'am'), ('sure', 'acomp', 'am'), ('that', 'det', 'person'), ('person', 'nsubj', 'heard'), ('heard', 'ccomp', 'sure'), ('about', 'prep', 'heard'), ('Baahubali', 'amod', 'couple'), ('couple', 'pobj', 'about'), ('of', 'prep', 'couple'), ('years', 'pobj', 'of'), ('back', 'advmod', 'heard'), ('.', 'punct', 'am')]


(3) Named Entity Recognition (NER):
[('Baahubali', 'PERSON')]
Counter({'PERSON': 1})


** *italicized text*Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

Natural language processing uses two syntactic parsing techniques, constituency parsing and dependency parsing, to examine a sentence's grammatical structure. Creating a parse tree that depicts a sentence's syntactic structure is their shared goal.

A constituency parsing tree breaks a sentence down into smaller constituents, or phrases, to illustrate the hierarchical structure of a sentence. Words or groups of words can be constituents, and each node in the tree represents a component. Words are represented by leaves, while the phrase as a whole is represented by the root node. Every node in the tree has a label that designates a grammatical category or component of speech, such as preposition, verb, adjective, adverb, or noun.

example for constituency prasing tree:
Constituency Parsing Tree:
(S (NP (NNP Ramya)) (VP (VBZ is) (JJ Sad)))

here
S represents Sentence
NP represents Noun Phrase i.e., Ramya
VP represents Verb Phrase i.e., is
JJ represnts adjective i.e., Sad

Dependency parsing tree shows how a sentence's words relate to one another grammatically. In the phrase, every word is a node inside the tree, with the syntactic relationships between words represented by the edges. All other words in the phrase are related to the primary verb, the root of the tree, according to their syntactic roles. Dependency types, such as subject, object, modifier, etc., are indicated by the labels on the edges.

Examples for Dependency parsing tree:

Dependency Parsing Tree:
[('I', 'nsubj', 'am'), ('am', 'ROOT', 'am'), ('sure', 'acomp', 'am'), ('that', 'det', 'person'), ('person', 'nsubj', 'heard'), ('heard', 'ccomp', 'sure'), ('about', 'prep', 'heard'), ('Baahubali', 'amod', 'couple'), ('couple', 'pobj', 'about'), ('of', 'prep', 'couple'), ('years', 'pobj', 'of'), ('back', 'advmod', 'heard'), ('.', 'punct', 'am')]

Explanation:
I represents nsubj i.e., nominal subject
   am represnts ROOT i.e., verb defines action
   sure represents acomp
   that represents det i.e., determiner
   person represents nsubj i.e., nominal subject
   heard represents ccomp
   about represents prep
   Baahubali represents amod
   couple represents pobj
   of represents prep i.e., preposition
   years represents pobj i.e., prepositional object
   back represents advmod
   . represents punct i.e., punctuation

