<a href="https://colab.research.google.com/github/yaminiravala/5731/blob/main/Ravala_Yamini_Assignment_2_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [10]:
import requests
from bs4 import BeautifulSoup
import csv
# Here i selected a popular movie in Telugu language.
url = "https://www.imdb.com/title/tt4727512/reviews?ref_=tt_ql_3"
# Sending an HTTP GET request to the IMDb movie reviews page
response = requests.get(url)
if response.status_code == 200:
    # parsing the HTML content of the page
    soup = BeautifulSoup(response.text, "html.parser")
    # extracting all the reviews
    reviews = []
    review_elements = soup.find_all("div", class_="text show-more__control")
    for review_element in review_elements:
        review_text = review_element.get_text()
        reviews.append(review_text.strip())
    #  i saved all the reviews to a CSV file
    with open("srmtd_reviews.csv", "w", newline="") as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow(["Review"])
        for review in reviews:
            writer.writerow([review])
    print("srmtd_reviews.csv.")
else:
    print("Failed to retrieve the data.")


srmtd_reviews.csv.


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [12]:
# Write your code here
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('omw')
nltk.download('omw-1.4')
nltk.data.find('corpora/omw.zip')
nltk.download('stopwords')
nltk.download('wordnet')
df = pd.read_csv("srmtd_reviews.csv")
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# function to clean and preprocess the text
def clean_text(text):
    # This will remove punctuatiuons
    text = text.translate(str.maketrans('', '', string.punctuation))
    # to remove numbers
    text = ''.join([i for i in text if not i.isdigit()])
    # here im atrying to removing the stopwords
    tokens = nltk.word_tokenize(text)
    tokens = [word.lower() for word in tokens if word.lower() not in stop_words]
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    cleaned_text = ' '.join(lemmatized_tokens)
    return cleaned_text
df['Cleaned_Review'] = df['Review'].apply(clean_text)
# create new CSV file inorder to stotre cleaned dataset
df.to_csv("srmtd_reviews_cleaned.csv", index=False)
print("srmtd_reviews_cleaned.csv.")

srmtd_reviews_cleaned.csv.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw to /root/nltk_data...
[nltk_data]   Package omw is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [13]:
pip install spacy



In [None]:
!pip list | grep typing-extensions
!pip list | grep nltk
!pip list | grep spacy

nltk                          3.7
spacy                         3.7.2
spacy-legacy                  3.0.12
spacy-loggers                 1.0.5


In [15]:
import nltk
import spacy
from nltk import sent_tokenize, word_tokenize, pos_tag, ne_chunk
from collections import Counter
nltk.download('averaged_perceptron_tagger')
# i took a sample sentence from one of the review
sample_sentence = "Srimanthudu works because of three reasons them being Mahesh Babu being used to his maxim potential, the role of Harsha fits him like a glove. The director made sure he used all of Mahesh's characteristics which have made him famous till now they are smartly placed into his character too look genuine.The other two reasons are the sensible story which connects with everyone from the family audiences to the masses and finally it was made sure the masses where fully pleased with some good action sequences, with some a whistle worth song or two too.Therefore a film like Srimanthudu is set for blockbuster status. But for me even though i read about the immense positive reviews about the film,I still found the story a little common and predictable. But what won me over was the hero's character and his characterization which I could relate too. And the films final message of how money is not the most important thing in life, how helping others grow with you is more important, it gives a real reason to live, and makes society move forward positively . We in the rustle, bustle of life forget about our roots.How people are struggling back home, it is our duty/debt to go back and make a change in any small way so we can see the growth of rural areas into a place where people want to live. In short after becoming successful in life don't forget the were you came from, and the people responsible for your success. A strong message like that is what makes Srimanthudu a winner.Story wise Srimanthudu tells the story of a man named Harsha, who is a son of a billionaire. His dad wants him to take over the business, but Harsha does not want too. Harsha is lost in his own world, he does not know what he really wants to do but he knows that he doesn't care about making loads of money, and that he loves helping people who are less privileged. He then eventually finds out he is interested in rural studies and pursues those studies, and doing so also falls in love with a fellow student. Through all this,conflicts keep on arising, and he finds himself heading to his native village. What is reason for him to go back to his native village, and is he successful in his reason is what makes up the the rest of the plot.To know more rush to the theatres.Acting wise its a Mahesh Babu show all the way, he rules the roost. He has been given an amazing character, and with it gives one of his career best performances. The superstar shines Shruti Hassan excels in a substantial role, she has has matured as an actor, and she looks the most beautiful she has ever on screen with this film.Jagapathi Babu and Rajendra Parsad impress. Villains Sampath Raj and Mukesh Rishi are the usual clichéd villains. Vennela Kishore and Ali provide ample laughs.The only minuses in the film are the excessive length of almost 3 hours plus and the slow pace, at times the film story doesn't move.Overall Srimanthudu all withstanding is a blockbuster all the way. The way the film has class and mass sensibilities mixed together, with a sensible story, the audiences will come in hordes.The film has Mahesh Babu like never before, and after the disappointing failures of Aagdu and 1 this provide a comeback of sorts.3.5/5* or 7/10"
#POS count
def pos_counts(text):
    tags = pos_tag(word_tokenize(text))
    tag_counts = Counter(tag for word, tag in tags)
    return tag_counts
# Named Entity Recognition
def named_entity_recognition(text):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities
# analyze the pos tags
pos_tag_counts = pos_counts(sample_sentence)
print("POS Tag Counts:")
print(pos_tag_counts)
# analyze NER
named_entities = named_entity_recognition(sample_sentence)
print("Named Entities:")
for entity, label in named_entities:
    print(f"Entity: {entity}, Label: {label}")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


POS Tag Counts:
Counter({'NN': 93, 'DT': 66, 'IN': 64, 'VBZ': 42, 'JJ': 40, 'PRP': 34, 'RB': 30, 'NNP': 29, 'NNS': 29, 'CC': 28, ',': 25, '.': 17, 'VBP': 17, 'VB': 17, 'PRP$': 13, 'VBN': 11, 'TO': 11, 'VBG': 10, 'VBD': 8, 'CD': 7, 'WP': 7, 'WRB': 4, 'RP': 4, 'WDT': 3, 'MD': 3, 'PDT': 3, 'POS': 2, 'RBS': 2, 'RBR': 2, 'JJR': 1, 'JJS': 1})
Named Entities:
Entity: three, Label: CARDINAL
Entity: Mahesh Babu, Label: PERSON
Entity: Harsha, Label: PERSON
Entity: Mahesh, Label: GPE
Entity: two, Label: CARDINAL
Entity: two, Label: CARDINAL
Entity: Srimanthudu, Label: PERSON
Entity: Srimanthudu, Label: PERSON
Entity: Srimanthudu, Label: PERSON
Entity: Harsha, Label: PERSON
Entity: Harsha, Label: PERSON
Entity: Harsha, Label: PERSON
Entity: Babu, Label: PERSON
Entity: Shruti Hassan, Label: PERSON
Entity: Jagapathi Babu, Label: PERSON
Entity: Rajendra Parsad, Label: PERSON
Entity: Villains Sampath Raj, Label: PERSON
Entity: Mukesh Rishi, Label: PERSON
Entity: Vennela Kishore, Label: PERSON
Entity: 

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

In general we can understand that constituency parsing tree provides a hierarchical view of the sentence's structure into phrases, whereas dependency parsing tree focuses on the grammatical relationships between words, showing how each word connects to others in the sentence.