<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Two.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [1]:
# Write your code here

import requests
from bs4 import BeautifulSoup
import csv

# URL of the IMDb film page with user reviews (replace with your film's URL)
url = "https://www.imdb.com/title/tt1160419/reviews"

# Send an HTTP GET request to the URL
response = requests.get(url)

if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the user reviews on the page (you'll need to inspect the webpage's structure)
    user_reviews = soup.find_all("div", class_="text show-more__control")

    # Create a CSV file to save the data
    with open("film_user_reviews.csv", "w", newline="", encoding="utf-8") as csvfile:
        csv_writer = csv.writer(csvfile)
        # Write headers to the CSV file
        csv_writer.writerow(["User Review"])

        # Loop through each user review and extract the text
        for review in user_reviews:
            review_text = review.get_text(strip=True)
            # Write the information to the CSV file
            csv_writer.writerow([review_text])

    print("Data has been successfully collected and saved to film_user_reviews.csv.")
else:
    print("Failed to retrieve data from the website. Status code:", response.status_code)



Data has been successfully collected and saved to film_user_reviews.csv.


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
# Write your code here



# Write your code here
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Load the CSV file containing user reviews
df = pd.read_csv("film_user_reviews.csv")

# Create a list to store the cleaned reviews
cleaned_reviews = []

# Initialize NLTK resources
nltk.download("stopwords")
nltk.download("wordnet")

# Define the stopwords list
stop_words = set(stopwords.words("english"))

# Initialize the stemming and lemmatization objects
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to clean and preprocess text
def clean_text(text):
    # Remove special characters and punctuation
    text = re.sub(r"[^a-zA-Z]", " ", text)

    # Remove numbers
    text = re.sub(r"\d", " ", text)

    # Lowercase the text
    text = text.lower()

    # Tokenize the text
    tokens = text.split()

    # Remove stopwords and perform stemming/lemmatization
    cleaned_tokens = []
    for token in tokens:
        if token not in stop_words:
            # Uncomment one of the following lines for either stemming or lemmatization
            # cleaned_tokens.append(stemmer.stem(token))
            cleaned_tokens.append(lemmatizer.lemmatize(token))

    # Join cleaned tokens into a string
    cleaned_text = " ".join(cleaned_tokens)
    return cleaned_text

# Apply the clean_text function to each review
df["Cleaned Review"] = df["User Review"].apply(clean_text)

# Save the cleaned data to a new CSV file
df.to_csv("film_user_reviews_cleaned.csv", index=False)

print("Data has been successfully cleaned and saved to film_user_reviews_cleaned.csv.")





[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Data has been successfully cleaned and saved to film_user_reviews_cleaned.csv.


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [5]:
import pandas as pd
import spacy
import nltk
from collections import Counter
from nltk.tree import ParentedTree

# Load the cleaned data
df = pd.read_csv("/content/film_user_reviews_cleaned.csv")

# Load the English language model for spaCy
nlp = spacy.load('en_core_web_sm')

# Initialize NLTK resources
nltk.download("punkt")

# Initialize counters for parts of speech
noun_count = 0
verb_count = 0
adj_count = 0
adv_count = 0

# Initialize entity recognition
entities = Counter()

# Define a function to analyze and print POS, constituency parsing, and dependency parsing
def analyze_text(text):
    global noun_count, verb_count, adj_count, adv_count
    doc = nlp(text)

    for token in doc:
        pos = token.pos_
        if pos == 'NOUN':
            noun_count += 1
        elif pos == 'VERB':
            verb_count += 1
        elif pos == 'ADJ':
            adj_count += 1
        elif pos == 'ADV':
            adv_count += 1

    # Perform dependency parsing using spaCy
    for sent in doc.sents:
        for token in sent:
            if token.dep_ != "punct":  # Exclude punctuation
                print(f"{token.text} ({token.dep_}) -> {token.head.text} ({token.head.dep_})")

    # Named Entity Recognition (NER)
    for ent in doc.ents:
        try:
            entities[ent.label_] += 1
        except KeyError:
            entities['OTHER'] += 1

# Analyze each review's cleaned content
for index, row in df.iterrows():
    print(f'Review {index + 1}:')
    analyze_text(row['Cleaned Review'])

# Print the total counts of different parts of speech
print(f'Total Nouns (N): {noun_count}')
print(f'Total Verbs (V): {verb_count}')
print(f'Total Adjectives (Adj): {adj_count}')
print(f'Total Adverbs (Adv): {adv_count}')

# Print the count of named entities
print("\nNamed Entity Recognition (NER):")
for entity_type, count in entities.items():
    print(f"{entity_type}: Total {count}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Review 1:
denis (compound) -> villeneuve (compound)
villeneuve (compound) -> accomplished (nsubj)
accomplished (nsubj) -> write (ROOT)
considered (acl) -> accomplished (nsubj)
impossible (amod) -> decade (npadvmod)
decade (npadvmod) -> considered (acl)
write (ROOT) -> write (ROOT)
direct (amod) -> adaptation (dobj)
faithful (amod) -> adaptation (dobj)
adaptation (dobj) -> write (ROOT)
fantastic (amod) -> adaptation (dobj)
sci (compound) -> fi (compound)
fi (compound) -> novel (appos)
novel (appos) -> adaptation (dobj)
frank (compound) -> herbert (appos)
herbert (appos) -> novel (appos)
tell (conj) -> write (ROOT)
done (xcomp) -> tell (conj)
actually (advmod) -> done (amod)
done (amod) -> dune (dobj)
introduced (amod) -> dune (dobj)
world (compound) -> dune (nmod)
dune (nmod) -> dune (dobj)
playing (acl) -> dune (nmod)
video (compound) -> game (compound)
game (compound) -> dune (dobj)
dune (dobj) -> done (xcomp)
released (amod) -> story (nsubj)
year (compound) -> story (nsubj)
story (ns

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

First, the Constituency Parsing Tree

The process of evaluating a sentence's grammatical structure to pinpoint its constituents (phrases or word groupings) and how they are arranged hierarchically inside the sentence is known as constituent parsing, also known as syntactic parsing. A constituency parsing tree, which is a hierarchical tree structure, is often used to display the results of constituency parsing.

Important information concerning constituency parsing trees:

Nodes: A constituent is represented by each node in a constituency parsing tree. Clauses, sentences, and words can all be constituents.

The tree structure is hierarchical, with its parts nested inside of one another. A noun phrase, for instance, can include a noun, an adjective, and a determiner.

Labels: The type of constituent that each node in the tree represents is indicated by a label. For a node containing a noun phrase, the label "NP" can be used.

The individual words that make up the sentence are like the tree's leaves.

Root: The root is the top-level node and it represents the whole phrase.

Branches: The links between the nodes signify the grammatical connections between the elements. To show that a verb phrase operates on a noun phrase, for instance, a verb phrase (VP) could be related to a noun phrase (NP).

Constituency parsing is useful for comprehending phrase structure, and it aids in a number of NLP applications, such as information extraction, machine translation, and text summarization.

Secondly,Dependency Parsing Tree :

Another method for examining the grammatical structure of sentences is called dependency parsing, which focuses on the relationships and dependencies between the words in a phrase. A dependency parsing tree, which is a directed graph, is the outcome of dependency parsing.

Important information concerning dependency parsing trees:
One word from the phrase is represented by each node in a dependency parsing tree.

Edges: Grammatical relationships are represented by the edges (arcs) between nodes. The sort of relationship, such as subject, object, modifier, etc., is labeled on each edge.

Root: The sentence's primary verb is represented by a single root node, which forms the basis of the sentence's grammatical structure.

Direction: The edges' directions point from dependent words to the dependent word. For instance, a subject (dependent) may be joined to a verb (head) via an incoming edge.

Dependency parsing trees have an acyclic structure, which means that there are no loops in the graph.

Understanding the syntactic relationships between words in a phrase is made possible by dependency parsing. It is utilized in a number of natural language processing activities, such as sentiment analysis, machine translation, and information extraction.

In conclusion, dependency parsing highlights the connections between words and their dependents, whereas constituency parsing concentrates on the hierarchical structures and sentence constituents. Understanding the syntax and structure of natural language requires both parsing methods.