<a href="https://colab.research.google.com/github/srivamsikakarla/venkatasuryasatya_INFO5731_Fall2023/blob/main/kakarlavenkatasuryasatya_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [None]:
# Write your code here

# Import the required modules
import requests
import random
import time
from bs4 import BeautifulSoup
import pandas as pd

# Define a list of urls to scrape
urls = [f"https://ddr.densho.org/narrators/{i}" for i in range(1, 905)]
bad_urls = []
texts = []

for url in urls:
    # Wait for a random time between 0 and 2 seconds before scraping a link
    time.sleep(random.uniform(0, 1))

    # Set headers to confuse the server
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',}

    # Get the responsse from the url
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        # Append url to the bad_url list
        bad_urls.append(url)

    else:
        # Parse the response using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find the h1 tag
        h_tag = soup.find('h1')

        # Get text from the h1 tag
        h_text = h_tag.get_text()

        # Find the p tag
        p_tag = soup.find('p')

        # Get the text from the p tag
        p_text = p_tag.get_text()
        texts.append(h_text + " " + p_text)

# Create a pandas dataframe object for the narrators
df = pd.DataFrame(texts, columns = ['Narrators'])

df.to_csv('narrators.csv', encoding='utf 8', index=False)

The output is a CSV file contaning all the narrators information.

# **Question 2**

In [1]:
pip install nltk



(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
import requests
import pandas as pd
from nltk.stem import PorterStemmer, WordNetLemmatizer


df = pd.read_csv("narrators.csv")

# Removing all non-alphabetic characters (special characters and numbers) except spaces
df['Cleaned'] = df['Narrators'].str.replace('[^a-zA-Z ]', '', regex=True)

# Removing stop words
response = requests.get("https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords")
stop_words = response.text.split("\n")
pattern = r'\b(?:{})\b'.format('|'.join(stop_words))
df['Cleaned'] = df['Cleaned'].str.replace(pattern, '', regex=True)
df['Cleaned'] = df['Cleaned'].str.replace(r'\s+', ' ', regex=True)  # remove extra spaces

# Changing all text to lowercase
df['Cleaned'] = df['Cleaned'].str.lower()

# Stemming the text in the cleaned column
stemmer = PorterStemmer()
df['Cleaned'] = df['Cleaned'].str.split().apply(lambda x: [stemmer.stem(word) for word in x])
df['Cleaned'] = df['Cleaned'].str.join(' ')

# Lemmatizing the text
lt = WordNetLemmatizer()
df['Cleaned'] = df['Cleaned'].str.split().apply(lambda x: [lt.lemmatize(word) for word in x])
df['Cleaned'] = df['Cleaned'].str.join(' ')

# Save the dataframe to a CSV file
output_file_path = "cleaned_data.csv"
df.to_csv(output_file_path, index=False)

# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
# Write your code here

# Import spacy
import spacy

# Load the spacy model for English language
nlp = spacy.load("en_core_web_sm")

# Create a new column with spacy document objects from the 'Cleaned' column
df['Doc'] = df['Cleaned'].apply(lambda x: nlp(x))

# Create a new column with lists of tuples with word and POS tag from the 'Doc' column
df['Word_POS'] = df['Doc'].apply(lambda x: [(token.text, token.pos_) for token in x])

# Create a new column with counts of nouns, verbs, adjectives, and adverbs from the 'Word_POS' column
df['POS_Counts'] = (df['Word_POS'].apply(lambda x:
                                         "{} Nouns, {} Verbs, {} Adjectives, {} Adverbs".format(
                                          len([word for word, pos in x if pos == "NOUN"]),
                                          len([word for word, pos in x if pos == "VERB"]),
                                          len([word for word, pos in x if pos == "ADJ"]),
                                          len([word for word, pos in x if pos == "ADV"]))))

In [None]:
# Printing out the constituency parsing trees and dependency parsing trees of all the sentences.

# Import TextBlob and graphviz
from textblob import TextBlob
from graphviz import Source

# Parse a sentence using TextBlob
sentences = df['Cleaned'].values
trees = []

for sentence in sentences:
    blob = TextBlob(sentence)
    # Convert the tree object into a string
    tree = blob.parse().__str__()
    # Append the string to the list
    trees.append(tree)

# Print the constituency parsing tree and the dependency parsing tree as strings for the first sentence
for i, tree in enumerate(trees):
    print("----------------------------------------------")
    print(f"TREE {i+1}")
    print(tree)
    print(" ")

----------------------------------------------
TREE 1
gene/NN/B-NP/O akutsu/NN/I-NP/O nisei/NN/I-NP/O male/JJ/B-ADJP/O born/VBN/B-VP/O septemb/NN/B-NP/O seattl/NN/I-NP/O washington/NN/I-NP/O spent/VBD/B-VP/O prewar/JJ/B-NP/O childhood/NN/I-NP/O seattl/NN/I-NP/O nihonmachi/NN/I-NP/O incarcer/NN/I-NP/O puyallup/NN/I-NP/O assembl/NN/I-NP/O center/NN/I-NP/O washington/NN/I-NP/O minidoka/NN/I-NP/O concentr/NN/I-NP/O camp/NN/I-NP/O idaho/NN/I-NP/O refus/NNS/I-NP/O particip/NN/I-NP/O draft/NN/I-NP/O imprison/VB/B-VP/O mcneil/NN/B-NP/O island/NN/I-NP/O penitentiari/NN/I-NP/O washington/NN/I-NP/O draft/NN/I-NP/O resist/VB/B-VP/O resettl/NN/B-NP/O seattl/NN/I-NP/O
 
----------------------------------------------
TREE 2
jim/NN/B-NP/O akutsu/NN/I-NP/O nisei/NN/I-NP/O male/JJ/B-ADJP/O born/VBN/B-VP/O januari/NN/B-NP/O seattl/NN/I-NP/O washington/NN/I-NP/O incarcer/NN/I-NP/O puyallup/NN/I-NP/O assembl/NN/I-NP/O center/NN/I-NP/O washington/NN/I-NP/O minidoka/NN/I-NP/O concentr/NN/I-NP/O camp/NN/I-NP/

In [None]:
# Extracting all the entities such as person names, organizations, locations, product names, and date from the clean texts,
# and calculating the count of each entity

# Import Counter
from collections import Counter

# Get the clean texts from the dataframe
texts = df["Cleaned"].values

# Define an empty list to store the entities
entities = []

# Loop over each text in the dataframe
for text in texts:
    # Parse the text using spaCy
    doc = nlp(text)
    # Loop over each entity in the text
    for ent in doc.ents:
        # Append the entity text and label to the list
        entities.append((ent.text, ent.label_))

# Count the frequency of each entity type using Counter
entity_counts = Counter([label for text, label in entities])

# Print the results
print("The entities and their counts are:")
for entity, count in entity_counts.items():
    print(f"{entity}: {count}")

The entities and their counts are:
PERSON: 1513
GPE: 3275
ORG: 919
PRODUCT: 9
EVENT: 552
CARDINAL: 105
LOC: 56
DATE: 337
NORP: 315
ORDINAL: 47
MONEY: 2
FAC: 47
LANGUAGE: 6
PERCENT: 1
TIME: 5
LAW: 1
QUANTITY: 1


**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

A dependency parsing tree is a way to break down how words in a sentence connect to their functions in the sentence. It connects words by lines and labels them to show what word is doing what in a sentence.<br/>

For example, from the sentence:

*Miyoko Tsuboi Nakagawa  Nisei female. Born March 8, 1925, in Portland, Oregon. Lost mother at an early age and helped to take care of the family. During World War II, removed to the Portland Assembly Center, Oregon, and the Minidoka concentration camp, Idaho. After leaving camp, worked for a War Relocation Authority office helping to return confiscated property to Japanese Americans. Returned to Portland, then moved to South Bend, Washington, after marriage to an oyster farmer*

The root word is born. It has three words connected to it: Miyoko, March 8 1925 , and in Portland Oregon. The word Miyoko has three words connected to it: Tsuboi , Nakagawa , and Nisei , with labels compound (part of a name). The compound label means that Tsuboi , Nakagawa , and Nisei are part of Miyoko’s full name or identity. The word March 8 1925 has two words connected to it: , and , with labels punct (punctuation). The punct label means that , and , are just marks to separate the date parts. The word in Portland Oregon has one word connected to it: Portland , with label pobj (object of a preposition).nformation.g a story.

A constituency parsing tree, also known as a parse tree, represents the syntactic structure of a sentence according to a formal grammar. In the case of this example Miyoko Tsuboi Nakagawa, a constituency parsing tree would decompose the text into its constituent parts, such as noun phrases (NPs) and verb phrases (VPs). Each word in the narrative would be classified into its appropriate category (noun, verb, adjective, etc.), and these categories would be grouped into larger syntactic units, showing the hierarchical relationship between words and phrases. For instance, "Miyoko Tsuboi Nakagawa" would be identified as a proper noun phrase, while "Born March 8, 1925, in Portland, Oregon" could be parsed into a verb phrase with nested prepositional phrases. This tree visualization aids in understanding the grammatical composition of the narrative.
