# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [7]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

# Define the base URL and headers
base_url = "https://ddr.densho.org/narrators/?page="
headers = {
    'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

# Generate the list of URLs using list comprehension
urls = [base_url + str(page_num) for page_num in range(1, 41)]

# Initialize an empty list to store the data
data = []

# Iterate over each URL
for url in urls:
    # Send a GET request to the URL
    response = requests.get(url, headers=headers)

    # Parse the HTML content
    soup = BeautifulSoup(response.content, "html.parser")

    # Find all <a> tags with class "item-hover"
    a_tags = soup.find_all('a', class_='item-hover')

    # Find all <div> tags with class "source muted"
    div_tags = soup.find_all('div', class_='source muted')

    # Iterate over the pairs of <a> and <div> tags using zip
    for a_tag, div_tag in zip(a_tags, div_tags):
        # Extract the text from <a> and <div> tags, concatenate with colon, and append to the data list
        data.append({'Name_Info': f"{a_tag.text.strip()} : {div_tag.text.strip()}"})


# Create a DataFrame from the data list
df = pd.DataFrame(data)

# Print the DataFrame
print(df)


                                             Name_Info
0    Kay Aiko Abe : Nisei female. Born May 9, 1927,...
1    Art Abe : Nisei male. Born June 12, 1921, in S...
2    Sharon Tanagi Aburano : Nisei female. Born Oct...
3    Toshiko Aiboshi : Nisei female. Born July 8, 1...
4    Douglas L. Aihara : Sansei male. Born March 15...
..                                                 ...
972  Karen Yoshitomi : Sansei female. Born 1962 in ...
973  John Young : Chinese American male. Born May 2...
974  Sharon Yuen : Sansei female. Born July 1945 in...
975  Lois Yuki : Nisei female. Born September 13, 1...
976  Aaron Zajic : Born in Baltimore, Maryland. Dur...

[977 rows x 1 columns]


In [22]:
df.to_csv('Info_5731.csv')

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [8]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [12]:
def remove_special_characters(text):
    if isinstance(text, (int, float)):
        return str(text)
    return ''.join([char for char in text if char.isalnum() or char.isspace()])

# Call the function on DataFrame columns
df['Name_Info'] = df['Name_Info'].apply(remove_special_characters)
df['Name_Info']

0      Kay Aiko Abe  Nisei female Born May 9 1927 in ...
1      Art Abe  Nisei male Born June 12 1921 in Seatt...
2      Sharon Tanagi Aburano  Nisei female Born Octob...
3      Toshiko Aiboshi  Nisei female Born July 8 1928...
4      Douglas L Aihara  Sansei male Born March 15 19...
                             ...                        
972    Karen Yoshitomi  Sansei female Born 1962 in Sp...
973    John Young  Chinese American male Born May 22 ...
974    Sharon Yuen  Sansei female Born July 1945 in S...
975    Lois Yuki  Nisei female Born September 13 1944...
976    Aaron Zajic  Born in Baltimore Maryland During...
Name: Name_Info, Length: 977, dtype: object

In [14]:
def remove_numbers(text):
    if isinstance(text, (int, float)):
        return str(text)
    return ''.join([char for char in text if not char.isdigit()])

# Call the function on DataFrame columns
df['Name_Info'] = df['Name_Info'].apply(remove_numbers)
df['Name_Info']

0      Kay Aiko Abe  Nisei female Born May   in Selle...
1      Art Abe  Nisei male Born June   in Seattle Was...
2      Sharon Tanagi Aburano  Nisei female Born Octob...
3      Toshiko Aiboshi  Nisei female Born July   in B...
4      Douglas L Aihara  Sansei male Born March   in ...
                             ...                        
972    Karen Yoshitomi  Sansei female Born  in Spokan...
973    John Young  Chinese American male Born May   i...
974    Sharon Yuen  Sansei female Born July  in Seatt...
975    Lois Yuki  Nisei female Born September   in th...
976    Aaron Zajic  Born in Baltimore Maryland During...
Name: Name_Info, Length: 977, dtype: object

In [15]:
def remove_stopwords(text):
    if isinstance(text, (int, float)):
        return str(text)
    return ' '.join([word for word in text.split() if word.lower() not in stopwords.words('english')])

# Call the function on DataFrame columns
df['Name_Info'] = df['Name_Info'].apply(remove_stopwords)
df['Name_Info']

0      Kay Aiko Abe Nisei female Born May Selleck Was...
1      Art Abe Nisei male Born June Seattle Washingto...
2      Sharon Tanagi Aburano Nisei female Born Octobe...
3      Toshiko Aiboshi Nisei female Born July Boyle H...
4      Douglas L Aihara Sansei male Born March Torran...
                             ...                        
972    Karen Yoshitomi Sansei female Born Spokane Was...
973    John Young Chinese American male Born May Los ...
974    Sharon Yuen Sansei female Born July Seattle Wa...
975    Lois Yuki Nisei female Born September Tule Lak...
976    Aaron Zajic Born Baltimore Maryland Redress Mo...
Name: Name_Info, Length: 977, dtype: object

In [16]:
def lowercase_text(text):
    if isinstance(text, (int, float)):
        return str(text)
    return text.lower()

# Call the function on DataFrame columns
df['Name_Info'] = df['Name_Info'].apply(lowercase_text)
df['Name_Info']

0      kay aiko abe nisei female born may selleck was...
1      art abe nisei male born june seattle washingto...
2      sharon tanagi aburano nisei female born octobe...
3      toshiko aiboshi nisei female born july boyle h...
4      douglas l aihara sansei male born march torran...
                             ...                        
972    karen yoshitomi sansei female born spokane was...
973    john young chinese american male born may los ...
974    sharon yuen sansei female born july seattle wa...
975    lois yuki nisei female born september tule lak...
976    aaron zajic born baltimore maryland redress mo...
Name: Name_Info, Length: 977, dtype: object

In [17]:
def stem_text(text):
    if isinstance(text, (int, float)):
        return str(text)
    stemmer = SnowballStemmer('english')
    return ' '.join([stemmer.stem(word) for word in text.split()])

# Call the function on DataFrame columns
df['Name_Info'] = df['Name_Info'].apply(stem_text)

df['Name_Info']

0      kay aiko abe nisei femal born may selleck wash...
1      art abe nisei male born june seattl washington...
2      sharon tanagi aburano nisei femal born octob s...
3      toshiko aiboshi nisei femal born juli boyl hei...
4      dougla l aihara sansei male born march torranc...
                             ...                        
972    karen yoshitomi sansei femal born spokan washi...
973    john young chines american male born may los a...
974    sharon yuen sansei femal born juli seattl wash...
975    loi yuki nisei femal born septemb tule lake co...
976    aaron zajic born baltimor maryland redress mov...
Name: Name_Info, Length: 977, dtype: object

In [18]:
def lemmatize_text(text):
    if isinstance(text, (int, float)):
        return str(text)
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

# Call the function on DataFrame columns
df['Name_Info'] = df['Name_Info'].apply(lemmatize_text)
df['Name_Info']

0      kay aiko abe nisei femal born may selleck wash...
1      art abe nisei male born june seattl washington...
2      sharon tanagi aburano nisei femal born octob s...
3      toshiko aiboshi nisei femal born juli boyl hei...
4      dougla l aihara sansei male born march torranc...
                             ...                        
972    karen yoshitomi sansei femal born spokan washi...
973    john young chine american male born may los an...
974    sharon yuen sansei femal born juli seattl wash...
975    loi yuki nisei femal born septemb tule lake co...
976    aaron zajic born baltimor maryland redress mov...
Name: Name_Info, Length: 977, dtype: object

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [19]:
import nltk

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Function to perform POS tagging and count POS tags
def pos_tag_and_count(text):
    # Tokenize the text into words
    words = nltk.word_tokenize(text)
    # Perform POS tagging
    pos_tags = nltk.pos_tag(words)

    # Initialize counts for nouns, verbs, adjectives, and adverbs
    noun_count = 0
    verb_count = 0
    adj_count = 0
    adv_count = 0

    # Count POS tags
    for word, tag in pos_tags:
        if tag.startswith('N'):
            noun_count += 1
        elif tag.startswith('V'):
            verb_count += 1
        elif tag.startswith('J'):
            adj_count += 1
        elif tag.startswith('R'):
            adv_count += 1

    return noun_count, verb_count, adj_count, adv_count

# Apply POS tagging and counting function to each row in the DataFrame
df['Noun_Count'], df['Verb_Count'], df['Adj_Count'], df['Adv_Count'] = zip(*df['Name_Info'].apply(pos_tag_and_count))

# Print the DataFrame with POS counts
print(df)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


                                             Name_Info  Noun_Count  \
0    kay aiko abe nisei femal born may selleck wash...          10   
1    art abe nisei male born june seattl washington...          10   
2    sharon tanagi aburano nisei femal born octob s...          14   
3    toshiko aiboshi nisei femal born juli boyl hei...          11   
4    dougla l aihara sansei male born march torranc...          11   
..                                                 ...         ...   
972  karen yoshitomi sansei femal born spokan washi...           6   
973  john young chine american male born may los an...           8   
974  sharon yuen sansei femal born juli seattl wash...           9   
975  loi yuki nisei femal born septemb tule lake co...          11   
976  aaron zajic born baltimor maryland redress mov...          11   

     Verb_Count  Adj_Count  Adv_Count  
0             3          2          1  
1             2          2          0  
2             1          4          0  

In [20]:
import spacy
from spacy import displacy

# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# Function to perform constituency parsing and dependency parsing and visualize the parsing trees
def parse_and_visualize(text):
    # Parse the text using spaCy
    doc = nlp(text)

    # Visualize the constituency parsing tree
    print("Constituency Parsing Tree:")
    for sent in doc.sents:
        displacy.render(sent, style="dep", options={"compact": True, "bg": "#09a3d5", "color": "#ffffff", "font": "Source Sans Pro"})

    # Visualize the dependency parsing tree
    print("Dependency Parsing Tree:")
    displacy.render(doc, style="dep", options={"compact": True, "bg": "#09a3d5", "color": "#ffffff", "font": "Source Sans Pro"})

# Apply parsing and visualization function to the first row in the DataFrame
parse_and_visualize(df['Name_Info'].iloc[0])


Constituency Parsing Tree:


Dependency Parsing Tree:


In [25]:
import spacy
import pandas as pd

# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# Sample DataFrame (replace with your actual DataFrame)
data = df

# Function to perform Named Entity Recognition and count entities
def extract_entities(text):
    # Parse the text using spaCy
    doc = nlp(text)

    # Initialize counts for different entity types
    entity_counts = {
        "PERSON": 0,
        "ORG": 0,
        "GPE": 0,  # Geo-Political Entity (Location)
        "PRODUCT": 0,
        "DATE": 0
    }

    # Iterate over each entity in the document
    for ent in doc.ents:
        # Check the entity label and update the corresponding count
        if ent.label_ in entity_counts:
            entity_counts[ent.label_] += 1

    return entity_counts

# Apply entity extraction function to the clean text column in the DataFrame
entity_counts = df['Name_Info'].apply(extract_entities)

# Initialize counts for total entities
total_entity_counts = {
    "PERSON": 0,
    "ORG": 0,
    "GPE": 0,
    "PRODUCT": 0,
    "DATE": 0
}

# Sum the entity counts across all rows
for counts in entity_counts:
    for entity_type, count in counts.items():
        total_entity_counts[entity_type] += count

# Print the total counts of each entity type
print("Total Entity Counts:")
for entity_type, count in total_entity_counts.items():
    print(f"{entity_type}: {count}")


Total Entity Counts:
PERSON: 2
ORG: 1
GPE: 1
PRODUCT: 0
DATE: 0


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
""" The assignment provided is very challenging every time i do the assignment
related to nltk that helps me think and search more on google and aquire knowledge.
First one was similar to the last excercise. Coming to the second and third they took lot of time and i did it at last.
The time was enough for the assignment
"""

' The assignment provided is very challenging every time i do the assignment \nrelated to nltk that helps me think and search more on google and aquire knowledge.\nFirst one was similar to the last excercise. Coming to the second and third they took lot of time and i did it at last. \n'