<a href="https://colab.research.google.com/github/yashwanthjilla7/INFO-5731/blob/main/Jilla_Yashwanth_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import time
import urllib.request

# Get the HTML of the page
base = 'https://ddr.densho.org/narrators/'

# Initialize an empty list to store data
data_list = []

# Loop through the pages
for i in range(1, 905):
    url = base + str(i) + '/'

    # Properly load the website using urllib.request with headers
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }

    req = urllib.request.Request(url, headers=headers)

    try:
        with urllib.request.urlopen(req) as response:
            page_content = response.read()

        soup = BeautifulSoup(page_content, 'html.parser')
        # /html/body/div[2]/div/section/div[1]/div[1]/h1 - Name
        # /html/body/div[2]/div/section/div[1]/div[1]/p - Description
        name = soup.find('h1').text
        description = soup.find('p').text

        # Append data to the list
        data_list.append({'Name': name, 'Description': description})

        # Introduce a delay (0.002 seconds)
        time.sleep(0.002)

    except urllib.error.HTTPError as e:
        pass

# Create a DataFrame from the list of data
df = pd.DataFrame(data_list)

# Save the DataFrame to a CSV file
df.to_csv('densho.csv', index=False)

In [2]:
df.head()

Unnamed: 0,Name,Description
0,\n Gene Akutsu\n,"Nisei male. Born September 23, 1925, in Seattl..."
1,\n Jim Akutsu\n,"Nisei male. Born January 25, 1920, in Seattle,..."
2,\n Terry Aratani\n,"Nisei male. During World War II, served with I..."
3,\n Kenneth Okuma\n,"Nisei male. Born September 19, 1917, in Hanape..."
4,\n Yone Bartholomew\n,"Nisei female. Born April 12, 1904, in Bedderav..."


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [3]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

# Download NLTK resources if not already downloaded
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

# Load the DataFrame from the CSV file
df = pd.read_csv('densho.csv')

# Function to perform text cleaning
def clean_text(text):
    # Check for NaN values
    if pd.isna(text):
        return ''

    # Remove special characters and punctuations
    text = ''.join([char for char in str(text) if char not in string.punctuation])

    # Remove numbers
    text = ''.join([char for char in text if not char.isdigit()])

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word.lower() not in stop_words])

    # Lowercase all texts
    text = text.lower()

    # Tokenization (split the text into words)
    tokens = nltk.word_tokenize(text)

    # Stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # Join the cleaned tokens back into a single string
    cleaned_text = ' '.join(tokens)

    return cleaned_text

# Apply the clean_text function to the 'Description' column and create a new 'Cleaned_Description' column
df['Cleaned_Description'] = df['Description'].apply(clean_text)

# Save the DataFrame with the new column to the CSV file
df.to_csv('densho_cleaned.csv', index=False)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [4]:
df.head()

Unnamed: 0,Name,Description,Cleaned_Description
0,\n Gene Akutsu\n,"Nisei male. Born September 23, 1925, in Seattl...",nisei male born septemb seattl washington spen...
1,\n Jim Akutsu\n,"Nisei male. Born January 25, 1920, in Seattle,...",nisei male born januari seattl washington inca...
2,\n Terry Aratani\n,"Nisei male. During World War II, served with I...",nisei male world war ii serv compani part nd r...
3,\n Kenneth Okuma\n,"Nisei male. Born September 19, 1917, in Hanape...",nisei male born septemb hanapep hawaii world w...
4,\n Yone Bartholomew\n,"Nisei female. Born April 12, 1904, in Bedderav...",nisei femal born april bedderavia california g...


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [5]:
import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
import spacy
import pandas as pd

# Download NLTK resources if not already downloaded
try:
    nltk.download('punkt')
    nltk.download('maxent_ne_chunker')
    nltk.download('words')
    nltk.download('averaged_perceptron_tagger')
except Exception as nltk_download_exception:
    print(f"NLTK download failed with the following error: {nltk_download_exception}")

# Load the clean text DataFrame
try:
    df_cleaned = pd.read_csv('densho_cleaned.csv')
except FileNotFoundError as file_not_found_error:
    print(f"File not found: {file_not_found_error}")
    # You might want to handle this exception, e.g., by prompting the user to check the file path.

# Function for Parts of Speech (POS) tagging
def pos_tagging(text):
    try:
        words = word_tokenize(text)
        pos_tags = pos_tag(words)

        # Count the number of Nouns, Verbs, Adjectives, and Adverbs
        noun_count = len([word for word, pos in pos_tags if pos.startswith('N')])
        verb_count = len([word for word, pos in pos_tags if pos.startswith('V')])
        adj_count = len([word for word, pos in pos_tags if pos.startswith('J')])
        adv_count = len([word for word, pos in pos_tags if pos.startswith('R')])

        return noun_count, verb_count, adj_count, adv_count
    except Exception as pos_tagging_exception:
        print(f"POS tagging failed with the following error: {pos_tagging_exception}")
        return 0, 0, 0, 0  # Return default values or handle the error accordingly

# Apply POS tagging and create new columns for counts
try:
    df_cleaned[['Noun_Count', 'Verb_Count', 'Adj_Count', 'Adv_Count']] = df_cleaned['Cleaned_Description'].apply(lambda x: pd.Series(pos_tagging(x)))
except Exception as pos_tagging_apply_exception:
    print(f"POS tagging apply failed with the following error: {pos_tagging_apply_exception}")

# Load spaCy model for Named Entity Recognition (NER)
try:
    nlp = spacy.load("en_core_web_sm")
except Exception as spacy_load_exception:
    print(f"spaCy model loading failed with the following error: {spacy_load_exception}")

# Function for Named Entity Recognition (NER)
def ner_extraction(text):
    try:
        doc = nlp(text)

        # Extract entities (person names, organizations, locations, product names, and dates)
        entities = [ent.text for ent in doc.ents]

        return entities
    except Exception as ner_extraction_exception:
        print(f"NER extraction failed with the following error: {ner_extraction_exception}")
        return []  # Return an empty list or handle the error accordingly

# Apply NER and create a new column for extracted entities
try:
    df_cleaned['Entities'] = df_cleaned['Cleaned_Description'].apply(ner_extraction)
except Exception as ner_apply_exception:
    print(f"NER apply failed with the following error: {ner_apply_exception}")

# Save the DataFrame with the new columns to a new CSV file
try:
    df_cleaned.to_csv('densho_analysis.csv', index=False)
except Exception as csv_save_exception:
    print(f"CSV save failed with the following error: {csv_save_exception}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


POS tagging failed with the following error: expected string or bytes-like object
POS tagging failed with the following error: expected string or bytes-like object
POS tagging failed with the following error: expected string or bytes-like object
POS tagging failed with the following error: expected string or bytes-like object
NER extraction failed with the following error: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'float'>
NER extraction failed with the following error: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'float'>
NER extraction failed with the following error: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'float'>
NER extraction failed with the following error: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'float'>


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
# Thank you for providing such good questions to learn about web scrapping, and word preprocesing and then about grammer , In first question i worked for densho website and I have collected authour name amd the data our approch is simple like I have used a base url and using string approch, changing the url for soup each times some of links are not giving data I used error handling for these types of links after getting these, just stored datain csv
# In question 2 i used nltk and other libraries to do task and I have some null values in dataset so for skipping that I used again here error handling
# In question 3 my approch is with mixed libries which nltk and spacy to get the grammer things from that and also used error handling for getting the data perfectly
# overall, its been tough but i learned a lot about this, Thank you