<a href="https://colab.research.google.com/github/tanvi2419/INF05731_assignment1/blob/main/INFO5731_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [11]:
import urllib.request
import pandas as pd
from bs4 import BeautifulSoup
import time

class DenshoScraper:
    def __init__(self, base_url):
        self.base_url = base_url
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        self.data_list = []

    def scrape_page(self, page_number):
        url = f'{self.base_url}{page_number}/'
        req = urllib.request.Request(url, headers=self.headers)

        try:
            with urllib.request.urlopen(req) as response:
                page_content = response.read()

            soup = BeautifulSoup(page_content, 'html.parser')
            name = soup.find('h1').text.strip()
            description = soup.find('p').text.strip()
            self.data_list.append({'Name': name, 'Description': description})
            time.sleep(0.001)

        except urllib.error.HTTPError as e:
            pass

    def scrape_range(self, start_page, end_page):
        for page_number in range(start_page, end_page + 1):
            self.scrape_page(page_number)

    def save_to_csv(self, filename):
        df = pd.DataFrame(self.data_list)
        df.to_csv(filename, index=False)

# Usage:
base_url = 'https://ddr.densho.org/narrators/'
scraper = DenshoScraper(base_url)
scraper.scrape_range(1, 905)
scraper.save_to_csv('scrapped.csv')

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [6]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import string

# Download NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

# Load the data
df = pd.read_csv('scrapped.csv')

# Define functions for text cleaning steps
def remove_noise(text):
    try:
        return ''.join([char for char in text if char not in string.punctuation])
    except TypeError:
        return ''

def remove_numbers(text):
    try:
        return ''.join([char for char in text if not char.isdigit()])
    except TypeError:
        return ''

def remove_stopwords(text):
    try:
        stop_words = set(stopwords.words('english'))
        return ' '.join([word for word in text.split() if word.lower() not in stop_words])
    except AttributeError:
        return ''

def text_lowercase(text):
    try:
        return text.lower()
    except AttributeError:
        return ''

def stemming(text):
    try:
        ps = PorterStemmer()
        return ' '.join([ps.stem(word) for word in text.split()])
    except AttributeError:
        return ''

def lemmatization(text):
    try:
        lemmatizer = WordNetLemmatizer()
        return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
    except AttributeError:
        return ''

# Clean the 'Description' column
df['Cleaned_Description'] = df['Description'].apply(remove_noise)
df['Cleaned_Description'] = df['Cleaned_Description'].apply(remove_numbers)
df['Cleaned_Description'] = df['Cleaned_Description'].apply(remove_stopwords)
df['Cleaned_Description'] = df['Cleaned_Description'].apply(text_lowercase)
df['Cleaned_Description'] = df['Cleaned_Description'].apply(stemming)
df['Cleaned_Description'] = df['Cleaned_Description'].apply(lemmatization)

# Save the DataFrame to a CSV file with cleaned data
df.to_csv('scrapped_cleaned.csv', index=False)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
df.head()

Unnamed: 0,Name,Description,Cleaned_Description
0,Gene Akutsu,"Nisei male. Born September 23, 1925, in Seattl...",nisei male born septemb seattl washington spen...
1,Jim Akutsu,"Nisei male. Born January 25, 1920, in Seattle,...",nisei male born januari seattl washington inca...
2,Terry Aratani,"Nisei male. During World War II, served with I...",nisei male world war ii serv compani part nd r...
3,Kenneth Okuma,"Nisei male. Born September 19, 1917, in Hanape...",nisei male born septemb hanapep hawaii world w...
4,Yone Bartholomew,"Nisei female. Born April 12, 1904, in Bedderav...",nisei femal born april bedderavia california g...


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [8]:
import spacy
import pandas as pd

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Load the cleaned text from the CSV file
df = pd.read_csv('scrapped_cleaned.csv')
clean_texts = df['Cleaned_Description'].tolist()

# Part-of-Speech Tagging
pos_counts = {'NOUN': 0, 'VERB': 0, 'ADJ': 0, 'ADV': 0, 'PROPN': 0}
for text in clean_texts:
    try:
        doc = nlp(str(text))
        for token in doc:
            try:
                pos_counts[token.pos_] += 1
            except KeyError:
                pass
    except ValueError:
        pass

# Save Part-of-Speech Counts to a file
with open('pos_counts.txt', 'w') as f:
    f.write("Part-of-Speech Counts:\n")
    for pos, count in pos_counts.items():
        f.write(f"{pos}: {count}\n")

# Constituency Parsing and Dependency Parsing
constituency_trees = []
dependency_trees = []
for text in clean_texts:
    try:
        doc = nlp(str(text))
        constituency_trees.append(list(doc.sents))
        dependency_trees.append(doc.to_json())
    except ValueError:
        pass

# Save Constituency Parsing Trees to a file
with open('constituency_trees.txt', 'w') as f:
    f.write("Constituency Parse Trees:\n")
    for tree in constituency_trees:
        f.write(f"{tree}\n")

# Save Dependency Parsing Trees to a file
with open('dependency_trees.txt', 'w') as f:
    f.write("Dependency Parse Trees:\n")
    for tree in dependency_trees:
        f.write(f"{tree}\n")

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
'''
Thank you for providing such valuable questions that allowed me to enhance my understanding of web scraping, text preprocessing, and grammatical analysis. In the first question, we scraped data from the Densho website, collecting author names and their associated data. Our approach involved iterating through page URLs and handling errors for pages that didn't yield data. We then stored the collected data in a CSV file.
For the second question, I utilized NLTK and other libraries to clean the text data. Handling null values in the dataset was crucial, and I implemented error handling to skip processing for these instances.
In the third question, I employed a mixed approach using NLTK and spaCy for grammatical analysis. Error handling was again implemented to ensure smooth processing of the data.
Overall, these tasks were challenging but provided an excellent opportunity for learning and improving my skills. Thank you for the opportunity!
'''

"\nThank you for providing such valuable questions that allowed me to enhance my understanding of web scraping, text preprocessing, and grammatical analysis. In the first question, we scraped data from the Densho website, collecting author names and their associated data. Our approach involved iterating through page URLs and handling errors for pages that didn't yield data. We then stored the collected data in a CSV file.\nFor the second question, I utilized NLTK and other libraries to clean the text data. Handling null values in the dataset was crucial, and I implemented error handling to skip processing for these instances.\nIn the third question, I employed a mixed approach using NLTK and spaCy for grammatical analysis. Error handling was again implemented to ensure smooth processing of the data.\nOverall, these tasks were challenging but provided an excellent opportunity for learning and improving my skills. Thank you for the opportunity!\n"