<a href="https://colab.research.google.com/github/shreyamadarapu/INFO_5731/blob/main/Madarapu_Shreya_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [8]:
import csv
import requests
import concurrent.futures
from bs4 import BeautifulSoup

# Function to collect user reviews from IMDb
def collect_user_reviews(imdb_url):
    review_data = []
    page_num = 1

    # Loop to collect reviews until 1000 reviews are gathered
    while len(review_data) < 1000:
         #Sending a GET request to IMDb with sorting and filtering parameters
        response = requests.get(f"{imdb_url}?sort=helpfulnessScore&dir=desc&ratingFilter=0&page={page_num}")
        soup = BeautifulSoup(response.text, 'html.parser')
        review_containers = soup.find_all('div', class_='review-container')

        if not review_containers:
            break  # No more reviews available

        # Iterating through each review container to extract review text and rating
        for container in review_containers:
            review_text_element = container.find('div', class_='text show-more__control')
            review_rating_element = container.find('span', class_='rating-other-user-rating')

            # Checking if both review text and rating elements are found
            if review_text_element and review_rating_element:
                review_text = review_text_element.text.strip()
                review_rating = review_rating_element.text.strip()
                review_data.append((review_text, review_rating))  # Appending review data as a tuple

                if len(review_data) >= 1000:
                    break

        page_num += 1

    return review_data[:1000]  # Return up to 1000 reviews

# Main function
def main():
    imdb_urls = [
        'https://www.imdb.com/title/tt1517268/reviews',  # Barbie 2023

    ]

    with open('movie_reviews.csv', 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['Review', 'Rating']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        total_reviews_collected = 0
        with concurrent.futures.ThreadPoolExecutor() as executor:
            for imdb_url in imdb_urls:
                print(f"Collecting reviews from: {imdb_url}")
                reviews = executor.submit(collect_user_reviews, imdb_url).result()
                for review, rating in reviews:
                    writer.writerow({'Review': review, 'Rating': rating})
                    total_reviews_collected += 1
                    if total_reviews_collected >= 1000:
                        break
                if total_reviews_collected >= 1000:
                    break

    print("Reviews collected and saved to 'movie_reviews.csv'.")

"""
It utilizes concurrent execution with ThreadPoolExecutor to speed up the process of collecting reviews.
The collect_user_reviews function gathers user reviews by sending HTTP requests to IMDb, parsing the HTML response to extract review text and rating,
and then storing them in a list.
"""

if __name__ == "__main__":
    main()


Collecting reviews from: https://www.imdb.com/title/tt1517268/reviews
Reviews collected and saved to 'movie_reviews.csv'.


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [9]:
# Write code for each of the sub parts with proper comments.
!pip install nltk

import csv
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Function to clean and preprocess text
def clean_text(text):
    # Remove noise: special characters and punctuations
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Tokenize the text
    words = nltk.word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word.lower() not in stop_words])

    # Lowercase all texts
    text=text.lower()

    # Stemming
    stemmer = PorterStemmer()
    text = ' '.join([stemmer.stem(word) for word in text.split()])

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

    # Join the words back into a string
    clean_text = ' '.join(text)

    return text

# Main function to clean the text data
def clean_text_data(input_csv_file, output_csv_file):
    # Read the input CSV file
    with open(input_csv_file, 'r', encoding='utf-8') as csvfile:
        reader = csv.DictReader(csvfile)
        rows = list(reader)

    # Clean the text data and add a new column for clean text
    for row in rows:
        review = row['Review']
        clean_review = clean_text(review)
        row['Clean_Review'] = clean_review

    # Write the cleaned data to a new CSV file
    with open(output_csv_file, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = list(rows[0].keys())
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(rows)

    print(f"Cleaned data saved to '{output_csv_file}'.")

if __name__ == "__main__":
    input_csv_file = 'movie_reviews.csv'  # Input CSV file containing the original reviews
    output_csv_file = 'cleaned_movie_reviews.csv'  # Output CSV file to save the cleaned reviews
    clean_text_data(input_csv_file, output_csv_file)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Cleaned data saved to 'cleaned_movie_reviews.csv'.


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [10]:
# (1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

import nltk
import pandas as pd
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from collections import Counter

# Download NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def pos_tagging_and_count(text):
    # Tokenize the text
    words = word_tokenize(text)

    # Perform POS tagging
    pos_tags = pos_tag(words)

    # Count the occurrences of each POS
    pos_counts = Counter(tag[1] for tag in pos_tags)

    return pos_counts

# Read cleaned data
cleaned_df = pd.read_csv('cleaned_movie_reviews.csv')

# Apply POS tagging and count for each row in the 'Clean_Review' column
cleaned_df['pos_counts'] = cleaned_df['Clean_Review'].apply(pos_tagging_and_count)

# Initialize counters for POS tags
noun_count = 0
verb_count = 0
adj_count = 0
adv_count = 0

# Iterate over each row and update POS counts
for pos_count in cleaned_df['pos_counts']:
    noun_count += pos_count.get('NN', 0) + pos_count.get('NNS', 0) + pos_count.get('NNP', 0) + pos_count.get('NNPS', 0)
    verb_count += pos_count.get('VB', 0) + pos_count.get('VBD', 0) + pos_count.get('VBG', 0) + pos_count.get('VBN', 0) + pos_count.get('VBP', 0) + pos_count.get('VBZ', 0)
    adj_count += pos_count.get('JJ', 0) + pos_count.get('JJR', 0) + pos_count.get('JJS', 0)
    adv_count += pos_count.get('RB', 0) + pos_count.get('RBR', 0) + pos_count.get('RBS', 0)

# Print POS counts
print("Parts of Speech Counts:")
print(f"Noun: {noun_count}")
print(f"Verb: {verb_count}")
print(f"Adjective: {adj_count}")
print(f"Adverb: {adv_count}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Parts of Speech Counts:
Noun: 65347
Verb: 17763
Adjective: 25639
Adverb: 5578


In [11]:
#(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

import spacy
import pandas as pd
from spacy import displacy

# Load the spaCy English language model
nlp = spacy.load('en_core_web_sm')

def parse_sentences(text):
    # Process the text using spaCy
    doc = nlp(text)

    # Get the first sentence in the document
    first_sent = next(doc.sents)

    # Constituency parsing tree (spaCy doesn't directly provide constituency parsing, so using the text for simplicity)
    constituency_tree = [token.text_with_ws for token in first_sent]
    print("Constituency Parsing Tree:")
    print(' '.join(constituency_tree))

    # Dependency parsing tree
    dependency_tree = [(token.text, token.dep_, token.head.text) for token in first_sent]
    print("\nDependency Parsing Tree:")
    for token, dep, head in dependency_tree:
        print(f"{token} ({dep}) --> {head}")

    # Plot dependency graph
    displacy.render(first_sent, style='dep', jupyter=True)

    print("\n")

if __name__ == '__main__':
    # Read cleaned data
    cleaned_df = pd.read_csv('cleaned_movie_reviews.csv')

    # Apply parsing for the first review only
    parse_sentences(cleaned_df['Clean_Review'][0])

    """
    Dependency parsing analyzes sentence structure by:
      Extracting token text, dependency label, and head word for each word.
      Combining them into tuples representing word relationships.
      Printing these tuples to show the sentence's dependency structure.

     """

Constituency Parsing Tree:
margot  best  she  given  film  disappoint  market  fun  quirki  satir  homag  movi  start  way  end  overdramat  speech  end  clearli  tri  make  audienc  feel  someth  left  everyon  feel  confus  say  i m  crotcheti  old  man 

Dependency Parsing Tree:
margot (advmod) --> best
best (advmod) --> given
she (nsubj) --> given
given (ROOT) --> given
film (compound) --> satir
disappoint (compound) --> market
market (compound) --> fun
fun (compound) --> satir
quirki (compound) --> satir
satir (compound) --> movi
homag (compound) --> movi
movi (nsubj) --> start
start (dobj) --> given
way (advmod) --> start
end (conj) --> given
overdramat (compound) --> speech
speech (nsubj) --> end
end (nsubj) --> make
clearli (compound) --> tri
tri (nsubj) --> make
make (ccomp) --> end
audienc (nsubj) --> feel
feel (ccomp) --> make
someth (advmod) --> left
left (amod) --> everyon
everyon (nsubj) --> feel
feel (ccomp) --> feel
confus (nsubj) --> say
say (ccomp) --> feel
i (nsubj) 





In [12]:
#(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

#Here's the Python code to conduct Named Entity Recognition:

import spacy
from collections import Counter
import pandas as pd

# Load the spaCy English language model
nlp = spacy.load('en_core_web_sm')

def extract_entities(text):
    # Process the text using spaCy
    doc = nlp(text)

    # Extract named entities and calculate the count of each entity type
    entity_counts = Counter([(ent.text, ent.label_) for ent in doc.ents])

    return entity_counts

if __name__ == '__main__':
    # Read cleaned data
    cleaned_df = pd.read_csv('cleaned_movie_reviews.csv')

    # Apply NER for each review
    cleaned_df['entity_counts'] = cleaned_df['Clean_Review'].apply(extract_entities)

    # Print the count of each entity type for each review
    for i, entity_count in enumerate(cleaned_df['entity_counts']):
        print(f"Entities for Review {i + 1}:")
        for entity, count in entity_count.items():
            print(f"{entity}: {count}")
        print("\n")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
('shullivan', 'PERSON'): 1
('first', 'ORDINAL'): 1
('third', 'ORDINAL'): 1
('variou ken agre', 'PERSON'): 1
('ken barbi', 'PERSON'): 1
('deliveri produc', 'PERSON'): 1
('ryan gosl ken doll', 'PERSON'): 1
('eight', 'CARDINAL'): 1
('vaniti', 'GPE'): 1
('ryan gosl', 'PERSON'): 1
('chang world advanc femin barbi', 'ORG'): 1
('ken dollsal', 'PERSON'): 1
('today', 'DATE'): 1


Entities for Review 437:
('funni icon', 'ORG'): 1
('kinda', 'PERSON'): 1
('half', 'CARDINAL'): 1


Entities for Review 438:
('overdramat', 'GPE'): 1
('crotcheti', 'PERSON'): 1
('second', 'ORDINAL'): 1
('half', 'CARDINAL'): 1


Entities for Review 439:
('two', 'CARDINAL'): 1
('day', 'DATE'): 1
('everi day', 'EVENT'): 1
('compani barbi', 'PERSON'): 1
('ken ryan gosl', 'PERSON'): 1
('barbi fulli dedic conceit', 'ORG'): 1
('fulli', 'GPE'): 1
('mattel headquart', 'PERSON'): 1
('one third', 'CARDINAL'): 1
('bush', 'PERSON'): 1
('veng film repeatedli', 'PERSON')

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [13]:
# Write your response below

"""
Natural language processing (NLP) tasks such as text cleaning, named entity recognition, parsing, and part-of-speech tagging were all part of the assignment. To effectively develop solutions using Python and pertinent libraries like NLTK, spaCy, and pandas, these activities required problem-solving abilities.
The good news is that this job gave me the chance to learn about several NLP topics, such as entity recognition, syntactic and semantic analysis, and text preprocessing. It made it possible to explore and learn how to use other libraries and methodologies, such spaCy and NLTK, to get the results needed.
Working with natural language data presents unique challenges due to its complexity and variability. Tasks such as text cleaning, tokenization, and parsing require careful handling of linguistic nuances and edge cases to ensure accurate and reliable results. But the worst part of this is to make the IMDB url to fetch reviews fom all the pages which ate almost a day by performing trial and error operations, i finally achieved. I felt that the time given was not sufficient because, as a student in last semester , along with this course i also have 3 more and managing all the subjects along with preparation to future job is being difficult these days. I also feel, there should be appropriate guidance from auhthorities if we students are struck in any point which would be very helpful.

"""


' \nNatural language processing (NLP) tasks such as text cleaning, named entity recognition, parsing, and part-of-speech tagging were all part of the assignment. To effectively develop solutions using Python and pertinent libraries like NLTK, spaCy, and pandas, these activities required problem-solving abilities.\nThe good news is that this job gave me the chance to learn about several NLP topics, such as entity recognition, syntactic and semantic analysis, and text preprocessing. It made it possible to explore and learn how to use other libraries and methodologies, such spaCy and NLTK, to get the results needed.\nWorking with natural language data presents unique challenges due to its complexity and variability. Tasks such as text cleaning, tokenization, and parsing require careful handling of linguistic nuances and edge cases to ensure accurate and reliable results. But the worst part of this is to make the IMDB url to fetch reviews fom all the pages which ate almost a day by perform