<a href="https://colab.research.google.com/github/tejaswini-151999/SriNagTejaswiniGandikota_INFO5731_Fall2024/blob/main/INFO5731_Assignment_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [40]:
import requests
import pandas as pd
import time


In [41]:
def fetch_abstracts(query, total_papers=10000, batch_size=100):
    base_url = "https://api.semanticscholar.org/graph/v1/paper/search"
    abstracts = []
    offset = 0

    while len(abstracts) < total_papers:
        params = {
            'query': query,
            'limit': batch_size,
            'offset': offset,
            'fields': 'title,abstract'
        }

        response = requests.get(base_url, params=params)
        print(f"Fetching batch starting at offset {offset} for query '{query}'...")

        if response.status_code != 200:
            print(f"Error fetching data: {response.status_code}")
            break

        data = response.json()
        papers = data.get('data', [])

        if not papers:
            print("No more abstracts available.")
            break

        for paper in papers:
            title = paper.get('title', 'No title available')
            abstract = paper.get('abstract', 'No abstract available')
            abstracts.append({'Title': title, 'Abstract': abstract})

            if len(abstracts) >= total_papers:
                break

        offset += batch_size
        time.sleep(2)  # Increase delay to avoid rate limits

    return abstracts


In [42]:
# Set parameters for fetching abstracts
queries = [
    "machine learning",
    "data science",
    "artificial intelligence",
    "information extraction"
]
total_papers_per_query = 2500  # Each query will fetch up to 2500 abstracts
all_abstracts = []

# Loop over each query and fetch abstracts
for query in queries:
    print(f"Fetching abstracts for query: {query}")
    abstracts = []
    retry_count = 0

    while len(abstracts) < total_papers_per_query and retry_count < 5:  # Retry a max of 5 times
        fetched_abstracts = fetch_abstracts(query, total_papers=total_papers_per_query - len(abstracts))
        abstracts.extend(fetched_abstracts)

        if len(fetched_abstracts) == 0:  # If no papers were fetched
            print(f"No more papers found for query '{query}'.")
            break

        # Check if we hit a rate limit
        if len(abstracts) < total_papers_per_query:
            print(f"Total abstracts collected so far: {len(abstracts)}")
            time.sleep(2)  # Increase sleep time to 2 seconds
        else:
            print(f"Successfully fetched {len(abstracts)} abstracts for query '{query}'.")
            break

        retry_count += 1  # Increment retry count

    all_abstracts.extend(abstracts)  # Append the abstracts to the main list
    print(f"Total abstracts collected so far: {len(all_abstracts)}")
    time.sleep(5)  # Add a longer wait before the next query


Fetching abstracts for query: machine learning
Fetching batch starting at offset 0 for query 'machine learning'...
Error fetching data: 429
No more papers found for query 'machine learning'.
Total abstracts collected so far: 0
Fetching abstracts for query: data science
Fetching batch starting at offset 0 for query 'data science'...
Error fetching data: 429
No more papers found for query 'data science'.
Total abstracts collected so far: 0
Fetching abstracts for query: artificial intelligence
Fetching batch starting at offset 0 for query 'artificial intelligence'...
Fetching batch starting at offset 100 for query 'artificial intelligence'...
Error fetching data: 429
Total abstracts collected so far: 100
Fetching batch starting at offset 0 for query 'artificial intelligence'...
Fetching batch starting at offset 100 for query 'artificial intelligence'...
Fetching batch starting at offset 200 for query 'artificial intelligence'...
Fetching batch starting at offset 300 for query 'artificial 

In [43]:
# Save the results to a CSV file
df = pd.DataFrame(all_abstracts)
df.to_csv('research_abstracts.csv', index=False)
print("Data saved to research_abstracts.csv")

# Optional: Download the CSV file
from google.colab import files
files.download('research_abstracts.csv')


Data saved to research_abstracts.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [36]:
from google.colab import files
files.download('research_abstracts.csv')



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# Load the collected abstracts from the CSV file
df = pd.read_csv('research_abstracts.csv')

# Check for NaN values and fill them with empty strings if any
df['Abstract'].fillna('', inplace=True)

# Display the first few rows to verify loading
print(df.head())


                                               Title  \
0  Fashion-MNIST: a Novel Image Dataset for Bench...   
1  TensorFlow: A system for large-scale machine l...   
2  TensorFlow: Large-Scale Machine Learning on He...   
3  Stop explaining black box machine learning mod...   
4  Convolutional LSTM Network: A Machine Learning...   

                                            Abstract  
0  We present Fashion-MNIST, a new dataset compri...  
1  TensorFlow is a machine learning system that o...  
2  TensorFlow is an interface for expressing mach...  
3                                                     
4  The goal of precipitation nowcasting is to pre...  


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Abstract'].fillna('', inplace=True)


In [None]:
def remove_noise(text):
    # Debug print to check what is being passed
    print(f"Original text: {text}")

    if isinstance(text, str):  # Check if the text is a string
        cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text)
        # Debug print to see the cleaned text
        print(f"Cleaned text: {cleaned_text}")
        return cleaned_text
    else:
        return ''  # Return an empty string for NaN or non-string values


In [None]:
# Apply the noise removal function
df['Cleaned_Abstract'] = df['Abstract'].apply(remove_noise)

# Display the first few rows of the DataFrame
print(df[['Abstract', 'Cleaned_Abstract']].head())


Original text: We present Fashion-MNIST, a new dataset comprising of 28x28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set has 60,000 images and the test set has 10,000 images. Fashion-MNIST is intended to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms, as it shares the same image size, data format and the structure of training and testing splits. The dataset is freely available at this https URL
Cleaned text: We present FashionMNIST a new dataset comprising of x grayscale images of  fashion products from  categories with  images per category The training set has  images and the test set has  images FashionMNIST is intended to serve as a direct dropin replacement for the original MNIST dataset for benchmarking machine learning algorithms as it shares the same image size data format and the structure of training and testing splits The dataset is freely avai

In [None]:
# Step 5: Save the cleaned DataFrame to a new CSV file
df.to_csv('cleaned_research_abstracts.csv', index=False)
print("Cleaned data saved to 'cleaned_research_abstracts.csv'.")


Cleaned data saved to 'cleaned_research_abstracts.csv'.


In [None]:
# Step 6: Download the cleaned CSV file
from google.colab import files
files.download('cleaned_research_abstracts.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
# Step 1: Import Libraries and Load Data
import pandas as pd
import nltk

# Ensure NLTK resources are downloaded
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Load your cleaned abstracts from the CSV file
df = pd.read_csv('cleaned_research_abstracts.csv')

# Sample Output
print("Loaded Data:")
print(df.head())  # Show the first few rows of the cleaned abstracts


Loaded Data:
                                               Title  \
0  Fashion-MNIST: a Novel Image Dataset for Bench...   
1  TensorFlow: A system for large-scale machine l...   
2  TensorFlow: Large-Scale Machine Learning on He...   
3  Stop explaining black box machine learning mod...   
4  Convolutional LSTM Network: A Machine Learning...   

                                            Abstract  \
0  We present Fashion-MNIST, a new dataset compri...   
1  TensorFlow is a machine learning system that o...   
2  TensorFlow is an interface for expressing mach...   
3                                                NaN   
4  The goal of precipitation nowcasting is to pre...   

                                    Cleaned_Abstract  
0  We present FashionMNIST a new dataset comprisi...  
1  TensorFlow is a machine learning system that o...  
2  TensorFlow is an interface for expressing mach...  
3                                                NaN  
4  The goal of precipitation nowcastin

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
# Step 2: POS Tagging and Counting
def pos_tagging(text):
    try:
        # Tokenize the text into words
        tokens = nltk.word_tokenize(text)
        # Tag the tokens with parts of speech
        pos_tags = nltk.pos_tag(tokens)
        return pos_tags
    except Exception as e:
        print(f"Error in POS tagging: {e}")
        return []

# Initialize counters for each POS
noun_count = 0
verb_count = 0
adj_count = 0
adv_count = 0

# Process each cleaned abstract
for abstract in df['Cleaned_Abstract']:
    # Skip empty abstracts
    if pd.isna(abstract) or not isinstance(abstract, str) or abstract.strip() == "":
        continue

    pos_tags = pos_tagging(abstract)
    if not pos_tags:  # Check if pos_tags is empty
        continue

    for word, tag in pos_tags:
        if tag.startswith('NN'):
            noun_count += 1
        elif tag.startswith('VB'):
            verb_count += 1
        elif tag.startswith('JJ'):
            adj_count += 1
        elif tag.startswith('RB'):
            adv_count += 1

# Output the results
print(f"Total Nouns: {noun_count}, Total Verbs: {verb_count}, Total Adjectives: {adj_count}, Total Adverbs: {adv_count}")


Total Nouns: 39857, Total Verbs: 18587, Total Adjectives: 13311, Total Adverbs: 3722


In [None]:
# Step 3: Constituency and Dependency Parsing
import spacy

# Load the English NLP model
nlp = spacy.load("en_core_web_sm")

# Choose a sample sentence from the cleaned abstracts
sample_sentence = df['Cleaned_Abstract'].iloc[0]  # You can replace with any specific abstract

# Perform parsing
doc = nlp(sample_sentence)

# Constituency Parsing (using Spacy's dependency parse)
print("Constituency Tree:")
for sent in doc.sents:
    print(sent)

# Dependency Parsing (visualize with Spacy)
print("\nDependency Parse:")
for token in doc:
    print(f"{token.text} --> {token.dep_} --> {token.head.text}")


Constituency Tree:
We present FashionMNIST a new dataset comprising of x grayscale images of  fashion products from  categories with  images per category The training set has  images and the test set has  images FashionMNIST is intended to serve as a direct dropin replacement for the original MNIST dataset for benchmarking machine learning algorithms as it shares the same image size data format and the structure of training and testing splits The dataset is freely available at this https URL

Dependency Parse:
We --> nsubj --> present
present --> ROOT --> present
FashionMNIST --> dobj --> present
a --> det --> comprising
new --> amod --> comprising
dataset --> amod --> comprising
comprising --> dobj --> present
of --> prep --> comprising
x --> compound --> images
grayscale --> compound --> images
images --> pobj --> of
of --> prep --> images
  --> dep --> of
fashion --> compound --> products
products --> pobj --> of
from --> prep --> images
  --> dep --> from
categories --> pobj --> fr

In [None]:
# Step 4: Named Entity Recognition (NER)
def extract_entities(doc):
    entities = {
        'PERSON': 0,
        'ORG': 0,
        'GPE': 0,  # Geopolitical Entity
        'PRODUCT': 0,
        'DATE': 0
    }

    for ent in doc.ents:
        if ent.label_ in entities:
            entities[ent.label_] += 1
    return entities

# Initialize total counts for entities
total_entities = {
    'PERSON': 0,
    'ORG': 0,
    'GPE': 0,
    'PRODUCT': 0,
    'DATE': 0
}

# Process each cleaned abstract
for abstract in df['Cleaned_Abstract']:
    # Skip empty or invalid abstracts
    if pd.isna(abstract) or not isinstance(abstract, str) or abstract.strip() == "":
        continue

    # Create a doc object for each valid abstract
    doc = nlp(abstract)

    # Extract and count named entities
    entities = extract_entities(doc)
    for key in total_entities:
        total_entities[key] += entities[key]

# Output the results
print("\nNamed Entity Counts:")
for entity, count in total_entities.items():
    print(f"{entity}: {count}")



Named Entity Counts:
PERSON: 407
ORG: 1666
GPE: 106
PRODUCT: 73
DATE: 120


#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [None]:
# Save cleaned abstracts to a new CSV file
#df.to_csv('cleaned_research_abstracts.csv', index=False)

# Inform the user about the saved file
#print("Cleaned data saved to cleaned_research_abstracts.csv")


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
Guess the questions are being more complicated