<a href="https://colab.research.google.com/github/sivanathvenigalla/Jaya-Venkatasivanath_INFO5731_Fall2024/blob/main/Venigalla_jayavenkatasivanath_Assignment_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [None]:
import requests
import pandas as pd
import time

api_key = "x2X70VOLvD9dgtxOzZ3Mc50QWwpNPsgu8Kw2pIwR"

request_headers = {
    "Accept": "application/json",
    "x-api-key": api_key
}

search_term = "information extraction"

endpoint_url = "https://api.semanticscholar.org/graph/v1/paper/search"

request_params = {
    "query": search_term,
    "fields": "title,abstract",
    "limit": 100
}

collected_papers = []

for start in range(0, 10000, 100):
    request_params["offset"] = start
    response = requests.get(endpoint_url, headers=request_headers, params=request_params)

    if response.status_code == 200:
        results = response.json().get("data", [])
        if results:
            collected_papers.extend(results)
            print(f"Retrieved papers starting from offset {start}.")
        else:
            print(f"No papers found at offset {start}.")
            break
    elif response.status_code == 429:
        print(f"Rate limit reached at offset {start}. Pausing briefly.")
        time.sleep(1)
        retry_response = requests.get(endpoint_url, headers=request_headers, params=request_params)

        if retry_response.status_code == 200:
            retry_results = retry_response.json().get("data", [])
            if retry_results:
                collected_papers.extend(retry_results)
                print(f"Successful retry for offset {start}.")
            else:
                print(f"No data found after retry at offset {start}.")
                break
        else:
            print(f"Retry failed with status code {retry_response.status_code} at offset {start}.")
            break
    else:
        print(f"Request failed with status code {response.status_code} at offset {start}.")
        break

    time.sleep(1)

paper_data = [(item["title"], item.get("abstract", "No abstract available")) for item in collected_papers]

csv_file = "information_extraction_papers.csv"
df = pd.DataFrame(paper_data, columns=["Title", "Abstract"])
df.to_csv(csv_file, index=False, encoding="utf-8")

print(f"Successfully saved {len(paper_data)} paper abstracts to {csv_file}.")


Retrieved papers starting from offset 0.
Retrieved papers starting from offset 100.
Retrieved papers starting from offset 200.
Retrieved papers starting from offset 300.
Retrieved papers starting from offset 400.
Retrieved papers starting from offset 500.
Retrieved papers starting from offset 600.
Retrieved papers starting from offset 700.
Retrieved papers starting from offset 800.
Retrieved papers starting from offset 900.
Request failed with status code 400 at offset 1000.
Successfully saved 1000 paper abstracts to information_extraction_papers.csv.


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
# Write code for each of the sub parts with proper comments.
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

df = pd.read_csv('information_extraction_papers.csv')

ps = PorterStemmer()
wnl = WordNetLemmatizer()

stopword_set = set(stopwords.words('english'))

def preprocess_text(text):
    if not isinstance(text, str):
        return ''

    # Remove special characters and numbers
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)

    # Tokenize and remove stopwords
    tokens = nltk.word_tokenize(text)
    tokens = [token for token in tokens if token.lower() not in stopword_set]

    # Convert to lowercase
    tokens = [token.lower() for token in tokens]

    # Apply stemming
    stemmed = [ps.stem(token) for token in tokens]

    # Apply lemmatization
    lemmatized = [wnl.lemmatize(token) for token in stemmed]

    # Join the processed words back into a single string
    return ' '.join(lemmatized)

df['Abstract'] = df['Abstract'].astype(str).fillna('')
df['Processed_Abstract'] = df['Abstract'].apply(preprocess_text)
output_file = 'cleaned_information_extraction_papers.csv'
df.to_csv(output_file, index=False)
print(f"Cleaned data has been saved to '{output_file}'")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Cleaned data has been saved to 'cleaned_information_extraction_papers.csv'


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
# Your code here

import pandas as pd
import nltk
import spacy
from nltk import pos_tag, word_tokenize
from nltk.chunk import ne_chunk
from nltk.tree import Tree
from collections import Counter

# Download necessary NLTK resources (uncomment if you haven't already)
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Load the cleaned CSV file
file = "cleaned_information_extraction_papers.csv"
data = pd.read_csv(file)

# Load the spaCy model for NER and dependency parsing
nlp = spacy.load("en_core_web_sm")

# Function to perform POS tagging and count POS types
def pos_analysis(text):
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)  # POS tagging

    # Counting specific POS tags
    counts = Counter(tag for word, tag in tagged)
    noun_count = sum(counts[tag] for tag in ['NN', 'NNS', 'NNP', 'NNPS'])
    verb_count = sum(counts[tag] for tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'])
    adjective_count = sum(counts[tag] for tag in ['JJ', 'JJR', 'JJS'])
    adverb_count = sum(counts[tag] for tag in ['RB', 'RBR', 'RBS'])

    pos_counts = {
        'Nouns': noun_count,
        'Verbs': verb_count,
        'Adjectives': adjective_count,
        'Adverbs': adverb_count
    }

    return tagged, pos_counts

# Function to perform dependency parsing using spaCy
def dependency_parsing(text):
    doc = nlp(text)
    return [(token.text, token.dep_, token.head.text) for token in doc]

# Function to perform constituency parsing using nltk
def constituency_parsing(text):
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    chunked = ne_chunk(tagged)

    def tree_to_string(tree):
        """ Convert an NLTK tree to a string to visualize the tree structure. """
        if isinstance(tree, Tree):
            return f"({tree.label()} {' '.join(tree_to_string(child) for child in tree)})"
        else:
            return tree[0]

    return tree_to_string(chunked)

# Function to extract named entities and their counts
def ner_analysis(text):
    doc = nlp(text)
    entities = Counter(ent.label_ for ent in doc.ents)
    return entities, [(ent.text, ent.label_) for ent in doc.ents]

# Perform analysis for each abstract and store results
results = []

for index, row in data.iterrows():
    abstract = row['Processed_Abstract']

    # Skip empty abstracts
    if pd.isna(abstract) or abstract.strip() == "":
        continue

    # POS Analysis
    tagged, pos_counts = pos_analysis(abstract)

    # Dependency Parsing
    dependency_parse = dependency_parsing(abstract)

    # Constituency Parsing
    constituency_parse = constituency_parsing(abstract)

    # NER Analysis
    ner_counts, ner_entities = ner_analysis(abstract)

    results.append({
        'Abstract': abstract,
        'POS Tagged': tagged,
        'POS Counts': pos_counts,
        'Dependency Parsing': dependency_parse,
        'Constituency Parsing': constituency_parse,
        'Named Entities': ner_entities,
        'NER Counts': ner_counts
    })

# Save the results to a new CSV file
results_df = pd.DataFrame(results)
results_df.to_csv("analyzed_information_extraction_papers.csv", index=False)
print("Data is saved in analyzed_information_extraction_papers.csv")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


Data is saved in analyzed_information_extraction_papers.csv


#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [None]:
#My file will be avaialble in the path once the whole file is run:  /content/analyzed_information_extraction_papers.csv
import pandas
file_path = "/content/analyzed_information_extraction_papers.csv"
df = pd.read_csv(file_path)
print(df.head())


                                            Abstract  \
0  zeroshot inform extract ie aim build ie system...   
1  capabl larg languag model llm like chatgpt com...   
2  inform extract suffer vari target heterogen st...   
3  larg languag model unlock strong multitask cap...   
4  humanlik larg languag model llm especi power p...   

                                          POS Tagged  \
0  [('zeroshot', 'JJ'), ('inform', 'NN'), ('extra...   
1  [('capabl', 'NN'), ('larg', 'NN'), ('languag',...   
2  [('inform', 'NN'), ('extract', 'NN'), ('suffer...   
3  [('larg', 'NN'), ('languag', 'NN'), ('model', ...   
4  [('humanlik', 'NN'), ('larg', 'NN'), ('languag...   

                                          POS Counts  \
0  {'Nouns': 66, 'Verbs': 14, 'Adjectives': 25, '...   
1  {'Nouns': 65, 'Verbs': 9, 'Adjectives': 25, 'A...   
2  {'Nouns': 63, 'Verbs': 4, 'Adjectives': 23, 'A...   
3  {'Nouns': 60, 'Verbs': 7, 'Adjectives': 15, 'A...   
4  {'Nouns': 67, 'Verbs': 7, 'Adjectives': 27,

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

Working with enormous datasets for this project was the most difficult part, especially managing the rate constraints while collecting data from APIs. To guarantee that all of the dataset was acquired error-free, attention had to be taken while managing rate restrictions, retries, and offsets during the Semantic Scholar data collection process.

The application of several Natural Language Processing (NLP) techniques, such as tokenization, noise removal, and stemming, made the text cleaning phase fun. It provided a greater knowledge of how unprocessed text may be turned into relevant data for analysis. It was also intriguing to undertake syntactic analysis with POS tagging and dependency parsing since it demonstrated how sentence structure can be programmatically evaluated.


Although the allotted time for the task seems plenty, there might occasionally be difficulties in guaranteeing efficient data extraction from an API. Considering this, a little extra time would have been beneficial to fully evaluate the data cleaning and analysis procedure.
