<a href="https://colab.research.google.com/github/tejaswini-151999/SriNagTejaswiniGandikota_INFO5731_Fall2024/blob/main/INFO5731_Assignment_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [None]:
import requests
import pandas as pd


In [None]:
def fetch_abstracts(query, num_papers=10):
    base_url = "https://api.semanticscholar.org/graph/v1/paper/search"
    params = {
        'query': query,
        'limit': num_papers,
        'fields': 'title,abstract'
    }
    response = requests.get(base_url, params=params)

    print(f"API Response Status Code: {response.status_code}")
    if response.status_code != 200:
        print(f"Error fetching data: {response.status_code}")
        return []

    data = response.json()
    print(data)
    abstracts = []

    for paper in data.get('data', []):
        title = paper.get('title', 'No title available')
        abstract = paper.get('abstract', 'No abstract available')
        abstracts.append({'Title': title, 'Abstract': abstract})

    return abstracts


In [None]:
query = "machine learning"
num_papers = 5
abstracts = fetch_abstracts(query, num_papers)

# Displaying the number of abstracts fetched
print(f"Total abstracts fetched: {len(abstracts)}")


API Response Status Code: 200
{'total': 6035597, 'offset': 0, 'next': 5, 'data': [{'paperId': 'f9c602cc436a9ea2f9e7db48c77d924e09ce3c32', 'title': 'Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms', 'abstract': 'We present Fashion-MNIST, a new dataset comprising of 28x28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set has 60,000 images and the test set has 10,000 images. Fashion-MNIST is intended to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms, as it shares the same image size, data format and the structure of training and testing splits. The dataset is freely available at this https URL'}, {'paperId': '4954fa180728932959997a4768411ff9136aac81', 'title': 'TensorFlow: A system for large-scale machine learning', 'abstract': 'TensorFlow is a machine learning system that operates at large scale and in heterogeneous environmen

In [None]:
if abstracts:
    df = pd.DataFrame(abstracts)
    df.to_csv('research_abstracts.csv', index=False)
    print("Data saved to research_abstracts.csv")
else:
    print("No abstracts found.")


Data saved to research_abstracts.csv


In [None]:
from google.colab import files
files.download('research_abstracts.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
# Write code for each of the sub parts with proper comments.
!pip install nltk




In [None]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import re

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
df = pd.read_csv('research_abstracts.csv')
print(df.head())


                                               Title  \
0  Fashion-MNIST: a Novel Image Dataset for Bench...   
1  TensorFlow: A system for large-scale machine l...   
2  TensorFlow: Large-Scale Machine Learning on He...   
3  Stop explaining black box machine learning mod...   
4  Convolutional LSTM Network: A Machine Learning...   

                                            Abstract  
0  We present Fashion-MNIST, a new dataset compri...  
1  TensorFlow is a machine learning system that o...  
2  TensorFlow is an interface for expressing mach...  
3                                                NaN  
4  The goal of precipitation nowcasting is to pre...  


In [None]:
def clean_text(text):
    if pd.isna(text):
        return ''

    # (1) Remove noise: special characters and punctuations
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # (2) Remove numbers
    text = re.sub(r'\d+', '', text)

    # (3) Remove stopwords
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word.lower() not in stop_words])

    # (4) Lowercase all texts
    text = text.lower()

    # (5) Stemming
    ps = PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])

    # (6) Lemmatization
    lemmatizer = WordNetLemmatizer()
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

    return text


In [None]:
df['Cleaned_Abstract'] = df['Abstract'].apply(clean_text)
print(df[['Abstract', 'Cleaned_Abstract']].head())


                                            Abstract  \
0  We present Fashion-MNIST, a new dataset compri...   
1  TensorFlow is a machine learning system that o...   
2  TensorFlow is an interface for expressing mach...   
3                                                NaN   
4  The goal of precipitation nowcasting is to pre...   

                                    Cleaned_Abstract  
0  present fashionmnist new dataset compris x gra...  
1  tensorflow machin learn system oper larg scale...  
2  tensorflow interfac express machin learn algor...  
3                                                     
4  goal precipit nowcast predict futur rainfal in...  


In [None]:
df.to_csv('cleaned_research_abstracts.csv', index=False)
print("Cleaned data saved to cleaned_research_abstracts.csv")


Cleaned data saved to cleaned_research_abstracts.csv


In [None]:
from google.colab import files
files.download('cleaned_research_abstracts.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [None]:
# Your code here
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m61.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import pandas as pd
import spacy
from collections import Counter

In [None]:
df = pd.read_csv('cleaned_research_abstracts.csv')
print(df.head())


                                               Title  \
0  Fashion-MNIST: a Novel Image Dataset for Bench...   
1  TensorFlow: A system for large-scale machine l...   
2  TensorFlow: Large-Scale Machine Learning on He...   
3  Stop explaining black box machine learning mod...   
4  Convolutional LSTM Network: A Machine Learning...   

                                            Abstract  \
0  We present Fashion-MNIST, a new dataset compri...   
1  TensorFlow is a machine learning system that o...   
2  TensorFlow is an interface for expressing mach...   
3                                                NaN   
4  The goal of precipitation nowcasting is to pre...   

                                    Cleaned_Abstract  
0  present fashionmnist new dataset compris x gra...  
1  tensorflow machin learn system oper larg scale...  
2  tensorflow interfac express machin learn algor...  
3                                                NaN  
4  goal precipit nowcast predict futur rainfal in..

In [None]:
nlp = spacy.load("en_core_web_sm")


In [None]:
# Function that performs POS tagging and count nouns, verbs, adjectives, and adverbs
def pos_tagging(text):
    doc = nlp(text)
    pos_count = {'Noun': 0, 'Verb': 0, 'Adjective': 0, 'Adverb': 0}

    for token in doc:
        if token.pos_ == 'NOUN':
            pos_count['Noun'] += 1
        elif token.pos_ == 'VERB':
            pos_count['Verb'] += 1
        elif token.pos_ == 'ADJ':
            pos_count['Adjective'] += 1
        elif token.pos_ == 'ADV':
            pos_count['Adverb'] += 1

    return pos_count


first_abstract_pos = pos_tagging(df['Cleaned_Abstract'].iloc[0])
print("Parts of Speech Counts for the First Abstract:", first_abstract_pos)


Parts of Speech Counts for the First Abstract: {'Noun': 13, 'Verb': 7, 'Adjective': 7, 'Adverb': 0}


In [None]:
# Function that performs constituency and dependency parsing
def parsing(text):
    doc = nlp(text)


    print("Constituency Parsing Tree (for the first sentence):")
    for sent in doc.sents:
        print(sent.text)


    print("\nDependency Parsing Tree (for the first sentence):")
    for token in doc:
        print(f"{token.text} --> {token.dep_} --> {token.head.text}")

parsing(df['Cleaned_Abstract'].iloc[0])


Constituency Parsing Tree (for the first sentence):
present fashionmnist new dataset compris x grayscal imag fashion product categori imag per categori train set imag test set imag fashionmnist intend serv direct dropin replac origin mnist dataset benchmark machin learn algorithm share imag size data format structur train test split dataset freeli avail http url

Dependency Parsing Tree (for the first sentence):
present --> amod --> product
fashionmnist --> amod --> product
new --> amod --> dataset
dataset --> compound --> product
compris --> nmod --> product
x --> punct --> product
grayscal --> amod --> imag
imag --> amod --> product
fashion --> compound --> product
product --> nsubj --> set
categori --> aux --> set
imag --> acl --> categori
per --> prep --> imag
categori --> compound --> train
train --> nsubj --> set
set --> ROOT --> set
imag --> amod --> test
test --> dobj --> set
set --> dep --> set
imag --> amod --> fashionmnist
fashionmnist --> nsubj --> intend
intend --> conj --

In [None]:
# Function performs Named Entity Recognition
def named_entity_recognition(text):
    doc = nlp(text)
    entities_count = Counter()

    for ent in doc.ents:
        entities_count[ent.label_] += 1

    return entities_count

first_abstract_ner = named_entity_recognition(df['Cleaned_Abstract'].iloc[0])
print("Named Entity Recognition Counts for the First Abstract:", first_abstract_ner)


Named Entity Recognition Counts for the First Abstract: Counter({'PERSON': 1, 'NORP': 1})


#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [None]:
# Cleaned Data CSV Submission

# The cleaned data CSV file has been saved as 'cleaned_research_abstracts.csv'.
# You can load the CSV file using the following code:
#
# import pandas as pd
# df = pd.read_csv('cleaned_research_abstracts.csv')


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
Guess the questions are being more complicated