<a href="https://colab.research.google.com/github/raian621/FinalProjectNLP/blob/main/FinalProjectNLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summary-Augmented Sentiment Analysis

## CSCE 4290 - Introduction to Natural Language Processing

- Ryan Bell
- Riwaj Mainali
- Farouq Siwoku

**Contents**
1. [Introduction](#scrollTo=NyWwimvyPOTi)
  - [1.1 Problem](#scrollTo=IzaaYyYzRAvB)
  - [1.2 Importance](#scrollTo=0aopY4JIRE6x)
  - [1.3 Dataset](#scrollTo=X7IAZFhXSmZH)
  - [1.4 Proposed Methodology](#scrollTo=efdTl80eRHmm)
  - [1.5 Project Management](#scrollTo=BblVXk1DSXbk)
2. [Implementation](#scrollTo=sY3kog91Q3az)
  - [2.1 Exploratory Data Analysis](#scrollTo=rmvhO7ncSbYr)
  - [2.2 Sentiment Analysis on Entire Passages](#scrollTo=TKJSjIK2OzMr)
    - [2.2.1 Aggregate Bag of Words](#scrollTo=opEbHPEJOMrn)
  - [2.3 Sentiment Analysis on Summaries](#scrollTo=RSm_uKq6QYiI)
    - [2.3.1 Summary Feature Engineering](#scrollTo=W_rR079PQ3Zf)
    - [2.3.2 Summarized Bag of Words](#scrollTo=laz-aTE3OfEK)
3. [Results](#scrollTo=H7TLZ7lIPUng)
4. [Conclusion](#scrollTo=MrV4glZJQ61_)



## 1. Introduction

### 1.1 Problem

What is the problem we want to solve, and a hypothesis
- PERFORMANCE: Improve the performance of sentiment analysis
- HYPOTHESIS: Maybe using a summary instead of the entire text for sentiment analysis will improve the accuracy of ML model's performance.

### 1.2 Importance

- Model could be used to predict the sentiment of news stories about a stock
- Model could also be used to predict the general opinion on a product or move made by a company
- Using summaries may result in a smaller corpora, and thus bag-of-words models may consume less memory
- etc.

### 1.3 Dataset

Explain the source of the dataset, how it was compiled, the features in the dataset, what features we're planning to use, etc.

### 1.4 Proposed Methodology

Write about what models we'll use, what combination of solutions (sentiment analysis and summarization in this case) we'll employ, what features we'll use, and what kind of cross validation / train test split we'll utilize.

models:
- Summarization:
  - Some Generative Pre-trained Transformer (GPT) model(s)
  - Maybe a naive summary of just the set of the most common $N$ words?
- Sentiment Analysis:
  - Bag of Words:
    - Naive Bayes
    - Decision Tree
    - Logistic Regression
  - Word Embeddings
    - Some Bidirectional Encoder Representations from Transformer (BERT) model(s)

Summary -> Sentiment Analysis = Better Accuracy?

### 1.5 Project Management


Explain how the project was organized I guess. The rubric isn't clear on what this means

## 2. Implementation

In [1]:
# imports and dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# natural language processing tools
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

nltk.download('stopwords')

# metrics and train-test splitting
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)
from sklearn.model_selection import train_test_split
# "classic" machine learning models for classification:
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### 2.1 Exploratory Data Analysis

Load the dataset and make some fancy graphs.

### 2.2 Sentiment Analysis on Entire Passages

In [2]:
def stem_tokenizer(text):
  ps = PorterStemmer()
  words = word_tokenize(text)
  stemmed_words = []
  for word in words:
    word_lower = word.lower()
    if word_lower not in stopwords.words('english'):
      stemmed_words.append(word_lower)
  return stemmed_words

def print_scores(clf, y_pred, y_target):
  pass

#### 2.2.1 Aggregate Bag of Words

### 2.2 Sentiment Analysis on Summaries

Generate a bag of words representative of the entire corpus:

#### 2.3.1 Summary Feature Engineering

Generate the summaries for each passage of text that we can use as a "summarized" corpora

#### 2.3.2 Summarized Bag of Words

In [3]:
%pip install  --no-cache-dir transformers sentencepiece torch tqdm

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m88.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m149.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m114.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Import Libraries

In [4]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from tqdm.auto import tqdm

Summary Generation Function

In [5]:
def generate_summary(text, tokenizer, model):
    inputs = tokenizer(
        text,
        return_tensors="pt",
        max_length=1024,
        truncation=True,
        padding="max_length",
    )
    summary_ids = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=150,
        length_penalty=2.0,
        num_beams=4,
        early_stopping=True,
    )
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

Main Function for Processing CSV File

In [6]:
def main(csv_file_path):
    # Load the CSV file
    data = pd.read_csv(csv_file_path).head(2)

    # Load tokenizer and model
    model_ckpt = "google/pegasus-cnn_dailymail"
    tokenizer = AutoTokenizer.from_pretrained(model_ckpt, use_fast=False)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)

    # Generate summaries
    tqdm.pandas()
    data["summary"] = data["intro"].progress_apply(
        lambda x: generate_summary(x, tokenizer, model)
    )

    # Save the updated DataFrame to a new CSV file
    output_file_path = csv_file_path.replace(".csv", "_with_summaries.csv")
    data.to_csv(output_file_path, index=False)

    print(f"Updated DataFrame saved to {output_file_path}")


In [7]:
summarized_data = pd.read_csv(output_file_path)

NameError: name 'output_file_path' is not defined

Analyzing the Summaries

In [None]:
# Calculate the length of the original texts and their summaries
summarized_data['original_length'] = summarized_data['intro'].apply(len)
summarized_data['summary_length'] = summarized_data['summary'].apply(len)

# Display the data with new columns
display(summarized_data[['original_length', 'summary_length']])


Visualization

In [None]:
plt.figure(figsize=(10, 6))
plt.bar(summarized_data.index, summarized_data['original_length'], label='Original Length')
plt.bar(summarized_data.index, summarized_data['summary_length'], label='Summary Length', alpha=0.7)
plt.ylabel('Length of Text')
plt.title('Comparison of Original Text Length to Summary Length')
plt.legend()
plt.show()

## 3. Results

Show the final metrics for each model / combination of models

## 4. Conclusion


Conclude with a conclusion in the concluding paragraph