<a href="https://colab.research.google.com/github/wrobbins0409/cse30124-project/blob/main/introToAI_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Intro to AI Project - Classical Sentiment Analysis vs Machine learning ###



In this project, I have attempted to fine tune an existing LLM for the purpose of sentiment analysis in order to learn the process of doing so and also to test on a dataset whether traditional sentiment analysis or machine learning based sentiment analysis works better.

I started with the model head from the distilbert-base-uncased model as this is optimized for downstream finetuning for a variety of language tasks. I used the glue dataset to finetune the model utilizing the sst2 portion of the dataset which includes sentence data that is labeled with either a 0 or 1 corresponding to negative or positive sentiment respectively. Here is a link to the dataset [glue/sst2](https://huggingface.co/datasets/glue/viewer/sst2)



In the repo the python script used to train the model is included but in this notebook we will be demonstrating its use and comparing it with traditional methods of sentiment analysis.

In [76]:
# imports and installations
# !pip install datasets
# !pip install transformers
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset, load_metric
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
import torch
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
from tqdm import tqdm

We can start by loading in the model into a pipeline using transformers which makes it very easy to plug and play with any public model on HuggingFace

In [77]:

# load model into pipeline
pipe = pipeline("sentiment-analysis", model="wrobbins0409/distilbert-base-uncased-finetuned-sst2-wrobbins", truncation=True, max_length=512)

# test pipeline on negative sentence that should return a negative label
pipe("I hate everything!")

[{'label': 'negative', 'score': 0.9999563694000244}]

The pipeline returns a list with a dict that has the label of the classified text and the associated score of the label which tends to be extreme towards either negative or positive.

### Test the model on a dataset ###

Lets see how this model does in predicting labels on a 2 label dataset with labels of either positive or negative and compare its accuracy to that of the VADER method for text classification. We will be testing on the validation set of the glue dataset because the test set has invalid labels.

In [78]:
# load a dataset with only the testing data
dataset = load_dataset("glue", "sst2", split="validation")

# look at contents of dataset, in this case it will be
print(dataset)

# map label to name
labels = {
    0: 'negative',
    1: 'positive'
}


# show examples of the data
for i, entry in enumerate(dataset):

    # stop after ten examples
    if i == 10:
        break

    # print out the text data and the coresponding label (limiting text to 200 chars for readability)
    print(f'Sentence: {entry["sentence"][:200]}, Label: {labels[entry["label"]]}')

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 872
})
Sentence: it 's a charming and often affecting journey . , Label: positive
Sentence: unflinchingly bleak and desperate , Label: negative
Sentence: allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . , Label: positive
Sentence: the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales . , Label: positive
Sentence: it 's slow -- very , very slow . , Label: negative
Sentence: although laced with humor and a few fanciful touches , the film is a refreshingly serious look at young women . , Label: positive
Sentence: a sometimes tedious film . , Label: negative
Sentence: or doing last year 's taxes with your ex-wife . , Label: negative
Sentence: you do n't have to know about music to appreciate the film 's easygoing blend of comedy and romance . , Label: positive
Sentence: in exactly 89 minutes , most of whic

Next we will need to set up the VADER model for sentiment analysis. VADER operates by assigning a polarity (positive, negative, or neutral) to each word in a given text and then combining these individual polarities to calculate an overall sentiment score for the entire text. It does not use any machine learning so it will be interesting to see if this older Lexical model still outperforms the Machine Learning approcah

In [79]:
# Download the VADER lexicon
nltk.download('vader_lexicon')

# Load the VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()

# define function to return sentiment score based on
def calculate_vader_sentiment(text, return_scores = False):
    scores = sia.polarity_scores(text)
    compound_score = scores['compound']
    if return_scores:
        return scores
    else:
        return 1 if compound_score >= 0 else 0

# define sentence
sentence = "I hate everything!"

# test function with compound and total scores, negative sentiment = 0, positive = 1
print(f'Sentence: {sentence}, Scores: {calculate_vader_sentiment(sentence, return_scores = True)}, Sentiment: {labels[calculate_vader_sentiment(sentence)]}')

Sentence: I hate everything!, Scores: {'neg': 0.8, 'neu': 0.2, 'pos': 0.0, 'compound': -0.6114}, Sentiment: negative


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Now that we are sure that both of our models are working, we need to test them on the dataset to see which one performs better.

In [81]:
# make preprocessing function for

# Calculate accuracy
correct_vader_predictions = 0
correct_mymodel_predictions = 0
examples = 0

for example in tqdm(dataset):
    examples += 1
    text = example['sentence']
    label = int(example['label'])

    # Calculate VADER sentiment
    vader_sentiment = calculate_vader_sentiment(text)
    if vader_sentiment == label:
        correct_vader_predictions += 1

    # Calculate Mymodel prediction
    mymodel_pipeline_result = pipe(text)
    if mymodel_pipeline_result[0]['label'] == 'negative':
        mymodel_prediction = 0
    else:
        mymodel_prediction = 1

    if mymodel_prediction == label:
        correct_mymodel_predictions += 1

accuracy_vader = float(correct_vader_predictions / examples)
accuracy_mymodel = float(correct_mymodel_predictions / examples)

print(f'\nvader accuracy: {accuracy_vader}, my models accuracy: {accuracy_mymodel}')

100%|██████████| 872/872 [00:58<00:00, 14.80it/s]


vader accuracy: 0.625, my models accuracy: 0.9059633027522935





It's pretty apparent from this testing that the my finetuned model is significantly better at correctly classifying text. Mine achieves an accuracy of 90.6% and the vader model only achieves 62.5% accuracy. However, I am curious to see how my model matches up against another sentiment analysis model trained on similar data with the same model-head, that being the distilbert-base-uncased. For this testing I will be comparing my model with the HuggingFace model distilbert-base-uncased-finetuned-sst-2-english which also uses the sst2 set for finetuning.

In [83]:
# load pipeline for new model
hf_pipe = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", truncation=True, max_length=512)

# test model with sample input
hf_pipe(sentence)

[{'label': 'NEGATIVE', 'score': 0.9992390871047974}]

Now that it appears to be working lets see how my model matches up against this new one

In [86]:
# Calculate accuracy
# Calculate accuracy
correct_hf_predictions = 0
correct_mymodel_predictions = 0
examples = 0

for example in tqdm(dataset):
    examples += 1
    text = example['sentence']
    label = int(example['label'])

    # Calculate HF sentiment
    hf_pipeline_result = hf_pipe(text)
    if hf_pipeline_result[0]['label'] == 'NEGATIVE':
        hf_prediction = 0
    else:
        hf_prediction = 1

    if hf_prediction == label:
        correct_hf_predictions += 1

    # Calculate Mymodel prediction
    mymodel_pipeline_result = pipe(text)
    if mymodel_pipeline_result[0]['label'] == 'negative':
        mymodel_prediction = 0
    else:
        mymodel_prediction = 1

    if mymodel_prediction == label:
        correct_mymodel_predictions += 1

accuracy_hf = float(correct_hf_predictions / examples)
accuracy_mymodel = float(correct_mymodel_predictions / examples)

print(f'\nHugging Face accuracy: {accuracy_hf}, my models accuracy: {accuracy_mymodel}')

100%|██████████| 872/872 [01:59<00:00,  7.29it/s]


Hugging Face accuracy: 0.9105504587155964, my models accuracy: 0.9059633027522935





With my model achieving 90.6% accuracy, I would say that it matched up quite well against the HuggingFace model which achieved an accuracy of 91.1%. There could be multiple reasons for this such as the learning rate or batch size. I was limited in my batch size to 8 due to the fact that anything larger would cause my computer to blue screen. Perhaps in the future with more hardware I could utilize larger batch sizes in order to achieve better accuracy.

Overall, I learned a lot in finetuning this model. I initially attempted to train one from scratch but quickly learned it would require more resources and time than what was reasonable so instead went the finetuning route. Still, it was quite a challenge that required a lot of research in trying to find the best tools for the job and also the best datasets while keeping the scope manageable for a laptop with just a mobile GPU.

I learned a lot and it was enjoyable, my model is also available on hugging face for use for sentiment analysis at this [link](https://huggingface.co/wrobbins0409/distilbert-base-uncased-finetuned-sst2-wrobbins).