![NVIDIA Logo](images/nvidia.png)

# Sentiment Analysis

In this notebook you will begin work on a sentiment analysis task using a dataset of Amazon reviews by performing a baseline zero-shot analysis on 2 GPT models.

---

## Learning Objectives

By the time you complete this notebook you will be able to:
- Be familiar with the Amazon reviews dataset.
- Observe zero-shot performance for sentiment analysis on the reviews using GPT43B and GPT8B.

---

## Imports

In [None]:
import json

from llm_utils.nemo_service_models import NemoServiceBaseModel
from llm_utils.models import Models

---

## List Models

In [None]:
Models.list_models()

---

## Amazon Review Data

For the sentiment analysis task, we will be working with a public dataset of Amazon customer reviews. The raw reviews file has been provided for you at `data/reviews.txt`. It contains 400,000 reviews.

In [None]:
!wc -l data/reviews.txt

If we look at the first few samples, we can see that each begins with either `__label__2` which indicates a positive sentiment, or `__label__1` which indicates a negative sentiment.

In [None]:
!head -3 data/reviews.txt

---

## Sentiment Analysis Prompt Template

For our sentiment analysis task, we will be working with the following prompt template.

In [None]:
def sentiment_template(text):
    return f'Is the overall sentiment of the following review "positive" or "negative"? {review} Sentiment:'

Assuming we have a review to pass into the template:

In [None]:
review = f'''\
One of the best game music soundtracks - for a game I didn't really play: Despite the fact that I \
have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger \
which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. \
There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially \
like, as there's not too many of those kinds of songs in my other video game soundtracks. \
I must admit that one of the songs (Life-A Distant Promise) has brought tears to my eyes on many occasions.\
My one complaint about this soundtrack is that they use guitar fretting effects in many of the songs, \
which I find distracting. But even if those weren't included I would still consider the collection worth it.\
'''

...we can generate a sentiment analysis prompt for the review.

In [None]:
print(sentiment_template(review))

## Process Prompts and Labels

For our purposes we will create a training dataset of 1500 samples, as well as a small test dataset of 20 samples.

Here we gather the first 1520 samples into a `prompts_with_labels` list which contains 2-tuples of review prompts, created using `sentiment_template`, and their labels.

In [None]:
prompts_with_labels = []

with open('data/reviews.txt', 'r', encoding='utf-8') as file:
    for i, line in enumerate(file):
        if i >= 1520:  # Stop after reading 1520 lines
            break

        label, review = line.strip().split(' ', 1)
        sentiment = 'positive' if label == '__label__2' else 'negative'
        prompts_with_labels.append((sentiment_template(review), sentiment))

In [None]:
print(prompts_with_labels[0])

Next we split the list into separate train and test lists.

In [None]:
train_prompts_with_labels = prompts_with_labels[:1500]
test_prompts_with_labels = prompts_with_labels[1500:]

In [None]:
len(train_prompts_with_labels)

In [None]:
len(test_prompts_with_labels)

## Write Data to File

For use in subsequent notebooks, we will now write the train and test prompts and labels data to file.

In [None]:
with open('data/sentiment_prompts_labels_train_1500.json', 'w') as f:
    json.dump(train_prompts_with_labels, f)

In [None]:
with open('data/sentiment_prompts_labels_test_20.json', 'w') as f:
    json.dump(test_prompts_with_labels, f)

## Test Models on Zero-shot Prompts

Before we begin work on fine-tuning, let's establish a baseline for performance by using our zero-shot prompts with GPT43B and GPT8B.

## GPT43B

First we create an instance of the GPT43B model.

In [None]:
gpt43b = NemoServiceBaseModel(Models.gpt43b.value)

### Sanity Check

Let's try a single sentiment analysis prompt out on GPT43B.

In [None]:
prompt, label = test_prompts_with_labels[0]

In [None]:
label

In [None]:
gpt43b.generate(prompt)

Except for some white space we can strip, it looks pretty good so far.

### Try on Test Data

Let's try GPT43B on the full test set.

In [None]:
num_correct = 0
num_samples = len(test_prompts_with_labels)
for prompt, label in test_prompts_with_labels:
    response = gpt43b.generate(prompt).strip()
    is_correct = response == label
    if is_correct:
        num_correct += 1
    print(f'Response: {response}')
    print(f'Label: {label}')
    print(f'Is Correct: {response == label}\n')

print(f'Number Correct: {num_correct}/{num_samples}')
print(f'Percentage Correct: {num_correct / num_samples*100:.1f}%')

### Analysis

GPT43B seems to be well-suited out of the box for this sentiment analysis task.

---

## GPT8B

Next we will try with GPT8B. First we create a model instance.

In [None]:
gpt8b = NemoServiceBaseModel(Models.gpt8b.value)

### Sanity Check

Let's try a single sentiment analysis prompt out on GPT8B.

In [None]:
prompt, label = test_prompts_with_labels[0]

In [None]:
label

In [None]:
gpt8b.generate(prompt)

GPT8B gave us a the correct sentiment, but then went on long after we wished.

### Try on Test Data

Let's try GPT8B on the full test set. We will indicate that we wish the model to stop generating after newline characters, strip white space, and lower case its responses.

In [None]:
num_correct = 0
num_samples = len(test_prompts_with_labels)
for prompt, label in test_prompts_with_labels:
    response = gpt8b.generate(prompt, stop=['\n']).strip().lower()
    is_correct = response == label
    if is_correct:
        num_correct += 1
    print(f'Response: {response}')
    print(f'Label: {label}')
    print(f'Is Correct: {response == label}\n')

print(f'Number Correct: {num_correct}/{num_samples}')
print(f'Percentage Correct: {num_correct / num_samples*100:.1f}%')

### Analysis

GPT8B did pretty well on this task, although we had to rely on a fair amount of post-processing, including a stop character to prevent it from going on long after we wished.

Looking at the outputs above, it missed at least a couple on account of including a period at the end of its output, and we see that it still got the wrong sentiment on occasion.