IMDB movie review dataset sentiment analysis


The [IMDB movie review dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) has 50k movie reviews. Each review has a binary sentiment classification label "positive" or "negative". We will use LLMs to perform movie review sentiment classification.

By the end of the assignment, you will gain hands on experience with
* Zero / Few shot the LLM to perform movie review sentiment analysis
* Evaluate accuracy of the sentiment predictions



In [None]:
# This cell mounts the assignment data directory to allow file read.
# Please first add the class content directory "Module 6 : Deep Dive Into LLMs - V"
# as a short cut under your Google Drive, before you execute the cell. During
# execution, you will be prompted to give permissions to the drive mounting
# operation.

import os

from google.colab import drive
drive.mount('/content/drive')
assets_dir = '/content/drive/MyDrive/Module 6 : Deep Dive into LLMs - V2/Assignment and MCQs/datasets/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Problem: Classify the sentiment of IMDB movie review as either `positive` or `negative`

In [None]:
# Parse the csv data file "IMDB_Dataset.csv" into a data frame.
# The data frame has 2 columns, "review" and "sentiment":
# * review: The text of an IMDB movie review.
# * sentiment: Groundtruth sentiment label of the review. Has two possible
#     values: "positive" and "negative".
import pandas as pd

df_reviews = pd.read_csv(os.path.join(assets_dir, 'IMDB_Dataset.csv'))

# view the first 3 rows of the data frame.
df_reviews.head(3)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive


In [None]:
# view full text of the first review in the data frame.
df_reviews.iloc[0].review

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [None]:
import torch
import transformers

transformers.utils.logging.set_verbosity_error()
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [None]:
#@title Download phi-3 model and its tokenizer from hugging face

from transformers import AutoModelForCausalLM, AutoTokenizer

phi3_model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct",
                                             torch_dtype="auto",
                                             trust_remote_code=True
                                             )
phi3_tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

### Zero Shot and Few Shot Sentiment Analysis

* IMDB movie review sentiment analysis
We want to predict the sentiment of IMDB movie reviews as either "positive" or "negative". We can solve the problem with an LLM, by instructing the model using zero shot or few shot prompts, tailored for movie review sentiment analysis. Our goal is to make the model generate response strings equal to either "positive" or "negative".

* In zero-shot prompting, we instruct the model the generate text to complete the desired task. For example, for a translation task, we can write the following prompt:
```
Translate the sentence below to French.
Sentence: Today is sunny.
Translation:
```
Generally, we want the prompt to end in a instructional tone/wording that the model can understand as "it's now my turn to complete the task". In the translation prompt above, we do this by ending the prompt with "Translation:".

* In few-shot prompting, we also instruct the model the generate text to complete the desired task. But now we also additionally include a few examples. These examples should ideally be representative of the real data, and have sufficient coverage of desired model behavior. For example, suppose we want the model to judge whether an input is humurous, it helps to show the model a "yes" example and a "no" example. Figuring out the number of examples we include and the content of the examples are part of prompt engineering.
```
Judge whether the sentence is funny. Answer "yes" or "no". Here's some examples.
Sentence: Why was six afraid of seven? Because seven eight nine.
Answer:yes
Sentence: I had an omelette this morning.
Answer:no
Now it's your turn.
Sentence: Did you hear about the shepherd who drove his sheep through town? He was given a ticket for making a ewe turn.
Answer:
```

* Zero shot and few shot prompts for movie review sentiment analysis should include
  * Instructions describing the task to the model.
  * A few examples when using few shot.
  * The specific movie review we want to predict sentiment for.

### (YOUR CODE HERE) Complete the functions building zero shot and few shot prompts.
Python's string formatting (see https://realpython.com/python-string-formatting/ for details) is very helpful to combine instructions, examples along with specific reviews.
```
review = "5 stars"
formatted_prompt = "Rate this movie as {review}".format(review=review)
print(formatted_prompt)

>>> "Rate this movie as 5 stars"
```

In [None]:
def zero_shot_prompt(review):
  return """Classify the sentiment of the follow movie review as either "positive" or "negative".

Review: {review}.
Sentiment:""".format(review=review)

def few_shot_prompt(review):
  return """Classify the sentiment of the follow movie review as either "positive" or "negative".

Review: I had the pleasure of watching the Titanic movie. What a blast! Great story and cinematography.
Sentiment:positive

Review: The Room is a total waste of time. It's bad to the point of comical.
Sentiment:negative

Review: {review}.
Sentiment:""".format(review=review)


### (YOUR CODE HERE) Complete `generation_args`

We have preprared a sentiment prediction function using Hugging face `transformers` package's `pipeline` object:. https://huggingface.co/microsoft/Phi-3-mini-4k-instruct shows example usage.

The `pipeline` object is initialized with the LLM model and its tokenizer. It takes a text prompt as input, and returns the model's text response. Under the hood, the `pipeline` tokenizes text input, calls the LLM, decode token ids from the LLM, and finally decodes token ids back to text. For the caller, this provides an easy "text in text out" interface with the LLM.

The `predict_sentiment` function below expects pre-loaded LLM model and tokenizer, as well as a zero/few shot prompt we built for movie review sentiment analysis. It returns the model's sentiment prediction for the  prompt in text.

Within the function, `generation_args` controls LLM decoding behavior. Complete the configuration of `generation_args` by replacing `None` with appropriate values if needed. You can find documentation of various generation config parameters in https://huggingface.co/docs/transformers/v4.42.0/en/main_classes/text_generation#transformers.GenerationConfig.

In [None]:
from transformers import pipeline

def predict_sentiment(model, tokenizer, prompt):
  pipe = pipeline(
      "text-generation",
      model=model,
      tokenizer=tokenizer,
      device=device,
  )

  generation_args = {
      "max_new_tokens": 1,             # how many max tokens to decode from the LLM
      "return_full_text": False,       # only return response (and exclude input prompt)
      "do_sample": False,              # boolean variable. when False, perform greedy decode. otherwise decode temperature is specified by "temprature" config above.
  }

  messages = [{"role": "user", "content": prompt}]
  chat_prompt = phi3_tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
  return pipe(chat_prompt, **generation_args)[0]['generated_text']

In [None]:
predict_sentiment(phi3_model, phi3_tokenizer, 'hello')

' Hello'

### (YOUR CODE HERE) Compare Zero Shot vs Few Shot Sentiment Analysis  Accuracy

The IMDB moview review dataset has 50k examples. This is too much compute for the assignment. We will look at the first `n` examples in the dataset. Feel free to adjust `n` according to your compute budget.

Compte the two function below to run zero shot and few shot sentiment analysis, and compute accuracy for both methods.
* Which method is more accurate?
* Does accuracy change with prompt modification?

In [None]:
def zero_shot_sentiment_analysis(n=100):
  n_correct = 0
  for _, row in df_reviews.iloc[0:n].iterrows():
    review = row.review
    sentiment = row.sentiment
    prediction = predict_sentiment(phi3_model, phi3_tokenizer,
                                   zero_shot_prompt(review))
    n_correct += prediction.lower().startswith(sentiment)
  accuracy = n_correct / n
  print(f'zero shot {accuracy = }')
  return accuracy

In [None]:
def few_shot_sentiment_analysis(n=100):
  n_correct = 0
  for _, row in df_reviews.iloc[0:n].iterrows():
    review = row.review
    sentiment = row.sentiment
    prediction = predict_sentiment(phi3_model, phi3_tokenizer,
                                   few_shot_prompt(review))
    n_correct += prediction.lower().startswith(sentiment)
  accuracy = n_correct / n
  print(f'few shot {accuracy = }')
  return accuracy

In [None]:
zero_shot_sentiment_analysis()

In [None]:
few_shot_sentiment_analysis()