<a href="https://colab.research.google.com/github/thiagolaitz/IA368-search-engines/blob/main/Project%2003/zero_shot_few_shot_imdb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this Colab notebook, we will be exploring the capabilities of the large language model, ChatGPT, in the task of sentiment analysis for a movie review dataset (IMDb). We will be using two techniques, few-shot and zero-shot learning, to evaluate the model's ability to classify movie reviews as either positive or negative. Few-shot learning involves presenting the model on a small number of examples of a particular task, while zero-shot learning involves using the model to classify samples from a task it has not been explicitly trained on. Through this notebook, we will be able to understand how ChatGPT can be used for sentiment analysis, as well as the benefits and limitations of few-shot and zero-shot learning techniques.

# Dataset

The IMDb dataset is a large collection of movie reviews from the popular website IMDb, which is one of the most comprehensive online movie databases. Each movie review is labeled as either positive or negative based on the reviewer's sentiment towards the movie. They are typically quite lengthy and contain a range of different opinions and sentiments. It has been widely used in natural language processing research as a benchmark for sentiment analysis models. Its popularity is due to the large size of the dataset and the fact that it contains a balanced distribution of positive and negative reviews.

In [1]:
!pip install datasets -q
!pip install openai -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.2/114.2 KB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m264.6/264.6 KB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

In [2]:
from datasets import load_dataset

# Load the IMDB dataset
dataset = load_dataset("imdb")

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

For all intents and purposes, we are going to randomly select 100 samples from the test dataset to evaluate the model.

In [3]:
import random

random.seed(42)
test_reviews = random.sample(list(dataset["test"]), 100)

In [13]:
from collections import Counter

# Distribution of the selected samples
Counter([review["label"] for review in test_reviews])

Counter({1: 42, 0: 58})

# OpenAI

ChatGPT is a large language model developed by OpenAI, based on the GPT-3.5 architecture. It is trained on a massive amount of text data using unsupervised learning techniques to predict the next word in a sequence of text. It is available with the nickname "gpt-3.5-turbo".

In [66]:
api_key = <YOUR_API_KEY>

In [67]:
import requests

def send_prompt(prompt: str):
    """
    Send a prompt to ChatGPT and get its answer.
    Args:
        prompt (str): a string containing the prompt
    Returns:
        The answer and the request cost
    """
    data = {
        "model": "gpt-3.5-turbo",
        "messages": [{"role": "user", "content": prompt}],
        "temperature": 0,
        "top_p": 1
    }

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    response = requests.post(
        "https://api.openai.com/v1/chat/completions",
        json=data,
        headers=headers
    )
    response.raise_for_status()
    response = response.json()
    cost = 0.000002 * response["usage"]["total_tokens"]
    
    return response["choices"][0]["message"]["content"].strip().lower(), cost

In [82]:
import time

def get_answers(prompt_base: str, test_reviews: list):
    """
    Given a list of movie reviews (from IMDb dataset) and the initial prompt,
    this function builds a prompt for each review and gets the model prediction.
    Args:
        prompt_base (str): The base prompt.
        test_reviews (list): A list with the reviews to get the prediction
    Returns:
        The model's answers and a list of dicts containing all logs.
    """
    all_answers = []
    logs = []
    total_cost = 0

    for review in tqdm(test_reviews, desc="Getting answers"):
        # Builds the prompt with the review
        prompt = f"{prompt_base}{review['text']}\nAnswer:"
        # Gets the answer and the request cost
        answer, cost = send_prompt(prompt)
        total_cost += cost
        # Assert the model's answer is one of the expected
        if answer not in ["positive", "negative"]:
            print("Not expected answer")
            continue
        # 1 for positive and 0 for negative
        pred_label = int(answer == 'positive')
        gold_label = int(review["label"])
        all_answers.append((pred_label, gold_label))
        # Append the prompt, review and predictions
        logs.append({
            "prompt": prompt,
            "review": review,
            "pred": pred_label,
            "gold": gold_label
        })
        # Sleep to prevent 429 (rate limiting) HTTP code
        time.sleep(3)
    print(f"Total cost: {total_cost}")
    return all_answers, logs

# Zero-shot

Zero-shot learning is a natural language processing technique that leverages large language models (LLMs) to perform classification tasks on unseen data without explicitly training on that specific task. In other words, the model is capable of generalizing to new tasks it has never seen before. This is achieved by providing the model with a prompt, which is a description of the task and the expected output format. The model then uses its pre-existing knowledge of language to generate a response that fits the prompt's format. This technique is particularly useful when working with limited or no training data for a specific task, as it allows for quick adaptation to new tasks without the need for extensive re-training.

In [72]:
from tqdm.notebook import tqdm

prompt_base = "The text below is a movie review. You should detect if the person who wrote the review gave a positive or negative review. You can't say mixed or both, answer with only one of the options: Positive or Negative.\n#\nReview:\n"

all_answers, logs = get_answers(prompt_base, test_reviews)

Getting answers:   0%|          | 0/100 [00:00<?, ?it/s]

Total cost: 0.066016


## Results

In [73]:
import json

# Saves the zero-shot logs
with open("zero-shot.json", "w") as fout:
    fout.write(json.dumps(logs))

In [74]:
# Get the model's accuracy
acc = sum(1 for pred, gold in all_answers if pred == gold)/len(all_answers)
print(f"Zero-shot accuracy: {acc:.2f}")

Zero-shot accuracy: 0.98


#Few-shot
Few-shot learning is a natural language processing technique that involves adding a small number of examples of a particular task in the provided prompt.

In [57]:
# Gets 4 random examples from the training set
training_samples = random.sample(list(dataset["train"]), 4)

In [58]:
training_samples = [
    {
        "text": sample["text"],
        "label": "positive" if sample["label"] else "negative"
    }
    for sample in training_samples
]

In [75]:
training_samples[0]

{'text': 'There is absolutely NO reason to waste your time with this "film". The original said it all and still holds up. Either read the book or do some research about the story, and you\'ll realize this remake is ludicrous. Eric Roberts as Perry Smith? His sister could have done a better job! Having been to Holcomb & Edgerton, KS where the story takes place, the sets and locations looked NOTHING like Kansas. The original is riveting, from the location filming to the use of the actual participants, weapons and victims belongings. Unforgettable performances by Scott Wilson and Robert Blake. Soundtrack by Quincy Jones and cinematography by Conrad Hall...The original is available on DVD in widescreen now. Let this turkey die a quick death.',
 'label': 'negative'}

In [80]:
prompt_base = f"""
The text below contains examples of movie reviews. Your task is to determine whether the reviewer gave a positive or negative review. You cannot say 'mixed' or 'neutral' - provide only one option: positive or negative.
#
[Example 1]:
Review: {training_samples[0]["text"]}
Answer: {training_samples[0]["label"]}
#
[Example 2]:
Review: {training_samples[1]["text"]}
Answer: {training_samples[1]["label"]}
#
[Example 3]:
Review: {training_samples[2]["text"]}
Answer: {training_samples[2]["label"]}
#
[Example 4]:
Review: {training_samples[3]["text"]}
Answer: {training_samples[3]["label"]}
#
[Example 5]:
Review: 
"""

In [77]:
print(prompt_base)


The text below contains examples of movie reviews. Your task is to determine whether the reviewer gave a positive or negative review. You cannot choose both or say 'mixed' or 'neutral' - provide only one option: positive or negative.
#
[Example 1]:
Review: There is absolutely NO reason to waste your time with this "film". The original said it all and still holds up. Either read the book or do some research about the story, and you'll realize this remake is ludicrous. Eric Roberts as Perry Smith? His sister could have done a better job! Having been to Holcomb & Edgerton, KS where the story takes place, the sets and locations looked NOTHING like Kansas. The original is riveting, from the location filming to the use of the actual participants, weapons and victims belongings. Unforgettable performances by Scott Wilson and Robert Blake. Soundtrack by Quincy Jones and cinematography by Conrad Hall...The original is available on DVD in widescreen now. Let this turkey die a quick death.
Answe

In [84]:
all_answers, logs = get_answers(prompt_base, test_reviews)

Getting answers:   0%|          | 0/100 [00:00<?, ?it/s]

Not expected answer
Total cost: 0.239242


In [85]:
# Saves the few-shot logs
with open("few-shot.json", "w") as fout:
    fout.write(json.dumps(logs))

## Results

In [86]:
# Get the model's accuracy
acc = sum(1 for pred, gold in all_answers if pred == gold)/len(all_answers)
print(f"Few-shot accuracy: {acc:.2f}")

Few-shot accuracy: 0.96


# Conclusions

Both few-shot and zero-shot techniques have demonstrated remarkable success in solving the sentiment analysis task on IMDb. Zero-shot learning achieved an impressive accuracy of 98%, while few-shot learning achieved an accuracy of 96%. These results show that both techniques are effective at analyzing the sentiment of movie reviews on IMDb.