# Zero-shot and few-shot IMDB Classifier with OpenAPI and SetFit

[![google colab link](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tcvieira/IA368-DD-012023/blob/main/assingments/03/notebook.ipynb)

In [32]:
!pip install setfit
!pip install huggingface_hub -q
!pip install datasets -q
!pip install openai
!pip install sentence_transformers -q

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# IMDB Dataset

In [19]:
from datasets import load_dataset
import random

dataset = load_dataset("imdb")



  0%|          | 0/3 [00:00<?, ?it/s]

In [8]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [28]:
dataset['train'][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [29]:
from datasets import concatenate_datasets

# create train dataset
seed=42
labels = 2
samples_per_label = 8
sampled_datasets = []
# find the number of samples per label
for i in range(labels):
    sampled_datasets.append(dataset["train"].filter(lambda x: x["label"] == i).shuffle(seed=seed).select(range(samples_per_label)))

# concatenate the sampled datasets
train_dataset = concatenate_datasets(sampled_datasets)

# create test dataset
test_dataset = dataset["test"]



In [30]:
train_dataset, test_dataset

(Dataset({
     features: ['text', 'label'],
     num_rows: 16
 }), Dataset({
     features: ['text', 'label'],
     num_rows: 25000
 }))

In [31]:
from collections import Counter

# Distribution of the selected samples
print(Counter([review["label"] for review in train_dataset]))
print(Counter([review["label"] for review in test_dataset]))

Counter({0: 8, 1: 8})
Counter({0: 12500, 1: 12500})


# OpenAPI

In [69]:
import openai
import re
import time

In [86]:
openai.api_key = "API-KEY"

In [87]:
def send_prompt(prompt: str):
    data = {
        "model": "gpt-3.5-turbo",
        "messages": [{"role": "user", "content": prompt}],
        "temperature": 0,
        "top_p": 1
    }

    response = openai.ChatCompletion.create(**data)

    cost = 0.000002 * response["usage"]["total_tokens"]
    
    return response["choices"][0]["message"]["content"].strip().lower(), cost

In [88]:
response = send_prompt('hi, how are you today?')
response['choices'][0]['message']['content']

RateLimitError: ignored

# Setfit

## Zero-Shot

In [37]:
from sentence_transformers import SentenceTransformer
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
model.to(device)

Downloading (…)0fe39/.gitattributes:   0%|          | 0.00/968 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)83e900fe39/README.md:   0%|          | 0.00/3.79k [00:00<?, ?B/s]

Downloading (…)e900fe39/config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/471M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading unigram.json:   0%|          | 0.00/14.8M [00:00<?, ?B/s]

Downloading (…)900fe39/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

In [40]:
train_dataset_embeddings = model.encode(train_dataset['text'], show_progress_bar=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [42]:
train_dataset_embeddings.shape

(16, 384)

In [44]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=42).fit(train_dataset_embeddings, train_dataset['label'])

### Evaluation

In [45]:
test_dataset_embeddings = model.encode(test_dataset['text'], show_progress_bar=True)

Batches:   0%|          | 0/782 [00:00<?, ?it/s]

In [46]:
y_pred = clf.predict(test_dataset_embeddings)

In [48]:
from sklearn.metrics import classification_report

print(classification_report(test_dataset['label'], y_pred))

              precision    recall  f1-score   support

           0       0.64      0.60      0.62     12500
           1       0.62      0.66      0.64     12500

    accuracy                           0.63     25000
   macro avg       0.63      0.63      0.63     25000
weighted avg       0.63      0.63      0.63     25000



## Few-Shot

In [68]:
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitModel, SetFitTrainer, sample_dataset

model_id = "paraphrase-mpnet-base-v2" #sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2


# Load a SetFit model from Hub
model = SetFitModel.from_pretrained(
    model_id,
    cache_dir="./models/"
)

eval_dataset = train_dataset.train_test_split(test_size=0.1, seed=42)['test']

# Create trainer
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    metric="accuracy",
    batch_size=2,
    num_iterations=20,  # The number of text pairs to generate for contrastive learning
    num_epochs=1,  # The number of epochs to use for contrastive learning
    column_mapping={"text": "text", "label": "label"}  # Map dataset columns to text/label expected by trainer
)

# Train and evaluate
trainer.train()
metrics = trainer.evaluate()

# save
trainer.model._save_pretrained(save_directory="./output/")

print(f"model used: {model_id}")
print(f"train dataset: {len(train_dataset)} samples")
print(f"accuracy: {metrics['accuracy']}")

config.json not found in HuggingFace Hub.


Downloading (…)f39ef/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)0182ff39ef/README.md:   0%|          | 0.00/3.70k [00:00<?, ?B/s]

Downloading (…)82ff39ef/config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)f39ef/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

Downloading (…)0182ff39ef/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)2ff39ef/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
Applying column mapping to training dataset
***** Running training *****
  Num examples = 640
  Num epochs = 1
  Total optimization steps = 320
  Total train batch size = 2


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/320 [00:00<?, ?it/s]

Applying column mapping to evaluation dataset
***** Running evaluation *****


model used: paraphrase-mpnet-base-v2
train dataset: 16 samples
accuracy: 1.0


In [70]:
model = trainer.model

In [71]:
from sklearn.metrics import classification_report

y_pred = model(test_dataset['text'])
y_true = test_dataset['label']

print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.92      0.88     12500
           1       0.91      0.84      0.87     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000



## Inference

In [58]:
from setfit import SetFitModel

model = SetFitModel.from_pretrained("./output/", local_files_only=True)

sentiment_dict = {"negative": 0, "positive": 1}
inverse_dict = {value: key for (key, value) in sentiment_dict.items()}

# Run inference
text_list = [
    "i loved the spiderman movie!",
    "pineapple on pizza is the worst",
    "what the fuck is this piece",
    "good morning, lady boss",
    "the product is excellent",
    "a piece of rubbish"
]

preds = model(text_list)

for i in range(len(text_list)):
    print(text_list[i])
    print(inverse_dict[preds[i].item()])
    print('\n')

i loved the spiderman movie!
positive


pineapple on pizza is the worst
negative


what the fuck is this piece
negative


good morning, lady boss
negative


the product is excellent
positive


a piece of rubbish
negative


