<a href="https://colab.research.google.com/github/rimon15/nlp_intro_notebook/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text classification with BERT

Text classification is a common NLP task that assigns a label or class to text. Some of the largest companies run text classification in production for a wide range of practical applications. One of the most popular forms of text classification is sentiment analysis, which assigns a label like üôÇ positive, üôÅ negative, or üòê neutral to a sequence of text.
Reference: https://huggingface.co/docs/transformers/en/tasks/sequence_classification

In [None]:
!pip install -q transformers
!pip install -q datasets
!pip install -q evaluate
!pip install -q accelerate


In [None]:
import transformers
import torch
from datasets import load_dataset, load_metric

###Load IMDb dataset

In [None]:
imdb = load_dataset("imdb")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
import random
def sampling_data(key, n, k):
  imdb[key] = imdb[key].select(random.sample(range(1, n), k))

sampling_data('train', len(imdb['train']), len(imdb['train'])//10)
sampling_data('test', len(imdb['test']), len(imdb['test'])//10)

del imdb['unsupervised']

In [None]:
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 2500
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2500
    })
})

In [None]:
imdb["test"][0]

{'text': 'The video case for this film reads "a story of beauty, passion, and forbidden fruit". Are they talking about the same movie I just saw?! They can\'t be, as the film I just saw was beautiful, but there was no passion and as for the fruit, this is all hogwash meant to entice the potential viewer to see this movie. If only it did have some passion or some life to it, I would have greatly enjoyed this film. Instead, it was an agonizingly slow paced and not particularly interesting film that I would definitely not want to see again. It isn\'t that it\'s a bad film (after all it IS very beautifully filmed), but it is dull beyond belief. I kept waiting for something exciting or interesting to happen, but then the movie just ended. There was no great sense of excitement, mystery or anything--just a rather unexciting story about a young girl who becomes a servant and spends the next 10 years of her life working as a maid.',
 'label': 0}

###Preprocess

In [None]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id,
)

# Move model to GPU if available
if torch.cuda.is_available():
  model = model.to("cuda")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_imdb = imdb.map(preprocess_function, batched=True)

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

In [None]:
import evaluate

accuracy = evaluate.load("accuracy")

In [None]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

### Train

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
training_args = TrainingArguments(
    output_dir="BERT_imdb_review_classifier",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["train"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.145108,0.9532
2,No log,0.094433,0.9732


TrainOutput(global_step=314, training_loss=0.2619737758757962, metrics={'train_runtime': 637.8651, 'train_samples_per_second': 7.839, 'train_steps_per_second': 0.492, 'total_flos': 1298802502572000.0, 'train_loss': 0.2619737758757962, 'epoch': 2.0})

### Simple Inference

In [None]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."


In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="/content/BERT_imdb_review_classifier/checkpoint-314")
classifier(text)

[{'label': 'POSITIVE', 'score': 0.9872947335243225}]

### Inference step by step

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("/content/BERT_imdb_review_classifier/checkpoint-314")
inputs = tokenizer(text, return_tensors="pt")

model = AutoModelForSequenceClassification.from_pretrained("/content/BERT_imdb_review_classifier/checkpoint-314")
with torch.no_grad():
    logits = model(**inputs).logits

In [None]:
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

'POSITIVE'

## Full test on test-set

In [None]:
# Based on the above, take the model and run a full evaluation on the IMDB test set. What accuracy can you achieve?
# Your code goes here

# Using ChatGPT

Go to <a href="https://chat.openai.com/">ChatGPT</a>, and try some prompts with the reviews to see if ChatGPT can classify them. How does this compare to BERT?

Some factors may include
- Runtime performance
- Prompt adherence
- Accuracy