# BERT Example

In [None]:
!pip install scikit-learn datasets transformers evaluate numpy huggingface_hub torch accelerate

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" 
os.environ["NCCL_SHM_DISABLE"] = "1" 

In [None]:
import evaluate
import numpy as np
from datasets import load_dataset
from huggingface_hub import interpreter_login
from transformers import (AutoModelForSequenceClassification, AutoTokenizer,
                          DataCollatorWithPadding, Trainer, TrainingArguments,
                          pipeline, BertTokenizer, BertModel)
import torch
from sklearn.metrics.pairwise import cosine_similarity

# get an access token from https://huggingface.co/settings/tokens
interpreter_login()

## 1. Word Embedding, Sentence Embedding, and Sentence Similarity

## 1.1 Split Sentence into Words

The BERT tokenizer divides input text into tokens, where each token can be a word or a subword. It tokenizes sentences into lists of tokens, like converting "I like coding in Python." into ['i', 'like', 'coding', 'in', 'python', '.']. Additionally, it inserts special tokens: [CLS] at the start of the first sentence and [SEP] at the end of each sentence to aid BERT in understanding sentence structure.

In [3]:


# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Example sentences
sentence1 = "I like coding in Python most."
sentence2 = "Python is my favorite programming language."
sentence3 = "Java is my favorite programming language."

# Tokenize the sentences
tokens1 = tokenizer.tokenize(sentence1)
tokens2 = tokenizer.tokenize(sentence2)
tokens3 = tokenizer.tokenize(sentence3)


## 1.2 Generate Text Embeddings

The process involves importing necessary libraries like BertTokenizer, BertModel, torch, and cosine_similarity. A pre-trained 'bert-base-uncased' model is loaded. Example sentences, preprocessed and tokenized, are converted to token IDs and reshaped into tensors. The BERT model generates embeddings, with the [CLS] token used as the sentence embedding.

In [None]:

model = BertModel.from_pretrained('bert-base-uncased')

# Convert tokens to input IDs
input_ids1 = torch.tensor(tokenizer.convert_tokens_to_ids(tokens1)).unsqueeze(0)  # Batch size 1
input_ids2 = torch.tensor(tokenizer.convert_tokens_to_ids(tokens2)).unsqueeze(0)  # Batch size 1
input_ids3 = torch.tensor(tokenizer.convert_tokens_to_ids(tokens3)).unsqueeze(0)  # Batch size 1

# Obtain the BERT embeddings
with torch.no_grad():
    outputs1 = model(input_ids1)
    outputs2 = model(input_ids2)
    outputs3 = model(input_ids3)
    embeddings1 = outputs1.last_hidden_state[:, 0, :]  # [CLS] token
    embeddings2 = outputs2.last_hidden_state[:, 0, :]  # [CLS] token
    embeddings3 = outputs3.last_hidden_state[:, 0, :]  # [CLS] token

In [6]:
# Calculate similarity
# similarity_score = cosine_similarity(embeddings1, embeddings2)
print("Sentence Similarity Score 1 vs 2:", cosine_similarity(embeddings1, embeddings2))

print("Sentence Similarity Score 1 vs 3:", cosine_similarity(embeddings1, embeddings3))

print("Sentence Similarity Score 2 vs 3:", cosine_similarity(embeddings2, embeddings3))

Sentence Similarity Score 1 vs 2: [[0.3855042]]
Sentence Similarity Score 1 vs 3: [[0.3912598]]
Sentence Similarity Score 2 vs 3: [[0.8175524]]


## 2. Text Classification

## 2.1 Load Data

Text classification is a common NLP task that assigns a label or class to text. Some of the largest companies run text classification in production for a wide range of practical applications. One of the most popular forms of text classification is sentiment analysis, which assigns a label like positive, negative, or neutral to a sequence of text.

We use the following IMDb dataset to demonstrate text classification with BERT. The dataset contains 50,000 movie reviews, which are labeled as positive or negative. We will train a BERT model to classify the sentiment of the reviews.

There are two fields in this dataset:

text: the movie review text.
label: a value that is either 0 for a negative review or 1 for a positive review.

In [4]:
# Load the dataset
imdb = load_dataset('imdb')
imdb["test"][0]

{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as 

## 2.2 Preprocess: Tokenization

- Since BERT requires a specific input format, we need to preprocess the text data. The first step is to tokenize the text data. Tokenization is the process of splitting the text into individual words or subwords. 
- The BERT has a limit of 512 tokens per input, so we need to truncate or pad the input text to fit this limit.

In [5]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

# Create a preprocessing function to tokenize text and truncate sequences to be no longer than DistilBERT’s maximum input length:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

# Apply the preprocessing function to the dataset
tokenized_imdb = imdb.map(preprocess_function, batched=True)

In [6]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## 2.3 Training

In [None]:
# Include a metric to compute accuracy during training
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

# Sentiment labels
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

# Load the model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

In [None]:
# Train the model
training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

## 2.4 Sentiment Classification

After fine-tuning the BERT model on the IMDb dataset, we can use the model to classify the sentiment of movie reviews. We will use the following example to demonstrate sentiment classification:

In [None]:
classifier = pipeline("sentiment-analysis", model="Morrisovo/my_awesome_model", device=0)

test1 = classifier("This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.")

test2 = classifier("This was a disappointment. Strayed too far from the books, and failed to captivate from beginning to end. Easily my least favorite of the three.")

In [10]:
print(test1)

print(test2)


[{'label': 'POSITIVE', 'score': 0.9976010918617249}]
[{'label': 'NEGATIVE', 'score': 0.9928188920021057}]
