# Sentiment Analysis with BERT

The objective is to judge the sentiment label of a sentence.
It consists of 3 labels, `positive`, `negative`, and `neutral`.
A dataset that contains sentences with the corresponding sentiment label is provided, and you have to use BERT and train a sentence classifier with this dataset.

As to the implementatin, we will introduce you the [🤗 transformers](https://huggingface.co/) library, which is mantained by huggingface company, as the training framework this week. [Pytorch](https://pytorch.org/) is used as the deep learning backend in this tutorial.

## Step 1: Prepare your environment

### 1.1 Create a new environment with conda

Again, we highly recommend you to install all packages with a virtual environment manager, like [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html), to prevent version conflicts of different packages.

```
# Create a new VM
conda create -y -n sentiment-analysis python=3.9

# Activate the VM
conda activate sentiment-analysis
```

### 1.2 Install python packages

Command:
```
pip install -r requirements.txt
```

Dependencies:

1. `numpy`: for matrix operation
2. `scikit-learn`: for label encoding
3. `datasets`: for data preparation
4. `transformers`: for model loading and finetuing
5. `pytorch`: the backend DL framework

### 1.3 Select GPU(s) for your backend

Select your GPU.
Note that this should be set before you load tensorflow or pytorch.

In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '0'

#### 1.4 Check Pytorch

In [2]:
import torch

torch.cuda.is_available()

  from .autonotebook import tqdm as notebook_tqdm


True

## Step 2: Prepare the dataset

Before starting the training, we need to load and process our dataset.

### 2.1 Load data

Library, `datasets` is a package provided by huggingface.

It contains many public datasets online and can help us with the data processing.

We can use `load_dataset` function to read the input `.csv` file.

Reference:
 - [Official datasets document](https://huggingface.co/docs/datasets)
 - [datasets.load_dataset](https://huggingface.co/docs/datasets/loading.html)

In [3]:
import os
from datasets import load_dataset

In [4]:
# TODO: Setup path and filename
path_to_folder = ...
filename = ...
dataset = load_dataset('csv', data_files = os.path.join(..., ...))

Using custom data configuration default-b6e650acb670b9ab
Found cached dataset csv (/home/nlplab/kedy/.cache/huggingface/datasets/csv/default-b6e650acb670b9ab/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)
100%|██████████| 1/1 [00:00<00:00, 419.10it/s]


### 2.2 Check loaded data structure

In [5]:
print(dataset['train'])

Dataset({
    features: ['label', 'text'],
    num_rows: 20632
})
{'label': 'neutral', 'text': 'Order Go Set a Watchman in store or through our website before Tuesday and get it half price! #GSAW @GSAWatchmanBook https://t.co/KET6EGD1an'}
["Picturehouse's, Pink Floyd's, 'Roger Waters: The Walll - opening 29 Sept is now making waves. Watch the trailer on Rolling Stone - look...", 'Order Go Set a Watchman in store or through our website before Tuesday and get it half price! #GSAW @GSAWatchmanBook https://t.co/KET6EGD1an', 'If these runway renovations at the airport prevent me from seeing Taylor Swift on Monday, Bad Blood will have a new meaning.', 'If you could ask an onstage interview question at Miss USA tomorrow, what would it be?', 'A portion of book sales from our Harper Lee/Go Set a Watchman release party on Mon. 7/13 will support @CAP_Tulsa and the great work they do.']


In [None]:
# [ Practice ] Check first data
print(dataset['train'][...])

In [None]:
# [ Practice ] Check first 5 texts
print(dataset['train'][...][:5])

## Step 3: Preprocessing

Before put into the model, texts should be tokenized, embedded, and padded.

Here we use [BERT](https://arxiv.org/abs/1810.04805) (**B**idirectional **E**ncoder **R**epresentations from **T**ransformers), a language model proposed by Google AI in 2018.

It's one of the most popular models used in NLP area.

However, we will not directly use BERT in this tutorial, because it's large and needs plenty of time to train.

Instead, we are using [DistilBert](https://medium.com/huggingface/distilbert-8cf3380435b5). 

DistilBERT is a distilled (蒸餾) version of BERT that is much more light-weighted than original model while reserving 95% of its original accuracy, which makes it perfect for our task today.

### 3.1 Specify the language model

In [6]:
# Available models can be found here: https://huggingface.co/models
# DistilBERT base model (uncased): https://huggingface.co/distilbert-base-uncased
# TODO: Setup the model name
MODEL_NAME = ...

### 3.2 Sentence processing with tokenizer

Different pre-trained language models may have their own preprocessing models, and that's why we should use the tokenizers trained along with that model.

In our case, we are using distilBERT, so we should use the distilBERT tokenizer.  

With huggingface, loading different tokenizer is extremely easy: just import the `AutoTokenizer` from `transformers` and tell it what model you plan to use, and it will handle everything for you.

In [7]:
from transformers import AutoTokenizer # For tokenization

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

### 3.3 Play with BERTTokenizer

In [8]:
example = "This so-called \"Perfect Evening\" was so disappointing, as well as discouraging us from coming to your Circle Theatre again."

embeddings = tokenizer(example)
embeddings

{'input_ids': [101, 2023, 2061, 1011, 2170, 1000, 3819, 3944, 1000, 2001, 2061, 15640, 1010, 2004, 2092, 2004, 12532, 4648, 4726, 2149, 2013, 2746, 2000, 2115, 4418, 3004, 2153, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [9]:
# TODO: Detokenization
decoded_tokens = tokenizer.batch_decode(embeddings[...])
print(' '.join(decoded_tokens))

[CLS] this so - called " perfect evening " was so disappointing , as well as disco ##ura ##ging us from coming to your circle theatre again . [SEP]


In [10]:
# EXAMPLE: directly transform into embedding tensor
embeddings = tokenizer(
                       example,
                       max_length=128,
                       padding="max_length",
                       is_split_into_words=False,
                       truncation=True,
                       return_tensors='pt',
                      )
embeddings

{'input_ids': tensor([[  101,  2023,  2061,  1011,  2170,  1000,  3819,  3944,  1000,  2001,
          2061, 15640,  1010,  2004,  2092,  2004, 12532,  4648,  4726,  2149,
          2013,  2746,  2000,  2115,  4418,  3004,  2153,  1012,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  

### 3.4 Label processing

In the following section, we will [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder) provided by scikit-learn.

In [11]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

# First, declare a new encoder
encoder = OneHotEncoder(sparse = False)

# Second, let the encoder learns all features in the given dataset
encoder = encoder.fit(np.reshape(dataset['train']['label'], (-1, 1)))

In [12]:
LABEL_COUNT = len(encoder.categories_[0])
LABEL_COUNT

3

### 3.5 Play with OneHotEncoder

#### 3.5.1 Check what features has the encoder captured

In [13]:
print(encoder.categories_)

[array(['negative', 'neutral', 'positive'], dtype='<U8')]


#### 3.5.2 one-hot code

Use `encoder.transform` to get the one-hot code of a label

In [14]:
# TODO: Check the code for each label
print(encoder.transform([[...], [...], [...]]))

[[0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]]


### 3.5.3 Decode one-hot code

Use `encoder.inverse_transform` instead

In [15]:
print(encoder.inverse_transform([[0, 0, 1]]))

[['positive']]


### 3.6 Process the whole data

With the `tokenizor` and `encoder` prepared, we can write a function to process the whole dataset.

#### 3.6.1 Define preprocess function

In [16]:
def preprocess(dataslice):
    """
    Input: a batch of your dataset
    Example: { 'text': [['sentence1'], ['setence2'], ...],
               'label': ['label1', 'label2', ...] }
    """

    embeddings = tokenizer(
                            dataslice['text'],
                            max_length=128,
                            padding="max_length",
                            is_split_into_words=False,
                            truncation=True,
                            return_tensors='pt',
                          )
    labels = encoder.transform(np.reshape(dataslice['label'], (-1, 1)))

    """
    Output: a batch of processed dataset
    Example: {
                'input_ids': ...,
                'attention_masks': ...,
                'label': ...
              }
    """
    return {**embeddings, 'label': labels}

#### 3.6.2 Apply the preprocess function to the whole dataset

In [17]:
processed_data = dataset.map(
                             preprocess,    # your processing function
                             batched = True # Process in batches so it can be faster
                            )

Loading cached processed dataset at /home/nlplab/kedy/.cache/huggingface/datasets/csv/default-b6e650acb670b9ab/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-541a61ccb849e795.arrow


In [18]:
# Take a look at processed dataset
print(processed_data)
processed_data['train'][0]

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 20632
    })
})


{'label': [0.0, 1.0, 0.0],
 'text': "Picturehouse's, Pink Floyd's, 'Roger Waters: The Walll - opening 29 Sept is now making waves. Watch the trailer on Rolling Stone - look...",
 'input_ids': [101,
  3861,
  4580,
  1005,
  1055,
  1010,
  5061,
  12305,
  1005,
  1055,
  1010,
  1005,
  5074,
  5380,
  1024,
  1996,
  2813,
  2140,
  1011,
  3098,
  2756,
  17419,
  2003,
  2085,
  2437,
  5975,
  1012,
  3422,
  1996,
  9117,
  2006,
  5291,
  2962,
  1011,
  2298,
  1012,
  1012,
  1012,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1

### 3.7 DataCollator

To do the training-time processing, we can use the DataCollator Class provided by `transformers`.

 - [transformers.DataCollatorWithPadding](https://huggingface.co/docs/transformers/master/en/main_classes/data_collator#transformers.DataCollatorWithPadding)

In [19]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Step 4: Training

### Preparation

We can load the pretrained model from `transformers`.

Generally, you need to build your own model on top of BERT if you want to use BERT for some downstream tasks.

Fortunately, sequence classification is a popular topic. With the support from `transformers` library, all works can be done in two lines of codes: 

1. Load `AutoModelForSequenceClassification` Class.
2. Load the pretrained model.

#### 4.1 Load `BERT` model

In [20]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME,
                                                           num_labels = LABEL_COUNT)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

### 4.2 Split train/val data

The `Dataset` class we prepared before already has the `train_test_split` method. You can use it to split your dataset.

Document:
 - [datasets.Dataset - Sort, shuffle, select, split, and shard](https://huggingface.co/docs/datasets/process.html#sort-shuffle-select-split-and-shard)


In [21]:
# [ TODO ] Choose the validation data size
train_val_dataset = processed_data['train'].train_test_split(test_size = ...)

In [22]:
# Take a look at split data
print(train_val_dataset)

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 18568
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 2064
    })
})


#### 4.3 Setup training parameters

#### 4.3.1 Import `TrainingArguments`, `Trainer` from `transformers`

In [23]:
from transformers import TrainingArguments, Trainer

#### 4.3.2 Set training properties and initialize a trainer

In [24]:
# TODO: Setup training details
BATCH_SIZE = ...
EPOCH = ...
MODEL_FOLDER = "finetuned"
training_args = TrainingArguments(
    output_dir = f"./model/{MODEL_FOLDER}",
    learning_rate = 2e-4,
    per_device_train_batch_size = BATCH_SIZE,
    per_device_eval_batch_size = BATCH_SIZE,
    num_train_epochs = EPOCH,
    # You can also set other parameters here
)

trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_val_dataset["train"],
    eval_dataset = train_val_dataset["test"],
    tokenizer = tokenizer,
    data_collator = data_collator,
    # You can also set other parameters
)

### 4.4 Training

Training is pretty easy. Simply ask the trainer to train the model for you!

In [25]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 18568
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1161
  Number of trainable parameters = 66955779
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.5595
1000,0.5087


Saving model checkpoint to ./model/finetuned/checkpoint-500
Configuration saved in ./model/finetuned/checkpoint-500/config.json
Model weights saved in ./model/finetuned/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./model/finetuned/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./model/finetuned/checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./model/finetuned/checkpoint-1000
Configuration saved in ./model/finetuned/checkpoint-1000/config.json
Model weights saved in ./model/finetuned/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./model/finetuned/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./model/finetuned/checkpoint-1000/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1161, training_loss=0.5272752628769986, metrics={'train_runtime': 114.5589, 'train_samples_per_second': 162.083, 'train_steps_per_second': 10.135, 'total_flos': 614924630673408.0, 'train_loss': 0.5272752628769986, 'epoch': 1.0})

You can see that Trainer saves some ckeckpoints, so you can load your model from those checkpoints if you want to fallback to a specific version.

### 4.5 Save for future use

In [26]:
model.save_pretrained(os.path.join('model', 'finetuned'))

Configuration saved in model/finetuned/config.json
Model weights saved in model/finetuned/pytorch_model.bin


## Step 5: Prediction

### 5.1 Load finetuned model

In [27]:
from transformers import AutoModelForSequenceClassification

mymodel = AutoModelForSequenceClassification.from_pretrained(os.path.join('model', 'finetuned'))

loading configuration file model/finetuned/config.json
Model config DistilBertConfig {
  "_name_or_path": "model/finetuned",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "multi_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.24.0",
  "vocab_size": 30522
}

loading weights file model/finetuned/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All 

### 5.2 Get the prediction

In [28]:
examples = [
    # neutral
    "All of you people who're saying The Weekend is the next Michael Jackson, Go to sleep, you got school tomorrow.",
    # negative
    "Here I am saying Kanye isn't a bad guy when these people just gave you a fucking award that Michael Jackson got c'mon bro",
    # positive
    "@MariahCarey may he R.I.P. Happy Birthday Michael Jackson . :)",
]

In [29]:
# Transform the sentences into embeddings
input = tokenizer(examples, truncation=True, padding=True, return_tensors="pt")
# Get the output
logits = mymodel(**input).logits
logits

tensor([[-2.2525, -0.4102,  0.0044],
        [-0.0196, -0.5089, -2.0296],
        [-3.4792, -1.5006,  1.2738]], grad_fn=<AddmmBackward0>)

### 5.3 Transform logits with softmax activation

Use softmax activation to transform them into more probability-like numbers

In [30]:
from torch import nn

predicts = nn.functional.softmax(logits, dim = -1)
predicts

tensor([[0.0593, 0.3742, 0.5665],
        [0.5724, 0.3509, 0.0767],
        [0.0081, 0.0583, 0.9337]], grad_fn=<SoftmaxBackward0>)

### 5.4 Transform logits back to labels

In [31]:
def predict2label(logits, input):
    predictions = np.zeros(shape=(len(input['input_ids']), LABEL_COUNT))
    predict_id = logits.argmax(dim = 1)
    predictions[np.arange(predict_id.size()[0]), predict_id] = 1
    return predict_id, encoder.inverse_transform(predictions).flatten()

In [32]:
predict2label(predicts, input)

(tensor([2, 0, 2]), array(['positive', 'negative', 'positive'], dtype='<U8'))

## Step 6: Evaluation

Load the testing data and calculate your accuracy.


#### 6.1 Load test data

In [34]:
test_data = load_dataset('csv', data_files = os.path.join('data', 'test.csv'))['train']
input = tokenizer(test_data['text'], truncation = True, padding = True, return_tensors = 'pt')

Using custom data configuration default-2b08c652a1924c67
Found cached dataset csv (/home/nlplab/kedy/.cache/huggingface/datasets/csv/default-2b08c652a1924c67/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)
100%|██████████| 1/1 [00:00<00:00, 330.18it/s]


In [35]:
logits = mymodel(**input).logits

In [36]:
predicts = nn.functional.softmax(logits, dim = -1)

In [37]:
predict_id, predict_label = predict2label(predicts, input)

In [38]:
for idx, (sent, label) in enumerate(zip(test_data['text'], predict_label)):
    if idx >= 5: break
    print(f'{label}: {sent}')

neutral: 05 Beat it - Michael Jackson - Thriller (25th Anniversary Edition) [HD] http://t.co/A4K2B86PBv
neutral: Jay Z joins Instagram with nostalgic tribute to Michael Jackson: Jay Z apparently joined Instagram on Saturday and.. http://t.co/Qj9I4eCvXy
positive: Michael Jackson: Bad 25th Anniversary Edition (Picture Vinyl): This unique picture disc vinyl includes the original 1 http://t.co/fKXhToAAuW
positive: I liked a @YouTube video http://t.co/AaR3pjp2PI One Direction singing "Man in the Mirror" by Michael Jackson in Atlanta, GA [June 26,
neutral: 18th anniv of Princess Diana's death. I still want to believe she is living on a private island away from the public. With Michael Jackson.


### Accuracy

$
accuracy = \frac{\#exactly\:the\:same\:levels}{\#total}
$

Example:
```
Prediction:   Pos Pos Neg Neu Neu
Ground truth: Pos Neu Neg Neu Neg
               ^       ^   ^
```

The six level accuracy is $\frac{3}{5} = 0.6$

In [39]:
correct = 0
total = 0

for predict, label in zip(predict_label, test_data['label']):
    if predict == label:
        correct += 1
    total += 1

In [40]:
accuracy = correct / total
accuracy

0.5571142284569138