<a href="https://colab.research.google.com/github/ronglu-stanford/RL_reference_public/blob/main/Copy_of_%5BICME_NLP%5D_Notebook_2_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practical use case: sentiment extraction with BERT
*Afshine Amidi, Shervine Amidi*

*Deep Learning for NLP - Part II, Stanford ICME Summer workshop 2021*

## Setup

### Pretty printing in Colab

First, let's set one detail: make Colab print arrays one element per line. To do so, we use the `pprint` package.

In [None]:
import pprint

### Packages installation

Some of the most used packages are pre-installed along with default Colab runtimes. That means that we don't have to install all the packages at initialization, which is great!

However when others are missing, it's really simple to make up for it! Just use the `!` symbol followed by the command line you would enter in a Terminal session to install the remaining packages.

Here, we do this procedure for all HuggingFace-related modules needed for data and model preparation.

In [None]:
!pip install datasets transformers

### Imports

For better readability, we gather all import statements at a same place.

In [None]:
import datasets
import random
import tensorflow as tf
import transformers

## Data

In [None]:
pprint.pprint(datasets.list_datasets()[:30])

['acronym_identification',
 'ade_corpus_v2',
 'adversarial_qa',
 'aeslc',
 'afrikaans_ner_corpus',
 'ag_news',
 'ai2_arc',
 'air_dialogue',
 'ajgt_twitter_ar',
 'allegro_reviews',
 'allocine',
 'alt',
 'amazon_polarity',
 'amazon_reviews_multi',
 'amazon_us_reviews',
 'ambig_qa',
 'amttl',
 'anli',
 'app_reviews',
 'aqua_rat',
 'aquamuse',
 'ar_cov19',
 'ar_res_reviews',
 'ar_sarcasm',
 'arabic_billion_words',
 'arabic_pos_dialect',
 'arabic_speech_corpus',
 'arcd',
 'arsentd_lev',
 'art']


### Loading

#### From the public

In [None]:
NUM_LABELS = 2

In [None]:
train_dataset = datasets.load_dataset('imdb', split='train')
test_dataset = datasets.load_dataset('imdb', split='test')

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a...


Downloading:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a. Subsequent calls will reuse this data.


Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


### Inspection

In [None]:
def print_samples_from_dataset(dataset: datasets.Dataset, 
                               target_label: int = None,
                               max_print_count: int = 1,
                               shuffle: bool = True) -> None:
  """Prints samples from the dataset."""
  # Shuffle the dataset if the parameter is set to True.
  if shuffle:
    dataset = dataset.shuffle(random.randint(0, 10e6))

  # Loop over dataset samples.
  print_count = 0
  for i, sample in enumerate(dataset):
    # Ignore labels that do not correspond to the one
    # we are looking for.
    if sample['label'] != target_label:
      continue
  
    pprint.pprint(sample)
    print_count += 1

    # Stop condition.
    if print_count == max_print_count:
      break

In [None]:
print_samples_from_dataset(train_dataset, target_label=0)

{'label': 0,
 'text': '(aka: DEMONS III)<br /><br />Made for Italian TV although shot in '
         'English and was never meant to be a sequel to the earlier DEMONS '
         'films. It was supposed to be simply titled, THE OGRE, which is how '
         'director Lamberto Bava had released it.<br /><br />An American '
         'family rents an Italian villa for the summer. The woman (Virginia '
         'Bryant) has recurring dreams of herself as a little girl going down '
         'to the old wine-cellar of this villa an encountering this '
         'cocoon-like structure hanging down from the ceiling. It glows and is '
         'covered in cobwebs and has what looks like spider or insect legs '
         'hanging down from it. It drips what looks like green paint.<br /><br '
         "/>Of course the husband doesn't believe any of this. The villa just "
         'is old and creaks and makes strange noises in the middle of the '
         'night and she should just ignore it.<br /><br

In [None]:
print_samples_from_dataset(train_dataset, target_label=1)

{'label': 1,
 'text': 'I saw this film at a store in the cheap section. I actually vividly '
         'remembered seeing the commercials and trailer for it years ago. I '
         'thought "What the hey\' and bought it, basically because the plot '
         'sounded interesting and Claire Danes has always been someone of '
         'talent in my eyes (this was also before I became a huge Kate '
         "Beckinsale fan).<br /><br />So it's about two girls who sneak off to "
         'a vacation in Bangkok, get busted for narcotics (which they are '
         'innocent of) and then are sent to a Thailand prison. The film '
         'follows what will happen to them and at times questions their '
         'innocence.<br /><br />Both Claire Danes and Kate Beckinsale give '
         'great performances, and the plot of this film wraps itself up '
         'unconventionally, and raises some nice moral discussion '
         'questions.<br /><br />I think this is a solid good film, but there '

### Statistics

In [None]:
def count_samples(dataset, target_label=None):
  total_count = 0
  for i, sample in enumerate(dataset):
    if target_label is not None and sample['label'] != target_label:
      continue
    total_count += 1
  return total_count

In [None]:
print(f'Training set contains {len(train_dataset)} samples of which')
print(f'  {count_samples(train_dataset, target_label=0)} are negative and')
print(f'  {count_samples(train_dataset, target_label=1)} are positive.')
print()
print(f'Test set contains {len(test_dataset)} samples of which')
print(f'  {count_samples(test_dataset, target_label=0)} are negative and')
print(f'  {count_samples(test_dataset, target_label=1)} are positive.')

Training set contains 25000 samples of which
  12500 are negative and
  12500 are positive.

Test set contains 25000 samples of which
  12500 are negative and
  12500 are positive.


### Tokenization

#### Loading

First, we start by loading a pre-trained WordPiece tokenizer.

In [None]:
tokenizer = transformers.BertTokenizerFast.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

#### Inspection

In [None]:
vocab_dict = tokenizer.get_vocab()
sorted_vocab_list = sorted(vocab_dict.items(), key=lambda string_to_id: string_to_id[1])

Unused tokens and special characters.

In [None]:
print(sorted_vocab_list[:1000])

Subwords sorted by order of decreasing frequency.

In [None]:
pprint.pprint(sorted_vocab_list[2000:2020])

#### Fit

Now that we saw that the loaded tokenizer looks reasonable, we can fit our dataset to produce the tokens that we will later feed to the model. Here, from HuggingFace's [docs](https://huggingface.co/transformers/preprocessing.html), we choose:
- a padding parameter such that the input is padded up to the size expected by the model
- to truncate the input, in case it exceeds the model's expected input size.

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

This operations turns the tokenized version of the dataset into a `Dataset` object as well, which avoids loading all expected tokens in memory. Instead the tokenized versions of the data is saved on disk and ready to be loaded at training time. 

The `batched` boolean enables processing tokens by batch of samples (of default size 1000), which is much more efficient computation-wise.

In [None]:
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

In [None]:
tokenized_train_dataset

In [None]:
tokenizer.model_input_names

In [None]:
BATCH_SIZE = 64

def get_tf_dataset(hf_dataset):
  """Transforms HuggingFace dataset into a TF dataset feedable to Keras."""
  # Objects returned by the HuggingFace Dataset object are now TensorFlow
  # tensors.
  hf_dataset = hf_dataset.with_format('tensorflow')

  # Prepare input.
  X = {col: hf_dataset[col].to_tensor() for col in tokenizer.model_input_names}
  y = hf_dataset['label']

  # Create TensorFlow dataset.
  tf_dataset = tf.data.Dataset.from_tensor_slices((X, y))

  # VERY important since IMDB data has grouped positive and negative reviews
  # together: Shuffle the order which the samples are generated from the
  # dataset.
  tf_dataset = tf_dataset.shuffle(len(tf_dataset))

  # Specify the batch size to be used in the model.
  tf_dataset = tf_dataset.batch(BATCH_SIZE)

  return tf_dataset

In [None]:
tf_tokenized_train_dataset = get_tf_dataset(tokenized_train_dataset)
tf_tokenized_test_dataset = get_tf_dataset(tokenized_test_dataset)

In [None]:
next(iter(tf_tokenized_train_dataset))

## Model

### Initialization

#### From a pre-trained checkpoint

In [None]:
model = transformers.TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=NUM_LABELS)

#### Input/output sanity check
Let's do one forward pass.

In [None]:
inputs = tokenizer("I LOVE French", return_tensors="tf")
inputs["labels"] = tf.reshape(tf.constant(1),  # Label 1: positive sentiment.
                              (-1, 1)  # Resizing to have a shape of (batch_size, label) == (1, 1)
                              )
outputs = model(inputs)

In [None]:
print('Inputs')
pprint.pprint(inputs)
print()
print('Outputs')
pprint.pprint(outputs)
print()
print('Logits')
pprint.pprint(outputs.logits)
print()
print('Loss')
pprint.pprint(outputs.loss)

Now let's check that the computed loss is consistent with the model output.

In [None]:
# Output logits.
print('Logits given by the model')
print(outputs.logits)
print()

# Pass output logit to a softmax layer.
# softmax_values = exp(logit_values) / sum(exp(logit_values))
softmaxed_output = tf.nn.softmax(outputs.logits)
print('After softmax')
print(softmaxed_output)
print()

# Apply binary cross-entropy formula: 
# loss = - [ y log(p) + (1 - y) log(1 - p) ]
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)(inputs['labels'], softmaxed_output)
print('After applying the binary cross-entropy formula')
print(loss)

#### Inspect the model

The summary of the model can be found with the command:

In [None]:
model.summary()

Let's inspect the parameters with which the BERT layer was trained.

In [None]:
model.layers[0].get_config()

HuggingFace's standard `TFBertForSequenceClassification` model adds us a dropout layer of parameter $p=0.1$

In [None]:
model.layers[1].get_config()

as well as a fully-connected layer of input size 768 and output size of `NUM_CLASSES`, i.e. 2.

In [None]:
model.layers[2].get_config()

In [None]:
tf.keras.utils.plot_model(model)

### Finetuning

#### Freeze pre-trained weights

First, we inspect the layers that we have at hand.

In [None]:
for layer in model.layers:
  print(type(layer))

Now, we freeze all layers, except for the last one.

In [None]:
for layer in model.layers:
  # For all layers except the last one, tell Keras that the weights
  # should remain frozen. Effectively, this operation will only apply
  # to the main BERT layer 'TFBertMainLayer'.
  if type(layer) != tf.keras.layers.Dense:
    layer.trainable = False

Let's check that only the only trainable parameters are those of the last layer.

In [None]:
model.summary()

#### Training

In [None]:
LEARNING_RATE = 5e-5

# Disable TensorFlow warnings that are unrelated to the
# computations.
tf.get_logger().setLevel('ERROR')

In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

In [None]:
model.fit(tf_tokenized_train_dataset, validation_data=tf_tokenized_test_dataset, epochs=1)

In [None]:
# The training process reaches ~95% train accuracy after training on all model
# weights for ~1-2 epochs.

#### Predictions after training

##### Load already fine-tuned model

In [None]:
!wget https://stanford.edu/~afshine/imdb_model_after_several_epochs.zip

In [None]:
import zipfile

file_name = 'imdb_model_after_several_epochs.zip'

with zipfile.ZipFile(file_name, 'r') as zip_ref:
  zip_ref.extractall()

In [None]:
# We load the model by simply giving the file path: 
model = transformers.TFBertForSequenceClassification.from_pretrained("imdb_model_after_several_epochs")

##### Metrics

In [None]:
# This package is handy to track for loops progressions.
import tqdm

def compute_accuracy(tf_tokenized_dataset):
  """Predicts model accuracy over a dataset."""
  # Initialize counters.
  n_samples = 0
  total_correct_pred = 0

  # Iterate over entire dataset.
  for sample in tqdm.tqdm(tf_tokenized_dataset):
    # Compute model outputs. Please note that the output are logits and not
    # probabilities, the latter being computed at the loss stage.
    output_logits = model.predict(sample).logits

    # Get predicted and true labels.
    pred_labels = tf.argmax(output_logits, axis=1)
    true_labels = sample[1]

    # A prediction is accurate when the predicted label equals the true label.
    correct_pred = tf.reduce_sum(tf.cast(pred_labels == true_labels, tf.float32)).numpy()

    # Some bookkeeping.
    total_correct_pred += correct_pred
    n_samples += len(true_labels)

  # Proportion of correct predictions.
  return total_correct_pred / n_samples

In [None]:
# This is equivalent to just doing model.evaluate(tf_tokenized_test_dataset).
# The function above gives some more details about what happens under the hood.
compute_accuracy(tf_tokenized_test_dataset)

##### Playground

In [None]:
SENTIMENTS = ['negative', 'positive']

def generate_sample_and_prediction(tf_tokenized_dataset):
  """Generates a sample and its predictions."""
  sample = next(iter(tf_tokenized_dataset))
  output_logits = model.predict(sample).logits
  output_probabilities = tf.nn.softmax(output_logits)

  tokenized_sentence = sample[0]['input_ids'][0]
  true_label = sample[1][0].numpy()
  pred_distribution = output_probabilities[0,:].numpy()

  print('Tokenized sentence: ')
  pprint.pprint(tokenizer.decode(tokenized_sentence))
  print()
  print(f'This review is labeled as {SENTIMENTS[true_label]}.')
  print()
  print('Model predicts:')
  print(f'{100 * pred_distribution[0]:.2f} % negative')
  print(f'{100 * pred_distribution[1]:.2f} % positive')

In [None]:
generate_sample_and_prediction(tf_tokenized_test_dataset)

## References
### Papers
- Devlin et al, 2018. *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*. [arXiv:1810.04805](https://arxiv.org/pdf/1810.04805.pdf).

### Datasets
- Maas et al., 2011. *Large Movie Review Dataset*. [ai.stanford.edu/~amaas/data/sentiment/](https://ai.stanford.edu/~amaas/data/sentiment/). Unlicensed.

### Posts
- Amidi, 2018. *A detailed example of how to use data generators with Keras*. [stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly](https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly). Unlicensed.
- Amidi, 2018. *A detailed example of how to generate your data in parallel with PyTorch*. [stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel](https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel). Unlicensed.
- HuggingFace, 2020. *Fine-tuning a pretrained model*. [huggingface.co/transformers/training](https://huggingface.co/transformers/training.html). [Apache License 2.0](https://github.com/huggingface/transformers/blob/master/LICENSE).
- HuggingFace, 2020. *BERT*. [huggingface.co/transformers/model_doc/bert](https://huggingface.co/transformers/model_doc/bert.html). [Apache License 2.0](https://github.com/huggingface/transformers/blob/master/LICENSE).
- HuggingFace, 2020. *Auto Classes*. [huggingface.co/transformers/model_doc/auto](https://huggingface.co/transformers/model_doc/auto.html). [Apache License 2.0](https://github.com/huggingface/transformers/blob/master/LICENSE).