# NLP Seminar 5: Pretrained Transformers and Transfer-Learning

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, losses, optimizers
from tensorflow.keras import Sequential

## Introduction

Transformers can be implemented from scratch in both tensorflow and pytorch
(e.g. https://www.tensorflow.org/text/tutorials/transformer).
The multi-headed attention layers used in transformers are also implemented as a Keras layer in tensorflow ([Attention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention), [MultiHeadAttention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention)).
However, constructing or reproducing meaningful transformer architectures from scratch, even with these building blocks, can still remain challenging. This is especially true for some of the more complex MLP tasks, combining encoder and decoder transformers.
Furthermore, transformers have really proved their state-of-the art efficiency for NLP taskes when trained on huge corpora of data. In particular, for many specific tasks, transfer learning is used to leverage the dynamic semantic information already aquired by pre-trained models.

Although transfer-learning using pre-trained transformers such as BERT is possible with tensorflow (e.g. [classify_text_with_bert](https://www.tensorflow.org/text/tutorials/classify_text_with_bert), [fine_tune_bert](https://www.tensorflow.org/tfmodels/nlp/fine_tune_bert)) this practical will instead introduce the `HuggingFace` transformer library, as it
- has a lot of pretrained stae-of-the-art transformer models for various tasks,
- has a very high-level user-friendly interface,
- is compatibile with both tensorflow and pytorch,
- is used by many universities, research labs and companies.

If needed, see the official tutorials to go further:
- https://huggingface.co/learn/nlp-course/chapter1/1
- https://huggingface.co/docs/transformers/index

In [None]:
#!pip install --upgrade transformers datasets

## Pre-trained pipelines

The `pipeline` allows loading pre-trained models with a very easy interface, for a wide range of different tasks from the `HuggingFace` database. Almost all main open-source pretrained transformer references (BERT, GPT, ...) are available.

- Selected transformer architectures: https://huggingface.co/docs/transformers/index
- comunity checkpoints: https://huggingface.co/models

Here are a few examples of pretrained transformer models (i.e. checkpoints) for some NLP tasks.

In [3]:
import datasets
from transformers import pipeline

#### Sentiment analysis

In [6]:
sent_pipe = pipeline("sentiment-analysis", # or "text-classification"
                     model="distilbert-base-uncased-finetuned-sst-2-english")

In [7]:
sent_pipe("I love this product.")

[{'label': 'POSITIVE', 'score': 0.9998775720596313}]

In [10]:
results = sent_pipe(["I love this product.",
                    "I hate this product."])

for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9999
label: NEGATIVE, with score: 0.9998


#### "Zero-shot" classification

Using natural language inference models to predict *entailment* between each sequence-label premise/hypothesis pair.

In [11]:
zsc_pipe = pipeline('zero-shot-classification',
                    model="facebook/bart-large-mnli", revision="c626438")

In [12]:
zsc_pipe("I like trains.", ["politics","vehicles","animals"])

{'sequence': 'I like trains.',
 'labels': ['vehicles', 'animals', 'politics'],
 'scores': [0.9911500811576843, 0.004934574011713266, 0.003915328532457352]}

#### Dynamic word embeddings

In [83]:
dwe_pipe = pipeline("feature-extraction", model="bert-base-cased") # e.g. "bert-base-cased" "distilbert-base-cased"

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
(dwe_pipe("I like trains.", return_tensors=True))

tensor([[[ 0.8597,  0.1216, -0.0761,  ..., -0.1833,  0.1758,  0.0822],
         [ 0.8721, -0.3453,  0.5372,  ..., -0.1697,  0.0987,  0.1806],
         [ 0.6673, -0.0972, -0.5464,  ...,  0.5048, -0.4832,  0.1718],
         [ 0.4819, -0.0781,  0.1035,  ...,  0.0943, -0.3238,  0.1974],
         [ 0.8397, -0.1973, -0.0363,  ..., -0.0951,  0.3160, -0.0663],
         [ 1.6680,  0.1131, -0.2623,  ..., -0.2753,  0.5270, -0.1521]]])

This outputs the last transformer block output. Some other embedding approaches exist, like averaging or concatenating the activations of several of the transformer's layers.

#### Text generation

With causal language models

In [17]:
generator = pipeline("text-generation", model="gpt2")

In [19]:
generator("In this NLP seminar about transformers, we will learn")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this NLP seminar about transformers, we will learn all about a basic transformation, especially involving a bit of information about the model that describes the form, the transformation, and the parameters for the transformation. However, this class is intended as an'}]

In [20]:
generator("In this NLP seminar about transformers, we will learn",
          max_length=30,
          num_return_sequences=2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this NLP seminar about transformers, we will learn more about transformation and how you can improve your NLP with the tools we have. We'},
 {'generated_text': 'In this NLP seminar about transformers, we will learn about all of the transformers in the program that will use the transforms in the application.'}]

#### Mask filling
This language model task is part of how BERT architectures are often pre-trained

In [21]:
unmasker = pipeline("fill-mask")

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [22]:
unmasker("This seminar will teach you all about <mask> models.", top_k=2)

[{'score': 0.19378963112831116,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This seminar will teach you all about mathematical models.'},
 {'score': 0.04502846300601959,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This seminar will teach you all about computational models.'}]

#### Named entity recognition

In [23]:
ner = pipeline("ner", grouped_entities=True)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [24]:
ner("My name is Olivier and I work at the University of Geneva near Plainpalais.")

[{'entity_group': 'PER',
  'score': 0.9992693,
  'word': 'Olivier',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.97970057,
  'word': 'University of Geneva',
  'start': 37,
  'end': 57},
 {'entity_group': 'LOC',
  'score': 0.9608657,
  'word': 'Plainpalais',
  'start': 63,
  'end': 74}]

#### Question answering

Using extractive encoding

In [25]:
question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [26]:
question_answerer(question="Where do I work?",
                  context="My name is Olivier and I work at the University of Geneva near Plainpalais. I have a lot of work at the office this week.")

{'score': 0.7924373149871826,
 'start': 37,
 'end': 57,
 'answer': 'University of Geneva'}

#### Document summarization

Using encoder-decoder tranformers

In [27]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [28]:
summarizer(
    # Wikipedia page on NLP:
    """
    Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science,
    and artificial intelligence concerned with the interactions between computers and human language,
    in particular how to program computers to process and analyze large amounts of natural language data.
    The goal is a computer capable of "understanding" the contents of documents, including the contextual
    nuances of the language within them. The technology can then accurately extract information and insights
    contained in the documents as well as categorize and organize the documents themselves.
    
    Challenges in natural language processing frequently involve speech recognition, natural-language
    understanding, and natural-language generation.
    
    Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article
    titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a
    criterion of intelligence, though at the time that was not articulated as a problem separate from
    artificial intelligence. The proposed test includes a task that involves the automated interpretation
    and generation of natural language.
"""
)

[{'summary_text': ' The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them . The technology can then extract information and insights contained in the documents as well as categorize and organize the documents themselves . Challenges in natural language processing frequently involve speech recognition, natural-language .'}]

#### Translation

Using encoder-decoder tranformers

In [29]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")



In [30]:
translator("Ce séminaire est exceptionnellement donné le mardi après-midi.")

[{'translation_text': 'This seminar is exceptionally given on Tuesday afternoon.'}]

## What constitues a pipeline?

Example for a classification pipeline.

https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg

In [31]:
pretrained_name = "distilbert-base-uncased-finetuned-sst-2-english"

sent_pipe = pipeline("sentiment-analysis", model=pretrained_name)

In [32]:
corp = ["I love this amazing Transformers introduction seminar.",
        "I hate debugging my code so much!"]

In [33]:
sent_pipe(corp)

[{'label': 'POSITIVE', 'score': 0.9998568296432495},
 {'label': 'NEGATIVE', 'score': 0.9962984919548035}]

### Step 1. Tokenizer

In [34]:
from transformers import AutoTokenizer

In [35]:
tokenizer = AutoTokenizer.from_pretrained(pretrained_name)

In [36]:
# vocabulary BoW indexes for the tokenized text corpus:
inputs = tokenizer(corp, padding=True, truncation=True, return_tensors="np")
print(inputs)

{'input_ids': array([[  101,  1045,  2293,  2023,  6429, 19081,  4955, 18014,  1012,
          102,     0,     0],
       [  101,  1045,  5223,  2139,  8569, 12588,  2026,  3642,  2061,
         2172,   999,   102]]), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [37]:
#For comparison, the pipeline tokenizer is the same:
sent_pipe.tokenizer(corp, padding=True, truncation=True, return_tensors="np")

{'input_ids': array([[  101,  1045,  2293,  2023,  6429, 19081,  4955, 18014,  1012,
          102,     0,     0],
       [  101,  1045,  5223,  2139,  8569, 12588,  2026,  3642,  2061,
         2172,   999,   102]]), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [38]:
#Textual version of the tokens:
for doc in corp:
    print(tokenizer.tokenize(doc, add_special_tokens=True)) # sub-word / wordpiece

['[CLS]', 'i', 'love', 'this', 'amazing', 'transformers', 'introduction', 'seminar', '.', '[SEP]']
['[CLS]', 'i', 'hate', 'de', '##bu', '##gging', 'my', 'code', 'so', 'much', '!', '[SEP]']


In [39]:
tokenizer.decode([101, 1045, 2293, 2023, 6429, 19081, 4955, 18014, 1012, 102, 0, 0])

'[CLS] i love this amazing transformers introduction seminar. [SEP] [PAD] [PAD]'

In [40]:
tokenizer.decode([101, 1045, 5223, 2139, 8569, 12588, 2026, 3642, 2061, 2172, 999, 102])

'[CLS] i hate debugging my code so much! [SEP]'

### Step 2.1. Transformer model

https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/transformer_and_head.svg

In [41]:
from transformers import AutoModel, TFAutoModel

In [42]:
model = TFAutoModel.from_pretrained(pretrained_name)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertModel: ['classifier', 'pre_classifier', 'dropout_19']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


In [43]:
model.summary()

Model: "tf_distil_bert_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
Total params: 66,362,880
Trainable params: 66,362,880
Non-trainable params: 0
_________________________________________________________________


In [47]:
outputs = model(**inputs)
outputs

TFBaseModelOutput(last_hidden_state=<tf.Tensor: shape=(2, 12, 768), dtype=float32, numpy=
array([[[ 0.74096173,  0.09954479,  0.17740408, ...,  0.37085024,
          1.0475554 , -0.53418267],
        [ 0.9534427 ,  0.17849974,  0.07076181, ...,  0.3553438 ,
          1.1498722 , -0.3271943 ],
        [ 1.1049628 ,  0.2552293 ,  0.33767316, ...,  0.3171587 ,
          1.0195711 , -0.37932056],
        ...,
        [ 1.1075138 ,  0.08023867,  0.6883422 , ...,  0.6212788 ,
          0.59678227, -0.8048649 ],
        [ 0.58498317,  0.11724679,  0.09631445, ...,  0.5941567 ,
          1.0606877 , -0.28705198],
        [ 0.57371145,  0.08199537,  0.07770054, ...,  0.43690923,
          1.0691313 , -0.36507556]],

       [[-0.12865362,  0.6125047 , -0.43362248, ...,  0.02990204,
         -0.37186334,  0.16652946],
        [-0.08211757,  0.91229707, -0.15405867, ..., -0.02132722,
         -0.25966606,  0.25903928],
        [-0.05406427,  0.6817771 , -0.06585205, ...,  0.07498453,
         -0.3

In [49]:
print(outputs.last_hidden_state.shape)

(2, 12, 768)


### Step 2.2. Transformer model with classification head

In [50]:
from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification

In [51]:
classif_model = TFAutoModelForSequenceClassification.from_pretrained(pretrained_name)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_38']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [52]:
classif_model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_38 (Dropout)        multiple                  0         
                                                                 
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
_________________________________________________________________


In [53]:
outputs2 = classif_model(**inputs)

In [54]:
outputs2

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.2869983,  4.5641413],
       [ 3.06056  , -2.5347545]], dtype=float32)>, hidden_states=None, attentions=None)

In [56]:
outputs2.logits

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.2869983,  4.5641413],
       [ 3.06056  , -2.5347545]], dtype=float32)>

In [55]:
print(outputs2.logits.shape)

(2, 2)


### Step 3. Post-processing

In [57]:
probabilities = tf.keras.activations.softmax(outputs2.logits, axis=-1)
probabilities

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[1.431980e-04, 9.998568e-01],
       [9.962985e-01, 3.701479e-03]], dtype=float32)>

In [60]:
# Predicted class:
probabilities.numpy().argmax(axis=-1)

array([1, 0], dtype=int64)

In [61]:
# Prediction certainty:
probabilities.numpy().max(axis=-1)

array([0.9998568, 0.9962985], dtype=float32)

In [63]:
# Class labels corresponding to the integer incoding:
classif_model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [64]:
#For comparison, the pipeline output:
sent_pipe(corp)

[{'label': 'POSITIVE', 'score': 0.9998568296432495},
 {'label': 'NEGATIVE', 'score': 0.9962984919548035}]

For different tasks, thare might be additionnal preprocessing and feature extraction steps.

### Saving

In [66]:
# save
tf_save_directory = "./checkpoints/tf_save_pretrained"
tokenizer.save_pretrained(tf_save_directory)
classif_model.save_pretrained(tf_save_directory)

In [67]:
#load
classif_model = TFAutoModelForSequenceClassification.from_pretrained("./checkpoints/tf_save_pretrained")

Some layers from the model checkpoint at ./checkpoints/tf_save_pretrained were not used when initializing TFDistilBertForSequenceClassification: ['dropout_38']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at ./checkpoints/tf_save_pretrained and are newly initialized: ['dropout_58']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [68]:
tokenizer = AutoTokenizer.from_pretrained("./checkpoints/tf_save_pretrained")

## Transfer learning with keras

#### Loading the data

In [69]:
simpsons = pd.read_csv("../Seminar08/data/simpsons_script_lines.csv",
                       usecols=["raw_character_text", "raw_location_text", "spoken_words", "normalized_text"],
                       dtype={'raw_character_text':'string', 'raw_location_text':'string',
                              'spoken_words':'string', 'normalized_text':'string'})
simpsons = simpsons.dropna().drop_duplicates().reset_index(drop=True)

n_classes = 10
main_characters = simpsons['raw_character_text'].value_counts(dropna=False)[:n_classes].index.to_list()
simpsons_main = simpsons.query("`raw_character_text` in @main_characters")

X = simpsons_main["normalized_text"].to_numpy()
y = simpsons_main["raw_character_text"].to_numpy()
y_int = np.array([np.where(np.array(main_characters)==char)[0].item() for char in y])

In [73]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y_int, test_size=0.2, random_state=42, shuffle=True)

#### Loading the pretrained model

In [74]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from transformers import AutoTokenizer

pretrained_name2 = "bert-base-cased" # e.g. "bert-base-cased" or "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(pretrained_name2)
model = TFAutoModelForSequenceClassification.from_pretrained(pretrained_name2, num_labels=n_classes)#?

Downloading tf_model.h5:   0%|          | 0.00/527M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [75]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108310272 
                                                                 
 dropout_96 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  7690      
                                                                 
Total params: 108,317,962
Trainable params: 108,317,962
Non-trainable params: 0
_________________________________________________________________


To freeze the pretrained transformer weights, and only train the classification head, we can set:

In [77]:
model.layers[0].trainable = False

In [78]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108310272 
                                                                 
 dropout_96 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  7690      
                                                                 
Total params: 108,317,962
Trainable params: 7,690
Non-trainable params: 108,310,272
_________________________________________________________________


Allowing the transformer weights to be modified by leaving `model.layers[0].trainable = True` can significantly improve performance of the downstream task, but will take significantly longer to train. Furthermore, more care needs to be taken when selecting the training hyperparameters (low initial learning rate, learning rate decay, not too many epochs), to prevent [Catastrophic forgetting](https://en.wikipedia.org/wiki/Catastrophic_interference), and loosing pretraining information.

In [79]:
X_train_tok = dict(tokenizer(X_train.tolist(), padding=True, truncation=True, return_tensors="tf"))
X_valid_tok = dict(tokenizer(X_valid.tolist(), padding=True, truncation=True, return_tensors="tf"))

In [81]:
model.compile(optimizer=optimizers.Adam(learning_rate=3e-5),
              loss=losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [None]:
# The fit is very long without a GPU or a cloud service
epochs = 5
history_ft = model.fit(X_train_tok, y_train, validation_data=(X_valid_tok, y_valid),
                       batch_size=16, epochs=epochs)

For larger dataset sizes, one can perform more efficient training using HuggingFace's `Datasets`, that can allow for smarter parallel memory allocation from disk: https://huggingface.co/docs/datasets/index.

HuggingFace's `Datasets` can then also interract with the Keras API (e.g. the `model.fit()` method), for example through `model.prepare_tf_dataset()` or `Dataset.to_tf_dataset()`.