# NLP Seminar 5: Pretrained Transformers and Transfer-Learning

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, losses, optimizers
from tensorflow.keras import Sequential

## Introduction

Transformers can be implemented from scratch in both tensorflow and pytorch
(e.g. https://www.tensorflow.org/text/tutorials/transformer).
The multi-headed attention layers used in transformers are also implemented as a Keras layer in tensorflow ([Attention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention), [MultiHeadAttention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention)).
However, constructing or reproducing meaningful transformer architectures from scratch, even with these building blocks, can still remain challenging. This is especially true for some of the more complex MLP tasks, combining encoder and decoder transformers.
Furthermore, transformers have really proved their state-of-the art efficiency for NLP taskes when trained on huge corpora of data. In particular, for many specific tasks, transfer learning is used to leverage the dynamic semantic information already aquired by pre-trained models.

Although transfer-learning using pre-trained transformers such as BERT is possible with tensorflow (e.g. [classify_text_with_bert](https://www.tensorflow.org/text/tutorials/classify_text_with_bert), [fine_tune_bert](https://www.tensorflow.org/tfmodels/nlp/fine_tune_bert)) this practical will instead introduce the `HuggingFace` transformer library, as it has
- a lot of pretrained stae-of-the-art transformer models for various tasks,
- a very high-level user-friendly interface,
- compatibility with both tensorflow and pytorch.

If needed, see the official tutorials to go further:
- https://huggingface.co/learn/nlp-course/chapter1/1
- https://huggingface.co/docs/transformers/index

In [None]:
#!pip install --upgrade transformers datasets

## Pre-trained pipelines

The `pipeline` allows loading pre-trained models with a very easy interface, for a wide range of different tasks from the `HuggingFace` database. Almost all main open-source pretrained transformer references (BERT, GPT, ...) are available.

- Selected transformer architectures: https://huggingface.co/docs/transformers/index
- comunity checkpoints: https://huggingface.co/models

Here are a few examples of pretrained transformer models (i.e. checkpoints) for some NLP tasks.

In [None]:
import datasets
from transformers import pipeline

#### Sentiment analysis

In [None]:
sent_pipe = pipeline("sentiment-analysis", # or "text-classification"
                     model="distilbert-base-uncased-finetuned-sst-2-english")

In [None]:
sent_pipe("This restaurant is awesome.")

In [None]:
results = sent_pipe([??, ??])

for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

#### "Zero-shot" classification

Using natural language inference models

In [None]:
zsc_pipe = pipeline('zero-shot-classification',
                    model="facebook/bart-large-mnli", revision="c626438")

In [None]:
zsc_pipe("I like trains.", ["politics","vehicles","animals"])

#### Dynamic word embeddings

In [None]:
dwe_pipe = pipeline("feature-extraction", model="bert-base-cased") # e.g. "bert-base-cased" "distilbert-base-cased"

In [None]:
(dwe_pipe("I like trains.", return_tensors=True)).shape

This outputs the last transformer block output. Some other embedding approaches exist, like averaging or concatenating the activations of several of the transformer's layers.

#### Text generation

With causal language models

In [None]:
generator = pipeline("text-generation", model="gpt2")

generator("In this NLP seminar about transformers, we will learn")

In [None]:
generator("In this NLP seminar about transformers, we will learn",
          max_length=30,
          num_return_sequences=2)

#### Mask filling
This language model taks is part of how BERT architectures are often pre-trained

In [None]:
unmasker = pipeline("fill-mask")

unmasker("This seminar will teach you all about <mask> models.", top_k=2)

#### Named entity recognition

In [None]:
ner = pipeline("ner", grouped_entities=True)

ner("My name is Olivier and I work at the University of Geneva near Plainpalais.")

#### Question answering

Using extractive encoding

In [None]:
question_answerer = pipeline("question-answering")

question_answerer(question="Where do I work?",
                  context="My name is Olivier and I work at the University of Geneva near Plainpalais. I have a lot of work this week.")

#### Document summarization

Using encoder-decoder tranformers

In [None]:
summarizer = pipeline("summarization")

summarizer(
    # Wikipedia page on NLP:
    """
    Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science,
    and artificial intelligence concerned with the interactions between computers and human language,
    in particular how to program computers to process and analyze large amounts of natural language data.
    The goal is a computer capable of "understanding" the contents of documents, including the contextual
    nuances of the language within them. The technology can then accurately extract information and insights
    contained in the documents as well as categorize and organize the documents themselves.
    
    Challenges in natural language processing frequently involve speech recognition, natural-language
    understanding, and natural-language generation.
    
    Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article
    titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a
    criterion of intelligence, though at the time that was not articulated as a problem separate from
    artificial intelligence. The proposed test includes a task that involves the automated interpretation
    and generation of natural language.
"""
)

#### Translation

In [None]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")

translator("Ce séminaire est exceptionnellement donné le mardi après-midi.")

## What constitues a pipeline?

Example for a classification pipeline

In [None]:
pretrained_name = "distilbert-base-uncased-finetuned-sst-2-english"

sent_pipe = pipeline("sentiment-analysis", model=pretrained_name)

In [None]:
corp = ["I love this amazing Transformers introduction seminar.",
        "I hate debugging my code so much!"]

In [None]:
sent_pipe(corp)

### Step 1. Tokenizer

In [None]:
from transformers import AutoTokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained(pretrained_name)

In [None]:
inputs = tokenizer(corp, padding=True, truncation=True, return_tensors="np")
print(inputs)

In [None]:
sent_pipe.tokenizer(corp, padding=True, truncation=True, return_tensors="np")

In [None]:
for doc in corp:
    print(tokenizer.tokenize(doc, add_special_tokens=True)) # sub-word / wordpiece

In [None]:
tokenizer.decode([101, 1045, 2293, 2023, 6429, 19081, 4955, 18014, 1012, 102, 0, 0])

In [None]:
tokenizer.decode([101, 1045, 5223, 2139, 8569, 12588, 2026, 3642, 2061, 2172, 999, 102])

### Step 2.1. Transformer model

In [None]:
from transformers import AutoModel, TFAutoModel

In [None]:
model = TFAutoModel.from_pretrained(pretrained_name)

model.summary()

In [None]:
outputs = model(**inputs)

print(outputs.last_hidden_state.shape)

### Step 2.2. Transformer model with classification head

In [None]:
from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification

In [None]:
classif_model = TFAutoModelForSequenceClassification.from_pretrained(pretrained_name)
classif_model.summary()

In [None]:
outputs2 = classif_model(**inputs)

In [None]:
outputs2

In [None]:
print(outputs2.logits.shape)

### Step 3. Post-processing

In [None]:
probabilities = tf.keras.activations.softmax(outputs2.logits, axis=-1)
probabilities

In [None]:
probabilities.numpy().argmax(axis=-1)

In [None]:
probabilities.numpy().max(axis=-1)

In [None]:
classif_model.config.id2label

In [None]:
sent_pipe(corp)

In [None]:
outputs2

For different tasks, thare might be additionnal preprocessing and feature extraction steps.

### Saving

In [None]:
# save
tf_save_directory = "./checkpoints/tf_save_pretrained"
tokenizer.save_pretrained(tf_save_directory)
classif_model.save_pretrained(tf_save_directory)

In [None]:
#load
classif_model = TFAutoModelForSequenceClassification.from_pretrained("./checkpoints/tf_save_pretrained")

## Transfer learning with keras

#### Loading the data

In [None]:
simpsons = pd.read_csv("../Seminar08/data/simpsons_script_lines.csv",
                       usecols=["raw_character_text", "raw_location_text", "spoken_words", "normalized_text"],
                       dtype={'raw_character_text':'string', 'raw_location_text':'string',
                              'spoken_words':'string', 'normalized_text':'string'})
simpsons = simpsons.dropna().drop_duplicates().reset_index(drop=True)

n_classes = 10
main_characters = simpsons['raw_character_text'].value_counts(dropna=False)[:n_classes].index.to_list()
simpsons_main = simpsons.query("`raw_character_text` in @main_characters")

X = simpsons_main["normalized_text"].to_numpy()
y = simpsons_main["raw_character_text"].to_numpy()
y_int = np.array([np.where(np.array(main_characters)==char)[0].item() for char in y])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y_int, test_size=0.2, random_state=42, shuffle=True)

#### Loading the pretrained model

In [None]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from transformers import AutoTokenizer

pretrained_name2 = "bert-base-cased"#"distilbert-base-uncased" "bert-base-cased"

tokenizer = AutoTokenizer.from_pretrained(pretrained_name2)
model = TFAutoModelForSequenceClassification.from_pretrained(pretrained_name2, num_labels=n_classes)#?

In [None]:
model.summary()

To freeze the pretrained transformer weights, and only train the classification head, we can set:

In [None]:
model.layers[0].trainable = False

In [None]:
model.summary()

Allowing the transformer weights to be modified by leaving `model.layers[0].trainable = True` can significantly improve performance of the downstream task, but will take significantly longer to train. Furthermore, more care needs to be taken when selecting the training hyperparameters (low initial learning rate, learning rate decay, not too many epochs), to prevent [Catastrophic forgetting](https://en.wikipedia.org/wiki/Catastrophic_interference), and loosing pretraining information.

In [None]:
#def tokenize_dataset(dataset):
#    return tokenizer(dataset["text"])
#
#dataset = dataset.map(tokenize_dataset)
#tf_dataset = model.prepare_tf_dataset(dataset, batch_size=16, shuffle=True, tokenizer=tokenizer)
#
#model.compile(optimizer=optimizers.Adam(learning_rate=3e-5))
#model.fit(dataset)

In [None]:
X_train_tok = dict(tokenizer(X_train.tolist(), padding=True, truncation=True, return_tensors="np"))
X_valid_tok = dict(tokenizer(X_valid.tolist(), padding=True, truncation=True, return_tensors="np"))

In [None]:
model.compile(optimizer=optimizers.Adam(learning_rate=3e-5),
              loss=losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [None]:
# The fit is very long without a GPU or a cloud service
epochs = 5
#history_ft = model.fit(X_train_tok, y_train, validation_data=(X_valid_tok, y_valid),
#                       batch_size=16, epochs=epochs)

For larger dataset sizes, one can perform more efficient training using HuggingFace's `Datasets`, that can allow for smarter parallel memory allocation from disk: https://huggingface.co/docs/datasets/index.

HuggingFace's `Datasets` can then also interract with the Keras API (e.g. the `model.fit()` method), for example through `model.prepare_tf_dataset()` or `Dataset.to_tf_dataset()`.