# NLP Seminar 5: Pretrained Transformers and Transfer-Learning

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, losses, optimizers
from tensorflow.keras import Sequential

## Introduction

Transformers can be implemented from scratch in both tensorflow and pytorch
(e.g. https://www.tensorflow.org/text/tutorials/transformer).
The multi-headed attention layers used in transformers are also implemented as a Keras layer in tensorflow ([Attention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention), [MultiHeadAttention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention)).
However, constructing or reproducing meaningful transformer architectures from scratch, even with these building blocks, can still remain challenging. This is especially true for some of the more complex MLP tasks, combining encoder and decoder transformers.
Furthermore, transformers have really proved their state-of-the art efficiency for NLP tasks when trained on huge corpora of data. In particular, for many specific tasks, transfer learning is used to leverage the dynamic semantic information already acquired by pre-trained models.

Although transfer-learning using pre-trained transformers such as BERT is possible with tensorflow (e.g. [classify_text_with_bert](https://www.tensorflow.org/text/tutorials/classify_text_with_bert), [fine_tune_bert](https://www.tensorflow.org/tfmodels/nlp/fine_tune_bert)) this practical will instead introduce the `HuggingFace` transformer library, as it
- has a lot of pretrained state-of-the-art transformer models for various tasks,
- has a very high-level user-friendly interface,
- is compatible with both tensorflow and pytorch,
- is used by many universities, research labs and companies.

If needed, see the official tutorials to go further:
- https://huggingface.co/learn/llm-course/chapter1/1
- https://huggingface.co/docs/transformers/index

In [3]:
# !pip install --upgrade transformers datasets

In [4]:
# In case of version incompatibility issues between transformers and TensorFlow::
# !pip install tf-keras
from tf_keras import layers, losses, optimizers, Sequential




## Pre-trained pipelines

The `pipeline` allows loading pre-trained models with a very easy interface, for a wide range of different tasks from the `HuggingFace` database. Almost all main open-source pretrained transformer references (BERT, GPT, ...) are available.

- Popular transformer architectures: https://huggingface.co/docs/transformers/v4.49.0/en/index
- community checkpoints: https://huggingface.co/models

Here are a few examples of pretrained transformer models (i.e. checkpoints) for some NLP tasks.

In [5]:
import datasets
from transformers import pipeline

#### Sentiment analysis

In [6]:
sent_pipe = pipeline("sentiment-analysis", # or "text-classification"
                     model="distilbert-base-uncased-finetuned-sst-2-english")

Device set to use cpu


In [7]:
sent_pipe("The weather is bad today...")

[{'label': 'NEGATIVE', 'score': 0.9997754693031311}]

In [None]:
results = sent_pipe(["The weather is bad today...",
                     "It is sunny and warm outside."])

for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: NEGATIVE, with score: 0.9998
label: POSITIVE, with score: 0.9998


#### "Zero-shot" classification

Using natural language inference models to predict *entailment* between each sequence-label premise/hypothesis pair.

In [9]:
zsc_pipe = pipeline('zero-shot-classification',
                    model="facebook/bart-large-mnli", revision="c626438")

Device set to use cpu


In [12]:
zsc_pipe("I like trains.", ["politics","vehicles","animals"])

{'sequence': 'I like trains.',
 'labels': ['vehicles', 'animals', 'politics'],
 'scores': [0.9911500811576843, 0.004934567026793957, 0.003915321547538042]}

#### Dynamic word embeddings

In [13]:
dwe_pipe = pipeline("feature-extraction", model="bert-base-cased") # e.g. "bert-base-cased" "distilbert-base-cased"

Device set to use cpu


In [16]:
(dwe_pipe("I like trains.", return_tensors=True))

tensor([[[ 0.8597,  0.1216, -0.0761,  ..., -0.1833,  0.1758,  0.0822],
         [ 0.8721, -0.3453,  0.5372,  ..., -0.1697,  0.0987,  0.1806],
         [ 0.6673, -0.0972, -0.5464,  ...,  0.5048, -0.4832,  0.1718],
         [ 0.4819, -0.0781,  0.1035,  ...,  0.0943, -0.3238,  0.1974],
         [ 0.8397, -0.1973, -0.0363,  ..., -0.0951,  0.3160, -0.0663],
         [ 1.6680,  0.1131, -0.2623,  ..., -0.2753,  0.5270, -0.1521]]])

This outputs the last transformer block output. Some other embedding approaches exist, like averaging or concatenating the activations of several of the transformer's layers.

#### Text generation

With causal language models

In [17]:
generator = pipeline("text-generation", model="gpt2") #"HuggingFaceTB/SmolLM2-360M"

Device set to use cpu


In [18]:
generator("In this NLP seminar about transformers, we will learn")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this NLP seminar about transformers, we will learn about transformation and how to use them effectively. It also covers how they can increase performance in many more tasks over the next few weeks as we improve our methods and tools in our own way.'}]

In [19]:
generator("In this NLP seminar about transformers, we will learn",
          max_length=30,
          num_return_sequences=2)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this NLP seminar about transformers, we will learn about them, what they are, and how we can apply them to create solutions.\n'},
 {'generated_text': 'In this NLP seminar about transformers, we will learn about the different ways in which they can be used and a lot more. Our approach is'}]

#### Mask filling
This language model task is part of how BERT architectures are often pre-trained

In [20]:
unmasker = pipeline("fill-mask")

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [21]:
unmasker("This seminar will teach you all about <mask> models.", top_k=2)

[{'score': 0.18968380987644196,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This seminar will teach you all about mathematical models.'},
 {'score': 0.04678095877170563,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This seminar will teach you all about computational models.'}]

#### Named entity recognition

In [22]:
ner = pipeline("ner", grouped_entities=True)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [23]:
ner("My name is Olivier and I work at the University of Geneva near Plainpalais.")

[{'entity_group': 'PER',
  'score': 0.9992693,
  'word': 'Olivier',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.97970057,
  'word': 'University of Geneva',
  'start': 37,
  'end': 57},
 {'entity_group': 'LOC',
  'score': 0.9608657,
  'word': 'Plainpalais',
  'start': 63,
  'end': 74}]

#### Question answering

Using extractive encoding

In [24]:
question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [25]:
question_answerer(question="Where do I work?",
                  context="My name is Olivier and I work at the University of Geneva near Plainpalais. I have a lot of work at the office this week.")

{'score': 0.7924377918243408,
 'start': 37,
 'end': 57,
 'answer': 'University of Geneva'}

#### Document summarization

Using encoder-decoder transformers

In [26]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [27]:
summarizer(
    # Wikipedia page on NLP:
    """
    Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science,
    and artificial intelligence concerned with the interactions between computers and human language,
    in particular how to program computers to process and analyze large amounts of natural language data.
    The goal is a computer capable of "understanding" the contents of documents, including the contextual
    nuances of the language within them. The technology can then accurately extract information and insights
    contained in the documents as well as categorize and organize the documents themselves.
    
    Challenges in natural language processing frequently involve speech recognition, natural-language
    understanding, and natural-language generation.
    
    Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article
    titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a
    criterion of intelligence, though at the time that was not articulated as a problem separate from
    artificial intelligence. The proposed test includes a task that involves the automated interpretation
    and generation of natural language.
    """
)

[{'summary_text': ' Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence . The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them . The technology can then extract information and insights from documents as well as categorize and organize the documents themselves .'}]

#### Translation

Using encoder-decoder transformers

In [28]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")

Device set to use cpu


In [29]:
translator("Ce séminaire est donné chaque semaine le jeudi après-midi.")

[{'translation_text': 'This seminar is given every week on Thursday afternoon.'}]

Or the more recent multilingual type:

In [30]:
mtranslator = pipeline(task="translation", model="google-t5/t5-small")

Device set to use cpu


In [32]:
mtranslator("translate to French: I would like to learn more about LLMs.")

[{'translation_text': 'Je voudrais en savoir plus sur les LLM.'}]

## What constitues a pipeline?

Example for a classification pipeline.

https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg

In [33]:
pretrained_name = "distilbert-base-uncased-finetuned-sst-2-english"

sent_pipe = pipeline("sentiment-analysis", model=pretrained_name)

Device set to use cpu


In [34]:
corp = ["I love this amazing Transformers introduction seminar.",
        "I hate debugging my code so much!"]

In [35]:
sent_pipe(corp)

[{'label': 'POSITIVE', 'score': 0.9998568296432495},
 {'label': 'NEGATIVE', 'score': 0.9962984919548035}]

### Step 1. Tokenizer

In [36]:
from transformers import AutoTokenizer

In [37]:
tokenizer = AutoTokenizer.from_pretrained(pretrained_name)

In [38]:
# vocabulary BoW indexes for the tokenized text corpus:
inputs = tokenizer(corp, padding=True, truncation=True, return_tensors="np")
print(inputs)

{'input_ids': array([[  101,  1045,  2293,  2023,  6429, 19081,  4955, 18014,  1012,
          102,     0,     0],
       [  101,  1045,  5223,  2139,  8569, 12588,  2026,  3642,  2061,
         2172,   999,   102]]), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [39]:
#For comparison, the pipeline tokenizer is the same:
sent_pipe.tokenizer(corp, padding=True, truncation=True, return_tensors="np")

{'input_ids': array([[  101,  1045,  2293,  2023,  6429, 19081,  4955, 18014,  1012,
          102,     0,     0],
       [  101,  1045,  5223,  2139,  8569, 12588,  2026,  3642,  2061,
         2172,   999,   102]]), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [40]:
#Textual version of the tokens:
for doc in corp:
    print(tokenizer.tokenize(doc, add_special_tokens=True)) # sub-word / wordpiece

['[CLS]', 'i', 'love', 'this', 'amazing', 'transformers', 'introduction', 'seminar', '.', '[SEP]']
['[CLS]', 'i', 'hate', 'de', '##bu', '##gging', 'my', 'code', 'so', 'much', '!', '[SEP]']


In [41]:
tokenizer.decode([101, 1045, 2293, 2023, 6429, 19081, 4955, 18014, 1012, 102, 0, 0])

'[CLS] i love this amazing transformers introduction seminar. [SEP] [PAD] [PAD]'

In [42]:
tokenizer.decode([101, 1045, 5223, 2139, 8569, 12588, 2026, 3642, 2061, 2172, 999, 102])

'[CLS] i hate debugging my code so much! [SEP]'

### Step 2.1. Transformer model

https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/transformer_and_head.svg

In [44]:
from transformers import AutoModel, TFAutoModel

In [45]:
model = TFAutoModel.from_pretrained(pretrained_name)




Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


In [46]:
model.summary()

Model: "tf_distil_bert_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
Total params: 66362880 (253.15 MB)
Trainable params: 66362880 (253.15 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [47]:
outputs = model(**inputs)
outputs

TFBaseModelOutput(last_hidden_state=<tf.Tensor: shape=(2, 12, 768), dtype=float32, numpy=
array([[[ 0.7409617 ,  0.09954516,  0.17740384, ...,  0.37085018,
          1.0475554 , -0.53418255],
        [ 0.95344263,  0.1784999 ,  0.07076187, ...,  0.35534364,
          1.1498725 , -0.32719424],
        [ 1.1049625 ,  0.25522926,  0.33767354, ...,  0.3171588 ,
          1.0195712 , -0.3793204 ],
        ...,
        [ 1.1075132 ,  0.08023884,  0.68834203, ...,  0.6212788 ,
          0.59678215, -0.804865  ],
        [ 0.5849833 ,  0.11724682,  0.09631411, ...,  0.5941564 ,
          1.0606879 , -0.2870518 ],
        [ 0.5737111 ,  0.0819948 ,  0.07770069, ...,  0.43690968,
          1.0691311 , -0.36507562]],

       [[-0.12865382,  0.6125046 , -0.43362278, ...,  0.02990168,
         -0.37186378,  0.16652936],
        [-0.08211732,  0.91229683, -0.15405825, ..., -0.02132722,
         -0.2596661 ,  0.25903916],
        [-0.05406391,  0.6817771 , -0.06585187, ...,  0.07498465,
         -0.3

In [48]:
print(outputs.last_hidden_state.shape)

(2, 12, 768)


### Step 2.2. Transformer model with classification head

In [49]:
from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification

In [50]:
classif_model = TFAutoModelForSequenceClassification.from_pretrained(pretrained_name)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [51]:
classif_model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_38 (Dropout)        multiple                  0 (unused)
                                                                 
Total params: 66955010 (255.41 MB)
Trainable params: 66955010 (255.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [52]:
outputs2 = classif_model(**inputs)

In [53]:
outputs2

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.2869983,  4.5641413],
       [ 3.06056  , -2.5347545]], dtype=float32)>, hidden_states=None, attentions=None)

In [54]:
outputs2.logits

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.2869983,  4.5641413],
       [ 3.06056  , -2.5347545]], dtype=float32)>

In [55]:
print(outputs2.logits.shape)

(2, 2)


### Step 3. Post-processing

In [56]:
probabilities = tf.keras.activations.softmax(outputs2.logits, axis=-1)
probabilities

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[1.4319799e-04, 9.9985683e-01],
       [9.9629849e-01, 3.7014787e-03]], dtype=float32)>

In [58]:
# Predicted class:
probabilities.numpy().argmax(axis=-1)

array([1, 0], dtype=int64)

In [59]:
# Prediction certainty:
probabilities.numpy().max(axis=-1)

array([0.9998568, 0.9962985], dtype=float32)

In [61]:
# Class labels corresponding to the integer incoding:
classif_model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [62]:
#For comparison, the pipeline output:
sent_pipe(corp)

[{'label': 'POSITIVE', 'score': 0.9998568296432495},
 {'label': 'NEGATIVE', 'score': 0.9962984919548035}]

For different tasks, there might be additional preprocessing and feature extraction steps.

### Saving

In [63]:
# save model and tokenizer
tf_save_directory = "./checkpoints/tf_save_pretrained"
tokenizer.save_pretrained(tf_save_directory)
classif_model.save_pretrained(tf_save_directory)

In [64]:
# load model
classif_model = TFAutoModelForSequenceClassification.from_pretrained("./checkpoints/tf_save_pretrained")

Some layers from the model checkpoint at ./checkpoints/tf_save_pretrained were not used when initializing TFDistilBertForSequenceClassification: ['dropout_38']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at ./checkpoints/tf_save_pretrained and are newly initialized: ['dropout_58']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [65]:
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("./checkpoints/tf_save_pretrained")

## Transfer learning with keras

#### Loading the data

In [66]:
simpsons = pd.read_csv("../data/simpsons_script_lines.csv",
                       usecols=["raw_character_text", "raw_location_text", "spoken_words", "normalized_text"],
                       dtype={'raw_character_text':'string', 'raw_location_text':'string',
                              'spoken_words':'string', 'normalized_text':'string'})
simpsons = simpsons.dropna().drop_duplicates().reset_index(drop=True)

n_classes = 10
main_characters = simpsons['raw_character_text'].value_counts(dropna=False)[:n_classes].index.to_list()
simpsons_main = simpsons.query("`raw_character_text` in @main_characters")

X = simpsons_main["normalized_text"].to_numpy()
y = simpsons_main["raw_character_text"].to_numpy()
y_int = np.array([np.where(np.array(main_characters)==char)[0].item() for char in y])

In [67]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y_int, test_size=0.2, random_state=42, shuffle=True)

#### Loading the pretrained model

In [68]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

pretrained_name2 = "bert-base-cased" # e.g. "bert-base-cased" or "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(pretrained_name2)
model = TFAutoModelForSequenceClassification.from_pretrained(pretrained_name2, num_labels=n_classes)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [69]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108310272 
                                                                 
 dropout_96 (Dropout)        multiple                  0 (unused)
                                                                 
 classifier (Dense)          multiple                  7690      
                                                                 
Total params: 108317962 (413.20 MB)
Trainable params: 108317962 (413.20 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


To freeze the pretrained transformer weights, and only train the classification head, we can set:

In [73]:
model.layers[0].trainable = False

In [75]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108310272 
                                                                 
 dropout_96 (Dropout)        multiple                  0 (unused)
                                                                 
 classifier (Dense)          multiple                  7690      
                                                                 
Total params: 108317962 (413.20 MB)
Trainable params: 7690 (30.04 KB)
Non-trainable params: 108310272 (413.17 MB)
_________________________________________________________________


Allowing the transformer weights to be modified by leaving `model.layers[0].trainable = True` can significantly improve performance of the downstream task, but will take significantly longer to train. Furthermore, more care needs to be taken when selecting the training hyperparameters (low initial learning rate, learning rate decay, not too many epochs), to prevent [Catastrophic forgetting](https://en.wikipedia.org/wiki/Catastrophic_interference), and loosing pretraining information.

In [76]:
X_train_tok = dict(tokenizer(X_train.tolist(), padding=True, truncation=True, return_tensors="tf"))
X_valid_tok = dict(tokenizer(X_valid.tolist(), padding=True, truncation=True, return_tensors="tf"))

In [77]:
model.compile(optimizer=optimizers.Adam(learning_rate=3e-5),
              loss=losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [None]:
# The fit is very long without a GPU or a cloud service
epochs = 5
history_ft = model.fit(X_train_tok, y_train, validation_data=(X_valid_tok, y_valid),
                       batch_size=16, epochs=epochs)

For larger dataset sizes, one can perform more efficient training using HuggingFace's `Datasets`, that can allow for smarter parallel memory allocation from disk: https://huggingface.co/docs/datasets/index.

HuggingFace's `Datasets` can then also interact with the Keras API (e.g. the `model.fit()` method), for example through `model.prepare_tf_dataset()` or `Dataset.to_tf_dataset()`.