This notebook will walk you through some examples of how to use [Hugging face's implementation of Transformers](https://https://github.com/huggingface/transformers) including: 

*   fine-tune a pre-trained Bert and apply it to a down-stream task like sequence classification
*   directly leverage a pre-trained model to serverl supported tasks

This notebook is based on: https://huggingface.co/transformers/custom_datasets.html, https://github.com/huggingface/transformers/blob/master/notebooks/03-pipelines.ipynb

You can also find other useful tutorials on Hugging Face's website: https://huggingface.co/transformers/notebooks.html

In [1]:
import tensorflow as tf
print(tf.__version__)

2.3.0


In [2]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/3a/83/e74092e7f24a08d751aa59b37a9fc572b2e4af3918cb66f7766c3affb1b4/transformers-3.5.1-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 11.5MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 40.0MB/s 
Collecting sentencepiece==0.1.91
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 39.8MB/s 
Collecting tokenizers==0.9.3
[?25l  Downloading https://files.pythonhosted.org/packages/4c/34/b39eb9994bc3c999270b69c9eea40ecc6f0e97991dba28282b9fd32d44ee/tokenizers-0.9.3-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)
[K     |█

# **Sequence Classification with IMDb Reviews**

In this example, we’ll show how to download, tokenize, and train a model on the IMDb reviews dataset. This task takes the text of a review and requires the model to predict whether the sentiment of the review is positive or negative. Let’s start by downloading the dataset from the [Large Movie Review Dataset webpage](http://ai.stanford.edu/~amaas/data/sentiment/).

NOTE: This dataset can also be explored in the Hugging Face model hub (IMDb), and can be alternatively downloaded with the NLP library with `load_dataset("imdb")`. But here we just show a general way on how to deal with a custom dataset from scratch.

## **Data Processing**

In [3]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

--2020-11-30 14:29:07--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2020-11-30 14:29:15 (10.7 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



This data is organized into *pos* and *neg* folders with one text file per example. Let’s write a function that can read this in.

In [4]:
from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

We now have a train and test dataset, but let’s also also create a validation set which we can use for for evaluation and tuning without training our test set results. Sklearn has a convenient utility for creating such splits:

In [5]:
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

We’ve read in our dataset. Now let’s tackle tokenization. We’ll eventually train a classifier using pre-trained DistilBert(one kind of improvement over Bert by having much less parameters with nearly same performance), so let’s use the DistilBert tokenizer.

In [6]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




Now we can simply pass our texts to the tokenizer. We’ll pass `truncation=True` and `padding=True`, which will ensure that all of our sequences are padded to the same length and are truncated to be no longer model’s maximum input length. This will allow us to feed batches of sequences into the model at the same time.

In [10]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

Now, let’s turn our labels and encodings into a Dataset object. In TensorFlow, we pass our input encodings and labels to the `from_tensor_slice`s constructor method. We put the data in this format so that the data can be easily batched such that each key in the batch encoding corresponds to a named parameter of the forward() method of the model we will train.

In [25]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
))

## **Fine-tuning with Trainer**


The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model to fine-tune, define the [TrainingArguments](https://https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments)/[TFTrainingArguments](https://https://huggingface.co/transformers/main_classes/trainer.html#transformers.TFTrainingArguments) and instantiate a [Trainer](https://https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer)/[TFTrainer](https://https://huggingface.co/transformers/main_classes/trainer.html#transformers.TFTrainer).

In [None]:
from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments

training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

with training_args.strategy.scope():
    model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = TFTrainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_transform', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_59']
You should probably TRAIN this model on a down-stream task to be able to use i

## **Fine-tuning with native TensorFlow**

Alternatively, we can also train with native Tensroflow

In [None]:
from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

model.summary()

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_transform', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_79']
You should probably TRAIN this model on a down-stream task to be able to use i

Model: "tf_distil_bert_for_sequence_classification_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
distilbert (TFDistilBertMain multiple                  66362880  
_________________________________________________________________
pre_classifier (Dense)       multiple                  590592    
_________________________________________________________________
classifier (Dense)           multiple                  1538      
_________________________________________________________________
dropout_79 (Dropout)         multiple                  0         
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
_________________________________________________________________


In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss, metrics=['accuracy']) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=1, batch_size=16)



<tensorflow.python.keras.callbacks.History at 0x7f3023512da0>

# **Use hidden states of Transformer model**

**NOTE:** If you just want to use Transformer model as part of your own model, you might want to get access to Transformer's hidden states of output:

For more info, see the official doc: https://huggingface.co/transformers/model_doc/bert.html#tfbertmodel

In [28]:
from transformers import TFBertModel, BertTokenizer
import tensorflow as tf
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertModel.from_pretrained('bert-base-cased', return_dict=True)


Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [29]:
input = tokenizer("Hey, how are you today?", return_tensors="tf", max_length=5) # max_length controls wether to pad or truncate
print(f"Input:{input}" )
outputs = model(input)
last_hidden_states = outputs.last_hidden_state
print(last_hidden_states.shape)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Input:{'input_ids': <tf.Tensor: shape=(1, 5), dtype=int32, numpy=array([[ 101, 4403,  117, 1293,  102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(1, 5), dtype=int32, numpy=array([[0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 5), dtype=int32, numpy=array([[1, 1, 1, 1, 1]], dtype=int32)>}
(1, 5, 768)


# **Can we just use pre-trained model and directly use it?**

The answer is yes!!  By using **pipelines**, we could easily leverage a state-of-the-art-model with a few lines of code to immediately use a model on a given text

**pipelines** provides a high-level, easy to use, API for doing inference over a variety of downstream-tasks, including: 



*   Sentence Classification (Sentiment Analysis): Indicate if the overall sentence is either positive or negative, i.e. binary classification task or logitic regression task.
*   Token Classification (Named Entity Recognition, Part-of-Speech tagging): For each sub-entities (tokens) in the input, assign them a label, i.e. classification task.
*   Question-Answering: Provided a tuple (question, context) the model should find the span of text in content answering the question.
Mask-Filling: Suggests possible word(s) to fill the masked input with respect to the provided context.
*   Summarization: Summarizes the input article to a shorter article.
*   Translation: Translates the input from a language to another language.
*   Feature Extraction: Maps the input to a higher, multi-dimensional space learned from the data.

**Pipelines** encapsulate the overall process of every NLP process:


*   Tokenization: Split the initial input into multiple sub-entities with ... properties (i.e. tokens).
*   Inference: Maps every tokens into a more meaningful representation.
*   Decoding: Use the above representation to generate and/or extract the final output for the underlying task.


The overall API is exposed to the end-user through the `pipeline()` method with the following structure:

In [30]:
from transformers import pipeline

# Using default model and tokenizer for the task
# pipeline("<task-name>")

# Using a user-specified model
# pipeline("<task-name>", model="<model_name>")

# Using custom model/tokenizer as str
# pipeline('<task-name>', model='<model name>', tokenizer='<tokenizer_name>')

In [31]:
from __future__ import print_function
import ipywidgets as widgets

## **Sentence Classification - Sentiment Analysis**

In [None]:
nlp_sentence_classif = pipeline('sentiment-analysis')


In [35]:
nlp_sentence_classif('ANLY590 is a great course !')

[{'label': 'POSITIVE', 'score': 0.999788224697113}]

## **Token Classification - Named Entity Recognition**

In [None]:
nlp_token_class = pipeline('ner')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=998.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1334448817.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=60.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




[{'entity': 'I-ORG', 'index': 1, 'score': 0.7398982048034668, 'word': 'AN'},
 {'entity': 'I-ORG', 'index': 2, 'score': 0.5741072297096252, 'word': '##L'},
 {'entity': 'I-PER', 'index': 15, 'score': 0.9984267950057983, 'word': 'Hi'},
 {'entity': 'I-PER',
  'index': 16,
  'score': 0.9949373006820679,
  'word': '##nes'},
 {'entity': 'I-PER', 'index': 20, 'score': 0.9989754557609558, 'word': 'W'},
 {'entity': 'I-PER',
  'index': 21,
  'score': 0.9809252619743347,
  'word': '##itte'},
 {'entity': 'I-PER', 'index': 22, 'score': 0.9432482719421387, 'word': '##n'},
 {'entity': 'I-PER',
  'index': 23,
  'score': 0.988129734992981,
  'word': '##bach'},
 {'entity': 'I-ORG',
  'index': 25,
  'score': 0.9992836117744446,
  'word': 'Georgetown'},
 {'entity': 'I-ORG',
  'index': 26,
  'score': 0.9952607154846191,
  'word': 'University'}]

In [None]:
nlp_token_class('ANLY590 is a deep learning course taught by Dr.Hines and Dr.Wittenbach at Georgetown University.')

## **Question Answering**

In [36]:
nlp_qa = pipeline('question-answering')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=473.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=260793700.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




In [37]:
nlp_qa(context='ANLY590 is a deep learning course taught by Dr.Hines and Dr.Wittenbach at Georgetown University.', question='Who teaches ANLY590 ?')



{'answer': 'Dr.Hines and Dr.Wittenbach',
 'end': 70,
 'score': 0.7651808857917786,
 'start': 44}

## **Translation**

Translation is currently supported by T5 for the language mappings English-to-French (translation_en_to_fr), English-to-German (translation_en_to_de) and English-to-Romanian (translation_en_to_ro).

In [None]:
# English to French
translator = pipeline('translation_en_to_fr')
translator("ANLY590 is a deep learning course taught by Dr.Hines and Dr.Wittenbach at Georgetown University.")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1199.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891691430.0, style=ProgressStyle(descri…




Some weights of T5Model were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




[{'translation_text': "ANLY590 est un cours d'apprentissage approfondi enseigné par Dr.Hines et Dr.Wittenbach à l'Université Georgetown."}]