<a href="https://colab.research.google.com/github/santiagxf/interpret/blob/main/saliency-maps/loading_transformers_allennlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading a HUggingFace model into AllenNLP

In [1]:
%pip install transformers allennlp eli5 --quiet
%pip install -U google-cloud-storage==1.40.0 --quiet

[K     |████████████████████████████████| 216 kB 4.3 MB/s 
[K     |████████████████████████████████| 133 kB 56.6 MB/s 
[?25h  Building wheel for eli5 (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
flask 1.1.4 requires Jinja2<3.0,>=2.10.1, but you have jinja2 3.1.2 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
[K     |████████████████████████████████| 104 kB 4.0 MB/s 
[K     |████████████████████████████████| 75 kB 4.3 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-cloud-bigquery 1.21.0 requires google-resumable-media!=0.4.0,<0.5.0dev,>=0.3.1, but you have google-resumable-media 1.3.3 which is incompatibl

In this case we will use `nlptown/bert-base-multilingual-uncased-sentiment`. This is a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish and Italian. It predicts the sentiment of the review as a number of stars (between 1 and 5).

The model can be used directly as a sentiment analysis model for product reviews in any of the six languages, or further finetuned on related sentiment analysis tasks. To keep the example small, we won't do any fine-tunning with our own data in this opportunity.

## Loading the model with transformers

In [22]:
from transformers.models.auto import AutoConfig, AutoModelForSequenceClassification
from transformers.models.auto.tokenization_auto import AutoTokenizer

model_uri = 'nlptown/bert-base-multilingual-uncased-sentiment'

config = AutoConfig.from_pretrained(model_uri)
tokenizer = AutoTokenizer.from_pretrained(model_uri)
classifier = AutoModelForSequenceClassification.from_pretrained(model_uri, config=config)

The transformers library provides a convenient way to store all the artifacts of a given model, and that is using the functionsave_pretrained from the model.

In [2]:
model_path = 'rating_classifier'
classifier.save_pretrained(model_path)

This will generate a single file called pytorch_model.bin which contains the weights of the model itself. However, remember that in order to run the model we also need it's corresponding tokenizer. The same save_pretrained method is available for the tokenizer, which will generate other set of files:

In [3]:
tokenizer.save_pretrained(model_path)

('rating_classifier/tokenizer_config.json',
 'rating_classifier/special_tokens_map.json',
 'rating_classifier/vocab.txt',
 'rating_classifier/added_tokens.json',
 'rating_classifier/tokenizer.json')

## Loading the saved model using AllenNLP

In [4]:
model_name = 'rating_classifier'

### Vocabulary

In [5]:
from allennlp.data.vocabulary import Vocabulary

transformer_vocab = Vocabulary.from_pretrained_transformer(model_name)

### Tokenizer

The tokenization work is devided into 2 parts in AllenNLP, which allows to have a more modular approach:

- Tokenizer: A Tokenizer splits strings of text into tokens. Typically, this either splits text into word tokens or character tokens. It's job is to split sequence of text into sequence of discreat words or tokens. It goes from text -> sequence of text.
- Indexer: it's job is to take a sequence of tokens and translate them to word indexes in according to the dictionary. It goes from sequence of text -> sequences of indexes.


In [6]:
from allennlp.data.tokenizers.pretrained_transformer_tokenizer import PretrainedTransformerTokenizer
from allennlp.data.token_indexers.pretrained_transformer_indexer import PretrainedTransformerIndexer

transformer_tokenizer = PretrainedTransformerTokenizer(model_name)
token_indexer = PretrainedTransformerIndexer(model_name)


Some weights of the model checkpoint at rating_classifier were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Embedder

The embedder job is to provide vectors for words. So it basically takes a word index in the dictionary and returns its vector representation.

In [16]:
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
from allennlp.modules.token_embedders.pretrained_transformer_embedder import PretrainedTransformerEmbedder

In [99]:
token_embedder = BasicTextFieldEmbedder({ "tokens": PretrainedTransformerEmbedder(model_name) })

> Note that AllenNLP support providing multiple embedders for different inputs. This is because it has this modular approach

A `Seq2VecEncoder` is a Module that takes as input a sequence of vectors and returns a single vector. The input shape would be `(batch_size, sequence_length, input_dim)` and return a `(batch_size, output_dim)` tensor. In the BERT architecture, there is a pooling layer at the end of the BERT model. This returns an embedding for the [CLS] token, after passing it through a non-linear tanh activation; the non-linear layer is also part of the BERT model.

We can create it in AllenNLP:

In [100]:
from allennlp.modules.seq2vec_encoders.bert_pooler import BertPooler

transformer_encoder = BertPooler(model_name)

### Building the model

In [101]:
from allennlp.models import BasicClassifier

model = BasicClassifier(vocab=transformer_vocab, 
                        text_field_embedder=token_embedder, 
                        seq2vec_encoder=transformer_encoder, 
                        dropout=0.1, 
                        num_labels=5)

Loading the model's weights

In [102]:
model._classification_layer.weight = classifier.classifier.weight
model._classification_layer.bias = classifier.classifier.bias

In [103]:
_ = model.eval()

## Data readers

AllenNLP uses the concept of DatasetReader which allows the creation of `Instance`'s which can provide the inputs in the format the model expects them. This abstraction allows the framwork to make any preprocessing needed before the data is actually sent to the model. 

In [104]:
from typing import Dict, Iterable, List

from allennlp.data import DatasetReader, Instance, Batch
from allennlp.data.fields import Field, TextField, LabelField
from allennlp.data.token_indexers import TokenIndexer
from allennlp.data.tokenizers import Tokenizer

class ClassificationTransformerReader(DatasetReader):
    def __init__(
        self,
        tokenizer: Tokenizer,
        token_indexer: TokenIndexer,
        max_tokens: int,
        **kwargs
    ):
        super().__init__(**kwargs)
        self.tokenizer = tokenizer
        self.token_indexers: Dict[str, TokenIndexer] = { "tokens": token_indexer }
        self.max_tokens = max_tokens

    def text_to_instance(self, text: str, label: str = None) -> Instance:
        tokens = self.tokenizer.tokenize(text)
        if self.max_tokens:
            tokens = tokens[: self.max_tokens]
        
        fields: Dict[str, Field] = { }
        fields["tokens"] = TextField(tokens, self.token_indexers)
            
        if label:
            fields["label"] = LabelField(label)
            
        return Instance(fields)

In [105]:
dataset_reader = ClassificationTransformerReader(tokenizer=transformer_tokenizer, 
                                                 token_indexer=token_indexer, 
                                                 max_tokens=400)

Testing if the reader works:

In [106]:
instance = dataset_reader.text_to_instance("this is a great read everyone should have")

In [107]:
from allennlp.nn import util

dataset = Batch([instance])
dataset.index_instances(transformer_vocab)
model_input = util.move_to_device(dataset.as_tensor_dict(), model._get_prediction_device())

In [108]:
model.make_output_human_readable(model(**model_input))

{'label': ['4'],
 'logits': tensor([[-2.3093, -2.4158, -0.5280,  1.7162,  2.7962]],
        grad_fn=<AddmmBackward0>),
 'probs': tensor([[0.0044, 0.0039, 0.0260, 0.2448, 0.7209]], grad_fn=<SoftmaxBackward0>),
 'token_ids': tensor([[  101, 10372, 10127,   143, 11838, 18593, 36053, 14693, 10574,   102]]),
 'tokens': [['[CLS]',
   'this',
   'is',
   'a',
   'great',
   'read',
   'everyone',
   'should',
   'have',
   '[SEP]']]}

## Making the tokenizer and the model a single piece

In [109]:
from allennlp.predictors import TextClassifierPredictor

predictor = TextClassifierPredictor(model, dataset_reader)

In [110]:
predictor.predict("this is a great read everyone should have")

{'label': '4',
 'logits': [-2.309262275695801,
  -2.415769338607788,
  -0.5280130505561829,
  1.7162026166915894,
  2.7961645126342773],
 'probs': [0.004371450748294592,
  0.003929796162992716,
  0.025954479351639748,
  0.2448289394378662,
  0.7209153175354004],
 'token_ids': [101, 10372, 10127, 143, 11838, 18593, 36053, 14693, 10574, 102],
 'tokens': ['[CLS]',
  'this',
  'is',
  'a',
  'great',
  'read',
  'everyone',
  'should',
  'have',
  '[SEP]']}