<a href="https://colab.research.google.com/github/swarnava-96/Hugging-Face/blob/main/Exploring_Hugging_Face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Getting started with Hugging Face**

In [1]:
# Lets install transformers
!pip install transformers

Collecting transformers
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 5.3 MB/s 
Collecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 4.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 38.7 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 36.8 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 54.3 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempti

Let's see how this work for sentiment analysis (the other tasks are all covered in the [task summary](https://huggingface.co/transformers/task_summary.html)):

In [2]:
# Lets use pretrained models
from transformers import pipeline
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

When typing this command for the first time, a pretrained model and its tokenizer are downloaded and cached. We will look at both later on, but as an introduction the tokenizer's job is to preprocess the text for the model, which is then responsible for making predictions. The pipeline groups all of that together, and post-process the predictions to make them readable. For instance:

In [3]:
# Testing
classifier('We are very happy to show you the 🤗 Transformers library.')

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

In [4]:
classifier("He is not a good human being.")

[{'label': 'NEGATIVE', 'score': 0.9997840523719788}]

We can use it on a list of sentences, which will be preprocessed then fed to the model as a batch, returning a list of dictionaries like this one:

In [5]:
results = classifier(["We are very happy to show you the 🤗 Transformers library.",
                      "We hope you don't hate it."])
for result in results:
  print(f"label: {result['label']}, with score: {round(result['score'], 4)}") 

  cpuset_checked))


label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


We can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is
fairly neutral.

By default, the model downloaded for this pipeline is called "distilbert-base-uncased-finetuned-sst-2-english". We can
look at its [model page](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) to get more
information about it. It uses the [DistilBERT architecture](https://huggingface.co/transformers/model_doc/distilbert.html) and has been fine-tuned on a
dataset called SST-2 for the sentiment analysis task.

Let's say we want to use another model; for instance, one that has been trained on French data. We can search through
the [model hub](https://huggingface.co/models) that gathers models pretrained on a lot of data by research labs, but
also community models (usually fine-tuned versions of those big models on a specific dataset). Applying the tags
"French" and "text-classification" gives back a suggestion "nlptown/bert-base-multilingual-uncased-sentiment". Let's
see how we can use it.

We can directly pass the name of the model to use to `pipeline`:

In [6]:
classifier = pipeline("sentiment-analysis", model = "nlptown/bert-base-multilingual-uncased-sentiment")

Downloading:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/638M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/851k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [7]:
# Testing
classifier("Esperamos que no lo odie.")

[{'label': '3 stars', 'score': 0.33688196539878845}]

This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish! You can also
replace that name by a local folder where you have saved a pretrained model (see below). We can also pass a model
object and its associated tokenizer.

We will need two classes for this. The first is `AutoTokenizer`, which we will use to download the
tokenizer associated to the model we picked and instantiate it. The second is
`AutoModelForSequenceClassification` (or
`TFAutoModelForSequenceClassification` if you are using TensorFlow), which we will use to download
the model itself. Note that if we were using the library on an other task, the class of the model would change. The
[task summary](https://huggingface.co/transformers/task_summary.html) tutorial summarizes which class is used for which task.

In [8]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

Now, to download the models and tokenizer we found previously, we just have to use the
`AutoModelForSequenceClassification.from_pretrained` method (we can replace `model_name` by
any other model from the model hub):

In [10]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
# This model only exists in PyTorch, so we use the `from_pt` flag to import that model in TensorFlow.
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt = True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [11]:
# Testing
classifier("Swarnava is a good boy.")

[{'label': '4 stars', 'score': 0.4621807336807251}]

If we don't find a model that has been pretrained on some data similar to yours, we will need to fine-tune a
pretrained model on our data. We provide [example scripts](https://huggingface.co/transformers/examples.html) to do so.

### Under the hood: pretrained models
Let's now see what happens beneath the hood when using those pipelines. As we saw, the model and tokenizer are created using the from_pretrained method:

In [12]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_95']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We mentioned the tokenizer is responsible for the preprocessing of our texts. First, it will split a given text in
words (or part of words, punctuation symbols, etc.) usually called *tokens*. There are multiple rules that can govern
that process (we can learn more about them in the [tokenizer summary](https://huggingface.co/transformers/tokenizer_summary.html)), which is why we need
to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was
pretrained.

The second step is to convert those *tokens* into numbers, to be able to build a tensor out of them and feed them to
the model. To do this, the tokenizer has a *vocab*, which is the part we download when we instantiate it with the
`from_pretrained` method, since we need to use the same *vocab* as when the model was pretrained.

To apply these steps on a given text, we can just feed it to our tokenizer:

In [13]:
inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")

This returns a dictionary string to list of ints. It contains the [ids of the tokens](https://huggingface.co/transformers/glossary.html#input-ids), as
mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an
[attention mask](https://huggingface.co/transformers/glossary.html#attention-mask) that the model will use to have a better understanding of the
sequence:

In [15]:
print(inputs)

{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


We can pass a list of sentences directly to our tokenizer. If our goal is to send them through our model as a batch, we probably want to pad them all to the same length, truncate them to the maximum length the model can accept and get tensors back. we can specify all of that to the tokenizer:

In [17]:
tf_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
    padding = True,
    truncation = True,
    max_length = 512,
    return_tensors = "tf"
)

The padding is automatically applied on the side expected by the model (in this case, on the right), with the padding token the model was pretrained with. The attention mask is also adapted to take the padding into account:

In [18]:
for key, value in tf_batch.items():
  print(f"{key} : {value.numpy().tolist()}")

input_ids : [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
attention_mask : [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]


We can learn more about tokenizers [here](https://huggingface.co/transformers/preprocessing.html).

### Using the Model
Once our input has been preprocessed by the tokenizer, we can send it directly to the model. As we mentioned, it will
contain all the relevant information the model needs. If we're using a TensorFlow model, we can pass the dictionary
keys directly to tensors, for a PyTorch model, we need to unpack the dictionary by adding `**`.

In [19]:
tf_output = tf_model(tf_batch)

In 🤗 Transformers, all outputs are tuples (with only one element potentially). Here, we get a tuple with just the final activations of the model.

In [20]:
print(tf_output)

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.083296  ,  4.336417  ],
       [ 0.08180393, -0.04178003]], dtype=float32)>, hidden_states=None, attentions=None)


The model can return more than just the final activations, which is why the output is a tuple. Here we only asked for the final activations, so we get a tuple with one element.

NOTE: All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model before the final activation function (like SoftMax) since this final activation function is often fused with the loss.

Let's apply the SoftMax activation to get predictions.

In [21]:
import tensorflow as tf
tf_predictions = tf.nn.softmax(tf_output[0], axis = -1)

We can see we get the numbers from before:

In [22]:
print(tf_predictions)

tf.Tensor(
[[2.2042930e-04 9.9977952e-01]
 [5.3085673e-01 4.6914327e-01]], shape=(2, 2), dtype=float32)


If we have labels, we can provide them to the model, it will return a tuple with the loss and the final activations.

In [23]:
import tensorflow as tf
tf_output = tf_model(tf_batch, labels = tf.constant([1,0]))

Models are standard [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) or [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) so we can use them in our usual training loop. 🤗
Transformers also provides a `Trainer` (or `TFTrainer` if we are using
TensorFlow) class to help with our training (taking care of things such as distributed training, mixed precision,
etc.). See the [training tutorial](https://huggingface.co/transformers/training.html) for more details.

NOTE: Pytorch model outputs are special dataclasses so that we can get autocompletion for their attributes in an IDE. They also behave like a tuple or a dictionary (e.g., we can index with an integer, a slice or a string) in which case the attributes not set (that have None values) are ignored.

Once our model is fine-tuned, we can save it with its tokenizer in the following way:
```
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)
```

We can then load this model back using the AutoModel.from_pretrained method by passing the directory name instead of the model name. One cool feature of 🤗 Transformers is that we can easily switch between PyTorch and TensorFlow: any model saved as before can be loaded back either in PyTorch or TensorFlow. If we are loading a saved PyTorch model in a TensorFlow model, use TFAutoModel.from_pretrained like this:
```
tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = TFAutoModel.from_pretrained(save_directory, from_pt=True)
```
and if we are loading a saved TensorFlow model in a PyTorch model, we should use the following code:
```
tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = AutoModel.from_pretrained(save_directory, from_tf=True)
```

Lastly, we can also ask the model to return all hidden states and all attention weights if you need them:

```
tf_outputs = tf_model(tf_batch, output_hidden_states=True, output_attentions=True)
all_hidden_states, all_attentions = tf_outputs[-2:]
```


### Accessing the code
The `AutoModel` and `AutoTokenizer` classes are just shortcuts that will automatically work with any
pretrained model. Behind the scenes, the library has one model class per combination of architecture plus class, so the
code is easy to access and tweak if we need to.

In our previous example, the model was called "distilbert-base-uncased-finetuned-sst-2-english", which means it's using
the [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html) architecture. As
`AutoModelForSequenceClassification` (or
`TFAutoModelForSequenceClassification` if we are using TensorFlow) was used, the model
automatically created is then a `DistilBertForSequenceClassification`. We can look at its
documentation for all details relevant to that specific model, or browse the source code. This is how we would
directly instantiate model and tokenizer without the auto magic:

In [27]:
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFDistilBertForSequenceClassification.from_pretrained(model_name)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_135']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Customizing the model
custom configuration class. Each architecture
comes with its own relevant configuration (in the case of DistilBERT, `DistilBertConfig`) which
allows us to specify any of the hidden dimension, dropout rate, etc. If we do core modifications, like changing the
hidden size, we won't be able to use a pretrained model anymore and will need to train from scratch. We would then
instantiate the model directly from this configuration.

Here we use the predefined vocabulary of DistilBERT (hence load the tokenizer with the
`DistilBertTokenizer.from_pretrained` method) and initialize the model from scratch (hence
instantiate the model from the configuration instead of using the
`DistilBertForSequenceClassification.from_pretrained` method).

In [26]:
from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = TFDistilBertForSequenceClassification(config)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

For something that only changes the head of the model (for instance, the number of labels), we can still use a
pretrained model for the body. For instance, let's define a classifier for 10 different labels using a pretrained body.
We could create a configuration with all the default values and just change the number of labels, but more easily, we
can directly pass any argument a configuration would take to the `from_pretrained` method and it will update the
default configuration with it:

In [28]:
from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
model_name = "distilbert-base-uncased"
model = TFDistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/347M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'activation_13', 'vocab_transform', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'pre_classifier', 'dropout_155']
You should probably TRAIN this model on a down-stream task to be able to use 