**Intro to 🤗 Hugging Face APIs for NLP and Vision**

This notebook introduces Hugging Face APIs based on all the documentation  provided by Hugging Face at https://huggingface.co/course/chapter1

This aims to introduce simple yet powerful APIs to get started with Hugging Face without deep DL expertise in Python

**Jargon Alert!!!!**

*Hugging Face Hub -* This is a repository of pre-trained models and datasets which can be imported directly by anyone with just one line of code


*Note - This notebook only deals with data preprocessing, inference, and postprocessing. Training/fine-tuning will be shown in the next notebook*



In [1]:
'''
First, let's install transformers
'''
!pip install transformers





---


**Experiment 1: Pipeline**

Pipeline is an API to do inference using a pre-trained DL model (we refer to NLP models here) available in Hugging Face Hub in just 3 lines of code.

It acutally does 3 different processes under the hood - preprocessing input data, passing it to the model, post processing model output to get an intelligible answer


---





In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
sentences = [
             "I think you can do better",
             "This is a great camera from Amazon"
]
results = classifier(sentences)

print(results)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
  cpuset_checked))


[{'label': 'NEGATIVE', 'score': 0.9979079961776733}, {'label': 'POSITIVE', 'score': 0.9997420907020569}]




---


You saw that pipelines handled almost everything. All it needed was the **type** of pipeline (list of pipelines in [this](https://huggingface.co/models) lesson) and sentences as the input.

We chose sentiment analysis here. But depending on the pipeline you choose, the results will be different.

There are multiple models availabe in Hugging Face Hub to choose from for a task. Since we did not choose one, pipeline used the default model for sentiment analysis from hub (distilbert-base-uncased-finetuned-sst-2-english).

We can always give a model name as well. To do that, go to [hub](https://huggingface.co/models) and choose the one you like. 

I like to choose based on the **Task** on the left side. You can also test the model by selecting it and playing with **Hosted Inference API**

Here, I selected **siebert/sentiment-roberta-large-english**, which was one of the models under *Text Classification* task

GO PLAY!


---



In [3]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis",model="siebert/sentiment-roberta-large-english")
sentences = [
             "I think you can do better",
             "This is a great camera from Amazon"
]
results = classifier(sentences)

print(results)

  cpuset_checked))


[{'label': 'NEGATIVE', 'score': 0.9968762397766113}, {'label': 'POSITIVE', 'score': 0.9987062215805054}]




---


Here's another example of pipeline for NLP tasks- question-answering


---



In [4]:
ques_ans = pipeline("question-answering")
context = "Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tunea model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script."
result = ques_ans(question="What is extractive question answering?", context=context)
print(result)


No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


{'score': 0.616338312625885, 'start': 33, 'end': 94, 'answer': 'the task of extracting an answer from a text given a question'}




---


**Experiment 2: Tokenizers and models**

Remember we said pipeline does 3 things - preprocessing input, passing it to the model, and post processing? We now seperate these out to gain more control over the process.

**Jargon Alert!!**

Tokenizer - This takes care of step 1 and 3 (preprocessing and postprocessing data). It handles the conversion from text to numerical inputs for the neural network, and the conversion back to text when it is needed

Generally, tokenization is the process of dividing a sentence into tokens based on some set of rules in NLP. And that's what tokenizer does in one of the steps (More info on tokenizers [here](https://huggingface.co/course/chapter2/4?fw=pt))


---




In [5]:
#AutoTokenizer automatically selects the correct tokenization process based on the model selected (checkpoint)
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

'''
This tokenizer will handle transforming raw input to the correct format
which the chosen model understands. We return pytorch tensor here because
the model needs tensors
'''
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
raw_inputs = [
    " I am delighted with the product",
    "This might not be the best idea"
]

#proprocessed inputs
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  2572, 15936,  2007,  1996,  4031,   102,     0],
        [  101,  2023,  2453,  2025,  2022,  1996,  2190,  2801,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]])}




---


Tokenizer has converted text into 'input_ids', which will be used by the model!


---


So let's talk about models now.
There are two kinds of models - with heads and headless. This is based on the requirement of either feature extractions or some specific task like classification

*Models without heads -* The last layer of the model is stripped and we get the results (feature vector) from the second last layer

*Models with heads -* The last layer called the head decides the task to be done (like classification)



---





In [6]:
#Model without Head. AutoModel automatically selects the correct model based on the model selected (checkpoint)
from transformers import AutoModel
'''
This architecture only has base Transformer: given some input(from the previous step), output hidden states (features).
It is an n-dimensional feature vector.
'''
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.weight', 'classifier.bias', 'classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([2, 9, 768])




---


SO! We see a **2x9x768** dimensional tensor here.

2 - number of inputs

9 - number of tokens in each input

768 - number of features for each input


---

Now let's look at models with heads. 

We can choose a model using AutoModelFor*** to do this, which will select the appropriate head.

In this examples, we classify prevously obtained inputs (from tokenizer)



---



In [7]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

outputs = model(**inputs)

#use softmax to getprobabilities
import torch
#softmax probablities
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

print(predictions)

#classification results for both inputs
results = torch.argmax(predictions, axis=1)
print(results)
#classification labels mapping to results
print(model.config.id2label)

tensor([[1.1872e-04, 9.9988e-01],
        [9.9974e-01, 2.5673e-04]], grad_fn=<SoftmaxBackward>)
tensor([1, 0])
{0: 'NEGATIVE', 1: 'POSITIVE'}


[link text](https://)

---


While tokenizer + model together do the job of preprocessing -> inference -> postprocessing well, we go deeper into tokenization to understand them better.

Tokenizers essentially perform encoding (tokenization and getting input ids from tokens)

Translating text to numbers is known as encoding. 

Encoding is done in a two-step process: the tokenization, 
followed by the conversion to input IDs. 

*Note - Atomic operations a tokenizer can handle: tokenization, conversion to IDs, and converting IDs back to a string*

We now demostrate this two step process below


---




In [8]:
#Dividing tokenization into two steps
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

tokens = tokenizer.tokenize("Using a Transformer network is simple")
print(tokens)

ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']
[7993, 170, 13809, 23763, 2443, 1110, 3014]




---


And if we convert these IDs back to a string (decode), we get the text input back below


---



In [9]:
decoded_string = tokenizer.decode(ids)
print(decoded_string)

Using a Transformer network is simple




---


Now comparing it with the two step process automatically done by the tokenizer, we get the same result for input ids below (101 and 102 just mark the start and end of sentences)


---



In [10]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

tokens = tokenizer("Using a Transformer networks is simple")  #use this always
print(tokens)

{'input_ids': [101, 7993, 170, 13809, 23763, 6379, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}




---


Moving forward, we will always use this API to handle the two steps of tokenization automatically



---

Now that we understand pipelines, tokenizers (and how they work under the hood) and models, we shall wrap it up by showing the preferred flow for inference below

**The cell below is a summary of this notebook**


---



In [11]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "This is such a nice view",
    "I think this is not up to the mark"
]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors='pt')
outputs = model(**tokens)

#use softmax to getprobabilities
import torch
#softmax probablities
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

print(predictions)

#classification results for both inputs
results = torch.argmax(predictions, axis=1)
print(results)
#classification labels mapping to results
print(model.config.id2label)

tensor([[1.5476e-04, 9.9985e-01],
        [9.9977e-01, 2.3102e-04]], grad_fn=<SoftmaxBackward>)
tensor([1, 0])
{0: 'NEGATIVE', 1: 'POSITIVE'}




---


**Bonus Tip**

Remeber how we used pipeline in the beginning of this notebook?
We can do that too.

But pipelines do not give us a lot of control. For instance, we cannot fine-tune a model and use its weights, since it only allows us to choose a model from Hugging Face hub. So unless we have our weights on Hugging Face hub, we need to use the previous methods for that.


---



In [12]:
from transformers import pipeline
#also can use a pipleline: We started with this
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
sequences = [
    "This is such a nice view",
    "I think this is not up to the mark"
]
result = classifier(sequences)
print(result)


  cpuset_checked))


[{'label': 'POSITIVE', 'score': 0.9998452663421631}, {'label': 'NEGATIVE', 'score': 0.9997690320014954}]
