In this notebook, we will explore how to use pre-trained transformer models from the Hugging Face library, focusing on making them work for Natural Language Processing (NLP) tasks. Pre-trained models, such as BERT (Bidirectional Encoder Representations from Transformers), are powerful tools that save us from training models from scratch. They allow us to leverage the knowledge the model has already learned from vast amounts of text data


**What are pre-trained models?**


*   Pre-trained models are machine learning models trained on large datasets to solve general NLP tasks. These models can be directly used or fine-tuned for specific tasks like sentiment analysis, text classification, or question answering

**Why use Hugging Face?**


*   Hugging Face provides a user-friendly interface to access a wide range of pre-trained models like BERT, GPT-2, RoBERTa, and many more. It also offers tools like tokenizers and pipelines that simplify complex tasks.

**What will you learn in this notebook?**



*   How to load and use a pre-trained model for inference.
*   Step-by-step procedures for tokenization, creating inputs, and obtaining model predictions.









# **Step 1 : Install the Required Libraries**

In [1]:
! pip install transformers
! pip install torch





*   **Transformers** : This library provides the pre-trained models and tokenizers
*   **Torch** : Hugging face models are built on top of Pytorch, so you need to have PyTorch installed. You can also use tensorflow with Hugging face, but we will focus in PyTorch here



# **Step 2 : Import Required Libraries**

In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

**AutoTokenizer** : This helps in tokenizing your input text into a format that the model can understand (token IDs)

**AutoModelForSequenceClassification** : This is used to load the pre-trained model for tasks like Classification (eg : Sentimental Analysis, spam detection)

**torch** : This is the PyTorch library, which is essential for running the model

# **Step 3: Load Pre-trained Tokenizer and Model**

Here's where we load a pre-trained model and tokenizer.can use BERT, GPT-2 or any other model available on Hugging Face

**Loading the Tokenizer**

In [3]:
tokenizer = AutoTokenizer.from_pretrained('google-bert/bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



*   **'bert-base-uncased'** : This is a popular pre-trained version of BERT. **uncased** means it doesnt differentiate between uppercase and lowercase.
*   The tokenizer takes care of converting text into tokens (words or subwords) and turning them into IDs, which is how the model understands text



**Loading the model**

In [4]:
model = AutoModelForSequenceClassification.from_pretrained('google-bert/bert-base-uncased', num_labels =2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




*   **AutoModelForSequenceClassification** : This is a pre-trained BERT model fine-tuned for sequence classification tasks like Sentimental Analysis
*   **num_labels=2**: This indicates that we are working with a binary classification task (e.g., positive/negative sentiment)



# **Step 4: Tokenize the Input Text**

Once the tokenizer and model are loaded, we need to tokenize the input text. Tokenization is the process of converting text into tokens (smaller pieces like words or subwords).

In [5]:
text = ['I love Hugging Face Transformers!', "I hate bugs in my code."]

# Tokenizing the text
inputs = tokenizer(text, padding = True, truncation = True, max_length = 512, return_tensors='pt')




*   **padding=True**: Ensures that all sentences are padded to the same length (important when working with batches of text).
*   **truncation=True**: Shortens text longer than the max_length to 512 tokens.


*   **max_length=512**: Limits the tokenized input to a maximum of 512 tokens (the maximum length for most BERT models).
*   **return_tensors='pt'**: Specifies that the output should be in the format used by PyTorch tensors





# **Step 5: make Predictions with the Model**

Now that the input text is tokenized, we can feed it into the model to make predictions. Since this is a classification task, the model will output the raw scores (logits) for each class.

In [6]:
# getting model prediction (logits)

with torch.no_grad():
  outputs = model(**inputs)

# outputs.logits contains the raw prediction scores
logits = outputs.logits
print(logits)

tensor([[0.6960, 0.4076],
        [0.8498, 0.1031]])


**outputs.logits:** The raw output from the model (logits). These are not probabilities but scores that the model predicts for each class.

# **Step 6: Convert Logits to Probabilities and Class Labels**

To interpret the output logits, we need to convert them into probabilities. The typical way to do this is to apply the **softmax function**, which turns logits into probabilities.

In [12]:
# Applying softmax to get probabilites
probabilities = torch.nn.functional.softmax(logits, dim=-1)

# Get predicted class

predicted_classes = torch.argmax(probabilities, dim=-1)
print(f"Predicted class: {predicted_class.tolist()}")

for idx, pred_class in enumerate(predicted_classes):
    print(f"Prediction for sample {idx}: {pred_class.item()}")

Predicted class: [0, 0]
Prediction for sample 0: 0
Prediction for sample 1: 0




*   **softmax:** This function converts logits into a probability distribution.
*   **torch.argmax:** This gives the index of the class with the highest probability.

If you are performing binary classification (e.g., sentiment analysis with positive/negative labels), the output predicted_class will be either 0 or 1:



*   0 might correspond to "negative" sentiment.
*   1 might correspond to "positive" sentiment.






# **Step 7: Interpreting the Results**

At this point, we will have the predicted class and the probability for each class

In [14]:
print(f"Probabilites: {probabilities}")
print(f"Predicted class: {predicted_class.tolist()}")

Probabilites: tensor([[0.5716, 0.4284],
        [0.6785, 0.3215]])
Predicted class: [0, 0]


# **Step 8: Fine-Tuning the Model (Optional)**

Fine-tuning involves these key steps:


1.  Prepare the dataset
2.  Tokenizng the text data

1.   Loading a pretrained model and adding task-specific layers
2.   Training the model on your data

1.   Evaluating and saving the fine-tuning model

By following this step-by-step approach, wll be able to fine-tune any pretrained model from Hugging Face for your specific tasks!







# **Step 9 : Save and load the model**

Once we have fine-tuned the model, we can save it and load it for later use

In [3]:
# save the model and tokenizer
## model.save_pretrained('./saved_model')
## tokenizer.save_pretrained('./saved_model')

# Load the model and tokenizer back
## model = AUtoModelForSequenceClassification.from_pretrained('./saved_model')
## tokenizer = AutoTokenizer.from_pretrained('./saved_model')