## Notebook Objectives:

In the last notebook, we covered **AutoTokenizer**. In this notebook, our focus will be on **AutoModel**.

### Objectives:

1. **Introduction to AutoModel**: We will explore the concept of AutoModel and understand its role in Hugging Face's Transformers library.

2. **Providing Raw Data**: We will learn how to provide raw data to the AutoModel for inference. This involves preprocessing the input data, tokenizing it using AutoTokenizer, and feeding it to AutoModel for generating outputs.

3. **Mapping Logit Scores**: We will learn how to interpret the output of AutoModel, which typically consists of logits. We will learn how to map these logits to positive and negative sentiment scores.

By the end of this notebook, you will have a comprehensive understanding of how to utilize AutoModel for various NLP tasks, including sentiment analysis, by providing raw data and mapping the model outputs to meaningful predictions.




Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: dill, responses, multiproce

The provided code utilizes the Hugging Face `transformers` library to create a sentiment analysis pipeline using a pre-trained model. Here's a breakdown of the code:

1. **Importing the necessary modules**:
   - The code begins by importing the `pipeline` function from the `transformers` library. This function allows us to easily create pipelines for various natural language processing (NLP) tasks.

2. **Creating a sentiment analysis pipeline**:
   - The `pipeline` function is used to create a sentiment analysis pipeline. This pipeline is capable of analyzing the sentiment (positive, negative, or neutral) of input text.
   - The argument `"sentiment-analysis"` specifies that we want to create a pipeline specifically for sentiment analysis.

3. **Performing sentiment analysis**:
   - The `classifier` object created by the pipeline function is then used to analyze the sentiment of two input sentences.
   - The input sentences are provided as a list within the `classifier` function call.
   - The sentiment analysis pipeline analyzes each input sentence and predicts its sentiment.

4. **Output**:
   - The output of the `classifier` function call is a list of dictionaries, where each dictionary represents the sentiment analysis result for one of the input sentences.
   - Each dictionary contains two keys:
     - `"label"`: Represents the predicted sentiment label (e.g., "POSITIVE", "NEGATIVE", "NEUTRAL").
     - `"score"`: Represents the confidence score associated with the predicted sentiment label.



In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

`classifier.model`: This line accesses the `model` attribute of the `classifier` object. The `model` attribute provides access to the underlying pre-trained model used by the `pipeline` object. You can use this attribute to directly interact with the model, access its parameters, fine-tune it on custom data, or perform any other operations supported by the underlying model architecture.


In [None]:
# classifier details
classifier.model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

## AutoTokenizer


- `from transformers import AutoTokenizer`: This line imports the `AutoTokenizer` class from the Transformers library. Tokenizers are used to preprocess text data before feeding it into a model. The `AutoTokenizer` class automatically selects an appropriate tokenizer based on the provided model checkpoint name.

- `checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"`: This line defines the checkpoint name of a pre-trained model for text classification. In this case, it specifies the "distilbert-base-uncased-finetuned-sst-2-english" checkpoint, which is a fine-tuned version of the DistilBERT model on the Stanford Sentiment Treebank (SST-2) dataset.

- `tokenizer = AutoTokenizer.from_pretrained(checkpoint)`: This line initializes an instance of the `AutoTokenizer` class with the specified checkpoint name. The `from_pretrained` method loads the pre-trained tokenizer associated with the provided checkpoint. The tokenizer will be used to tokenize and preprocess text inputs for the text classification model.




In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


- `inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")`: This line tokenizes the `raw_inputs` using a tokenizer. The `padding=True` argument ensures that sequences are padded to the maximum length, and `truncation=True` specifies that sequences longer than the maximum length should be truncated. The `return_tensors="pt"` argument instructs the tokenizer to return PyTorch tensors.


In [None]:
raw_inputs = [
    "In This changing landscape of AI, I wanted to learn LLMs.",
    "I hate Deep Learning so much!",
    "Help"
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
#When you input 3 sentences of lengths (5, 10, 15) to the tokenizer with padding=True,
#the tokenizer will pad each sentence to the same length,
#which is determined by the longest sentence in the input.

print("-------- input to model --------------------")
print("input to the model: ",inputs)
print("-------- input mapping to ids --------------")
print("input ids",inputs['input_ids'] )
print("-------- tokens to the model  --------------")

tokenized_sentences = inputs["input_ids"]

# Loop through each tokenized sentence and print its tokens
for sentence in tokenized_sentences:
    # Decode the tokens to strings
    sentence_tokens = tokenizer.decode(sentence)

    # Print the tokens
    print(sentence_tokens)

-------- input to model --------------------
input to the model:  {'input_ids': tensor([[ 101, 1999, 2023, 5278, 5957, 1997, 9932, 1010, 1045, 2359, 2000, 4553,
         2222, 5244, 1012,  102],
        [ 101, 1045, 5223, 2784, 4083, 2061, 2172,  999,  102,    0,    0,    0,
            0,    0,    0,    0],
        [ 101, 2393,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}
-------- input mapping to ids --------------
input ids tensor([[ 101, 1999, 2023, 5278, 5957, 1997, 9932, 1010, 1045, 2359, 2000, 4553,
         2222, 5244, 1012,  102],
        [ 101, 1045, 5223, 2784, 4083, 2061, 2172,  999,  102,    0,    0,    0,
            0,    0,    0,    0],
        [ 101, 2393,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0

##AutoModel
- `model = AutoModel.from_pretrained(checkpoint)`: This initializes a new model object by loading a pre-trained model specified by the `checkpoint` parameter. The `AutoModel.from_pretrained()` method automatically downloads and loads the specified pre-trained model from the Hugging Face model hub.


In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

### Without unpacking
outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])

### With unpacking
outputs = model(**inputs)

In [None]:
#**inputs, it’s being used to unpack a dictionary into keyword arguments.
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([3, 16, 768])


In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [None]:
print(outputs.logits.shape)

torch.Size([3, 2])


In [None]:
print(outputs.logits)

tensor([[-1.5473,  1.5419],
        [ 3.3221, -2.7394],
        [-2.2521,  2.2720]], grad_fn=<AddmmBackward0>)


In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[0.0436, 0.9564],
        [0.9977, 0.0023],
        [0.0107, 0.9893]], grad_fn=<SoftmaxBackward0>)


In [None]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [None]:
# Map predictions to labels using argmax
predicted_indices = torch.argmax(predictions, dim=-1)

[model.config.id2label[idx.item()] for idx in predicted_indices]


['POSITIVE', 'NEGATIVE', 'POSITIVE']