# LLM models with Hugging face transformers library and Pytorch


### Hugging Face Inc.:

Hugging Face, Inc. is a French-American company based in New York City that develops computation tools for building applications using machine learning. It is most notable for its transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets and showcase their work.

## Services & technologies:

### Transformers:

`transformers` library is a python package that contains open source implementations of transformer models for text, image and audio tasks. It's compatible with `pytorch`, `tensorflow` and `jax` and also contains implementations of notable models such as `llama` and `bert`, it was initially called `pytorch-pretrained-bert` and then `pytorch-bert` libray and then finally renamed to just `transformers` library.

### Hugging Face Hub:

**Hugging Face Hub** is a platform (Centralized web service) for Hosting:

<ul>
  <li>Git based code repositories including discussions and pull requests</li>
  <li>Models alongh with git based version control</li>
  <li>datasets mainly text image and audio</li>
</ul>

> Along with transformers library HF ecosystem also contains libraries for other tasks such as data processing (`datasets`), model evaluation (`evaluate`) and machine learning demos (`gradio`).

Hugging face ecosystem is deeply compatible with pytorch libray and offers great convinience while working with llms in `pytorch` or `tensorflow`


### Implementation of distill-berts classification model.

`distilbert/distilbert-base-uncased-finetuned-sst-2-english` distilbert-base is transformer model, smaller and faster than `bert-base` model using same test corpus in a self-supervised fashion, using the `bert-base` model as it's teacher. This means it was pretrained on raw texts only, with no humans involved in labelling them in any way (this allows it to process lots of publicly available text data) with an automated process for generating inputs and labels using the bert-base uncased model. More precisely, it was pre-trained with three objectives:

<ul>
  <li>Distillation loss</li>
  <li>Masked language modelling loss</li>
  <li>Cosine embedding loss</li>
</ul>


In [None]:
### Importing pprint for flattening json responses
from pprint import pprint

### Importing tranfromers and torch
import transformers
import torch

### Importing Autotokenizer class for generating tokens of input
from transformers import AutoTokenizer

### Importing Automodel for initializing model instance and generating vector embeddings
from transformers import AutoModel

### Importing AutomodelforsequenceClassification for instanting a text classification model
from transformers import AutoModelForSequenceClassification

### Importing pipeline for HF abstraction
from transformers import pipeline

# Importing pandas and numpy for data processing
import pandas as pd
import numpy as np

In [None]:
### variable storing the model-card
model_checkpoint = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

In [None]:
### Input

input=[
    "India landed on the moon bringing great joy to the nation and surprise to everyone else",
    "I am not feeling well today",
    "India has the most number of deaths in the world."
      ]

In [None]:
### Instantiating tokenizer

tokenizer= AutoTokenizer.from_pretrained(model_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [None]:
### exploring paths for directories and models retrieved by autokenizer

!ls -lah

total 16K
drwxr-xr-x 1 root root 4.0K Apr 24 18:20 .
drwxr-xr-x 1 root root 4.0K Apr 27 09:51 ..
drwxr-xr-x 4 root root 4.0K Apr 24 18:19 .config
drwxr-xr-x 1 root root 4.0K Apr 24 18:20 sample_data


In [None]:
# finding the  path where model files are downloaded
!find / -iname config.json

/root/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased-finetuned-sst-2-english/snapshots/714eb0fa89d2f80546fda750413ed43d93601a13/config.json
/root/.julia/packages/TiffImages/w9Bbj/docs/demos/config.json
find: ‘/proc/67/task/67/net’: Invalid argument
find: ‘/proc/67/net’: Invalid argument
/usr/local/lib/python3.11/dist-packages/zmq/utils/config.json
/tools/google-cloud-sdk/lib/googlecloudsdk/core/config.json


In [None]:
# Exploring the folder
!ls /root/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased-finetuned-sst-2-english/ -lh

total 12K
drwxr-xr-x 2 root root 4.0K Apr 27 09:53 blobs
drwxr-xr-x 2 root root 4.0K Apr 27 09:53 refs
drwxr-xr-x 3 root root 4.0K Apr 27 09:53 snapshots


In [None]:
!ls /root/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased-finetuned-sst-2-english/blobs -lah

total 244K
drwxr-xr-x 2 root root 4.0K Apr 27 09:53 .
drwxr-xr-x 6 root root 4.0K Apr 27 09:53 ..
-rw-r--r-- 1 root root   48 Apr 27 09:53 3ed34255a7cb8e6706a8bb21993836e99e7b959f
-rw-r--r-- 1 root root  629 Apr 27 09:53 b57fe5dfcb8ec3f9bab35ed427c3434e3c7dd1ba
-rw-r--r-- 1 root root 227K Apr 27 09:53 fb140275c155a9c7c5a3b3e0e77a9e839594a938


In [None]:
!head -9000 /root/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased-finetuned-sst-2-english/blobs/fb140275c155a9c7c5a3b3e0e77a9e839594a938 | tail

virtually
gen
gravity
exploration
amber
vital
wishes
powell
doctrine
elbow


In [None]:
#Open root models config.json
!cat /root/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased-finetuned-sst-2-english/snapshots/714eb0fa89d2f80546fda750413ed43d93601a13/config.json

{
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "vocab_size": 30522
}


### Tokenizing the Input

In [None]:
### Tokenizing the input using tokenizer instance

tokenized_text= tokenizer(input)
pprint(tokenized_text)

{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
 'input_ids': [[101,
                2634,
                5565,
                2006,
                1996,
                4231,
                5026,
                2307,
                6569,
                2000,
                1996,
                3842,
                1998,
                4474,
                2000,
                3071,
                2842,
                102],
               [101, 1045, 2572, 2025, 3110, 2092, 2651, 102],
               [101,
                2634,
                2038,
                1996,
                2087,
                2193,
                1997,
                6677,
                1999,
                1996,
                2088,
                1012,
                102]]}


In [None]:
# Type of Output

type(tokenized_text)

In [None]:
# token ids for the input text
pprint(tokenized_text.input_ids)

[[101,
  2634,
  5565,
  2006,
  1996,
  4231,
  5026,
  2307,
  6569,
  2000,
  1996,
  3842,
  1998,
  4474,
  2000,
  3071,
  2842,
  102],
 [101, 1045, 2572, 2025, 3110, 2092, 2651, 102],
 [101, 2634, 2038, 1996, 2087, 2193, 1997, 6677, 1999, 1996, 2088, 1012, 102]]


In [None]:
# Attention mask for the input ids
tokenized_text.attention_mask

[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]

In [None]:
# getting input sentence back from input ids
tokenizer.decode(tokenized_text.input_ids[0])

'[CLS] india landed on the moon bringing great joy to the nation and surprise to everyone else [SEP]'

In [None]:
# Relation between tokens ids and vocab file

df = pd.Series(open("/root/.cache/huggingface/hub/models--distilbert--distilbert-base-uncased-finetuned-sst-2-english/blobs/fb140275c155a9c7c5a3b3e0e77a9e839594a938"))

In [None]:
df.iloc[tokenized_text.input_ids[0]]

Unnamed: 0,0
101,[CLS]\n
2634,india\n
5565,landed\n
2006,on\n
1996,the\n
4231,moon\n
5026,bringing\n
2307,great\n
6569,joy\n
2000,to\n


In order the pass the tokens we need to first generate them in the form of vectors and ensure that all sentences are of uniform length by adding extra padding to them.

In [None]:
# Generating model acceptable tokenized output vectors
tokenized_text = tokenizer(input,
                           padding= True,
                           return_tensors="pt")

pprint(tokenized_text)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]]),
 'input_ids': tensor([[ 101, 2634, 5565, 2006, 1996, 4231, 5026, 2307, 6569, 2000, 1996, 3842,
         1998, 4474, 2000, 3071, 2842,  102],
        [ 101, 1045, 2572, 2025, 3110, 2092, 2651,  102,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0],
        [ 101, 2634, 2038, 1996, 2087, 2193, 1997, 6677, 1999, 1996, 2088, 1012,
          102,    0,    0,    0,    0,    0]])}


as we can see now all the tokens are of uniform lenth and in the form of pytorch vectors.

### Model instantiation and passing the input tokens

In [None]:
### Instantiating distilbert-base-uncased-model

model = AutoModel.from_pretrained(model_checkpoint)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [None]:
# Model Config
model.config

DistilBertConfig {
  "_attn_implementation_autoset": true,
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.51.3",
  "vocab_size": 30522
}

In [None]:
# Model Architecture
pprint(model)

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): DistilBertSdpaAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): L

This model has 5 layers with 6 transformer blocks in each ....

In [None]:
# Model embeddings
%%capture capt
model.state_dict()

In [None]:
capt.show()

In [None]:
# Model parameters
no_param = model.num_parameters()
print(f"No of parameters in the model: {no_param}")

No of parameters in the model: 66362880


In [None]:
# Model size
print(f"size of the model: {no_param*4/10**6} mb")

size of the model: 265.45152 mb


In [None]:
# After passing the tokens to a model, the last state of the vectors is returned

model(**tokenized_text)

BaseModelOutput(last_hidden_state=tensor([[[ 0.6786,  0.2116,  0.2134,  ...,  0.4311,  0.9219, -0.5872],
         [ 0.4083,  0.1110,  0.3373,  ...,  0.2122,  0.7986, -0.6584],
         [ 0.5258,  0.2334,  0.4273,  ..., -0.0329,  0.7275, -0.5175],
         ...,
         [ 0.7248,  0.2988,  0.3545,  ...,  0.4456,  1.0790, -0.5557],
         [ 0.6883,  0.2909,  0.3172,  ...,  0.4103,  1.0424, -0.7060],
         [ 1.1934,  0.1189,  0.5984,  ...,  0.4708,  0.7246, -0.8912]],

        [[-0.8214,  0.7480,  0.0973,  ..., -0.0457, -1.1065, -0.3524],
         [-0.8090,  0.7177,  0.2172,  ..., -0.1756, -0.9477, -0.2446],
         [-1.0262,  0.6648,  0.1837,  ..., -0.2497, -0.9956, -0.2826],
         ...,
         [-0.8590,  0.7709,  0.1200,  ..., -0.0781, -0.9487, -0.3195],
         [-0.8194,  0.7506,  0.0613,  ..., -0.0756, -0.9781, -0.3513],
         [-0.7707,  0.7625,  0.0414,  ..., -0.0674, -0.9842, -0.3520]],

        [[-0.3636,  0.2916, -0.3554,  ..., -0.5411,  0.0856, -0.0165],
         [-

### Native Layer Norm Backward:



In [None]:
# Getting rid of Backward Norm native layer

with torch.inference_mode():
  output = model(**tokenized_text)

pprint(output)

BaseModelOutput(last_hidden_state=tensor([[[ 0.6786,  0.2116,  0.2134,  ...,  0.4311,  0.9219, -0.5872],
         [ 0.4083,  0.1110,  0.3373,  ...,  0.2122,  0.7986, -0.6584],
         [ 0.5258,  0.2334,  0.4273,  ..., -0.0329,  0.7275, -0.5175],
         ...,
         [ 0.7248,  0.2988,  0.3545,  ...,  0.4456,  1.0790, -0.5557],
         [ 0.6883,  0.2909,  0.3172,  ...,  0.4103,  1.0424, -0.7060],
         [ 1.1934,  0.1189,  0.5984,  ...,  0.4708,  0.7246, -0.8912]],

        [[-0.8214,  0.7480,  0.0973,  ..., -0.0457, -1.1065, -0.3524],
         [-0.8090,  0.7177,  0.2172,  ..., -0.1756, -0.9477, -0.2446],
         [-1.0262,  0.6648,  0.1837,  ..., -0.2497, -0.9956, -0.2826],
         ...,
         [-0.8590,  0.7709,  0.1200,  ..., -0.0781, -0.9487, -0.3195],
         [-0.8194,  0.7506,  0.0613,  ..., -0.0756, -0.9781, -0.3513],
         [-0.7707,  0.7625,  0.0414,  ..., -0.0674, -0.9842, -0.3520]],

        [[-0.3636,  0.2916, -0.3554,  ..., -0.5411,  0.0856, -0.0165],
         [-

Since the current model only has linear tranformation layers, so out output is just a tranformed format of initial token tensors with shape:

In [None]:
# Shape of the last hidden state
output.last_hidden_state.shape

torch.Size([3, 18, 768])

There are 3 inputs of size 18 token ids and each token id is tensor of length 768.

### Model for Classification

using `AutoModelforSequenceClassification` library for generating classification probabilities of the input vectors.


In [None]:
# Instantiating the classification Model
cls_model= AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

In [None]:
# Classification models architecture
pprint(cls_model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [None]:
# Model Config
cls_model.config

DistilBertConfig {
  "_attn_implementation_autoset": true,
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.51.3",
  "vocab_size": 30522
}

With reference to the previous Auto Model, the current classification model has classification layer attached to it which will transform the last hidden state of input vectors to probability vectors of classification.....

In [None]:
# Generating classification vectors
output= cls_model(**tokenized_text)
output

SequenceClassifierOutput(loss=None, logits=tensor([[-4.3472,  4.6557],
        [ 4.5186, -3.6110],
        [ 2.6655, -2.1722]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [None]:
# Implementing Softmax over the input vectors for classification
prob_scores= torch.nn.functional.softmax(output.logits, dim=-1)

In [None]:
# Labeling the probability scores using argmax
labels = torch.argmax(prob_scores, dim=-1)
labels

tensor([1, 0, 0])

In [None]:
# Classifying the output labels as positive or negative sentiment
[cls_model.config.id2label[i] for i in labels.tolist()]

['POSITIVE', 'NEGATIVE', 'NEGATIVE']

### HF Abstraction:

