<a href="https://colab.research.google.com/github/ujjalkumarmaity/NLP/blob/main/transformers-huggingface/huggingface_NLP_Course_2_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!pip install dataset transformers

### *pipeline*

In [None]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

### Model

In [None]:
# load any model. it auutomatically detect model configration
from transformers import AutoModel
model = AutoModel.from_pretrained('distilgpt2')

In [None]:
# creating a BERT model
# this model needs to be trained first.
from transformers import BertConfig,BertModel
from pprint import pprint
pprint(BertConfig())
config = BertConfig()
bert_model = BertModel(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [None]:
model = BertModel.from_pretrained('bert-base-cased')
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(28996, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

### Tokenizer




*   convert our text inputs to numerical data
*   "unknown” token, often represented as ”[UNK]”
*   One way to reduce the amount of unknown tokens is to go one level deeper, using a character-based tokenizer.
*    **Subword tokenization** algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords

<img src ="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/bpe_subword.svg">


* Other Tokenizer Algorithm
    - Byte-level BPE, as used in GPT-2
    - WordPiece, as used in BERT
    - SentencePiece or Unigram, as used in several multilingual models



In [None]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer(["Using a Transformer network is simple","using a transformer network is simple"])

{'input_ids': [[101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], [101, 1606, 170, 11303, 1200, 2443, 1110, 3014, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer(["Using a Transformer network is simple","using a transformer network is simple"])

{'input_ids': [[101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], [101, 1606, 170, 11303, 1200, 2443, 1110, 3014, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]]}

**Encoding**

Translating text to numbers is known as encoding

Encoding is done in a two-step process: the **tokenization**, followed by the **conversion to input IDs**.



In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Using a Transformer network is simple"
token = tokenizer.tokenize(sequence)
print(tokenizer.tokenize(sequence))
print(tokenizer.convert_tokens_to_ids(token))

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']
[7993, 170, 13809, 23763, 2443, 1110, 3014]


**Decoding**

Decoding is going the other way around: **from vocabulary indices, we want to get a string**. This can be done with the decode()

In [None]:
print(tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014]))

Using a transformer network is simple


### Handling multiple sequences


- How do we handle multiple sequences?
- How do we handle multiple sequences of different lengths?
- Are vocabulary indices the only inputs that allow a model to work well?
- Is there such a thing as too long a sequence?

In [None]:
#Models expect a batch of inputs
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

token = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(token)
model(torch.tensor([ids])) # batch. instead of passing ids, we passing list of ids, because models expect multiple sentences by default

SequenceClassifierOutput(loss=None, logits=tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

**Padding the inputs**

Is input sequence diferent then we add padding for same sequence length

In [None]:
print('pad_token_id',tokenizer.pad_token_id)
ids = [[100,200,200,200],[100,200,tokenizer.pad_token_id,tokenizer.pad_token_id]]
print(ids)
model(torch.tensor(ids))

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


pad_token_id 0
[[100, 200, 200, 200], [100, 200, 0, 0]]


SequenceClassifierOutput(loss=None, logits=tensor([[ 0.9290, -0.7948],
        [ 0.6497, -0.5397]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

**Attention masks**

attention layers that contextualize each token

Attention masks are tensors with the exact **same shape as the input IDs tensor**, filled with 0s and 1s: **1s indicate the corresponding tokens should be attended** to, and **0s indicate the corresponding tokens should not be attended**

In [None]:
from transformers import AutoTokenizer
from pprint import pprint
checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
pprint(tokenizer(['attention layers that contextualize each token','large language model'], max_length=10, padding="max_length")) # 0 ignore

{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]],
 'input_ids': [[101, 2209, 8798, 1115, 5618, 4746, 3708, 1296, 22559, 102],
               [101, 1415, 1846, 2235, 102, 0, 0, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}


**Longer sequences**
- In transformer model limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens
- two solutions to this problem -
    - Use a model with a longer supported sequence length (https://huggingface.co/docs/transformers/model_doc/longformer)
    - Truncate your sequences

In [None]:
# Truncate your sequences
checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
pprint(tokenizer(['attention layers that contextualize each token','large language model'], max_length=5, padding="max_length",truncation=True))

{'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]],
 'input_ids': [[101, 2209, 8798, 1115, 102], [101, 1415, 1846, 2235, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]}


**Special tokens**

- tokenizer added the special word [CLS] at the beginning and the special word [SEP] at the end.

In [None]:
# Truncate your sequences
checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
seq = 'attention layers that contextualize each token'
tok = tokenizer(seq)
print(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(seq)))
print('input_ids',tok['input_ids'])

#
tokenizer.decode(tok['input_ids'])

[2209, 8798, 1115, 5618, 4746, 3708, 1296, 22559]
input_ids [101, 2209, 8798, 1115, 5618, 4746, 3708, 1296, 22559, 102]


'[CLS] attention layers that contextualize each token [SEP]'