# ü§ó Hugging Face Tokenization and Model Usage

This notebook demonstrates how Hugging Face Transformers handle:
- Text tokenization
- Model-specific special tokens
- Integration with PyTorch
- Saving and loading pre-trained models

The focus is on understanding how text is prepared for models
and how models are executed and reused in practice.


## 1Ô∏è‚É£ Tokenization with AutoTokenizer (BERT)

In this section, we use the `bert-base-uncased` tokenizer to observe how
a sentence is transformed into tokens and numerical IDs.

The steps include:
- Encoding text into model inputs
- Tokenizing text into subword tokens
- Converting tokens to token IDs
- Decoding token IDs back to tokens
- Inspecting special tokens used by BERT


In [1]:
from transformers import AutoTokenizer

In [2]:
model = "bert-base-uncased"

In [3]:
tokenizer = AutoTokenizer.from_pretrained(model)

In [4]:
sentence =  "I'm so excited to learn about Transformers library!"

In [5]:
input_ids = tokenizer(sentence)
print(input_ids)

{'input_ids': [101, 1045, 1005, 1049, 2061, 7568, 2000, 4553, 2055, 19081, 3075, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [6]:
tokens = tokenizer.tokenize(sentence)
print(tokens)

['i', "'", 'm', 'so', 'excited', 'to', 'learn', 'about', 'transformers', 'library', '!']


In [7]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

[1045, 1005, 1049, 2061, 7568, 2000, 4553, 2055, 19081, 3075, 999]


In [8]:
decoded_tokens = tokenizer.convert_ids_to_tokens(token_ids)
print(decoded_tokens)

['i', "'", 'm', 'so', 'excited', 'to', 'learn', 'about', 'transformers', 'library', '!']


## 2Ô∏è‚É£ Special Tokens in BERT

BERT tokenizers automatically add special tokens to represent
sentence boundaries and structure.

In this section, we explicitly decode:
- `[CLS]` ‚Üí marks the start of a sequence
- `[SEP]` ‚Üí marks the end or separation of sequences

These tokens are required for correct model behavior.


In [9]:
tokenizer.decode(101)  # Example: Decoding the token ID for [CLS] #special token added by our tokenizer to indicate the start of a sentence

'[CLS]'

In [10]:
tokenizer.decode(102)  # Example: Decoding the token ID for [SEP] #special token added by our tokenizer to indicate the end of a sentence

'[SEP]'

## 3Ô∏è‚É£ Tokenization with a Different Model (XLNet)

Here, we repeat the same tokenization steps using the
`xlnet-base-cased` tokenizer.

This demonstrates that:
- Tokenization rules differ between models
- Token IDs and special tokens are model-specific
- Each architecture defines its own input format


In [11]:
#another model
model2 = "xlnet-base-cased"

In [12]:
tokenizer2 = AutoTokenizer.from_pretrained(model2)

In [13]:
input_ids = tokenizer2(sentence)
print(input_ids)

{'input_ids': [35, 26, 98, 102, 5564, 22, 1184, 75, 17, 21442, 270, 2992, 136, 4, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [14]:
tokens =  tokenizer2.tokenize(sentence)
print(tokens)

['‚ñÅI', "'", 'm', '‚ñÅso', '‚ñÅexcited', '‚ñÅto', '‚ñÅlearn', '‚ñÅabout', '‚ñÅ', 'Transform', 'ers', '‚ñÅlibrary', '!']


In [15]:
token_ids = tokenizer2.convert_tokens_to_ids(tokens)
print(token_ids)

[35, 26, 98, 102, 5564, 22, 1184, 75, 17, 21442, 270, 2992, 136]


## 4Ô∏è‚É£ Special Tokens in XLNet

XLNet uses a different set of special tokens compared to BERT.

In this section, we decode XLNet-specific token IDs to observe
how sequence structure is represented differently across models.


In [16]:
tokenizer2.decode(4)    #special token for this model

'<sep>'

### Special Tokens are specific placeholders or markers that help the model perform various tasks, 
 - helps the model understand structure and context
 - guide model behaviour 
 - ensure output of tokenization is in a format that model comprehends
 - cls and sep (classification and separator)

In [17]:
tokenizer2.decode(3)  #another special token for this model

'<cls>'

## 5Ô∏è‚É£ Using Hugging Face Models with PyTorch

In this section, we integrate Hugging Face Transformers with PyTorch.

The steps include:
- Tokenizing text and returning PyTorch tensors
- Loading a fine-tuned sequence classification model
- Running inference without gradient computation
- Extracting model logits
- Mapping predicted class IDs to human-readable labels


In [18]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch

In [19]:
print(sentence)

I'm so excited to learn about Transformers library!


In [20]:
print(input_ids)

{'input_ids': [35, 26, 98, 102, 5564, 22, 1184, 75, 17, 21442, 270, 2992, 136, 4, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [21]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")  #finetuned model


In [22]:
input_ids_pt = tokenizer(sentence, return_tensors="pt")  #pt for pytorch
print(input_ids_pt)

{'input_ids': tensor([[  101,  1045,  1005,  1049,  2061,  7568,  2000,  4553,  2055, 19081,
          3075,   999,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [23]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

In [24]:
with torch.no_grad():
    logits = model(**input_ids_pt).logits

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

'POSITIVE'

## 6Ô∏è‚É£ Saving and Loading Models

Finally, we demonstrate how to persist models and tokenizers locally.

This includes:
- Saving a tokenizer to disk
- Saving a fine-tuned model to disk
- Reloading both components for future use

This is essential for deployment and reuse without retraining.


In [25]:
model_directory= "F:\Project_folder\models_directory"

In [26]:
tokenizer.save_pretrained(model_directory)

('F:\\Project_folder\\models_directory\\tokenizer_config.json',
 'F:\\Project_folder\\models_directory\\special_tokens_map.json',
 'F:\\Project_folder\\models_directory\\vocab.txt',
 'F:\\Project_folder\\models_directory\\added_tokens.json',
 'F:\\Project_folder\\models_directory\\tokenizer.json')

In [27]:
model.save_pretrained(model_directory)

In [28]:
my_tokenizer = AutoTokenizer.from_pretrained(model_directory)

In [29]:
my_model = AutoModelForSequenceClassification.from_pretrained(model_directory)

## Key Takeaways

- Tokenization is model-specific
- Special tokens control sequence structure
- Transformers integrate seamlessly with PyTorch
- Pre-trained models can be saved and reused efficiently
