# ü§ó Hugging Face Tokenization and Model Usage

This notebook demonstrates how Hugging Face Transformers handle:

- Text tokenization  
- Model-specific special tokens  
- Integration with PyTorch  
- Saving and loading pre-trained models  

The focus is on understanding how text is prepared for models
and how models are executed and reused in practice.


In [None]:
from transformers import AutoTokenizer

---

## 1Ô∏è‚É£ Tokenization with AutoTokenizer (BERT)

In this section, we use the `bert-base-uncased` tokenizer to observe how
a sentence is transformed into tokens and numerical IDs.

The steps include:
- Encoding text into model inputs  
- Tokenizing text into subword tokens  
- Converting tokens to token IDs  
- Decoding token IDs back to tokens  
- Inspecting special tokens used by BERT


In [None]:
model = "bert-base-uncased"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model)

In [None]:
sentence =  "I'm so excited to learn about Transformers library!"

In [None]:
input_ids = tokenizer(sentence)
print(input_ids)

In [None]:
tokens = tokenizer.tokenize(sentence)
print(tokens)

In [None]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

In [None]:
decoded_tokens = tokenizer.convert_ids_to_tokens(token_ids)
print(decoded_tokens)

---

## 2Ô∏è‚É£ Special Tokens in BERT

BERT tokenizers automatically add special tokens to represent
sentence boundaries and structure.

In this section, we explicitly decode:
- `[CLS]` ‚Üí marks the start of a sequence  
- `[SEP]` ‚Üí marks the end or separation of sequences  

These tokens are required for correct model behavior.


In [None]:
tokenizer.decode(101)  # Example: Decoding the token ID for [CLS] #special token added by our tokenizer to indicate the start of a sentence

In [None]:
tokenizer.decode(102)  # Example: Decoding the token ID for [SEP] #special token added by our tokenizer to indicate the end of a sentence

---

## 3Ô∏è‚É£ Tokenization with a Different Model (XLNet)

Here, we repeat the same tokenization steps using the
`xlnet-base-cased` tokenizer.

This demonstrates that:
- Tokenization rules differ between models  
- Token IDs and special tokens are model-specific  
- Each architecture defines its own input format


In [None]:
#another model
model2 = "xlnet-base-cased"

In [None]:
tokenizer2 = AutoTokenizer.from_pretrained(model2)

In [None]:
input_ids = tokenizer2(sentence)
print(input_ids)

In [None]:
tokens =  tokenizer2.tokenize(sentence)
print(tokens)

In [None]:
token_ids = tokenizer2.convert_tokens_to_ids(tokens)
print(token_ids)

---

## 4Ô∏è‚É£ Special Tokens in XLNet

XLNet uses a different set of special tokens compared to BERT.

In this section, we decode XLNet-specific token IDs to observe
how sequence structure is represented differently across models.

### Notes on Special Tokens
- Help the model understand structure and context  
- Guide model behavior  
- Ensure tokenized input matches model expectations  
- `[CLS]` and `[SEP]` equivalents differ by architecture


In [None]:
tokenizer2.decode(4)    #special token for this model

In [None]:
tokenizer2.decode(3)  #another special token for this model

---

## 5Ô∏è‚É£ Using Hugging Face Models with PyTorch

In this section, we integrate Hugging Face Transformers with PyTorch.

The steps include:
- Tokenizing text and returning PyTorch tensors  
- Loading a fine-tuned sequence classification model  
- Running inference without gradient computation  
- Extracting model logits  
- Mapping predicted class IDs to human-readable labels


In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch

In [None]:
print(sentence)

In [None]:
print(input_ids)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")  #finetuned model


In [None]:
input_ids_pt = tokenizer(sentence, return_tensors="pt")  #pt for pytorch
print(input_ids_pt)

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

In [None]:
with torch.no_grad():
    logits = model(**input_ids_pt).logits

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

---

## 6Ô∏è‚É£ Saving and Loading Models

Finally, we demonstrate how to persist models and tokenizers locally.

This includes:
- Saving a tokenizer to disk  
- Saving a fine-tuned model to disk  
- Reloading both components for future use  

This is essential for deployment and reuse without retraining.


In [None]:
model_directory= "F:\Project_folder\models_directory"

In [None]:
tokenizer.save_pretrained(model_directory)

In [None]:
model.save_pretrained(model_directory)

In [None]:
my_tokenizer = AutoTokenizer.from_pretrained(model_directory)

In [None]:
my_model = AutoModelForSequenceClassification.from_pretrained(model_directory)

---

## Key Takeaways

- Tokenization is model-specific  
- Special tokens control sequence structure  
- Transformers integrate seamlessly with PyTorch  
- Pre-trained models can be saved and reused efficiently
