<h2 align="center">NLP using Hugging Face Pipelines</h2>

#### 4.2: Pipelines

In [2]:
from transformers import pipeline

#### Sentiment Classificaiton

In [None]:
cls = pipeline("sentiment-analysis")

In [4]:
cls("Pushpa 2 movie is full of violence and gave me a headache")

[{'label': 'NEGATIVE', 'score': 0.9987161159515381}]

In [5]:
cls("12th fail is such an inspiring movie")

[{'label': 'POSITIVE', 'score': 0.9983185529708862}]

**Specify Model Explicitly**

Enable developer mode on windows: https://learn.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development

pipe = pipeline(model="FacebookAI/roberta-large-mnli")
pipe("This restaurant is awesome")

#### Language Translation

In [None]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-hi")

translation = translator("How are you?")
translation

#### ZERO Shot Classification

In [None]:
classifier = pipeline("zero-shot-classification")
classifier(
    "I bought the product but it is faulty, I would like to return it and get my money back",
    candidate_labels=["refund", "new order", "existing order"],
)

#### Text Generation

In [None]:
generator = pipeline("text-generation")
generator("To become happy in life, we need to focus on healthy diet and ")

#### NER

In [None]:
ner = pipeline("ner")
ner("I am Dhaval, I work for Codebasics and live in New Jersey, USA", grouped_entities=True)

#### 4.3: BERT, DistilBERT, RoBERTa, ALBERT

### Comparison of NLP Models

| Model       | Creator              | Architecture             | Key Features                                                                 | Strengths                                                                  | Weaknesses                                                                  |
|-------------|----------------------|---------------------------|------------------------------------------------------------------------------|-----------------------------------------------------------------------------|------------------------------------------------------------------------------|
| BERT        | Google AI (2018)     | Transformer (12–24 layers)| Bidirectional MLM, NSP                                                      | High performance, Strong embeddings                                        | Large size, Computationally expensive                                       |
| DistilBERT  | Hugging Face (2019)  | Distilled BERT (6 layers) | Knowledge distillation, No NSP                                              | Faster and lightweight                                                     | Slight performance drop                                                     |
| ALBERT      | Google Research (2019)| Optimized BERT (12–24 layers)| Parameter sharing, SOP, Dynamic masking, No NSP                          | Fewer parameters, Lower memory usage                                       | Complex architecture, Longer training                                       |
| RoBERTa     | Facebook AI (2019)   | Enhanced BERT (24 layers) | Large data, No NSP                                                          | Outperforms BERT, Robust performance                                       | Resource-intensive, Long training                                           |


#### Quick Guide
- Need speed & smaller size? → DistilBERT
- Want parameter efficiency? → ALBERT
- Need best performance? → RoBERTa
- Want balanced baseline? → BERT
- BERT give the balance in most of the cases.

### 4.4: Tokenizers in Hugging Face

In [12]:
from transformers import DistilBertTokenizer, AutoTokenizer

#### DistilBERT Tokenizer

In [None]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = DistilBertTokenizer.from_pretrained(model_name)

text = "Happiness lies within you"
output = tokenizer(text)
output

#### BERT Tokenizer

In [14]:
model_name = 'bert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(model_name)

In [15]:
text = "Happiness lies within you"

output = tokenizer(text)
output

{'input_ids': [101, 8404, 3658, 2306, 2017, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}

In [16]:
tokenizer.decode(output['input_ids'])

'[CLS] happiness lies within you [SEP]'

In [17]:
tokens = tokenizer.convert_ids_to_tokens(output['input_ids'])
tokens

['[CLS]', 'happiness', 'lies', 'within', 'you', '[SEP]']

#### Special token ids

In [18]:
tokenizer.cls_token_id

101

In [19]:
tokenizer.sep_token_id

102

In [20]:
tokenizer.pad_token_id

0

In [21]:
texts = [
    "Happiness lies within you",
    "I love nature"
]

In [22]:
tokenizer(texts)

{'input_ids': [[101, 8404, 3658, 2306, 2017, 102], [101, 1045, 2293, 3267, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

#### padding and truncation

In [23]:
tokenizer(texts, padding=True, return_tensors='pt')

{'input_ids': tensor([[ 101, 8404, 3658, 2306, 2017,  102],
        [ 101, 1045, 2293, 3267,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0]])}

In [24]:
tokenizer(texts, padding='max_length', max_length=5, truncation=True, return_tensors='pt')

{'input_ids': tensor([[ 101, 8404, 3658, 2306,  102],
        [ 101, 1045, 2293, 3267,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1]])}

In [25]:
tokenizer(texts, padding='max_length', max_length=20, truncation=True, return_tensors='pt')

{'input_ids': tensor([[ 101, 8404, 3658, 2306, 2017,  102,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0],
        [ 101, 1045, 2293, 3267,  102,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

#### Supplying tokens to a model

In [29]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequences = [
    "That phone case broke after 2 days of use", 
    "That herbel tea has helped me so much"
]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
output

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

SequenceClassifierOutput(loss=None, logits=tensor([[ 4.0561, -3.2456],
        [-3.6340,  3.8584]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [31]:
import torch
import torch.nn.functional as F

probs = F.softmax(output.logits, dim=-1)
probs

tensor([[9.9933e-01, 6.7395e-04],
        [5.5700e-04, 9.9944e-01]], grad_fn=<SoftmaxBackward0>)

In [32]:
predicted_classes = torch.argmax(probs, dim=1).tolist()
predicted_classes

[0, 1]

input text ==> tokenizer ==> tokens(token ids) ==> model ==> logits ==> post processing ==> output text

Previously when we used HuggingFace pipeline we were able to do all of this with just one line of code. Above code explains the inner workings of the pipeline.

In [33]:
from transformers import pipeline
pipe = pipeline("sentiment-analysis")
pipe("My dog is cute")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9997941851615906}]

### 4.5: Model Fine Tuning

In [1]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments, DataCollatorWithPadding
from datasets import load_dataset
import numpy as np
import torch
from sklearn.metrics import accuracy_score, f1_score

#### Load Dataset

In [None]:
dataset = load_dataset("glue", "mrpc")
dataset

In [None]:
dataset['train'][0]

In [None]:
threshold = 5
count = 0
for record in dataset['train']:
    if record['label']==0:
        print(record)
        if count > threshold:
            break
        count+=1

In [None]:
dataset['train'].features

#### Tokenization

In [None]:
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

In [None]:
tokenized_dataset

In [None]:
# Prepare datasets for PyTorch
train_dataset = tokenized_dataset["train"].shuffle(seed=42)
valid_dataset = tokenized_dataset["validation"]
test_dataset = tokenized_dataset["test"]

In [None]:
# Data collator is used for dynamic padding per batch. For example batch 1 has texts of 
# size 10, 12 and 15. batch 2 has sizes 8, 6 and 9. Due to collator batch 1 will be padded to
# max size of 15 whereas batch 2 will be padded to a max size of 9. The other option is to apply
# global padding that will tax max length from the entire dataset and pad all other text to that max
# length which is not efficient

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
data_collator

In [None]:
# Check if CUDA is available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

#### Train / Fine Tune the Model

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2).to(device)

In [None]:
# Define evaluation metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)
    acc = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions)
    return {"accuracy": acc, "f1": f1}

In [None]:
# Set training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
)

In [None]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
# You may get an error if mlflow or dagshub are installed. To remove that error, you can uninstall 
# mlflow and dagshub and then restart the kernel
trainer.train()

#### Model Evaluation

In [None]:
# Evaluate on test set
results = trainer.evaluate(test_dataset)
results

#### Predictions

In [None]:
# Perform predictions on new sentences
def predict(sentences):
    inputs = tokenizer(sentences["sentence1"], sentences["sentence2"], return_tensors="pt", truncation=True, padding=True).to(device)
    with torch.no_grad():
        logits = model(**inputs).logits
    predictions = torch.argmax(logits, dim=1)
    return predictions.cpu().numpy()

In [None]:
sample_sentences = {
    "sentence1": [
        "PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So",
        "The Nasdaq composite index increased 10.73 , or 0.7 percent , to 1,514.77"
    ],
    "sentence2": [
        "Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So", 
        "The Nasdaq Composite index, full of technology stocks, was lately up around 18 points"
    ]
}
predictions = predict(sample_sentences)
print("Predictions:", predictions)

### MCQ on NLP using Hugging Face

- Which pipeline is used for entity extraction?
  - The pipeline used for entity extraction is the Named Entity Recognition (NER) pipeline, commonly provided by NLP frameworks like spaCy and Haystack. This pipeline processes text to identify and extract entities such as names, locations, and dates.
- What is Hugging Face known for in the field of NLP?
  - Providing tools for developing NLP applications
  - Hugging Face is known for democratizing Natural Language Processing (NLP) by providing open-source libraries like Transformers, which offer thousands of pre-trained models (such as BERT, GPT, RoBERTa) for NLP tasks including text classification, translation, summarization, and question answering. They also provide tools for dataset management (Datasets library), efficient tokenization (Tokenizers library), and a collaborative Model Hub to share and deploy models. Hugging Face has built a vibrant community, making advanced NLP technology accessible and easy to use for researchers and developers worldwide.
- Which tokenization technique is used by GPT model?
  - The GPT model uses the Byte Pair Encoding (BPE) tokenization technique, which breaks text into subword units by iteratively merging frequent character pairs to form tokens. This approach allows GPT models to efficiently handle various words and out-of-vocabulary terms.
- What task does AutoTokenizer.from_pretrained() perform?
  - AutoTokenizer.from_pretrained() loads the appropriate tokenizer and its vocabulary associated with a specific pretrained model, allowing you to easily preprocess text data for use with that model.

### MCQ on NLP

- What is the primary purpose of using a tokenizer in Hugging Face models?
  - To split the text into tokens compatible with the model
- Which of the following correctly represents the role of BERT in NLP?
  - Contextual word embeddings for a wide range of tasks
- Why is it important to remove stop words in text preprocessing?
  - Removing stop words in text preprocessing is important because it reduces the size of the text data, improves computational efficiency, and increases the relevance of analysis by focusing on words that carry meaningful information.
- In tokenization, what challenge does subword tokenization aim to solve?
  - Handling out-of-vocabulary words by breaking them into smaller units
- Which metric in TF-IDF represents the importance of a word in a specific document?
  - In TF-IDF, the metric that represents the importance of a word in a specific document is the TF-IDF score itself. It combines Term Frequency (TF) (how often a word appears in the document) and Inverse Document Frequency (IDF) (how unique the word is across all documents). A higher TF-IDF score indicates greater importance of the word in that particular document.
- Why is fine-tuning transformer models computationally expensive?
  - They have millions (or billions) of parameters to optimize
- Which of the following best describes the difference between Bag of Words (BoW) and TF-IDF?
  - BoW counts word occurrences; TF-IDF also considers word importance.