<a href="https://colab.research.google.com/github/xmpuspus/NLP-transfer-learning/blob/main/NLP_transfer_learning_xmpuspus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Question 1: Develop an AI model that will be used to classify the components of an object that will be provided via a text input. For example if the user key in "a can of tuna", the expected output will be as follows: 3 possible components; a) can, b)tuna, c) can packaging with the relevant inference score.

In [None]:
!pip install transformers torch datasets
!pip install accelerate -U


Collecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.17.1 dill-0.3.8 multiprocess-0.70.16
Collecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Installing collected pa

In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
from datasets import Dataset
from sklearn.model_selection import train_test_split

# Extended dataset examples
data = {
    "Object": [
        "a can of tuna", "a bottle of water", "a chocolate bar",
        "a laptop with charger", "a book on AI", "a pair of sneakers",
        "a glass of milk", "a plate of spaghetti", "a cup of coffee",
        "a smartphone with earbuds", "a pack of gum", "a jar of honey",
        "a bowl of cereal", "a tube of toothpaste", "a bag of flour",
        "a box of tea bags", "a can of soda", "a bottle of shampoo",
        "a bar of soap", "a jug of orange juice", "a carton of eggs",
        "a packet of seeds", "a loaf of bread", "a piece of cake",
        "a slice of pizza", "a tub of ice cream", "a stick of butter",
        "a flask of oil", "a container of yogurt", "a pouch of tobacco"
    ],
    "Components": [
        "can;tuna;packaging", "bottle;water;packaging",
        "chocolate;bar;wrapper", "laptop;charger;packaging",
        "book;AI;cover", "sneakers;;box", "glass;milk;",
        "plate;spaghetti;", "cup;coffee;", "smartphone;earbuds;packaging",
        "pack;gum;wrapper", "jar;honey;lid", "bowl;cereal;",
        "tube;toothpaste;cap", "bag;flour;packaging",
        "box;tea bags;wrapper", "can;soda;tab", "bottle;shampoo;cap",
        "bar;soap;wrapper", "jug;orange juice;cap", "carton;eggs;container",
        "packet;seeds;wrapper", "loaf;bread;bag", "piece;cake;plate",
        "slice;pizza;box", "tub;ice cream;lid", "stick;butter;wrapper",
        "flask;oil;cap", "container;yogurt;lid", "pouch;tobacco;seal"
    ]
}

df = pd.DataFrame(data)

# Simplified label map for demonstration purposes
label_map = {'O': 0, 'B-COMP': 1, 'I-COMP': 2}  # 'O': Other, 'B-COMP': Beginning of Component, 'I-COMP': Inside a Component

# Function to encode components into integer labels
def encode_labels(components_str):
    encoded_labels = []
    for comp in components_str.split(';'):
        if comp:  # If component is not empty
            encoded_labels.append(label_map['B-COMP'])
        else:
            encoded_labels.append(label_map['O'])
    return encoded_labels

# Encoding labels for each object
df['Encoded Components'] = df['Components'].apply(encode_labels)


tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples['Object'], truncation=True, padding='max_length', is_split_into_words=False, return_tensors="pt")
    labels = []
    for i, label in enumerate(examples['Encoded Components']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = [-100 if word_id is None else label[word_id] if word_id < len(label) else label_map['O'] for word_id in word_ids]
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

# Train-test Split
train_df, val_df = train_test_split(df, test_size=0.1)

# DataFrame into Hugging Face's Dataset
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

# Tokenization and label alignment
train_dataset = train_dataset.map(tokenize_and_align_labels, batched=True)
val_dataset = val_dataset.map(tokenize_and_align_labels, batched=True)

model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased", num_labels=len(label_map))

# Train configs
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    evaluation_strategy="epoch",
)

# Actually train the model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/27 [00:00<?, ? examples/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,No log,1.227792
2,No log,1.150621
3,No log,1.029593


TrainOutput(global_step=21, training_loss=1.159881864275251, metrics={'train_runtime': 10.8151, 'train_samples_per_second': 7.489, 'train_steps_per_second': 1.942, 'total_flos': 21165228647424.0, 'train_loss': 1.159881864275251, 'epoch': 3.0})

In [None]:
import torch

# Check if CUDA is available and choose device accordingly
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Move model to the current GPU cuda:0
model.to(device)

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, el

In [None]:
test_cases = [
    "a smartphone with a case and screen protector",
    "a pair of headphones with Bluetooth connectivity",
    "a laptop bag with multiple compartments",
    "a water bottle made of stainless steel",
    "a notebook with lined pages and a hard cover"
]

for test_case in test_cases:

    inputs = tokenizer(test_case, return_tensors="pt", padding=True, truncation=True, max_length=512)
    inputs = {k: v.to(device) for k, v in inputs.items()}


    with torch.no_grad():
        outputs = model(**inputs)
    predictions = outputs.logits.argmax(dim=-1)

    # Decode predictions
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0].cpu().numpy())
    predicted_labels = [list(label_map.keys())[list(label_map.values()).index(p)] for p in predictions[0].cpu().numpy()]

    # Combine tokens with their predicted labelsy
    token_label_pairs = zip(tokens, predicted_labels)
    print(f"Test case: '{test_case}'")
    for token, label in token_label_pairs:
        print(f"{token}: {label}")
    print("\n")

Test case: 'a smartphone with a case and screen protector'
[CLS]: I-COMP
a: B-COMP
smartphone: I-COMP
with: O
a: B-COMP
case: O
and: B-COMP
screen: I-COMP
protector: O
[SEP]: B-COMP


Test case: 'a pair of headphones with Bluetooth connectivity'
[CLS]: B-COMP
a: B-COMP
pair: B-COMP
of: O
head: O
##phones: O
with: O
blue: O
##tooth: O
connectivity: I-COMP
[SEP]: B-COMP


Test case: 'a laptop bag with multiple compartments'
[CLS]: I-COMP
a: B-COMP
laptop: O
bag: O
with: O
multiple: O
compartments: I-COMP
[SEP]: B-COMP


Test case: 'a water bottle made of stainless steel'
[CLS]: B-COMP
a: B-COMP
water: O
bottle: O
made: B-COMP
of: O
stainless: O
steel: I-COMP
[SEP]: B-COMP


Test case: 'a notebook with lined pages and a hard cover'
[CLS]: I-COMP
a: B-COMP
notebook: O
with: O
lined: O
pages: I-COMP
and: I-COMP
a: B-COMP
hard: B-COMP
cover: I-COMP
[SEP]: B-COMP




## Question 2: Based on the model in question 1 that is developed how can it be improved through the different machine learning techniques. List the machine learning technique and the approach to be used.

- Increase the diversity and quantity of training data, helping the model learn more robust features and reducing overfitting. This can be done by creating more synthetic training data, modifying existing examples, or researching for external datasets that fit the problem statement. Packages like: https://github.com/makcedward/nlpaug can be used to increase the dataset artificially.

- Trying other tokenization methods like `xlnet-base-uncased` to handle out-of-vocabulary words can also be used especially since our model performs poorly on words that aren't in the dataset. For this script,`bert-base-uncased`.

- Due to time constraints, future improvements can be made by performing basic HPO like grid search, etc.


## Question 3: How do you prevent for an overfitting of the features / data points in the model?

- Since the dataset I created above is manual made from scratch, I couldn't make too many of them but if we had more data, cross-validation like k-fold (minimum of 10) would be useful to ensure that the model's performance is consistent across various parts of the dataset which helps identify if the model is overfitting. We can use something like Optuna (https://optuna.readthedocs.io/en/stable/) for HPO.

- Regularization techniques can also help prevent overfitting by stopping training before the model learns noise from the training data. early stopping can be easily done through `EarlyStoppingCallback` in `transformers` as part of Trainer parameter.