# Zero Shot Classification - BART - Fine Tuning

In this post we are going to leverage BART model and movies data. We are first going to fine tune the BART model on few classes in the movies data and then leverage the fine tuned model to do zero shot classifcation to predict classes on validation data with all the classes.

## Importing Data

We have our data saved in Google Drive. Lets mount the drive and load the data.

In [None]:
!pip install transformers==4.28.0

Collecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers==4.28.0)
  Downloading huggingface_hub-0.17.2-py3-none-any.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m82.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.17.2 tokenizers-0.13.3 transformers-4.28.0


In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2.14.

In [None]:
from datasets import disable_caching
disable_caching()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd

train_data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Datasets/train.csv')
train_data.head()

Unnamed: 0,id,movie_name,synopsis,genre
0,44978,Super Me,A young scriptwriter starts bringing valuable ...,fantasy
1,50185,Entity Project,A director and her friends renting a haunted h...,horror
2,34131,Behavioral Family Therapy for Serious Psychiat...,This is an educational video for families and ...,family
3,78522,Blood Glacier,Scientists working in the Austrian Alps discov...,scifi
4,2206,Apat na anino,Buy Day - Four Men Widely - Apart in Life - By...,action


In [None]:
unique_labels = train_data['genre'].unique().tolist()
unique_labels

['fantasy',
 'horror',
 'family',
 'scifi',
 'action',
 'crime',
 'adventure',
 'mystery',
 'romance',
 'thriller']

In [None]:
id2label = {idx: label for idx, label in enumerate(unique_labels)}
id2label

{0: 'fantasy',
 1: 'horror',
 2: 'family',
 3: 'scifi',
 4: 'action',
 5: 'crime',
 6: 'adventure',
 7: 'mystery',
 8: 'romance',
 9: 'thriller'}

In [None]:
label2id = {label: idx for idx, label in enumerate(unique_labels)}
label2id

{'fantasy': 0,
 'horror': 1,
 'family': 2,
 'scifi': 3,
 'action': 4,
 'crime': 5,
 'adventure': 6,
 'mystery': 7,
 'romance': 8,
 'thriller': 9}

In [None]:
train_data['genre_id'] = train_data['genre'].map(label2id)
train_data['genre_id'].value_counts()

0    5400
1    5400
2    5400
3    5400
4    5400
5    5400
6    5400
7    5400
8    5400
9    5400
Name: genre_id, dtype: int64

Now lets split the data into test and validation. Since this is just an academic exercise, I have to reduce the train data significantly to ensure less usage of computing power.

In [None]:
#Split the data
from sklearn.model_selection import train_test_split

train_texts_label, val_texts_label = train_test_split(train_data, test_size=.9, random_state = 100)
train_texts_label.head(2)

Unnamed: 0,id,movie_name,synopsis,genre,genre_id
14333,49829,Predecessor,"Competing teams of scientists, one millennials...",horror,1
21599,84919,The Sleeping Car Murder,The witnesses of a train murder must take the ...,thriller,9


Next step would be to create a subset of train dataset to only keep only 7 classes and keep 3 out of the fine tuning process.

In [None]:
train_texts_label['genre_id'].value_counts()

7    578
4    566
9    559
1    558
8    538
2    538
5    534
3    531
6    507
0    491
Name: genre_id, dtype: int64

In this attempt I have used all available classes to train my data. However, to make this a completely zero shot classification strategy, one should hold back certain classes to train the data.

In [None]:
#train_texts_label_filtered = train_texts_label[~train_texts_label['genre'].isin(['mystery','romance','crime'])]
train_texts_label_filtered = train_texts_label
train_texts_label_filtered['genre_id'].value_counts()

7    578
4    566
9    559
1    558
8    538
2    538
5    534
3    531
6    507
0    491
Name: genre_id, dtype: int64

## Data Pre-Processing for Model Finetuning

We need to few more steps now to train our model. First, is to prepare our training text which in this case is the conactenation of synopsis and movie name. The synopsis in the data is limited to 1-2 lines per movie, which in my opinion maybe a little less to achieve higher accuracies.

In [None]:
concatenated_train_text = train_texts_label_filtered['synopsis'] + " " + train_texts_label_filtered['movie_name']
concatenated_train_text.head(1)

14333    Competing teams of scientists, one millennials...
dtype: object

Split in this section is simply to evaluate the model on the validation dataset.

In [None]:
#Split the data
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(concatenated_train_text, train_texts_label_filtered['genre'], test_size=.2)


In [None]:
val_labels.value_counts()

family       127
mystery      120
action       115
adventure    107
scifi        106
horror       106
fantasy      105
romance      105
thriller      98
crime         91
Name: genre, dtype: int64

In [None]:
train_labels.value_counts()

thriller     461
mystery      458
horror       452
action       451
crime        443
romance      433
scifi        425
family       411
adventure    400
fantasy      386
Name: genre, dtype: int64

In this attempt, I have leveraged bart-large-mnli as my model. But one can also leverage other available zero-shot-classification models.

In [None]:
from transformers import BartTokenizerFast, BartForSequenceClassification

checkpoint = 'facebook/bart-large-mnli'
tokenizer = BartTokenizerFast.from_pretrained(checkpoint)
model = BartForSequenceClassification.from_pretrained(checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In order to pass the data in the model, we need to convert our pandas dataframe in Dataset format.

In [None]:
import torch
from datasets import Dataset, load_metric

train_data_1 = pd.DataFrame(data = {'text': train_texts, 'class': train_labels}, columns = ['text', 'class']).reset_index()
val_data_1 = pd.DataFrame(data = {'text': val_texts, 'class': val_labels}, columns = ['text', 'class']).reset_index()

train_ds = Dataset.from_pandas(train_data_1)
test_ds = Dataset.from_pandas(val_data_1)

train_ds

Dataset({
    features: ['index', 'text', 'class'],
    num_rows: 4320
})

This step is one of the crucial ones as we are creating premise and hypothesis for the given text. Here, along with correct hypothesis, I am also randomly assigning one of the other labels as contradictory hypothesis to the same text. This can be further enhanced to include more contradictory hypotheses per text to ensure better training. Also, in my other attempts on same data, I will try and modify the template for the hypothesis to compare the accuracy of the models.

In [None]:
import random
template = 'This movie is about {}'
def create_input_sequence(sample):
  text = sample["text"]
  label = sample["class"][0]
  contradiction_label = random.choice([x for x in unique_labels if x != label])
  encoded_sequence = tokenizer(text*2 , [template.format(label), template.format(contradiction_label)], truncation = True, padding = 'max_length', max_length = 128)
  encoded_sequence["labels"] = [2,0]
  encoded_sequence["input_sentence"] = tokenizer.batch_decode(encoded_sequence.input_ids)
  return encoded_sequence


train_dataset = train_ds.map(create_input_sequence, batched = True, batch_size = 1, remove_columns = ['index', 'text', 'class'])
test_dataset = test_ds.map(create_input_sequence, batched = True, batch_size = 1, remove_columns = ['index', 'text', 'class'])
train_dataset

Map:   0%|          | 0/4320 [00:00<?, ? examples/s]

Map:   0%|          | 0/1080 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask', 'labels', 'input_sentence'],
    num_rows: 8640
})

In [None]:
train_dataset['input_sentence'][0]


'<s>After getting interested in murder as a kid in Colombia, Gabriela now has a scrapbook on murders including clippings on "The Blue Blood Killer". While cleaning his latest murder scene in Miami, she comes across a clue missed by the cops. Curdled</s></s>This movie is about thriller</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>'

One can also play around with training arguments below, such as increasing epochs and decreasing learning rates for better results.

In [None]:
from transformers import Trainer, TrainingArguments, EvalPrediction
import numpy as np

def compute_metrics(p: EvalPrediction):
  metric_acc = load_metric("accuracy")
  metric_f1 = load_metric("f1")
  preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
  preds = np.argmax(preds, axis = 1)
  result = {}
  result["accuracy"] = metric_acc.compute(predictions = preds, references = p.label_ids)["accuracy"]
  result["f1"] = metric_f1.compute(predictions = preds, references = p.label_ids, average = 'macro')["f1"]
  return result

training_args = TrainingArguments(
  output_dir = 'bart_classifier',      # Output directory
  num_train_epochs=3,               # total number of training epochs
  per_device_train_batch_size=8,   # batch size per device during training
  per_device_eval_batch_size=8,    # batch size for evaluation
  weight_decay=0.01,                # strength of weight decay
  evaluation_strategy="epoch",      # evaluation is done at each training step
  load_best_model_at_end=True,      # load the best model when finished training (defaults to `False`)
  save_strategy='epoch',            # save the model after each training step
  metric_for_best_model='f1',       # metric to use to compare models
  greater_is_better=True            # whether a larger metric value is better
)

trainer = Trainer(
  model = model,                     # The instantiated model to be trained
  args = training_args,              # Training arguments, defined above
  compute_metrics = compute_metrics, # A function to compute the metrics
  train_dataset = train_dataset,     # Training dataset
  eval_dataset = test_dataset,       # Evaluation dataset
  tokenizer = tokenizer              # The tokenizer that was used
)


In [None]:
trainer.train()

You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.7144,0.693422,0.5,0.333333
2,0.7053,0.694,0.5,0.333333
3,0.6966,0.693183,0.5,0.333333


  metric_acc = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

TrainOutput(global_step=3240, training_loss=0.7096578668665003, metrics={'train_runtime': 2935.8045, 'train_samples_per_second': 8.829, 'train_steps_per_second': 1.104, 'total_flos': 7042374048890880.0, 'train_loss': 0.7096578668665003, 'epoch': 3.0})

Its important to save the model to be loaded later for estimating the accuracies.

In [None]:
trainer.save_model('/content/drive/MyDrive/Colab Notebooks/Bart_Classifier_Movies_Attempt1')

## Inference and Validation

In this section we will validate trained model versus the pre trained model.

In [None]:
import torch
torch.cuda.empty_cache()

Loading the saved model

In [None]:
from transformers import AutoModelForSequenceClassification

finetuned_model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/Colab Notebooks/Bart_Classifier_Movies_Attempt1')

Now in order to compare the original pre-trained model versus our trained model, I am creating two different classifiers which will be used to compare their accuracies.

In [None]:
from transformers import pipeline

classifier_original = pipeline("zero-shot-classification", model = "facebook/bart-large-mnli", device = 0, framework = 'pt')
classifier_finetuned = pipeline("zero-shot-classification", model = finetuned_model, tokenizer = "facebook/bart-large-mnli", device = 0, framework = 'pt')


Now we load the validation data split in the first step. This is the data trainer hasn't seen.

In [None]:
val_texts_label.head(1)

Unnamed: 0,id,movie_name,synopsis,genre,genre_id
7995,84186,Jailbreak Pact,This drama is inspired on a real event in 1990...,thriller,9


In [None]:
val_texts_label['genre'].value_counts()

fantasy      4909
adventure    4893
scifi        4869
crime        4866
romance      4862
family       4862
horror       4842
thriller     4841
action       4834
mystery      4822
Name: genre, dtype: int64

Just reducing the size of the validation data to manage the memory usage.

In [None]:
val_dataset_sample = val_texts_label.groupby('genre').apply(lambda x: x.sample(frac=0.05, random_state=42))
val_dataset_sample['genre'].value_counts()

adventure    245
fantasy      245
crime        243
family       243
romance      243
scifi        243
action       242
horror       242
thriller     242
mystery      241
Name: genre, dtype: int64

List of labels remain the same.

In [None]:
unique_labels

['fantasy',
 'horror',
 'family',
 'scifi',
 'action',
 'crime',
 'adventure',
 'mystery',
 'romance',
 'thriller']

Below code predict the labels from both the models.

In [None]:
val_dataset_sample['concatenated_text'] = ""
val_dataset_sample['label_predicted_original'] = ""
val_dataset_sample['label_predicted_finetuned'] = ""
i = 0
for ind in val_dataset_sample.index:
  text = val_dataset_sample['synopsis'][ind] + " " + val_dataset_sample['movie_name'][ind]
  val_dataset_sample['concatenated_text'][ind] = text
  output_original = classifier_original(text, unique_labels, multi_label=False)
  val_dataset_sample['label_predicted_original'][ind] = output_original['labels'][0]
  output_finetuned = classifier_finetuned(text, unique_labels, multi_label=False)
  val_dataset_sample['label_predicted_finetuned'][ind] = output_finetuned['labels'][0]
  i +=1
  if i % 500 == 0:
    print(i)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_dataset_sample['concatenated_text'][ind] = text
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_dataset_sample['label_predicted_original'][ind] = output_original['labels'][0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_dataset_sample['label_predicted_finetuned'][ind] = output_finetuned['labels'][0]


500
1000
1500
2000


In [None]:
val_dataset_sample.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,id,movie_name,synopsis,genre,genre_id,concatenated_text,label_predicted_original,label_predicted_finetuned
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
action,37441,7547,Men with Wings,Plot #1 is the love triangle between two guys ...,action,4,Plot #1 is the love triangle between two guys ...,romance,fantasy
action,37804,5630,Tadap,Unforeseen circumstances threaten the passiona...,action,4,Unforeseen circumstances threaten the passiona...,romance,fantasy
action,35566,904,Detrimental Decisions,"Matthew, a decent young man, is reluctantly su...",action,4,"Matthew, a decent young man, is reluctantly su...",action,fantasy
action,52512,8774,Robot Ninja,A scientist helps a comic-book artist to becom...,action,4,A scientist helps a comic-book artist to becom...,adventure,fantasy
action,33150,4546,Synthetic,"Set in the near future, a retired soldier's li...",action,4,"Set in the near future, a retired soldier's li...",scifi,fantasy


In [None]:
val_dataset_sample.to_csv('/content/drive/MyDrive/Colab Notebooks/Validation_Data.csv')

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(val_dataset_sample['genre'],val_dataset_sample['label_predicted_original'])

0.21778509674763277

In [None]:
accuracy_score(val_dataset_sample['genre'],val_dataset_sample['label_predicted_finetuned'])

0.10086455331412104

In [None]:
from sklearn.metrics import classification_report

print(classification_report(val_dataset_sample['genre'],val_dataset_sample['label_predicted_finetuned'], target_names=unique_labels, digits=4))

              precision    recall  f1-score   support

     fantasy     0.3333    0.0041    0.0082       242
      horror     0.0000    0.0000    0.0000       245
      family     0.0000    0.0000    0.0000       243
       scifi     0.0000    0.0000    0.0000       243
      action     0.1006    0.9959    0.1828       245
       crime     0.0000    0.0000    0.0000       242
   adventure     0.0000    0.0000    0.0000       241
     mystery     0.0000    0.0000    0.0000       243
     romance     0.0000    0.0000    0.0000       243
    thriller     0.0000    0.0000    0.0000       242

    accuracy                         0.1009      2429
   macro avg     0.0434    0.1000    0.0191      2429
weighted avg     0.0434    0.1009    0.0192      2429



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
print(classification_report(val_dataset_sample['genre'],val_dataset_sample['label_predicted_original'], target_names=unique_labels, digits=4))

              precision    recall  f1-score   support

     fantasy     0.1337    0.5661    0.2163       242
      horror     0.2438    0.2408    0.2423       245
      family     0.3049    0.2058    0.2457       243
       scifi     0.2050    0.2346    0.2188       243
      action     0.3636    0.0653    0.1107       245
       crime     0.3224    0.2025    0.2487       242
   adventure     0.2295    0.3361    0.2727       241
     mystery     0.6216    0.1893    0.2902       243
     romance     0.5476    0.0947    0.1614       243
    thriller     0.2000    0.0455    0.0741       242

    accuracy                         0.2178      2429
   macro avg     0.3172    0.2181    0.2081      2429
weighted avg     0.3174    0.2178    0.2080      2429



Looking at the comparison of accuracies, clearly the trained model is performing worse than the original model. Thus to improve the accuracies of the model, lets try following changes to this attempt:

1. Increase number of training and validation samples
2. Change the hypothesis string
3. Change number of epochs.

But more on that in next attempt.

