# Zero Shot Classification - BART - Fine Tuning

In this post we are going to leverage BART model and movies data. We are first going to fine tune the BART model on few classes in the movies data and then leverage the fine tuned model to do zero shot classifcation to predict classes on validation data with all the classes.

This is the second attempt on the same problem where we trained on a smaller set which resulted in worse results than the original. In this attempt I plan to increase the size of the train dataset.

## Importing Data

We have our data saved in Google Drive. Lets mount the drive and load the data.

In [None]:
!pip install transformers==4.28.0



In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2.14.

In [None]:
from datasets import disable_caching
disable_caching()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd

train_data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Datasets/train.csv')
train_data.head()

Unnamed: 0,id,movie_name,synopsis,genre
0,44978,Super Me,A young scriptwriter starts bringing valuable ...,fantasy
1,50185,Entity Project,A director and her friends renting a haunted h...,horror
2,34131,Behavioral Family Therapy for Serious Psychiat...,This is an educational video for families and ...,family
3,78522,Blood Glacier,Scientists working in the Austrian Alps discov...,scifi
4,2206,Apat na anino,Buy Day - Four Men Widely - Apart in Life - By...,action


In [None]:
unique_labels = train_data['genre'].unique().tolist()
unique_labels

['fantasy',
 'horror',
 'family',
 'scifi',
 'action',
 'crime',
 'adventure',
 'mystery',
 'romance',
 'thriller']

In [None]:
id2label = {idx: label for idx, label in enumerate(unique_labels)}
id2label

{0: 'fantasy',
 1: 'horror',
 2: 'family',
 3: 'scifi',
 4: 'action',
 5: 'crime',
 6: 'adventure',
 7: 'mystery',
 8: 'romance',
 9: 'thriller'}

In [None]:
label2id = {label: idx for idx, label in enumerate(unique_labels)}
label2id

{'fantasy': 0,
 'horror': 1,
 'family': 2,
 'scifi': 3,
 'action': 4,
 'crime': 5,
 'adventure': 6,
 'mystery': 7,
 'romance': 8,
 'thriller': 9}

In [None]:
train_data['genre_id'] = train_data['genre'].map(label2id)
train_data['genre_id'].value_counts()

0    5400
1    5400
2    5400
3    5400
4    5400
5    5400
6    5400
7    5400
8    5400
9    5400
Name: genre_id, dtype: int64

Now lets split the data into test and validation. Since this is just an academic exercise, I have to reduce the train data significantly to ensure less usage of computing power.

In this attempt, I have made the change here to make the number of records 3 times in training set of the attepmt 1.

In [None]:
#Split the data
from sklearn.model_selection import train_test_split

train_texts_label, val_texts_label = train_test_split(train_data, test_size=.5, random_state = 100)
train_texts_label.head(2)

Unnamed: 0,id,movie_name,synopsis,genre,genre_id
35781,63765,Barefoot,"The ""black sheep"" son of a wealthy family meet...",romance,8
8412,23741,A New York Heartbeat,"Spider, a young gang leader, gets in over his ...",crime,5


Next step would be to create a subset of train dataset to only keep only 7 classes and keep 3 out of the fine tuning process.

In [None]:
train_texts_label['genre_id'].value_counts()

9    2769
7    2756
4    2748
6    2722
1    2709
5    2694
0    2669
2    2647
8    2643
3    2643
Name: genre_id, dtype: int64

In this attempt I have used all available classes to train my data. However, to make this a completely zero shot classification strategy, one should hold back certain classes to train the data.

In [None]:
#train_texts_label_filtered = train_texts_label[~train_texts_label['genre'].isin(['mystery','romance','crime'])]
train_texts_label_filtered = train_texts_label
train_texts_label_filtered['genre_id'].value_counts()

9    2769
7    2756
4    2748
6    2722
1    2709
5    2694
0    2669
2    2647
8    2643
3    2643
Name: genre_id, dtype: int64

## Data Pre-Processing for Model Finetuning

We need to few more steps now to train our model. First, is to prepare our training text which in this case is the conactenation of synopsis and movie name. The synopsis in the data is limited to 1-2 lines per movie, which in my opinion maybe a little less to achieve higher accuracies.

In [None]:
concatenated_train_text = train_texts_label_filtered['synopsis'] + " " + train_texts_label_filtered['movie_name']
concatenated_train_text.head(1)

35781    The "black sheep" son of a wealthy family meet...
dtype: object

Split in this section is simply to evaluate the model on the validation dataset.

In [None]:
#Split the data
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(concatenated_train_text, train_texts_label_filtered['genre'], test_size=.2)


In [None]:
val_labels.value_counts()

thriller     584
adventure    576
fantasy      567
crime        548
family       542
mystery      542
horror       537
action       527
romance      503
scifi        474
Name: genre, dtype: int64

In [None]:
train_labels.value_counts()

action       2221
mystery      2214
thriller     2185
horror       2172
scifi        2169
crime        2146
adventure    2146
romance      2140
family       2105
fantasy      2102
Name: genre, dtype: int64

In this attempt, I have leveraged bart-large-mnli as my model. But one can also leverage other available zero-shot-classification models.

In [None]:
from transformers import BartTokenizerFast, BartForSequenceClassification

checkpoint = 'facebook/bart-large-mnli'
tokenizer = BartTokenizerFast.from_pretrained(checkpoint)
model = BartForSequenceClassification.from_pretrained(checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In order to pass the data in the model, we need to convert our pandas dataframe in Dataset format.

In [None]:
import torch
from datasets import Dataset, load_metric

train_data_1 = pd.DataFrame(data = {'text': train_texts, 'class': train_labels}, columns = ['text', 'class']).reset_index()
val_data_1 = pd.DataFrame(data = {'text': val_texts, 'class': val_labels}, columns = ['text', 'class']).reset_index()

train_ds = Dataset.from_pandas(train_data_1)
test_ds = Dataset.from_pandas(val_data_1)

train_ds

Dataset({
    features: ['index', 'text', 'class'],
    num_rows: 21600
})

This step is one of the crucial ones as we are creating premise and hypothesis for the given text. Here, along with correct hypothesis, I am also randomly assigning one of the other labels as contradictory hypothesis to the same text. This can be further enhanced to include more contradictory hypotheses per text to ensure better training. Also, in my other attempts on same data, I will try and modify the template for the hypothesis to compare the accuracy of the models.

In [None]:
import random
template = 'This movie is about {}'
def create_input_sequence(sample):
  text = sample["text"]
  label = sample["class"][0]
  contradiction_label = random.choice([x for x in unique_labels if x != label])
  encoded_sequence = tokenizer(text*2 , [template.format(label), template.format(contradiction_label)], truncation = True, padding = 'max_length', max_length = 128)
  encoded_sequence["labels"] = [2,0]
  encoded_sequence["input_sentence"] = tokenizer.batch_decode(encoded_sequence.input_ids)
  return encoded_sequence


train_dataset = train_ds.map(create_input_sequence, batched = True, batch_size = 1, remove_columns = ['index', 'text', 'class'])
test_dataset = test_ds.map(create_input_sequence, batched = True, batch_size = 1, remove_columns = ['index', 'text', 'class'])
train_dataset

Map:   0%|          | 0/21600 [00:00<?, ? examples/s]

Map:   0%|          | 0/5400 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask', 'labels', 'input_sentence'],
    num_rows: 43200
})

In [None]:
train_dataset['input_sentence'][1]


'<s>The tomb of Tutankhamun is coming to a museum in town, and someone thinks there is a curse to it. The detectives are on it. Operasjon Mumie</s></s>This movie is about thriller</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>'

One can also play around with training arguments below, such as increasing epochs and decreasing learning rates for better results.

In [None]:
from transformers import Trainer, TrainingArguments, EvalPrediction
import numpy as np

def compute_metrics(p: EvalPrediction):
  metric_acc = load_metric("accuracy")
  metric_f1 = load_metric("f1")
  preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
  preds = np.argmax(preds, axis = 1)
  result = {}
  result["accuracy"] = metric_acc.compute(predictions = preds, references = p.label_ids)["accuracy"]
  result["f1"] = metric_f1.compute(predictions = preds, references = p.label_ids, average = 'macro')["f1"]
  return result

training_args = TrainingArguments(
  output_dir = 'bart_classifier',      # Output directory
  num_train_epochs=3,               # total number of training epochs
  per_device_train_batch_size=8,   # batch size per device during training
  per_device_eval_batch_size=1,    # batch size for evaluation
  weight_decay=0.01,                # strength of weight decay
  evaluation_strategy="no",      # evaluation is done at each training step
  load_best_model_at_end=True,      # load the best model when finished training (defaults to `False`)
  save_strategy='no',            # save the model after each training step
  metric_for_best_model='f1',       # metric to use to compare models
  greater_is_better=True            # whether a larger metric value is better
)

trainer = Trainer(
  model = model,                     # The instantiated model to be trained
  args = training_args,              # Training arguments, defined above
  compute_metrics = compute_metrics, # A function to compute the metrics
  train_dataset = train_dataset,     # Training dataset
  eval_dataset = test_dataset,       # Evaluation dataset
  tokenizer = tokenizer              # The tokenizer that was used
)


In [None]:
import torch
torch.cuda.empty_cache()


In [None]:
trainer.train()

You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.7465
1000,0.7102
1500,0.7101
2000,0.7359
2500,0.701
3000,0.704
3500,0.7033
4000,0.6997
4500,0.7016
5000,0.7029


TrainOutput(global_step=16200, training_loss=0.7006390248993297, metrics={'train_runtime': 4703.3845, 'train_samples_per_second': 27.555, 'train_steps_per_second': 3.444, 'total_flos': 3.52118702444544e+16, 'train_loss': 0.7006390248993297, 'epoch': 3.0})

In [None]:
trainer.evaluate()

OutOfMemoryError: ignored

Its important to save the model to be loaded later for estimating the accuracies.

In [None]:
trainer.save_model('/content/drive/MyDrive/Colab Notebooks/Bart_Classifier_Movies_Attempt2')

## Inference and Validation

In this section we will validate trained model versus the pre trained model.

In [None]:
import torch
torch.cuda.empty_cache()

Loading the saved model

In [None]:
from transformers import AutoModelForSequenceClassification

finetuned_model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/Colab Notebooks/Bart_Classifier_Movies_Attempt2')

Now in order to compare the original pre-trained model versus our trained model, I am creating two different classifiers which will be used to compare their accuracies.

In [None]:
from transformers import pipeline

classifier_original = pipeline("zero-shot-classification", model = "facebook/bart-large-mnli", device = 0, framework = 'pt')
classifier_finetuned = pipeline("zero-shot-classification", model = finetuned_model, tokenizer = "facebook/bart-large-mnli", device = 0, framework = 'pt')


Now we load the validation data split in the first step. This is the data trainer hasn't seen.

In [None]:
val_texts_label.head(1)

Unnamed: 0,id,movie_name,synopsis,genre,genre_id
7995,84186,Jailbreak Pact,This drama is inspired on a real event in 1990...,thriller,9


In [None]:
val_texts_label['genre'].value_counts()

romance      2757
scifi        2757
family       2753
fantasy      2731
crime        2706
horror       2691
adventure    2678
action       2652
mystery      2644
thriller     2631
Name: genre, dtype: int64

Just reducing the size of the validation data to manage the memory usage.

In [None]:
val_dataset_sample = val_texts_label.groupby('genre').apply(lambda x: x.sample(frac=0.1, random_state=42))
val_dataset_sample['genre'].value_counts()

romance      276
scifi        276
family       275
fantasy      273
crime        271
horror       269
adventure    268
action       265
mystery      264
thriller     263
Name: genre, dtype: int64

List of labels remain the same.

In [None]:
unique_labels

['fantasy',
 'horror',
 'family',
 'scifi',
 'action',
 'crime',
 'adventure',
 'mystery',
 'romance',
 'thriller']

Below code predict the labels from both the models.

In [None]:
val_dataset_sample['concatenated_text'] = ""
val_dataset_sample['label_predicted_original'] = ""
val_dataset_sample['label_predicted_finetuned'] = ""
i = 0
for ind in val_dataset_sample.index:
  text = val_dataset_sample['synopsis'][ind] + " " + val_dataset_sample['movie_name'][ind]
  val_dataset_sample['concatenated_text'][ind] = text
  output_original = classifier_original(text, unique_labels, multi_label=False)
  val_dataset_sample['label_predicted_original'][ind] = output_original['labels'][0]
  output_finetuned = classifier_finetuned(text, unique_labels, multi_label=False)
  val_dataset_sample['label_predicted_finetuned'][ind] = output_finetuned['labels'][0]
  i +=1
  if i % 500 == 0:
    print(i)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_dataset_sample['concatenated_text'][ind] = text
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_dataset_sample['label_predicted_original'][ind] = output_original['labels'][0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_dataset_sample['label_predicted_finetuned'][ind] = output_finetuned['labels'][0]


500
1000
1500
2000
2500


In [None]:
val_dataset_sample.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,id,movie_name,synopsis,genre,genre_id,concatenated_text,label_predicted_original,label_predicted_finetuned
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
action,11482,2485,Confession,On the 25th anniversary of his sister's rape/m...,action,4,On the 25th anniversary of his sister's rape/m...,action,thriller
action,15302,6091,A.S.T.,"A brave, young operative for a secret, Black B...",action,4,"A brave, young operative for a secret, Black B...",action,thriller
action,51462,3857,Nadiya Kollappetta Rathri,Railway Anti-Criminal Task Force (RATs) invest...,action,4,Railway Anti-Criminal Task Force (RATs) invest...,crime,family
action,25155,6916,Hei ying di gu dao,Avenging a father's death and locating a treas...,action,4,Avenging a father's death and locating a treas...,action,thriller
action,41859,5199,Soldiers of Fortune,Wealthy thrill-seekers pay huge premiums to ha...,action,4,Wealthy thrill-seekers pay huge premiums to ha...,thriller,thriller


In [None]:
val_dataset_sample.to_csv('/content/drive/MyDrive/Colab Notebooks/Validation_Data.csv')

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(val_dataset_sample['genre'],val_dataset_sample['label_predicted_original'])

0.21333333333333335

In [None]:
accuracy_score(val_dataset_sample['genre'],val_dataset_sample['label_predicted_finetuned'])

0.09851851851851852

In [None]:
from sklearn.metrics import classification_report

print(classification_report(val_dataset_sample['genre'],val_dataset_sample['label_predicted_finetuned'], target_names=unique_labels, digits=4))

              precision    recall  f1-score   support

     fantasy     0.1159    0.0302    0.0479       265
      horror     0.0903    0.0522    0.0662       268
      family     0.1349    0.0627    0.0856       271
       scifi     0.0600    0.0109    0.0185       275
      action     0.0189    0.0037    0.0061       273
       crime     0.0952    0.0149    0.0257       269
   adventure     0.1062    0.0909    0.0980       264
     mystery     0.1190    0.1812    0.1437       276
     romance     0.0894    0.0761    0.0822       276
    thriller     0.0937    0.4715    0.1563       263

    accuracy                         0.0985      2700
   macro avg     0.0924    0.0994    0.0730      2700
weighted avg     0.0922    0.0985    0.0728      2700



In [None]:
print(classification_report(val_dataset_sample['genre'],val_dataset_sample['label_predicted_original'], target_names=unique_labels, digits=4))

              precision    recall  f1-score   support

     fantasy     0.1308    0.5736    0.2130       265
      horror     0.1794    0.2015    0.1898       268
      family     0.3184    0.2362    0.2712       271
       scifi     0.2360    0.2764    0.2546       275
      action     0.4146    0.0623    0.1083       273
       crime     0.3718    0.2156    0.2729       269
   adventure     0.2043    0.2500    0.2249       264
     mystery     0.6250    0.1630    0.2586       276
     romance     0.4902    0.0906    0.1529       276
    thriller     0.2676    0.0722    0.1138       263

    accuracy                         0.2133      2700
   macro avg     0.3238    0.2141    0.2060      2700
weighted avg     0.3257    0.2133    0.2062      2700



Comparing the results from Attempt 1, increasing the training samples does increase the accuracies. But to further improve the accuracies, one still need to consider following changes:

1. Increase number of training and validation samples
2. Change the hypothesis string
3. Change number of epochs.
