<a href="https://colab.research.google.com/github/sblaizerwize/LLMs-with-transformers/blob/main/Analyzing%20Movie%20Genre%20Predictions%20through%20the%20Lens%20of%20Hugging%20Face%20Transformers%20and%20a%20Training%20Loop%20Approach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyzing Movie Genre Predictions through the Lens of Hugging Face Transformers and a Training Loop Approach
This notebook proposes a solution to the [Movie Genre Prediction competition from Hugging Face](https://huggingface.co/spaces/competitions/movie-genre-prediction) using a `BERT-based model` to classify movie genres based on their title and synopses. It implements a training loop manually instead of using the trainer API from Hugging Face. This choice aims to improve the fine-tuning phase by manually setting and optimizing selected hyperparameters.

The resulted predicted scores obtained by the fine-tuned model are as follows:

*   Public Score:  0.4260611
*   Private Score: 0.4184444

The fine-tuning stage required 1.5 compute units of type T4 GPU. The execution time for this task took 23 min, utilizing the provided movie dataset. Subsequently, the prediction stage required approximately 2.5 min.

This notebook was inspired by [Anubhav's solution](https://anubhavmaity.github.io/myblog/posts/movie_genre_prediction_using_hf_transformer/) and the concepts acquired from the [Hugging Face NLP course](https://huggingface.co/learn/nlp-course/chapter1/1).


In [None]:
# View the infrastructure provided by Colab
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Thu Jan 11 11:08:08 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
# View the assigned RAM memory
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 13.6 gigabytes of available RAM

Not using a high-RAM runtime


In [None]:
# Install libraries and modules
!pip install evaluate datasets transformers[sentencepiece]
!pip install accelerate -U
!pip install huggingface_hub



In [None]:
# Check transformers and accelerate modules version
import transformers
import accelerate

transformers.__version__, accelerate.__version__

('4.35.2', '0.26.0')

In [None]:
# Login to Huggingface_hub to access movie dataset
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Import libraries and packages
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from datasets import load_dataset, Dataset
from collections import Counter
import evaluate

import numpy as np
import pandas as pd
from rich import print

---

# Loading Movie Datasets

In [None]:
# Load competition datasets
raw_datasets = load_dataset("datadrivenscience/movie-genre-prediction")
raw_datasets

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.16M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.74M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/54000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/36000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'genre'],
        num_rows: 54000
    })
    test: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'genre'],
        num_rows: 36000
    })
})

In [None]:
# Explore train dataset
raw_datasets["train"].features

{'id': Value(dtype='int64', id=None),
 'movie_name': Value(dtype='string', id=None),
 'synopsis': Value(dtype='string', id=None),
 'genre': Value(dtype='string', id=None)}

In [None]:
# Explore train dataset
raw_datasets["train"][:5]

{'id': [44978, 50185, 34131, 78522, 2206],
 'movie_name': ['Super Me',
  'Entity Project',
  'Behavioral Family Therapy for Serious Psychiatric Disorders',
  'Blood Glacier',
  'Apat na anino'],
 'synopsis': ['A young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a demon. Selling them makes him rich.',
  'A director and her friends renting a haunted house to capture paranormal events in order to prove it and become popular.',
  'This is an educational video for families and family therapists that describes the Behavioral Family Therapy approach to dealing with serious psychiatric illnesses.',
  'Scientists working in the Austrian Alps discover that a glacier is leaking a liquid that appears to be affecting local wildlife.',
  'Buy Day - Four Men Widely - Apart in Life - By Night Shadows United in One Fight Venting the Fire of their Fury Against the Hated Oppressors.'],
 'genre': ['fantasy', 'horror', 'family', 'scifi', 'action']}

In [None]:
# Explore test dataset
raw_datasets["test"].features

{'id': Value(dtype='int64', id=None),
 'movie_name': Value(dtype='string', id=None),
 'synopsis': Value(dtype='string', id=None),
 'genre': Value(dtype='string', id=None)}

In [None]:
# Explore test dataset
raw_datasets["test"][:5]

{'id': [16863, 48456, 41383, 84007, 40269],
 'movie_name': ['A Death Sentence',
  'Intermedio',
  '30 Chua Phai Tet',
  'Paranoiac',
  'Ordinary Happiness'],
 'synopsis': ["12 y.o. Ida's dad'll die without a DKK1,500,000 operation. Ida plans to steal the money from the bank, her mom installed alarm systems in. She'll need her climbing skills, her 2 friends and 3 go-karts.",
  'A group of four teenage friends become trapped in a Mexican border tunnel where they fall prey, one-by one, to tortured ghosts who haunt it.',
  "A guy left his home for 12 years till he came back to claim what's his from his father, the vast Land, just to uncover that he had to live that day, year-end Lunar day, for another 12 years.",
  'A man long believed dead returns to the family estate to claim his inheritance.',
  'After a deadly accident, Paolo comes back on Earth just 92 minutes more, thanks to a calculation error made in a paradise office.'],
 'genre': ['action', 'action', 'action', 'action', 'action']

## What are the existing movie genres?

In [None]:
# Identifying the existing genres in the train dataset
labels = set(raw_datasets["train"]["genre"])
num_labels = len(labels)
num_labels, labels

(10,
 {'action',
  'adventure',
  'crime',
  'family',
  'fantasy',
  'horror',
  'mystery',
  'romance',
  'scifi',
  'thriller'})

In [None]:
# Counting the number of movies per genre in the train dataset
labels_count = Counter(raw_datasets['train']['genre'])
print(labels_count)

In [None]:
# Counting the number of movies per genre in the test dataset
labels_count_test = Counter(raw_datasets['test']['genre'])
print(labels_count_test)

In [None]:
# Rename "genre" column as "labels" in the train dataset and turn into a ClassLabel type
raw_datasets = raw_datasets.rename_column('genre','labels')
raw_datasets = raw_datasets.class_encode_column('labels')
raw_datasets['train'].features

Casting to class labels:   0%|          | 0/54000 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/36000 [00:00<?, ? examples/s]

{'id': Value(dtype='int64', id=None),
 'movie_name': Value(dtype='string', id=None),
 'synopsis': Value(dtype='string', id=None),
 'labels': ClassLabel(names=['action', 'adventure', 'crime', 'family', 'fantasy', 'horror', 'mystery', 'romance', 'scifi', 'thriller'], id=None)}

**Answer: The train dataset contains 10 genres that seemed to be evenly distributed accross the dataset. Meanwhile, the test dataset only contains action movies as dummy values prior inference.**

## Removing Duplicated Items

In [None]:
# Convert Datasets into Dataframes
raw_datasets.set_format('pandas')

In [None]:
# Convert Datasets into Dataframes
train_dataset = raw_datasets['train'][:]
train_dataset.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54000 entries, 0 to 53999
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54000 non-null  int64 
 1   movie_name  54000 non-null  object
 2   synopsis    54000 non-null  object
 3   labels      54000 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 15.4 MB


In [None]:
train_dataset.head(3)

Unnamed: 0,id,movie_name,synopsis,labels
0,44978,Super Me,A young scriptwriter starts bringing valuable ...,4
1,50185,Entity Project,A director and her friends renting a haunted h...,5
2,34131,Behavioral Family Therapy for Serious Psychiat...,This is an educational video for families and ...,3


In [None]:
# Drop duplicates from the train dataframe
train_dataset = train_dataset.drop_duplicates(['movie_name', 'synopsis'])
train_dataset.info(memory_usage = 'deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46344 entries, 0 to 53998
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          46344 non-null  int64 
 1   movie_name  46344 non-null  object
 2   synopsis    46344 non-null  object
 3   labels      46344 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 13.8 MB


**Answer: The train dataset cointained 7,656 duplicates.**

## Analyzing text movie titles and their synopses
This section analysis the length of movie titles and their synopses.

In [None]:
train_dataset.head(3)

Unnamed: 0,id,movie_name,synopsis,labels
0,44978,Super Me,A young scriptwriter starts bringing valuable ...,4
1,50185,Entity Project,A director and her friends renting a haunted h...,5
2,34131,Behavioral Family Therapy for Serious Psychiat...,This is an educational video for families and ...,3


In [None]:
# Create a new column "synopsis_len" that contains the synopsis length
train_dataset['synopsis_len'] = train_dataset['synopsis'].apply(lambda x: len(x))
train_dataset.head(3)

Unnamed: 0,id,movie_name,synopsis,labels,synopsis_len
0,44978,Super Me,A young scriptwriter starts bringing valuable ...,4,141
1,50185,Entity Project,A director and her friends renting a haunted h...,5,120
2,34131,Behavioral Family Therapy for Serious Psychiat...,This is an educational video for families and ...,3,164


In [None]:
# Create a new column "movie_name_len" that contains the length of movie_name
train_dataset['movie_name_len'] = train_dataset['movie_name'].apply(lambda x: len(x))
train_dataset.head(3)

Unnamed: 0,id,movie_name,synopsis,labels,synopsis_len,movie_name_len
0,44978,Super Me,A young scriptwriter starts bringing valuable ...,4,141,8
1,50185,Entity Project,A director and her friends renting a haunted h...,5,120,14
2,34131,Behavioral Family Therapy for Serious Psychiat...,This is an educational video for families and ...,3,164,59


In [None]:
# Order train dataframe by synopsis_len
train_dataset.sort_values(by='synopsis_len', ascending=False)

Unnamed: 0,id,movie_name,synopsis,labels,synopsis_len,movie_name_len
52518,46444,Final Destination,Alex Browning is among a group of high school ...,5,400,17
49498,1468,Bhargava Ramudu,"Bhargava, an efficient, yet jobless young man ...",0,395,15
29141,71309,Krishnatulasi,Krishna is a blind young man who works as a gu...,7,381,13
50834,44856,The Sex Cycle,The Cocoa Poodle bar is the central meeting pl...,4,377,13
53370,4779,Uro,Turning his back on a delinquent past and join...,0,370,3
...,...,...,...,...,...,...
6891,71298,Qismat 2,Fortune 2.,7,10,8
38284,5454,Rader,Invasion.,0,9,5
3301,15654,Adventure Night,TBD,1,3,15
34698,42213,Dark Army,NA.,4,3,9


In [None]:
import plotly.figure_factory as ff
fig = ff.create_distplot([train_dataset['synopsis_len']], ['length'], colors=['#2ca02c'])
fig.update_layout(title_text='Word Count Distribution of Movie Synopsis')
fig.show()

In [None]:
fig2 = ff.create_distplot([train_dataset['movie_name_len']], ['length'], colors=['#ffa408'])
fig2.update_layout(title_text='Word Count Distribution of Movie Titles')
fig2.show()

In [None]:
train_dataset['movie_name_len'].max(), train_dataset['synopsis_len'].max()

(180, 400)

**The average length of characters in the movie name is 12. For the synopsis, we see two peaks around 145 and 230 characters. The maximum character size is 180 for the movie name and 400 for the synopsis. So, there won't be any issues during tokenization and training because the bert-base-uncased model supports a maximum character length of 512.**

# Tokenization

In [None]:
# Convert the train_dataset Dataframe to DataSet format again
train_ds = Dataset.from_pandas(train_dataset)
train_ds.features

{'id': Value(dtype='int64', id=None),
 'movie_name': Value(dtype='string', id=None),
 'synopsis': Value(dtype='string', id=None),
 'labels': Value(dtype='int64', id=None),
 'synopsis_len': Value(dtype='int64', id=None),
 'movie_name_len': Value(dtype='int64', id=None),
 '__index_level_0__': Value(dtype='int64', id=None)}

In [None]:
# Turn "labels" column into ClassLabel type
train_ds = train_ds.class_encode_column('labels')
train_ds.features

Stringifying the column:   0%|          | 0/46344 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/46344 [00:00<?, ? examples/s]

{'id': Value(dtype='int64', id=None),
 'movie_name': Value(dtype='string', id=None),
 'synopsis': Value(dtype='string', id=None),
 'labels': ClassLabel(names=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], id=None),
 'synopsis_len': Value(dtype='int64', id=None),
 'movie_name_len': Value(dtype='int64', id=None),
 '__index_level_0__': Value(dtype='int64', id=None)}

In [None]:
# Create tokenizer
# i.e. bert-base-uncased, bert-large-uncased, bert-large-uncased-whole-word-masking
checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [None]:
# Do a sample tokenization
sample_tokenized = tokenizer(train_ds['movie_name'][0], train_ds['synopsis'][0])
tokenizer.decode(sample_tokenized['input_ids'])

'[CLS] super me [SEP] a young scriptwriter starts bringing valuable objects back from his short nightmares of being chased by a demon. selling them makes him rich. [SEP]'

In [None]:
sample_tokenized

{'input_ids': [101, 3565, 2033, 102, 1037, 2402, 5896, 15994, 4627, 5026, 7070, 5200, 2067, 2013, 2010, 2460, 15446, 1997, 2108, 13303, 2011, 1037, 5698, 1012, 4855, 2068, 3084, 2032, 4138, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
# Split Train Dataset (train_ds) into training and test datasets
train_ds = train_ds.train_test_split(test_size=0.2, stratify_by_column="labels")
train_ds

DatasetDict({
    train: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'labels', 'synopsis_len', 'movie_name_len', '__index_level_0__'],
        num_rows: 37075
    })
    test: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'labels', 'synopsis_len', 'movie_name_len', '__index_level_0__'],
        num_rows: 9269
    })
})

In [None]:
# Define a tokenize function
def tokenize(ds):
  return tokenizer(ds['movie_name'], ds['synopsis'], truncation=True)

In [None]:
# Tokenize train_ds
tokenized_datasets = train_ds.map(tokenize, batched=True)
tokenized_datasets

Map:   0%|          | 0/37075 [00:00<?, ? examples/s]

Map:   0%|          | 0/9269 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'labels', 'synopsis_len', 'movie_name_len', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 37075
    })
    test: Dataset({
        features: ['id', 'movie_name', 'synopsis', 'labels', 'synopsis_len', 'movie_name_len', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9269
    })
})

In [None]:
# Select a random sample to verify tokenization
tokenizer.decode(tokenized_datasets['train']['input_ids'][37074])

'[CLS] begum [SEP] a sheltered beauty, begum, is introduced to the enchanting world of bollywood by the enigmatic madan where she discovers true freedom and love come at the price of her passion and life. [SEP]'

# Preparing data for the training stage

In [None]:
# Removing columns the model doesn't expect
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'movie_name', 'synopsis', 'synopsis_len', 'movie_name_len','__index_level_0__'])
tokenized_datasets['train'].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

In [None]:
# Setting the datasets format so that they can return Pytorch tensors
tokenized_datasets.set_format("torch")

In [None]:
# Define a data_collator function for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
# Defining DataLoaders
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets['train'], shuffle=True, batch_size=32, collate_fn=data_collator
)

eval_dataloader = DataLoader(
    tokenized_datasets['test'], batch_size=64, collate_fn=data_collator
)

In [None]:
# Inspecting a batch from train_dataloader
for batch in train_dataloader:
  break
{k:v.shape for k,v in batch.items()}

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'labels': torch.Size([32]),
 'input_ids': torch.Size([32, 65]),
 'token_type_ids': torch.Size([32, 65]),
 'attention_mask': torch.Size([32, 65])}

In [None]:
# Inspecting a batch from train_dataloader
batch.input_ids

tensor([[  101,  3019,  5320,  ...,     0,     0,     0],
        [  101,  1051, 10381,  ...,     0,     0,     0],
        [  101, 13970, 13278,  ...,     0,     0,     0],
        ...,
        [  101, 14477,  9587,  ...,     0,     0,     0],
        [  101,  1037,  3543,  ...,     0,     0,     0],
        [  101, 15274,  1004,  ...,     0,     0,     0]])

# Step-by-step setting of the training stage

## Model Instantiation

In [None]:
# Instantiate a new model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Passing a single batch to our model to check that everything is OK
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

**Note: When labels are provided, HF transformers will return the loss and the logits (two for each input)**

In [None]:
outputs.logits

tensor([[-4.0785e-01, -5.9843e-01,  2.3831e-01, -3.6281e-01,  7.7003e-02,
         -7.9042e-01,  3.7949e-01,  6.3509e-02,  2.5117e-01, -4.0870e-01],
        [-4.7463e-01,  8.1971e-01, -5.7489e-01,  2.9883e-01, -5.2216e-03,
          2.9287e-01,  1.3395e-01, -1.2713e-01, -6.2679e-03,  2.6193e-01],
        [-3.2331e-01, -7.0246e-01,  3.2733e-01, -4.3661e-01,  1.2958e-02,
         -9.1787e-01,  4.7303e-01,  3.7585e-02,  4.2619e-01, -4.0903e-01],
        [-3.8950e-01, -5.9636e-01,  1.9592e-01, -3.3946e-01,  6.1083e-02,
         -7.2596e-01,  4.0944e-01,  6.8973e-02,  2.9999e-01, -4.2091e-01],
        [-3.8568e-01, -5.6021e-01,  1.7959e-01, -3.0257e-01,  6.5529e-02,
         -7.6407e-01,  3.8535e-01,  2.6806e-02,  2.6749e-01, -4.2585e-01],
        [-3.6171e-01, -6.5866e-01,  3.1120e-01, -3.8170e-01,  5.1930e-02,
         -8.8321e-01,  4.5273e-01,  5.1025e-02,  3.6372e-01, -4.0050e-01],
        [-4.2266e-01, -4.7145e-01,  9.5230e-02, -2.3389e-01,  1.3982e-01,
         -6.2665e-01,  3.7735e-0

In [None]:
# Instantiate a new model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Setting an optimizer and accelerator

In [None]:
# Setting an accelerator and optimizer
from transformers import AdamW
from accelerate import Accelerator

accelerator = Accelerator()
optimizer = AdamW(model.parameters(), lr=1e-5)





In [None]:
# Prepare data for accelerator
train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

## Setting a scheduler

In [None]:
# Setting a learning rate scheduler
from transformers import get_scheduler

num_epochs = 2
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)
print(num_training_steps)

## Verifying infrastructure settings

In [None]:
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cpu')

## Setting a progress bar to track the training stage

In [None]:
# Add a progress bar
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
  for batch in train_dataloader:
    batch = {k:v.to(device) for k,v in batch.items()}
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)

  0%|          | 0/3477 [00:00<?, ?it/s]

## Setting the evaluation stage

In [None]:
metric = evaluate.load("accuracy")
model.eval()

for batch in eval_dataloader:
  batch = {k: v.to(device) for k,v in batch.items()}
  with torch.no_grad():
    outputs = model(**batch)

  logits = outputs.logits
  predictions = torch.argmax(logits, dim=-1)
  metric.add_batch(predictions=predictions, references=batch['labels'])

metric.compute()

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.4532312007767828}

# Full training loop with accelerate



In [None]:
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

In [None]:
# Model instantiation
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Setting an Optimizer, Accelerator, and Scheduler
optimizer = AdamW(model.parameters(), lr=1e-5)
accelerator = Accelerator()

train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)





In [None]:
# Verifying infrastructure settings
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

In [None]:
# Model training and evaluation
from tqdm.auto import tqdm

metric = evaluate.load("accuracy")
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_epochs):
  # Training
  model.train()
  for batch in train_dl:
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)

  # Evaluation
  model.eval()
  for batch in eval_dl:
    with torch.no_grad():
      outputs = model(**batch)

    predictions = outputs.logits.argmax(dim=-1)
    labels = batch["labels"]

    predictions_gathered = accelerator.gather(predictions)
    labels_gathered = accelerator.gather(labels)
    metric.add_batch(predictions=predictions_gathered, references=labels_gathered)

  results = metric.compute()
  print(f"epoch {epoch}: {results['accuracy']}")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

  0%|          | 0/3477 [00:00<?, ?it/s]

# Preparing results for submitting to the competition

Now that we have fine-tuned our classification model, it's time to check how it performs on the test dataset. Since we aren't using the Trainer API, it's necessary to pre-process data of the test dataset.

In [None]:
# Inspect the test dataset
raw_datasets['test'].features

{'id': Value(dtype='int64', id=None),
 'movie_name': Value(dtype='string', id=None),
 'synopsis': Value(dtype='string', id=None),
 'labels': ClassLabel(names=['action'], id=None)}

In [None]:
# Convert the test dataset to a dataframe
test_raw_dataset = raw_datasets['test'][:]
test_raw_dataset.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36000 entries, 0 to 35999
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          36000 non-null  int64 
 1   movie_name  36000 non-null  object
 2   synopsis    36000 non-null  object
 3   labels      36000 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 10.3 MB


In [None]:
# Turn the test dataframe into a Dataset format again
test_ds = Dataset.from_pandas(test_raw_dataset)
test_ds.features

{'id': Value(dtype='int64', id=None),
 'movie_name': Value(dtype='string', id=None),
 'synopsis': Value(dtype='string', id=None),
 'labels': Value(dtype='int64', id=None)}

In [None]:
# Turn "labels" column into a ClassType format
test_ds = test_ds.class_encode_column('labels')
test_ds.features

Stringifying the column:   0%|          | 0/36000 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/36000 [00:00<?, ? examples/s]

{'id': Value(dtype='int64', id=None),
 'movie_name': Value(dtype='string', id=None),
 'synopsis': Value(dtype='string', id=None),
 'labels': ClassLabel(names=['0'], id=None)}

In [None]:
# Tokenize the test dataset
tokenized_test_ds = test_ds.map(tokenize, batched=True)

Map:   0%|          | 0/36000 [00:00<?, ? examples/s]

In [None]:
# Inspect the content of the test dataset
tokenized_test_ds.column_names

['id',
 'movie_name',
 'synopsis',
 'labels',
 'input_ids',
 'token_type_ids',
 'attention_mask']

In [None]:
# Create a copy of the original test dataset
from copy import deepcopy
tokenized_test_ds_copy = deepcopy(tokenized_test_ds)

In [None]:
# Remove columns the model doesn't expect
tokenized_test_ds_copy = tokenized_test_ds_copy.remove_columns(['id', 'movie_name', 'synopsis'])
tokenized_test_ds_copy.column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

In [None]:
# Define a DataLoader for the test dataset
from torch.utils.data import DataLoader
test_dataloader = DataLoader(
    tokenized_test_ds_copy, batch_size=64, collate_fn=data_collator
)

In [None]:
# Inspect a batch from train_dataloader
for batch in test_dataloader:
  break
{k:v.shape for k,v in batch.items()}


{'labels': torch.Size([64]),
 'input_ids': torch.Size([64, 73]),
 'token_type_ids': torch.Size([64, 73]),
 'attention_mask': torch.Size([64, 73])}

In [None]:
# Specify a device type since we aren't using accelerate for the prediction stage
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

In [None]:
# Run the model to get predictions
num_eval_steps = len(test_dataloader)
progress_bar = tqdm(range(num_eval_steps))

predictions = []
model.eval()
for batch in test_dataloader:
  batch = {k:v.to(device) for k,v in batch.items()}
  with torch.no_grad():
    outputs = model(**batch)

  batch_predictions = outputs.logits.argmax(dim=-1).tolist()
  predictions.extend(batch_predictions)
  progress_bar.update(1)

  0%|          | 0/563 [00:00<?, ?it/s]

In [None]:
# Display some predictions
print(predictions[:20])

In [None]:
# Convert predictions to their string representations based on the mapping defined in the 'labels' feature.
predicted_genre = raw_datasets['train'].features['labels'].int2str(predictions)

In [None]:
# Create a dataframe specifying movie id and genre
df = pd.DataFrame({'id':tokenized_test_ds['id'], 'genre':predicted_genre})
df.head(3)

Unnamed: 0,id,genre
0,16863,family
1,48456,horror
2,41383,fantasy


In [None]:
# Save results to a csv file
df.to_csv('submission.csv')