# Fine-tuning for Video Classification with 🤗 Transformers
### Abstract
We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we release code at https://github.com/google-research/scenic/tree/main/scenic/projects/vivit

https://arxiv.org/pdf/2103.15691

![image.png](vivit.png)


## Embeddings
### Uniform frame sampling 
straightforward method of tokenising the input video is to uniformly sample nt frames from the input video clip, embed each 2D frame independently using the same method as ViT, and concatenate all these tokens together. Concretely, if nh · nw non-overlapping image patches are extracted from each frame, then a total of nt ·nh·nw tokens will be forwarded through the transformer encoder.Intuitively, this process may be seen as simply constructing a large 2D image to be tokenised following ViT

#### Tubelet embedding
An alternate method, to extract non-overlapping, spatio-temporal “tubes” from the input volume, and to linearly project this to Rd. This method is an extension of ViT’s embedding to 3D,and corresponds to a 3D convolution. 

### HF Vivit
https://huggingface.co/docs/transformers/main/model_doc/vivit

# Dataset
https://paperswithcode.com/dataset/kinetics-400-1

# Download Dataset sayakpaul/ucf101-subset
#### Complete UCF101
UCF101 is an action recognition data set of realistic action videos, collected from YouTube, having 101 action categories. This data set is an extension of UCF50 data set which has 50 action categories.

With 13320 videos from 101 action categories, UCF101 gives the largest diversity in terms of actions and with the presence of large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc, it is the most challenging data set to date. As most of the available action recognition data sets are not realistic and are staged by actors, UCF101 aims to encourage further research into action recognition by learning and exploring new realistic action categories.

https://www.crcv.ucf.edu/research/data-sets/ucf101/

In [5]:
from huggingface_hub import hf_hub_download
import os
hf_dataset_identifier = "sayakpaul/ucf101-subset"
filename = "UCF101_subset.tar.gz"
file_path = hf_hub_download(repo_id=hf_dataset_identifier, filename=filename, repo_type="dataset", local_dir=".")
file_path

'UCF101_subset.tar.gz'

In [7]:
os.getcwd()

'/mnt/d/repos2/video'

In [4]:
import tarfile
import os
with tarfile.open("UCF101_subset.tar.gz") as t:
     t.extractall("./data")

In [1]:
from transformers import TrainingArguments
from transformers import Trainer, TrainingArguments, AdamW
from model_configuration import *
from transformers import Trainer
from preprocessing import create_dataset
from data_handling import frames_convert_and_create_dataset_dictionary
from model_configuration import initialise_model
import wandb

In [2]:
from dotenv import load_dotenv
import os
env_path =  ".env"
load_dotenv(env_path)

True

# Base Model

https://github.com/google-research/scenic/tree/main/scenic/projects/vivit

### google/vivit-f-16x2-kinetics400

![image.png](models.png)


##### https://huggingface.co/docs/transformers/main/model_doc/vivit

In [3]:
import model_configuration
from model_configuration import compute_metrics
import cv2
import av
from data_handling import sample_frame_indices, read_video_pyav

In [4]:
container = av.open("./data/UCF101_subset/test/ApplyEyeMakeup/v_ApplyEyeMakeup_g03_c01.avi")

In [5]:
container.streams.video[0].frames

209

In [6]:
# from ipywidgets import Video, Image
# from IPython.display import display
# import numpy as np
# import cv2
# import base64

In [7]:
# cap = cv2.VideoCapture("./data/UCF101_subset/test/ApplyEyeMakeup/v_ApplyEyeMakeup_g03_c01.avi")

# frames = []

# while(1):
#     try:
#         _, frame = cap.read()

#         fgmask = cv2.Canny(frame, 100, 100)

#         mask = fgmask > 100
#         frame[mask, :] = 0

#         frames.append(frame)
#     except Exception:
#         break

# width = int(cap.get(3))
# height = int(cap.get(4))

# filename = 'tmp/output.mp4'

# fourcc = cv2.VideoWriter_fourcc(*'avc1')
# writer = cv2.VideoWriter(filename, fourcc, 25, (width, height))

# for frame in frames:
#     writer.write(frame)

# cap.release()
# writer.release()

# with open(filename, 'rb') as f:
#     video.value = f.read()

In [8]:
container = av.open("./data/UCF101_subset/test/ApplyEyeMakeup/v_ApplyEyeMakeup_g03_c01.avi")
indices = sample_frame_indices(clip_len=50, frame_sample_rate=1,seg_len=container.streams.video[0].frames)
video = read_video_pyav(container=container, indices=indices)

In [9]:
indices

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])

In [10]:
video.shape

(50, 224, 224, 3)

In [11]:
# from importlib import reload
# reload(model_configuration)



In [12]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [13]:
path_files = "data/UCF101_subset"
video_dict, class_labels = frames_convert_and_create_dataset_dictionary(path_files)


Processing file data/UCF101_subset/test/ApplyEyeMakeup/v_ApplyEyeMakeup_g03_c01.avi number of Frames: 209
Processing file data/UCF101_subset/test/ApplyEyeMakeup/v_ApplyEyeMakeup_g03_c03.avi number of Frames: 115
Processing file data/UCF101_subset/test/ApplyEyeMakeup/v_ApplyEyeMakeup_g03_c05.avi number of Frames: 146
Processing file data/UCF101_subset/test/ApplyEyeMakeup/v_ApplyEyeMakeup_g23_c02.avi number of Frames: 131
Processing file data/UCF101_subset/test/ApplyEyeMakeup/v_ApplyEyeMakeup_g23_c04.avi number of Frames: 200
Processing file data/UCF101_subset/test/ApplyEyeMakeup/v_ApplyEyeMakeup_g23_c06.avi number of Frames: 140
Processing file data/UCF101_subset/test/ApplyLipstick/v_ApplyLipstick_g14_c01.avi number of Frames: 176
Processing file data/UCF101_subset/test/ApplyLipstick/v_ApplyLipstick_g14_c03.avi number of Frames: 174
Processing file data/UCF101_subset/test/ApplyLipstick/v_ApplyLipstick_g16_c02.avi number of Frames: 165
Processing file data/UCF101_subset/test/ApplyLipstic

In [77]:
len(video_dict)

405

In [78]:
video_dict[0].keys()

dict_keys(['video', 'labels'])

In [80]:
video_dict[0]['video'].shape

(10, 224, 224, 3)

In [15]:
video_dict[0]['labels']

'ApplyEyeMakeup'

In [16]:
num_frames, height, width, channels =  video_dict[0]['video'].shape
num_frames, height, width, channels 

(10, 224, 224, 3)

In [17]:
# filename = "./tmp/saved.avi"
# codec_id = "mp4v" # ID for a video codec.
# fourcc = cv2.VideoWriter_fourcc(*codec_id)
# out = cv2.VideoWriter(filename, fourcc=fourcc, fps=20, frameSize=(width, height))

# for frame in np.split(video_dict[0]['video'], num_frames, axis=0):
#     out.write(frame)

In [18]:
# from IPython.display import Video

# Video(filename)

In [19]:
class_labels = sorted(class_labels)
label2id = {label: i for i, label in enumerate(class_labels)}
id2label = {i: label for label, i in label2id.items()}

print(f"Unique classes: {list(label2id.keys())}.")

Unique classes: ['ApplyEyeMakeup', 'ApplyLipstick', 'Archery', 'BabyCrawling', 'BalanceBeam', 'BandMarching', 'BaseballPitch', 'Basketball', 'BasketballDunk', 'BenchPress'].


In [20]:
shuffled_dataset = create_dataset(video_dict)

Casting to class labels:   0%|          | 0/405 [00:00<?, ? examples/s]

Map:   0%|          | 0/405 [00:00<?, ? examples/s]

  return torch.tensor(value)


Map:   0%|          | 0/405 [00:00<?, ? examples/s]

In [21]:
shuffled_dataset['train'].features

{'labels': ClassLabel(names=['ApplyEyeMakeup', 'ApplyLipstick', 'Archery', 'BabyCrawling', 'BalanceBeam', 'BandMarching', 'BaseballPitch', 'Basketball', 'BasketballDunk', 'BenchPress'], id=None),
 'pixel_values': Sequence(feature=Sequence(feature=Sequence(feature=Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None), length=-1, id=None), length=-1, id=None), length=-1, id=None)}

In [22]:

model = model_configuration.initialise_model(shuffled_dataset, device)

Some weights of VivitForVideoClassification were not initialized from the model checkpoint at google/vivit-b-16x2-kinetics400 and are newly initialized because the shapes did not match:
- vivit.embeddings.position_embeddings: found shape torch.Size([1, 3137, 768]) in the checkpoint and torch.Size([1, 981, 768]) in the model instantiated
- classifier.weight: found shape torch.Size([400, 768]) in the checkpoint and torch.Size([10, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([400]) in the checkpoint and torch.Size([10]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
training_output_dir = "/tmp/results"
training_args = TrainingArguments(
    output_dir=training_output_dir,         
    num_train_epochs=3,             
    per_device_train_batch_size=2,   
    per_device_eval_batch_size=2,    
    learning_rate=5e-05,            
    weight_decay=0.01,              
    logging_dir="./logs",           
    logging_steps=10,                
    seed=42,                       
    eval_strategy="steps",    
    eval_steps=10,                   
    warmup_steps=int(0.1 * 20),      
    optim="adamw_torch",          
    lr_scheduler_type="linear",      
    fp16=True,  
    report_to="wandb"
)

In [24]:
wandb_key =  os.getenv("WANDB_API_KEY")
wandb.login(key=wandb_key)

PROJECT = "ViViT"
MODEL_NAME = "google/vivit-b-16x2-kinetics400"
DATASET = "sayakpaul/ucf101-subset"

wandb.init(project=PROJECT, # the project I am working on
           tags=[MODEL_NAME, DATASET],
           notes ="Fine tuning ViViT with ucf101-subset")

[34m[1mwandb[0m: Currently logged in as: [33molonok[0m ([33molonok69[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/olonok/.netrc


In [25]:
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-05, betas=(0.9, 0.999), eps=1e-08)
# Define the trainer
trainer = Trainer(
    model=model,                      
    args=training_args,              
    train_dataset=shuffled_dataset["train"],      
    eval_dataset=shuffled_dataset["test"],       
    optimizers=(optimizer, None),  
    compute_metrics = compute_metrics
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [26]:
with wandb.init(project=PROJECT, job_type="train", # the project I am working on
           tags=[MODEL_NAME, DATASET],
           notes =f"Fine tuning {MODEL_NAME} with {DATASET}."):
           train_results = trainer.train()

VBox(children=(Label(value='0.003 MB of 0.003 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011113483377913427, max=1.0…



Step,Training Loss,Validation Loss,Accuracy
10,2.4822,2.53044,0.146341
20,2.1769,2.166158,0.219512
30,2.2479,1.969536,0.268293
40,1.7939,2.041849,0.365854
50,1.7219,2.050001,0.243902
60,1.7689,1.586923,0.414634
70,1.6194,1.292942,0.585366
80,1.1536,1.353015,0.585366
90,1.2014,1.127209,0.609756
100,1.0635,0.94532,0.780488


VBox(children=(Label(value='0.003 MB of 0.019 MB uploaded\r'), FloatProgress(value=0.13545091295331546, max=1.…

0,1
eval/accuracy,▁▂▂▂▃▅▅▆▅▆▆▆▆▆▇▆▆▆▆▆▇▇▇▇████████████████
eval/loss,█▇▆▇▅▅▄▄▃▃▃▃▃▃▂▃▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
eval/runtime,▄▁▄▅▆▆█▇███▆▅▄▄▃▄▇▇▆▂▂▁▁▁▁▁▁▂▁▁▁▂▂▂▂▁▁▁▂
eval/samples_per_second,▅█▄▄▃▃▁▂▁▁▁▃▄▄▅▅▅▂▂▂▇▇▇▇████▇▇▇▇▇▇▇▇███▇
eval/steps_per_second,▅█▅▃▃▂▁▁▁▁▁▃▄▄▅▆▅▂▂▂▇▇▇█████▇▇██▇▇▇▇███▇
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
train/grad_norm,▅▅▅▅▄▃▄▄█▇▄▅▄▂▁▄▂▃▁▁▄▁▁▄▂▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁
train/learning_rate,███▇▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▁▁▁
train/loss,█▇▇▆▆▆▄▄▄▄▃▄▃▃▂▂▂▂▁▂▁▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
eval/accuracy,1.0
eval/loss,0.05494
eval/runtime,127.9668
eval/samples_per_second,0.32
eval/steps_per_second,0.164
total_flos,8.580287827825459e+17
train/epoch,3.0
train/global_step,546.0
train/grad_norm,0.06288
train/learning_rate,0.0


In [29]:
trainer.save_model("model")
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

***** train metrics *****
  epoch                    =         3.0
  total_flos               = 799101575GF
  train_loss               =      0.5111
  train_runtime            =  3:45:40.14
  train_samples_per_second =       0.081
  train_steps_per_second   =        0.04


In [30]:
custom_path = "./model"

In [31]:
with wandb.init(project=PROJECT, job_type="models"):
  artifact = wandb.Artifact("ViViT-Fine-tuned", type="model")
  artifact.add_dir(custom_path)
  wandb.save(custom_path)
  wandb.log_artifact(artifact)


[34m[1mwandb[0m: Adding directory to artifact (./model)... Done. 5.1s


VBox(children=(Label(value='331.903 MB of 331.909 MB uploaded\r'), FloatProgress(value=0.999981093708994, max=…

# Inference

In [33]:
path_files_val = "data/UCF_101_subset_val"
video_dict_val, class_labels_val = frames_convert_and_create_dataset_dictionary(path_files_val)

Processing file data/UCF_101_subset_val/val/ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c01.avi number of Frames: 164
Processing file data/UCF_101_subset_val/val/ApplyEyeMakeup/v_ApplyEyeMakeup_g14_c05.avi number of Frames: 160
Processing file data/UCF_101_subset_val/val/ApplyEyeMakeup/v_ApplyEyeMakeup_g20_c04.avi number of Frames: 220
Processing file data/UCF_101_subset_val/val/ApplyLipstick/v_ApplyLipstick_g10_c04.avi number of Frames: 247
Processing file data/UCF_101_subset_val/val/ApplyLipstick/v_ApplyLipstick_g20_c04.avi number of Frames: 140
Processing file data/UCF_101_subset_val/val/ApplyLipstick/v_ApplyLipstick_g25_c02.avi number of Frames: 151
Processing file data/UCF_101_subset_val/val/Archery/v_Archery_g12_c03.avi number of Frames: 436
Processing file data/UCF_101_subset_val/val/Archery/v_Archery_g18_c02.avi number of Frames: 291
Processing file data/UCF_101_subset_val/val/Archery/v_Archery_g18_c06.avi number of Frames: 160
Processing file data/UCF_101_subset_val/val/BabyCrawling/v

In [34]:
val_dataset = create_dataset(video_dict_val)

Casting to class labels:   0%|          | 0/30 [00:00<?, ? examples/s]

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

In [35]:
import wandb
run = wandb.init()
artifact = run.use_artifact('olonok69/ViViT/ViViT-Fine-tuned:v0', type='model')
artifact_dir = artifact.download()

[34m[1mwandb[0m: Downloading large artifact ViViT-Fine-tuned:v0, 331.90MB. 3 files... 
[34m[1mwandb[0m:   3 of 3 files downloaded.  
Done. 0:0:5.7


In [81]:
artifact_dir

'/mnt/d/repos2/video/artifacts/ViViT-Fine-tuned:v0'

In [36]:
val_dataset

DatasetDict({
    train: Dataset({
        features: ['labels', 'pixel_values'],
        num_rows: 27
    })
    test: Dataset({
        features: ['labels', 'pixel_values'],
        num_rows: 3
    })
})

In [71]:
from data_handling import generate_all_files
import os
import numpy as np
import av
from pathlib import Path
def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])


def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
    '''
    Sample a given number of frame indices from the video.
    Args:
        clip_len (`int`): Total number of frames to sample.
        frame_sample_rate (`int`): Sample every n-th frame.
        seg_len (`int`): Maximum allowed index of sample's last frame.
    Returns:
        indices (`List[int]`): List of sampled frame indices
    '''
    converted_len = int(clip_len * frame_sample_rate)
    end_idx = np.random.randint(converted_len, seg_len)
    start_idx = end_idx - converted_len
    indices = np.linspace(start_idx, end_idx, num=clip_len)
    indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
    return indices

In [54]:
labels = val_dataset['train'].features['labels'].names
config = VivitConfig.from_pretrained(artifact_dir)
config.num_classes=len(labels)
config.id2label = {str(i): c for i, c in enumerate(labels)}
config.label2id = {c: str(i) for i, c in enumerate(labels)}
config.num_frames=10
config.video_size= [10, 224, 224]

In [46]:
import gc
gc.collect()
torch.cuda.empty_cache()

In [45]:
from transformers import VivitImageProcessor, VivitForVideoClassification

In [55]:
image_processor = VivitImageProcessor.from_pretrained("google/vivit-b-16x2-kinetics400")
fine_tune_model = VivitForVideoClassification.from_pretrained(artifact_dir,config=config)

In [66]:
directory =  "data/UCF_101_subset_val"

In [73]:
class_labels = []
true_labels=[]
predictions = []
predictions_labels = []
all_videos=[]
video_files= []
sizes = []
for p in generate_all_files(Path(directory), only_files=True):
    set_files = str(p).split("/")[2] # train or test
    cls = str(p).split("/")[3] # class
    file= str(p).split("/")[4] # file name
    #file name path
    file_name= os.path.join(directory, set_files, cls, file)
    true_labels.append(cls)   
    # Process class
    if cls not in class_labels:
        class_labels.append(cls)
    # process video File
    container = av.open(file_name)
    #print(f"Processing file {file_name} number of Frames: {container.streams.video[0].frames}")  
    indices = sample_frame_indices(clip_len=10, frame_sample_rate=1,seg_len=container.streams.video[0].frames)
    video = read_video_pyav(container=container, indices=indices)
    inputs = image_processor(list(video), return_tensors="pt")
    with torch.no_grad():
        outputs = fine_tune_model(**inputs)
        logits = outputs.logits

    # model predicts one of the 400 Kinetics-400 classes
    predicted_label = logits.argmax(-1).item()
    prediction = fine_tune_model.config.id2label[str(predicted_label)]
    predictions.append(prediction)
    predictions_labels.append(predicted_label)
    print(f"file {file_name} True Label {cls}, predicted label {prediction}")

file data/UCF_101_subset_val/val/ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c01.avi True Label ApplyEyeMakeup, predicted label ApplyEyeMakeup
file data/UCF_101_subset_val/val/ApplyEyeMakeup/v_ApplyEyeMakeup_g14_c05.avi True Label ApplyEyeMakeup, predicted label ApplyEyeMakeup
file data/UCF_101_subset_val/val/ApplyEyeMakeup/v_ApplyEyeMakeup_g20_c04.avi True Label ApplyEyeMakeup, predicted label ApplyEyeMakeup
file data/UCF_101_subset_val/val/ApplyLipstick/v_ApplyLipstick_g10_c04.avi True Label ApplyLipstick, predicted label ApplyLipstick
file data/UCF_101_subset_val/val/ApplyLipstick/v_ApplyLipstick_g20_c04.avi True Label ApplyLipstick, predicted label ApplyLipstick
file data/UCF_101_subset_val/val/ApplyLipstick/v_ApplyLipstick_g25_c02.avi True Label ApplyLipstick, predicted label ApplyLipstick
file data/UCF_101_subset_val/val/Archery/v_Archery_g12_c03.avi True Label Archery, predicted label Archery
file data/UCF_101_subset_val/val/Archery/v_Archery_g18_c02.avi True Label Archery, predicted la

In [74]:
from sklearn.metrics import classification_report

In [76]:
report = classification_report(true_labels, predictions)
print(report)

                precision    recall  f1-score   support

ApplyEyeMakeup       1.00      1.00      1.00         3
 ApplyLipstick       1.00      1.00      1.00         3
       Archery       1.00      1.00      1.00         3
  BabyCrawling       1.00      1.00      1.00         3
   BalanceBeam       1.00      1.00      1.00         3
  BandMarching       1.00      1.00      1.00         3
 BaseballPitch       1.00      1.00      1.00         3
    Basketball       1.00      1.00      1.00         3
BasketballDunk       1.00      1.00      1.00         3
    BenchPress       1.00      1.00      1.00         3

      accuracy                           1.00        30
     macro avg       1.00      1.00      1.00        30
  weighted avg       1.00      1.00      1.00        30

