# Learning language and video representation with VideoBERT 
In this section, we will learn yet another interesting variant of the BERT model called VideoBERT. As the name suggests, along with learning the representation of a language, VideoBERT also learns the representation of the video. It is the first model that learns the representation of both the video and language in a joint manner.

Just like we use the pre-trained BERT model and fine-tune it for downstream tasks, we can also use the pre-trained VideoBERT model and fine-tune it for many interesting downstream tasks. The VideoBERT model is used for tasks such as image caption generation, video captioning, predicting the next frames of a video, and many more.  

But how exactly the VideoBERT model is pre-trained to learn the video and language representations? Let us find that out in the next section. 

# Pre-training the VideoBERT model  
We know that the BERT model is pre-trained using two important tasks called masked language modeling (cloze task) and the next sentence prediction task. Can we also pre-train the VideoBERT using the masked language modeling and the next sentence prediction task? Yes and no. We can pre-train the VideoBERT using the cloze task but we cannot use the next sentence prediction, instead, we use a new task called linguistic-visual alignment task. Now let's explore how exactly the VideoBERT model is pre-trained using the cloze task and linguistic-visual alignment task in detail. 

# Cloze task 
First, let us see how the VideoBERT is pre-trained using the cloze task. In order to pre-train the VideoBERT, we use instructional videos like cooking videos. But why instructional videos? Why can't we use any random videos? let us understand this with an example. Consider a video where someone is teaching us how to cook. Say, the speaker is saying - Cut the lemon into slices. At the time, we hear the speaker saying lemon into slices, they will also visually show us how they are cutting the lemon into slices, right? Yes. This is shown in the below example: 

![title](images/1.png)

These sorts of instructional videos where the speaker's statement and the corresponding visuals aligning with each other are very useful for pre-training the VideoBERT. But still why? Why instructional videos are useful? Why not random videos? Since in the instructional videos, we have the speaker's statement and the corresponding visuals match with one another, this helps us to learn the better representation of the language and video in a joint fashion. 

Okay, we learned that the instructional videos are useful for pre-training the VideoBERT model. What's next? How we can use the video for training? First, we need to extract the language tokens (linguistic tokens) and visual tokens (video tokens) from the video. Let's see how to extract these tokens. 

From the audio (speaker statements) used in the video, we can extract the linguistic tokens. So, we need to extract the audio from the video and convert the audio to text. In order to achieve this, we will leverage the automatic speech recognition (ASR) toolkit. Using ASR, we extract the audio used in the video and convert it into text. After converting the audio to text, we will tokenize the text, and this forms our language tokens. 

Now, how to obtain the visual tokens? To obtain the visual tokens, we sample the image frames from the video at 20 fps (frames per second). Then, we convert the image frames into visual tokens with a duration of 1.5 seconds. 

That's it. Now, we have language and visual tokens. What's next? How to pre-train the VideoBERT model with this language and visual tokens? First, we combine language and visual tokens. After combining, we will have the language and visual tokens as shown below. We can observe that there is a [>] token in the middle of language and visual tokens. It is a special token used for combining the language and visual tokens: 

![title](images/2.png)


We know that we add the [CLS] token at the beginning of the first sentence and [SEP] token at the end of every sentence. Now, we add the [CLS] token at the beginning of the language token and the [SEP] token only at the end of the visual tokens which indicates that we treat the whole language and visual tokens as a single sentence:

![title](images/3.png)
Now we randomly mask a few language and visual tokens as shown below:

![title](images/4.png)
Next, we feed all the tokens to the VideoBERT model which returns the representation of all the tokens. For instance, as shown in the below figure, we can observe that, $R_{[CLS]}$  denotes the representation of [CLS] token, $R_{cut}$  denotes the representation of the token 'cut' and so on:

![title](images/5.png)
Now, we take the representation of the masked token returned by the VideoBERT to a classifier (feedforward + softmax) and the classifier predicts the masked token as shown below:

![title](images/6.png)

In this way, we pre-train the VideoBERT model with a cloze task by predicting the masked linguistic and visual tokens. 

Okay, now that we have learned how the VideoBERT is pre-trained using the cloze task, in the next section, we will see, how the VideoBERT is pre-trained using the linguistic-visual alignment task. 
























