# Computer Vision Applications for Videos

Through out the course we have discussed various CV applications both in the traditional CV space as well as the deep learning space. This notebook discusses some relatively new applications in these spaces for videos. 

This notebook presents an overview of some intereseting applications to analyze or process videos using computer vision techniques. Along with videos, this notebook also presents some worthy real time applications of CV. These are not neccessarily for video data but on a stream of images. 

Let's look at these applications and try to understand the concepts behind them on a high level.

## CV for Videos

We have seen how we can analyse videos using image processing techniques, by processing one frame at a time. But in those techniques we do not leverage the relationship which consecutive frames of the video have.

As we know that a video is a sequence of a large number of frames, thus these video frames are temporally related to each other. This relation manifests in both in the form of temporal redundancies as well as logical temporal relationship between the frames.

**Example**: a video of a car on a highway will have frames showing the car progressing in a particular direction as we progress forward in the frame sequence.

**This temporal nature of the video makes them a good candidate input for sequential models such as RNN and LSTMs for sequential analysis.**

Let us see some examples of cv applied to video data.

### Video Classification

**Problem Statement**


Video classification is simply the categorization of a video into defined categories. This problem is similar to image classification but only with increased complexity due to the fact a video is a collection of several hundreds or thousands of frames and also that these images might be different from each other although belonging to a single category.

<br/>

<figure>
    <img src="http://blog.qure.ai/assets/images/actionrec/fronststroke.gif" width = 200px/>
    <figcaption style = "text-align:center">Front Stroke. Ref: 
        <a href="http://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review">Deep Learning for Videos: A 2018 Guide to Action Recognition</a>
    </figcaption>
</figure>

<br/>

<br/>

<figure>
    <img src="http://blog.qure.ai/assets/images/actionrec/breaststroke.gif" width = 200px/>
    <figcaption style = "text-align:center">Breast Stroke. Ref: 
        <a href="http://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review">Deep Learning for Videos: A 2018 Guide to Action Recognition</a>
    </figcaption>
</figure>

<br/>

**Datasets**:

- [Sports 1M Dataset](https://github.com/gtoderici/sports-1m-dataset/): provides links to 1,133,158 YouTube videos annotated with 487 sports labels. The annotations were generated automatically using the the YouTube Topics, which has a public API.
- [UCF 101 - Action Recognition Dataset](https://www.crcv.ucf.edu/data/UCF101.php): UCF101 is an action recognition data set of realistic action videos, collected from YouTube, having 101 action categories. 

**Approaches**:

- Classifying each frame of a video using a 2D CNN and then averaging the predictions.
- Using a 3D CNN to perform convolution on a set of frames.
- Extracting features for each frame using 2D CNN in a time distributed manner and then feeding the features to an RNN

**Interesting Reads**:

- https://www.pyimagesearch.com/2019/07/15/video-classification-with-keras-and-deep-learning/
- https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42455.pdf
- http://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review
- https://www.youtube.com/watch?v=PrPv9GV1jPI

## Video Captioning/Description

**Problem Statement**

Video captioning/description is an interesting intersection area of CV and NLP. It refers to generation of a sequence of text which describes a sequence of frames. This problem brings with it a very challenging aspect of learning the spatio-temporal dependencies in video frames and learn their mapping with the sequential representation of text.

<br/>

<figure>
    <img src="https://github.com/AdrianHsu/S2VT-seq2seq-video-captioning-attention/raw/master/util/s2vt-1.png" width = 900px/>
    <figcaption style = "text-align:center">Image showing a Seq2Seq model for video captioning. Ref: 
        <a href="https://github.com/AdrianHsu/S2VT-seq2seq-video-captioning-attention">S2VT</a>
    </figcaption>
</figure>

<br/>

**Aproaches**

- Seq2Seq Models
- Attention Models

**Interesting Reads**

- https://github.com/DataScienceNigeria/AI-powered-by-Google-s-VideoBERT-
- https://github.com/AdrianHsu/S2VT-seq2seq-video-captioning-attention
- 