Data \& Code Preparation
---

If you want to download the code and run it by yourself in your environment, or reproduce our experiments, please follow the next steps:

- ### 1. Clone the repo and install the dependencies

In [None]:
! git clone https://github.com/verlab/TextDrivenVideoAcceleration_TPAMI_2022
%cd TextDrivenVideoAcceleration_TPAMI_2022
! pip install -r requirements.txt

  - ### 2. Prepare the data to train VDAN+

  Download \& Organize the VaTeX Dataset (Annotations and Videos) + Download the Pretrained GloVe Embeddings

In [None]:
## Download VaTeX JSON data
! wget -O semantic_encoding/resources/vatex_training_v1.0.json https://eric-xw.github.io/vatex-website/data/vatex_training_v1.0.json
! wget -O semantic_encoding/resources/vatex_validation_v1.0.json https://eric-xw.github.io/vatex-website/data/vatex_validation_v1.0.json

## Download the Pretrained GloVe Embeddings
! wget -O semantic_encoding/resources/glove.6B.zip http://nlp.stanford.edu/data/glove.6B.zip
! unzip -j semantic_encoding/resources/glove.6B.zip glove.6B.300d.txt -d semantic_encoding/resources/
! rm semantic_encoding/resources/glove.6B.zip

## Download VaTeX Videos (We used the kinetics-datasets-downloader tool to download the available videos from YouTube)
# NOTE: VaTeX is composed of the VALIDATION split of the Kinetics-600 dataset; therefore, you must modify the script to download the validation videos only. 
# We adpated the function download_test_set in the kinetics-datasets-downloader/downloader/download.py file to do so.
# 1. Clone repository and copy the modified files
! git clone https://github.com/dancelogue/kinetics-datasets-downloader/ semantic_encoding/resources/VaTeX_downloader_files/kinetics-datasets-downloader/
! cp semantic_encoding/resources/VaTeX_downloader_files/download.py semantic_encoding/resources/VaTeX_downloader_files/kinetics-datasets-downloader/downloader/download.py
! cp semantic_encoding/resources/VaTeX_downloader_files/config.py semantic_encoding/resources/VaTeX_downloader_files/kinetics-datasets-downloader/downloader/lib/config.py

# 2. Get the kinetics dataset annotations
! wget -O semantic_encoding/resources/VaTeX_downloader_files/kinetics600.tar.gz https://storage.googleapis.com/deepmind-media/Datasets/kinetics600.tar.gz
! tar -xf semantic_encoding/resources/VaTeX_downloader_files/kinetics600.tar.gz -C semantic_encoding/resources/VaTeX_downloader_files/
! rm semantic_encoding/resources/VaTeX_downloader_files/kinetics600.tar.gz

# 3. Download the videos (This can take a while (~28k videos to download)... If you want, you can stop it at any time and train with the downloaded videos)
! python3 semantic_encoding/resources/VaTeX_downloader_files/kinetics-datasets-downloader/downloader/download.py --val

# Troubleshooting: If the download stops for a long time, experiment increasing the queue size in the parallel downloader (semantic_encoding/resources/VaTeX_downloader_files/kinetics-datasets-downloader/downloader/lib/parallel_download.py)

### Training VDAN+

To train VDAN+, you first need to set up the model and train parameters (current parameters are the same as described in the paper) in the **semantic_encoding/config.py** file, then run the training script **semantic_encoding/train.py**.

The training script will save the model in the **semantic_encoding/models** folder.

  - ### 1. Setup

    ```python
        model_params = {
            'num_input_frames': 32,
            'word_embed_size': 300,
            'sent_embed_size': 512,  # h_ij
            'doc_embed_size': 512,  # h_i
            'hidden_feat_size': 512,
            'feat_embed_size': 128,  # d = 128. We also tested with 512 and 1024, but no substantial changes
            'sent_rnn_layers': 1,  # Not used in our paper, but feel free to change
            'word_rnn_layers': 1,  # Not used in our paper, but feel free to change
            'word_att_size': 1024,  # c_p
            'sent_att_size': 1024,  # c_d

            'use_sentence_level_attention': True,  # Not used in our paper, but feel free to change
            'use_word_level_attention': True,  # Not used in our paper, but feel free to change
            'use_visual_shortcut': True,  # Uses the R(2+1)D output as the first hidden state (h_0) of the document embedder Bi-GRU.
            'learn_first_hidden_vector': False  # Learns the first hidden state (h_0) of the document embedder Bi-GRU.
        }

        ETA_MARGIN = 0.  # η from Equation 1 - (Section 3.1.3 Training)

        train_params = {
            # VaTeX
            'captions_train_fname': 'resources/vatex_training_v1.0.json', # Run semantic_encoding/resources/download_resources.sh first to obtain this file
            'captions_val_fname': 'resources/vatex_validation_v1.0.json', # Run semantic_encoding/resources/download_resources.sh first to obtain this file
            'train_data_path': 'datasets/VaTeX/raw_videos/', # Download all Kinetics-600 (10-seconds) validation videos using the semantic_encoding/resources/download_vatex_videos.sh script
            'val_data_path': 'datasets/VaTeX/raw_videos/', # Download all Kinetics-600 (10-seconds) validation videos using the semantic_encoding/resources/download_vatex_videos.sh script

            'embeddings_filename': 'resources/glove.6B.300d.txt', # Run semantic_encoding/resources/download_resources.sh first to obtain this file

            'max_sents': 20,  # maximum number of sentences per document
            'max_words': 20,  # maximum number of words per sentence

            # Training parameters
            'train_batch_size': 64, # We used a batch size of 64 (requires a 24Gb GPU card)
            'val_batch_size': 64, # We used a batch size of 64 (requires a 24Gb GPU card)
            'num_epochs': 100, # We ran in 100 epochs
            'learning_rate': 1e-5,
            'model_checkpoint_filename': None,  # Add an already trained model to continue training (Leave it as None to train from scratch)...

            # Video transformation parameters
            'resize_size': (128, 171),  # h, w
            'random_crop_size': (112, 112),  # h, w
            'do_random_horizontal_flip': True,  # Horizontally flip the whole video randomly in block

            # Training process
            'optimizer': 'Adam',
            'eta_margin': ETA_MARGIN,
            'criterion': nn.CosineEmbeddingLoss(ETA_MARGIN),

            # Machine and user data
            'username': getpass.getuser(),
            'hostname': socket.gethostname(),

            # Logging parameters
            'checkpoint_folder': 'models/',
            'log_folder': 'logs/',

            # Debugging helpers (speeding things up for debugging)
            'use_random_word_embeddings': False,  # Choose if you want to use random embeddings
            'train_data_proportion': 1.,  # Choose how much data you want to use for training
            'val_data_proportion': 1.,  # Choose how much data you want to use for validation
        }

        models_paths = {
            'VDAN': '<PATH/TO/THE/VDAN/MODEL>', # OPTIONAL: Provide the path to the VDAN model (https://github.com/verlab/StraightToThePoint_CVPR_2020/releases/download/v1.0.0/vdan_pretrained_model.pth) from the CVPR paper: https://github.com/verlab/StraightToThePoint_CVPR_2020/
            'VDAN+': '<PATH/TO/THE/VDAN+/MODEL>' # You must fill this path after training the VDAN+ to train the SAFFA agent
        }

        deep_feats_base_folder = '<PATH/TO/THE/VDAN+EXTRACTED_FEATS/FOLDER>' # Provide the location you stored/want to store your VDAN+ extracted feature vectors
    ```

  - ### 2. Train

    First, make sure you have `punkt` installed...

In [None]:
import nltk
nltk.download('punkt')

  Finally, you're ready to go!

In [None]:
%cd semantic_encoding/
! python train.py