Skip to content

wangyu-ustc/LVChat

Repository files navigation

LVChat

This is the official implementation of our paper LVChat: Facilitating Long Video Comprehension. Our code base is built on the repo Ask-Anything.

Environment Preparation

conda create --name lvchat python=3.11
pip install -r requirements.txt

Datasets

We used the instruction data for training. Specifically, we used the following subsets (Please refer to the link here which includes all the json file needed for training):

conversation_videochat1
conversation_videochat2
conversation_videochatgpt
caption_videochat
reasoning_clevrer_qa
reasoning_clevrer_mc
reasoning_next_qa

To replicate our training for Frame Scalable Encoding (FSE), please download the datasets Clevrer, NExT-QA, VideoChatGPT, WebVid-10M(However, this dataset is no longer available) as well as the json files from VideoChat2-IT. Then we put all the datasets as the following structure:

- data
    - ANet
        - activitynet_train_videos_video_chatgpt
    - anno
        - video
            - caption
            - conversation
            - reasoning
    - clevrer
    - internvid-10s (This is the instruction dataset collected by VideoChat2. These videos are from InternVid (https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid). Considering the data is too large, you may can download the video by yourself.  For example, “LLU5X98aozs_648.258.mp4”, “LLU5X98aozs”is YouTube ID, “648.258”is the start time,and the video clip duration is 10s. Thanks to the author Kunchang Li of VideoChat2 for offering the link and instructions.)
    - nextqa
    - WebVid10M (All the videos of VideoChat v1 data are from here)

Base model preparation

  1. Download the VideoBLIP model.
wget -P video_models https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/videochat2/umt_l16_qformer.pth
  1. Follow here to prepare vicuna-7b-v0 and place it under video_models

Training with Frame Scalable Encoding (FSE)

Download the model videochat2_7b_stage3.pth from here then put it under the folder video_models. Now the folder video_models should have the following structure:

- video_models
    - vicuna-7b-v0
    l16_25m.pth
    umt_l16_qformer.pth
    videochat2_7b_stage3.pth

For Validation, please refer to the following section to download MVBench and put the dataset under the folder ./MVBench.

Then simply run the following code (remember to set the number of gpus in the file NUM_GPUS).

sh run_7b_stage4.sh

Evaluation

Download MVBench

Download from Hugging Face and place it under ./MVBench. The file structure under MVBench is:

- assert
- json
- video
.gitattributes
README.md

Prepare street-scene data(required if want to use the extended MVBench data)

bash download_street_scnene.sh 

Prepare LV-Chat Model

Please download the model from LV-Chat. Put the pth file 7b_stage4.pth under the folder video_models.

Evaluate LV-Chat on MVBench

Run the script to test our model and the result will be written to logs:

bash run_mvbench.sh

You can also run the baseline (VideoChat2) using:

bash run_mvbench.sh --config ./configs/config_videochat2.json

Evaluate LV-Chat on Real-world datasets

TACoS

  1. Download TACoS dataset from here and place the videos folder under ./TACoS.
  2. Download GPT-4 generated summary:
wget -P ./TACoS https://huggingface.co/datasets/Kevin99z/tacos_summary/resolve/main/summary.json
  1. Evaluate TACoS
bash run_tacos.sh # add --config ./configs/config_videochat2.json to test the baseline

EgoSchema

  1. Download EgoSchema here and place it under ./EgoSchema.
  2. Evaluate EgoSchema
bash run_egoschema.sh # add --config ./configs/config_videochat2.json to test the baseline

If you find our paper or code useful, please consider citing our paper.

About

The official implementation of the paper **LVChat: Facilitating Long Video Comprehension**

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published