Skip to content

This repository contains training code for the Gemamba VLM

License

Notifications You must be signed in to change notification settings

tensorsense/gemamba

Repository files navigation

Gemamba

This repository contains training code for the Gemamba multimodal language model.

Gemamba is the first multimodal LLM to combine a Mamba-based video encoder with performant and flexible Gemma transformer LLM in a LLaVA-style architecture.

Getting started

We recommend using Dev Containers to create the environment using pre-made configuration.

  1. Install PyTorch.

  2. Install Python dependencies.

pip3 install -r requirements.txt

Install VideoMamba dependencies:

pip3 install -e llava/model/multimodal_encoder/videomamba/causal-conv1d
pip3 install -e llava/model/multimodal_encoder/videomamba/mamba

[optional] Update transformers to get Phi3 support:

pip3 install git+https://github.com/huggingface/transformers
  1. Download pretrained weights for VideoMamba:
wget https://huggingface.co/OpenGVLab/VideoMamba/resolve/main/videomamba_m16_25M_f8_res224.pth
  1. Refer to run_finetune.ipynb to learn how to load a checkpoint and run inference.

Pretrained checkpoints

Pretrained checkpoint for the model can be found here: HF 🤗.

  • The model's projector has been pretrained for 1 epoch on the Valley dataset.
  • LLM and the projector have been jointly fine-tuned using the Video-ChatGPT dataset.

Training

We inherit most of the training workflow from the original LLaVA. Please refer to scripts/train to see configurations used for training the model. See scripts/eval for scripts used to calculate benchmark scores.

About

This repository contains training code for the Gemamba VLM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published