Skip to content
@EvolvingLMMs-Lab

LMMs-Lab

Feeling and building multimodal intelligence.

LMMs-Lab: Building Multimodal Intelligence

We are a group of researchers, with a focus on large multimodal models (LMMs). We wish to bring insights to community with our research.

GitHub User's stars

For one week, six individuals lived together, capturing every moment through AI glasses, and creating the EgoLife dataset. Based on this, we build models and benchmarks to drive the future of AI life assistants that are capable of recalling past events, tracking habits, and providing personalized, long-context assistance to enhance daily life. This multi-personal, multi-view, multimodal, long-term setting is just the beginning—unlocking new frontiers for AI assistants with true deeper understanding. 🚀

We conducted a speed-run on to investigate R1's paradigm in multimodal models after observing growing interest in R1 and studying the elegant implementation of the GRPO algorithm in open-r1 and trl.

We're on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks.

To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI. To address this challenge, we introduce lmms-eval, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.

VideoMMMU is a multi-modal, multi-disciplinary video benchmark that evaluates the knowledge acquisition capability from educational videos.

Our dataset comprises 300 lecture-style videos spanning 6 professional disciplines: Art, Business, Science, Medicine, Humanities, and Engineering, with 30 subjects distributed among them.

VideoMMMU features a Knowledge Acquisition-based Question Design. Each video includes 3 question-answer pairs aligned with the three knowledge acquisition stages: Perception (identifying key information related to the knowledge), Comprehension (understanding the underlying concepts), and Adaptation (applying knowledge to new scenarios).

VideoMMMU proposes a knowledge acquisition metric (Δknowledge) to measure performance gains on practice exam questions after learning from videos. This metric enables us to quantitatively evaluate how effectively LMMs can assimilate and utilize the information presented in the videos to solve real-world, novel problems.

We expanded the LLaVA-NeXT series with recent stronger open LLMs, reporting our findings on more capable language models: We maintain an efficient training strategy like previous LLaVA models. We supervised finetuned our model on the same data as in previous LLaVA-NeXT 7B/13B/34B models. Our current largest model LLaVA-NeXT-110B is trained on 128 H800-80G for 18 hours.

With stronger LLMs support, LLaVA-NeXT achieves consistently better performance compared with prior open-source LMMs by simply increasing the LLM capability. It catches up to GPT4-V on selected benchmarks.

We report detailed ablations, including architectural modifications, enlarged visual tokens, and varied training strategies, to explore potential improvements in LLaVA-NeXT's performance.

We explore LLaVA-NeXT's capabilities in video understanding tasks, highlighting its strong performance. Key improvements include:

SoTA Performance! Without seeing any video data, LLaVA-Next demonstrates strong zero-shot modality transfer ability, outperforming all the existing open-source LMMs (e.g., LLaMA-VID) that have been specifically trained for videos. Compared with proprietary ones, it achieves comparable performance with Gemini Pro on NextQA and ActivityNet-QA.

Strong length generalization ability Despite being trained under the sequence length constraint of a 4096-token limit, LLaVA-Next demonstrates remarkable ability to generalize to longer sequences. This capability ensures robust performance even when processing long-frame content that exceeds the original token length limitation.

DPO pushes performance DPO with AI feedback on videos yields significant performance gains.

Pinned Loading

  1. lmms-eval Public

    Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.

    Python 2.2k 220

  2. Otter Public

    🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

    Python 3.2k 214

  3. LongVA Public

    Long Context Transfer from Language to Vision

    Python 368 19

  4. multimodal-sae Public

    Auto Interpretation Pipeline and many other functionalities for Multimodal SAE Analysis.

    Python 112 5

  5. open-r1-multimodal Public

    A fork to add multimodal model training to open-r1

    Python 1k 52

  6. EgoLife Public

    [CVPR 2025] EgoLife: Towards Egocentric Life Assistant

    Python 203 12

Repositories

Showing 10 of 12 repositories
  • lmms-eval Public

    Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.

    Python 2,195 220 191 (10 issues need help) 4 Updated Mar 9, 2025
  • .github Public
    1 0 0 0 Updated Mar 7, 2025
  • EgoLife Public

    [CVPR 2025] EgoLife: Towards Egocentric Life Assistant

    Python 203 12 1 0 Updated Mar 7, 2025
  • VideoMMMU Public
    Python 33 1 0 1 Updated Feb 25, 2025
  • open-r1-multimodal Public

    A fork to add multimodal model training to open-r1

    Python 1,031 Apache-2.0 52 18 1 Updated Feb 8, 2025
  • multimodal-sae Public

    Auto Interpretation Pipeline and many other functionalities for Multimodal SAE Analysis.

    Python 112 5 0 0 Updated Jan 24, 2025
  • my-python-template Public template Forked from kcz358/my-python-template

    My template repo for setting up a new python repo

    Python 0 1 0 0 Updated Dec 11, 2024
  • LongVA Public

    Long Context Transfer from Language to Vision

    Python 368 Apache-2.0 19 27 0 Updated Nov 20, 2024
  • demos Public
    Python 0 0 0 0 Updated Sep 18, 2024
  • sglang Public Forked from sgl-project/sglang

    SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.

    Python 4 Apache-2.0 1,220 0 0 Updated Sep 18, 2024

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…