Skip to content
View thomas-yanxin's full-sized avatar
:octocat:
Regular bencher
:octocat:
Regular bencher

Organizations

@ColugoMum @X-D-Lab

Block or report thomas-yanxin

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Stars

VLM

28 repositories

A family of lightweight multimodal models.

Python 1,051 75 Updated Nov 18, 2024

Mobile-Agent: The Powerful GUI Agent Family

Python 7,009 722 Updated Dec 2, 2025

✨✨Latest Advances on Multimodal Large Language Models

17,228 1,107 Updated Dec 26, 2025

[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning"; an interactive Large Language 3D Assistant.

Python 310 14 Updated Jul 17, 2024

[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Python 942 48 Updated Oct 16, 2024

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

Python 3,572 491 Updated Jan 20, 2026

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Python 2,264 132 Updated May 30, 2025

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

Python 3,724 614 Updated Jan 15, 2026

Lumina-T2X is a unified framework for Text to Any Modality Generation

Python 2,248 94 Updated Feb 16, 2025

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

Python 9,725 756 Updated Sep 22, 2025
Python 4,518 441 Updated Sep 14, 2025

Open Source framework for voice and multimodal conversational AI

Python 9,907 1,641 Updated Jan 20, 2026

[CVPR'25 highlight] RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

Python 441 21 Updated May 14, 2025

[T-IV] This repository collects research papers of large Vision Language Models in Autonomous driving and Intelligent Transportation System. The repository will be continuously updated to track the…

441 32 Updated Apr 1, 2025

Multimodal Models in Real World

Jupyter Notebook 552 23 Updated Feb 24, 2025

SpeechGPT Series: Speech Large Language Models

Python 1,400 94 Updated Jul 22, 2024

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding

Python 2,174 137 Updated Dec 15, 2025

A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization

Python 2,740 242 Updated Dec 8, 2025

Data annotation toolbox supports image, audio and video data.

Python 1,473 160 Updated Oct 1, 2025

《多模态大模型:新一代人工智能技术范式》作者:刘阳,林倞

HTML 258 24 Updated Dec 5, 2024

[ICCV-2025] Official implementation of Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

Python 95 3 Updated Jul 26, 2025

OpenVLA: An open-source vision-language-action model for robotic manipulation.

Python 5,053 612 Updated Mar 23, 2025

[ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.

Python 1,856 83 Updated Jan 8, 2026

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

Python 3,111 219 Updated May 19, 2025

High-quality and streaming Speech-to-Speech interactive agent in a single file. 只用一个文件实现的流式全双工语音交互原型智能体!

Python 491 52 Updated Dec 15, 2025

VaViM and VaVAM: Autonomous Driving through Video Generative Modeling (official repository).

Jupyter Notebook 138 8 Updated Jul 3, 2025