- Beijing China
- https://thomas-yanxin.github.io/
- @thomas_yanxin
Highlights
VLM
Mobile-Agent: The Powerful GUI Agent Family
✨✨Latest Advances on Multimodal Large Language Models
[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning"; an interactive Large Language 3D Assistant.
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Lumina-T2X is a unified framework for Text to Any Modality Generation
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Open Source framework for voice and multimodal conversational AI
[CVPR'25 highlight] RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
[T-IV] This repository collects research papers of large Vision Language Models in Autonomous driving and Intelligent Transportation System. The repository will be continuously updated to track the…
SpeechGPT Series: Speech Large Language Models
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization
Data annotation toolbox supports image, audio and video data.
[ICCV-2025] Official implementation of Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data
OpenVLA: An open-source vision-language-action model for robotic manipulation.
[ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
High-quality and streaming Speech-to-Speech interactive agent in a single file. 只用一个文件实现的流式全双工语音交互原型智能体!
VaViM and VaVAM: Autonomous Driving through Video Generative Modeling (official repository).



