
- Beijing China
- https://thomas-yanxin.github.io/
- @thomas_yanxin
Highlights
Lists (13)
Sort Name ascending (A-Z)
Starred repositories
Roblox Foundation Model for 3D Intelligence
A generative world for general-purpose robotics & embodied AI learning.
[CVPR 2025] Source codes for the paper "3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning"
DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping
VaViM and VaVAM: Autonomous Driving through Video Generative Modeling (official repository).
The python library for real-time communication
[ICLR 2025] SOTA discrete acoustic codec models with 40/75 tokens per second for audio language modeling
Tencent Hunyuan3D-1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation
Drop in a screenshot and convert it to clean code (HTML/Tailwind/React/Vue)
High-quality and streaming Speech-to-Speech interactive agent in a single file. 只用一个文件实现的流式全双工语音交互原型智能体!
Data-Driven Astrology 💫 Kerykeion is a Python library for astrology. It generates SVG charts and extracts detailed data for birth charts, synastry, transits, and composite charts.
Quickly produce both human-readable and JSON-formatted astrology chart data based on the Swiss Ephemeris and astro.com.
Clone a voice in 5 seconds to generate arbitrary speech in real-time
TEN Agent is a conversational voice AI agent powered by TEN, integrating Deepseek, Gemini, OpenAI, RTC, and hardware like ESP32. It enables realtime AI capabilities like seeing, hearing, and speaki…
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
EvolKit is an innovative framework designed to automatically enhance the complexity of instructions used for fine-tuning Large Language Models (LLMs).
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs
End-to-end stack for WebRTC. SFU media server and SDKs.
[ICLR 2025] Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
A pipeline for LLM knowledge distillation
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction