I'm a CS master's student at Lanzhou University, focused on two corners of multimodal ML that don't get along with each other as well as they should:
- 3D-aware vision-language models — making MLLMs reason about layout, occlusion, rotation, not just object identity.
- Speaker-side speech models — small, practical tools for screening cloned / synthetic voices, and for understanding what makes a TTS clip sound off.
Both interests stem from the same place: most "multimodal" systems treat each modality as a flat bag of tokens, and they fall down on the parts that need a little structure.
- I write papers in tmux + vim, but I'm not proud of it
- The slowest part of my workflow is convincing the cluster scheduler I deserve a GPU
- My first ML project was a digit classifier; my second was the same one, debugged
📍 Lanzhou, China · he/him · happy to chat about benchmarks and voice anti-spoofing
