A transformer-based video analysis tool that utilizes multimodal large language models to enable zero-shot behavioral classification across Drosophila species, bypassing traditional computer vision pipelines for more direct and nuanced ethological insights.