We investigate how Large Audio-Language Models (ALMs) can be adapted for L2 pronunciation training, focusing on mispronunciation detection and human-friendly feedback generation. The system covers three components:
- Cascaded ASR + LLM
- Existing Audio Language Models (ALMs)
- Instruction-Tuning Pipeline (Two-Stage)
All experiments were conducted on NVIDIA A40 GPUs (46GB) using strictly controlled prompts and evaluation protocols.
git clone <your-git-remote> ALMs4Learning
cd ALMs4Learning
conda env create -f environment_base.yml
conda env create -f environment_instruct_tuning.yml
conda activate ALMs_baseALMs4Learning/
├── data/
│ ├── L2-Arctic-plus/ # L2-ARCTIC raw + augmented annotations
│ └── training_datasets/ # HuggingFace datasets for pretrain/finetune
├── pipelines/
│ ├── cascaded_asr_llms/ # ASR front-ends, LLM back-ends, parsing
│ ├── existing_alms/ # GPT-4o, Qwen Audio, Qwen2 Audio baselines
│ └── instruction_tuning/ # Model configs, training, inference, parsing
├── eval/ # Unified evaluation script + results
├── image/
├── environment_base.yml
└── environment_instruct_tuning.ymlRequest access at: https://psi.engr.tamu.edu/l2-arctic-corpus/
After submitting your information you will receive Google Drive links. Download "L2-ARCTIC-V5.0 (everything packed)", unpack it, and place:
./data/L2-Arctic-plus/l2arctic_release_v5.0/This repository includes augmented prompts + GPT-4o annotations:
./data/L2-Arctic-plus/train_data.json # (1,699 entries)
./data/L2-Arctic-plus/test_data.json # (900 entries)- train_data.json → instruction tuning
- test_data.json → evaluation & baselines
Pretraining Dataset:
python data/training_datasets/build/build_pretrain_dataset.py \
--output_folder ./data/training_datasets/pretrain \
--max_examples 200000Finetuning Dataset:
python data/training_datasets/build/build_finetune_dataset.py \
--output_folder ./data/training_datasets/finetune \
--num_proc 10Generated HuggingFace datasets:
./data/training_datasets/pretrain/
./data/training_datasets/finetune/Use:
python ./pipelines/cascaded_asr_llms/asr/asr_whisper.py
python ./pipelines/cascaded_asr_llms/asr/wav2vec2.pyEach script loads train_data.json and test_data.json, runs ASR using Whisper or Wav2Vec2, and writes outputs to:
./pipelines/cascaded_asr_llms/asr/results/{train,test}/<asr>.jsonUse:
python ./pipelines/cascaded_asr_llms/llms/generate.pyThis script:
loads ASR outputs
- normalizes transcript text
- formats prompts
- sends them to Mistral/Llama models
Results saved to:
./pipelines/cascaded_asr_llms/llms/results/<llm>/<asr>.jsonUse:
python ./pipelines/cascaded_asr_llms/parse/parse.pyOutputs placed under:
./pipelines/cascaded_asr_llms/parse/results/<llm>/<asr>.jsonAll outputs become normalized dictionaries:
{
"word": [
{"issue": "...", "suggestion": "..."}
]
}This repository includes three strong audio-language model baselines:
- GPT-4o Audio
- Qwen Audio
- Qwen2 Audio
GPT-4o Audio communicates with the OpenAI gpt-4o-audio-preview endpoint using paired reference–utterance audio inputs:
python ./pipelines/existing_alms/alms/gpt4o_audio.pyNote: Set your OpenAI API key in gpt4o_audio.py.
Qwen Audio and Qwen2 Audio run fully locally via HuggingFace implementations:
python ./pipelines/existing_alms/alms/qwen_audio.py
python ./pipelines/existing_alms/alms/qwen2_audio.pyAll raw outputs are saved under:
./pipelines/existing_alms/alms/results/They can be converted into structured formats using:
python ./pipelines/existing_alms/parse/parse.pyParsed outputs are stored in:
./pipelines/existing_alms/parse/results/Stage 1 learns a universal audio-to-text projector, while Stage 2 adapts the LLM to L2 pronunciation feedback tasks.
Pretraining is launched using the following script:
conda activate ALMs_instruct_tuning
bash ./pipelines/instruction_tuning/training/pretraining.shThis script is a DeepSpeed wrapper around train_model.py and configures:
- LLM backbone
- acoustic encoder (Whisper / Wav2Vec2)
- modality builder
- per-device batch size
- gradient accumulation steps
- learning-rate scheduler
Note: Set your configures in pretraining.sh
Checkpoints are saved under:
pipelines/instruction_tuning/checkpoints/pretraining/{llm}_{asr}_{modality}/Finetuning is launched via:
bash ./pipelines/instruction_tuning/training/finetuning.shNote: Set your configures in finetuning.sh
The script loads the pretrained projector from:
.../pretraining/.../non_lora_trainables.binLoRA is enabled during finetuning, using the dataset located at:
./data/training_datasets/finetune/Finetuned checkpoints are saved under:
./pipelines/instruction_tuning/checkpoints/finetuning/{llm}_{asr}_{modality}/Run inference with:
python ./pipelines/instruction_tuning/inference/inference.pyThis loads a selected LoRA checkpoint and writes predictions to:
./pipelines/instruction_tuning/inference/results/{llm}_{asr}_{modality}.jsonNormalize and structure inference outputs using:
python ./pipelines/instruction_tuning/parse/parse.pyParsed results are saved to:
./pipelines/instruction_tuning/parse/results/{llm}_{asr}_{modality}.jsonProduces 8 metrics across two groups:
Mispronunciation Detection Evaluation (MDE)
- Accuracy
- Precision
- Recall
- F1
- Extra Word Ratio
Feedback Generation Evaluation (FGE)
- BLEU-2
- ROUGE-L
- BERTScore
Run:
python eval/eval.pyProduces:
./eval/eval_results.jsonFormat:
[
{
"pipeline_type": "cascaded_asr_llms",
"llm": "llama",
"asr": "whisper",
"variant": "large",
"MDE": {...},
"FGE": {...}
},
{
"pipeline_type": "existing_alms",
"alm": "gpt4o",
"MDE": {...},
"FGE": {...}
}
]