# Fine-tune MiDashengLM with ESC-50

## Pre-run checklist

Before running, ensure MDL-Toolkit is properly installed. Running the command below should print the help message of `mdl-toolkit`. If it fails, please check your installation. For more details, see the Installation Guide at ./installation.md.

> ### Note
> When running locally, it is strongly recommended to install MDL-Toolkit into an isolated virtual environment to avoid dependency issues. In notebooks, environment handling can be tricky, so it’s fine to install MDL-Toolkit into the notebook’s current environment.

> ### Note
> During the example execution, the ESC-50 dataset and the full MiDashengLM-7B bf16 weights will be downloaded over the network. Ensure stable connections to GitHub and Hugging Face and enough disk space.
>
> You can also configure MDL-Toolkit to download models from Modelscope. To use Modelscope, make sure you installed the `modelscope` extra and add `--from-modelscope true` to `mdl-toolkit` commands.

In [1]:
# Install MDL-Toolkit, for example:
# !pip install mdl-toolkit
!mdl-toolkit --help

usage: mdl-toolkit [-h] {train,convert-dataset,inference} ...

options:
  -h, --help            show this help message and exit

subcommands:
  {train,convert-dataset,inference}
    train
    convert-dataset
    inference


## Data preparation

### Download and extract the ESC-50 dataset

Run the following command to download and extract the dataset. You can also obtain the dataset by other means; adjust paths in later steps accordingly.

> ### Network access
> This downloads the dataset (~615 MiB) from GitHub and may take some time. Please ensure your network is stable and you have sufficient storage space.

In [2]:
!wget -nc https://github.com/karoldvl/ESC-50/archive/master.zip -O ESC-50.zip
!unzip -o -q ESC-50.zip

File ‘ESC-50.zip’ already there; not retrieving.


Now the ESC-50 dataset should be available under `ESC-50-master`.

### Convert the dataset to the required training format

ESC-50 provides a CSV list of samples with five folds `1` to `5`. The first 10 rows are:

In [3]:
!head -n 11 ESC-50-master/meta/esc50.csv

filename,fold,target,category,esc10,src_file,take
1-100032-A-0.wav,1,0,dog,True,100032,A
1-100038-A-14.wav,1,14,chirping_birds,False,100038,A
1-100210-A-36.wav,1,36,vacuum_cleaner,False,100210,A
1-100210-B-36.wav,1,36,vacuum_cleaner,False,100210,B
1-101296-A-19.wav,1,19,thunderstorm,False,101296,A
1-101296-B-19.wav,1,19,thunderstorm,False,101296,B
1-101336-A-30.wav,1,30,door_wood_knock,False,101336,A
1-101404-A-34.wav,1,34,can_opening,False,101404,A
1-103298-A-9.wav,1,9,crow,False,103298,A
1-103995-A-30.wav,1,30,door_wood_knock,False,103995,A


You can split the dataset into training and test by `fold` and format each row so the model predicts the category in the form `category: <category>, target: <target>`. Feel free to modify the code below, for example, make the model output JSON.

In [4]:
import csv
import os
from pathlib import Path

esc50_base = Path("ESC-50-master")
meta_file = esc50_base / "meta" / "esc50.csv"
train_output = Path("train.csv")
test_output = Path("test.csv")

with (
    open(meta_file, "r") as meta,
    open(train_output, "w") as train,
    open(test_output, "w") as test,
):
    reader = csv.DictReader(meta)
    train_writer = csv.DictWriter(train, fieldnames=["audio", "prediction"])
    test_writer = csv.DictWriter(test, fieldnames=["audio", "prediction"])
    train_writer.writeheader()
    test_writer.writeheader()

    for row in reader:
        writer = train_writer if row["fold"] != "5" else test_writer
        writer.writerow(
            {
                "audio": os.fspath(esc50_base / "audio" / row["filename"]),
                "prediction": f"category: {row['category']}, target: {row['target']}",
            }
        )

In [5]:
!echo '==== Train split ===='
!head -n 11 train.csv
!echo '==== Test split  ===='
!head -n 11 test.csv

==== Train split ====
audio,prediction
ESC-50-master/audio/1-100032-A-0.wav,"category: dog, target: 0"
ESC-50-master/audio/1-100038-A-14.wav,"category: chirping_birds, target: 14"
ESC-50-master/audio/1-100210-A-36.wav,"category: vacuum_cleaner, target: 36"
ESC-50-master/audio/1-100210-B-36.wav,"category: vacuum_cleaner, target: 36"
ESC-50-master/audio/1-101296-A-19.wav,"category: thunderstorm, target: 19"
ESC-50-master/audio/1-101296-B-19.wav,"category: thunderstorm, target: 19"
ESC-50-master/audio/1-101336-A-30.wav,"category: door_wood_knock, target: 30"
ESC-50-master/audio/1-101404-A-34.wav,"category: can_opening, target: 34"
ESC-50-master/audio/1-103298-A-9.wav,"category: crow, target: 9"
ESC-50-master/audio/1-103995-A-30.wav,"category: door_wood_knock, target: 30"
==== Test split  ====
audio,prediction
ESC-50-master/audio/5-103415-A-2.wav,"category: pig, target: 2"
ESC-50-master/audio/5-103416-A-2.wav,"category: pig, target: 2"
ESC-50-master/audio/5-103418-A-2.wav,"category: pig, t

### Check the pretrained model’s output

MiDashengLM doesn’t know the ESC-50 label space out of the box, so it may output incorrect categories. In this tutorial, we’ll fine-tune the model to align outputs with the expected format. In practice, carefully designed prompts or decoding constraints may improve outputs without fine-tuning—choose what fits your use case.

MDL-Toolkit provides a convenient inference command to quickly run inference without writing code. The input format matches training, except the `prediction` column is optional; the command will write predictions into `prediction` and preserve other columns. Because the inference input format is compatible with training, we can reuse the `test.csv` generated above to observe the base model’s outputs.

Arguments:
- `--model-name mispeech/midashenglm-7b-bf16`: the Hugging Face repo or local path of the model to use.

> ### Note
> This tutorial uses bf16 weights to reduce download and disk usage. If you already have the full fp32 weights, you may use `--model-name mispeech/midashenglm-7b` instead of `--model-name mispeech/midashenglm-7b-bf16` to avoid re-downloading.

In [6]:
!mdl-toolkit inference \
    test.csv \
    --system-prompt "Output the predicted category in the format of category: <category>, category_id: <category_id>." \
    --output orig-output.csv \
    --model-name mispeech/midashenglm-7b-bf16
! head -n 11 orig-output.csv

Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:04<00:00,  1.03s/it]
Generating train split: 400 examples [00:00, 5910.05 examples/s]
Processing dataset: 100%|██████████████| 400/400 [00:18<00:00, 22.22 examples/s]
Batching examples (num_proc=32): 100%|█| 400/400 [00:03<00:00, 127.43 examples/s
Inference: 100%|████████████████████████████████| 32/32 [00:32<00:00,  1.02s/it]
audio,prediction
ESC-50-master/audio/5-103415-A-2.wav,"category: livestock, category_id: 1"
ESC-50-master/audio/5-103416-A-2.wav,"category: music, category_id: 1"
ESC-50-master/audio/5-103418-A-2.wav,"category: pig, category_id: 1"
ESC-50-master/audio/5-103420-A-2.wav,"category: animal, category_id: 1"
ESC-50-master/audio/5-103421-A-2.wav,"category: pig, category_id: 1"
ESC-50-master/audio/5-103422-A-2.wav,"category: animal, category_id: 1"
ESC-50-master/audio/5-117118-A-42.wav,"category: alarm, category_id: 1"
ESC-50-master

### Convert the data

To speed up training, we convert the dataset ahead of time. The commands below convert the CSV files and set a simple system prompt. You can skip this step and convert during training; in that case, replace dataset paths in later steps with the CSV paths, and expect some conversion time before each run.

> ### Network access
> The following command will download the tokenizer from Hugging Face. Ensure your network is stable and wait patiently.
>
> To download from Modelscope, add `--from-modelscope true` to the commands and ensure you installed the `modelscope` extra.

Arguments:
- `train.csv`: path to the input CSV file.
- `--output train-converted/`: output directory for the converted dataset; it will be created and existing content overwritten.
- `--system-prompt ...`: a simple system prompt to guide the model’s behavior.

In [7]:
!mdl-toolkit convert-dataset \
    train.csv \
    --output train-converted/ \
    --system-prompt "Output the predicted category in the format of category: <category>, category_id: <category_id>."
!mdl-toolkit convert-dataset \
    test.csv \
    --output test-converted/ \
    --system-prompt "Output the predicted category in the format of category: <category>, category_id: <category_id>."

Generating train split: 1600 examples [00:00, 27101.33 examples/s]
Processing dataset: 100%|████████████| 1600/1600 [00:46<00:00, 34.10 examples/s]
Deriving labels for training (num_proc=32): 100%|█| 1600/1600 [00:02<00:00, 667.
Saving the dataset (2/2 shards): 100%|█| 1600/1600 [00:03<00:00, 522.67 examples
Processing dataset: 100%|██████████████| 400/400 [00:16<00:00, 23.59 examples/s]
Deriving labels for training (num_proc=32): 100%|█| 400/400 [00:02<00:00, 172.46
Saving the dataset (1/1 shards): 100%|█| 400/400 [00:00<00:00, 1083.64 examples/


## Train the model

We want the model to classify audio while following a strict output format. Formatting is simple, so we use a modest LoRA rank and evaluate periodically. In this tutorial, we’ll use bf16 weights by default and evaluate more sparsely to save compute.

> ### Network access
> The command will download model weights from Hugging Face and may take some time. Ensure a stable network, enough storage space, and wait patiently.
>
> To download from Modelscope, add `--from-modelscope true` and ensure you installed the `modelscope` extra.

> ### Note
> We recommend training on a high-performance GPU for speed; MDL-Toolkit will automatically detect and use available GPUs. By default, training uses a single GPU. If you have multiple GPUs, see the Distributed Training Guide at ./distributed.md. Avoid CPU-only training as it will be very slow.
>
> To run with bf16 precision, you’ll need about 18 GiB of VRAM. If VRAM is limited, try adding `--quantization 8bit` or `--quantization 4bit` to quantize the model on load with bitsandbytes. Note that quantization may reduce capabilities and lead to suboptimal results.

Arguments:
- `--lora-rank 32`: set LoRA rank to 32. For more complex tasks, consider increasing the rank.
- `--eval-steps 100`: evaluate every 100 steps.
- `--train-dataset train-converted/`: training dataset path; you can also specify the CSV path to convert on the fly.
- `--eval-dataset test-converted/`: evaluation dataset path; you can also specify the CSV path. If omitted, evaluation is skipped.
- `--output output/`: output directory for checkpoints and results; it will be created and existing content overwritten.

In [8]:
!mdl-toolkit train \
     --lora-rank 32 \
     --eval-steps 100 \
     --train-dataset train-converted/ \
     --eval-dataset test-converted/ \
     --output output/ \
     --model-name mispeech/midashenglm-7b-bf16

Distributed: NO
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:04<00:00,  1.01s/it]
Model loaded with torch.bfloat16
trainable params: 68,968,448 || all params: 8,350,708,352 || trainable%: 0.8259
Peak VRAM during loading: 15.684 GiB
  0%|                                                   | 0/200 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.
{'loss': 3.8798, 'grad_norm': 3.2964117527008057, 'learning_rate': 0.0001, 'epoch': 0.01}
{'loss': 3.6867, 'grad_norm': 3.0418500900268555, 'learning_rate': 9.95e-05, 'epoch': 0.01}
{'loss': 3.1348, 'grad_norm': 2.080457925796509, 'learning_rate': 9.900000000000001e-05, 'epoch': 0.01}
{'loss': 2.8754, 'grad_norm': 2.0675864219665527, 'learning_rate': 9.850000000000001e-05, 'epoch': 0.02}
{'lo

## Inference

After training, use the fine-tuned model for inference. By default, LoRA adapters are merged, so you can load it the same way as the base model by specifying the model path:

In [9]:
# Set TOKENIZERS_PARALLELISM=0 in the Notebook to prevent warnings from
# messing up the output. Do not set this outside the Notebook, as it may
# cause reduced performance.
import os

os.environ["TOKENIZERS_PARALLELISM"] = "0"

import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

model_id = "./output/final/"

model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

model.eval()

messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "Output the predicted category in the format of category: <category>, category_id: <category_id>.",
            },
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "path": "ESC-50-master/audio/5-103415-A-2.wav"},
        ],
    },
]

with torch.no_grad():
    model_inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        add_special_tokens=True,
        return_dict=True,
    ).to(device=model.device, dtype=model.dtype)
    generation = model.generate(**model_inputs)
    output = tokenizer.batch_decode(generation, skip_special_tokens=True)

print(output)

  from .autonotebook import tqdm as notebook_tqdm
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:15<00:00,  3.78s/it]


['category: pig, target: 2']


We can also use the MDL-Toolkit inference command to obtain predictions:

In [10]:
!mdl-toolkit inference \
    test.csv \
    --system-prompt "You are a helpful audio classifier." \
    --user-prompt "Output the predicted category in the format of category: <category>, category_id: <category_id>." \
    --output finetuned-output.csv \
    --model-name ./output/final/

Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:04<00:00,  1.24s/it]
Processing dataset: 100%|██████████████| 400/400 [00:10<00:00, 37.82 examples/s]
Batching examples (num_proc=32): 100%|█| 400/400 [00:03<00:00, 119.73 examples/s
Inference: 100%|████████████████████████████████| 32/32 [00:33<00:00,  1.06s/it]


The predictions are written to the specified CSV file with the same schema as training. After fine-tuning, the model should produce outputs that follow the specified format:

In [11]:
! head -n 11 orig-output.csv
! head -n 11 finetuned-output.csv

audio,prediction
ESC-50-master/audio/5-103415-A-2.wav,"category: livestock, category_id: 1"
ESC-50-master/audio/5-103416-A-2.wav,"category: music, category_id: 1"
ESC-50-master/audio/5-103418-A-2.wav,"category: pig, category_id: 1"
ESC-50-master/audio/5-103420-A-2.wav,"category: animal, category_id: 1"
ESC-50-master/audio/5-103421-A-2.wav,"category: pig, category_id: 1"
ESC-50-master/audio/5-103422-A-2.wav,"category: animal, category_id: 1"
ESC-50-master/audio/5-117118-A-42.wav,"category: alarm, category_id: 1"
ESC-50-master/audio/5-117120-A-42.wav,"category: alarm, category_id: 1"
ESC-50-master/audio/5-117122-A-42.wav,"category: alarm, category_id: 1"
ESC-50-master/audio/5-117250-A-2.wav,"category: animal, category_id: 1"
audio,prediction
ESC-50-master/audio/5-103415-A-2.wav,"category: pig, target: 2"
ESC-50-master/audio/5-103416-A-2.wav,"category: door_wood_creaks, target: 33"
ESC-50-master/audio/5-103418-A-2.wav,"category: pig, target: 2"
ESC-50-master/audio/5-10

## How to improve performance

Congrats on completing your first fine-tune! Hyperparameters in this tutorial may not be optimal for all tasks. If you’re not satisfied with the results, try the following:

1. Increase the LoRA rank, e.g., `--lora-rank 64`.
2. Tune the learning rate, e.g., `--lr 5e-5`. The best LR depends on many factors and may require multiple trials or a systematic search.
3. Adjust trainable targets, e.g., `--train-target encoder --train-target projector --train-target decoder --train-target embed_tokens --train-target lm_head` to train all available targets. In some cases, adding targets—especially `embed_tokens` and `lm_head`—can improve results.
4. Use higher numerical precision. If you used quantization, try running without it. If you didn’t, you can set `--bf16 false` to load fp32 weights.
5. Increase both the quantity and quality of training data. This often helps, although naively repeating the same data (e.g., setting `--num-epochs` > 1) may have limited effect or even hurt performance.