# Evaluation Results (Tables 1 & 2 & 3 in Paper)

- **Pretrained Single_Velocity_HPT**  
   Reported by *Kong et al. (2021)*, uses the AMT-estimated MIDI notes for evaluation.  
   → This reflects the performance of **original Automatic Music Transcription (AMT)** systems.

- **Retrained Single_Velocity_HPT**  
   Uses the **ground-truth MIDI notes** for evaluation.  
   → This simulates a use case where **the MIDI transcription timing has already been corrected** (either manually or via audio–MIDI alignment tools).

- **Dual_Velocity_HPT** and **Triple_Velocity_HPT**  
   Follow the same setup as the retrained Single_Velocity_HPT, and the goal is to **refine MIDI velocity prediction** based on known note timings.
<br>
<br>

Please check the results on this link (available_later_for_anonymous). Our code will automatically upload all evaluation results to wandb logger, which is forming this online report. The following code is for a single inference:

In [None]:
!python pytorch/calculate_scores.py exp.batch_size=30 \
  model.name="FiLMUNetPretrained" model.input2="frame" exp.ckpt_iteration=1000000 \
  dataset.test_set=smd \
  feature.audio_feature=logmel feature.sample_rate=16000 feature.segment_seconds=2.0 feature.hop_seconds=1.0 feature.frames_per_second=100

Evaluation Mode : Kim et al.
Model Name      : FiLMUNetPretrained+frame
Test Set        : smd
Kim Eval: 100%|███████████████| 49/49 [09:30<00:00, 11.64s/file, frame_err=9.96]

===== Kim-style Average Metrics =====
frame_max_error: 9.9590
frame_max_std: 8.3273
f1_score: 0.5935
precision: 0.5944
recall: 0.8877
onset_masked_error: 16.0383
onset_masked_std: 12.4601


In [8]:
!python pytorch/inference.py \
  --mode dataset \
  --velocity-method max_frame \
  --overrides \
    model.name=FiLMUNetPretrained model.input2=frame \
    dataset.test_set=smd exp.ckpt_iteration=1000000 \
    feature.sample_rate=16000 feature.segment_seconds=2.0 feature.hop_seconds=1.0 \
    feature.audio_feature=logmel feature.frames_per_second=100

Inference Mode : DATASET (single checkpoint)
Model Name     : FiLMUNetPretrained+frame
Test Set       : smd
Using Device   : cuda
Feature Config : logmel | sr=16000 | fps=100 | seg=2.0s
Checkpoint     : /media/datadisk/home/22828187/zhanh/202510_hpt_data/workspaces/checkpoints/FiLMUNetPretrained+frame/1000000_iterations.pth
Proc 1000000 Ckpt: 100%|██████████████████████| 49/49 [07:29<00:00,  9.17s/file]

[Done] Dataset inference finished in 449.41 sec


In [None]:
!python pytorch/inference.py \
  --mode dataset_velo_score \
  --velocity-method max_frame \
  --overrides \
    model.name=FiLMUNetPretrained model.input2=frame \
    dataset.test_set=smd exp.ckpt_iteration=1000000 \
    feature.sample_rate=16000 feature.segment_seconds=2.0 feature.hop_seconds=1.0 \
    feature.audio_feature=logmel feature.frames_per_second=100


Inference Mode : DATASET_SCORE (single checkpoint)
Model Name     : FiLMUNetPretrained+frame
Test Set       : smd
Using Device   : cuda
Feature Config : logmel | sr=16000 | fps=100 | seg=2.0s
Checkpoint     : /media/datadisk/home/22828187/zhanh/202510_hpt_data/workspaces/checkpoints/FiLMUNetPretrained+frame/1000000_iterations.pth
MIDI Output    : /media/datadisk/home/22828187/zhanh/202510_hpt_data/workspaces/dataset_score/smd/FiLMUNetPretrained+frame/1000000_iterations/midis
Results Dir    : /media/datadisk/home/22828187/zhanh/202510_hpt_data/workspaces/dataset_score/smd/FiLMUNetPretrained+frame/1000000_iterations
Dataset Score: 100%|█████████| 49/49 [13:55<00:00, 17.05s/file, frame_err=10.00]

===== Dataset Score (velocity-picked) Averages =====
frame_max_error: 10.0024
frame_max_std: 8.2800
f1_score: 0.8426
precision: 0.9868
recall: 0.7838
frame_mask_f1: 0.3493
frame_mask_precision: 0.5000
frame_mask_recall: 0.2838
onset_masked_error: 10.1006
onset_masked_std: 8.3790


In [None]:
!python pytorch/inference.py \
  --mode dataset_audio_score \
  --velocity-method max_frame \
  --soundfont-path /path/to/FluidR3_GM.sf2 \
  --overrides model.name=FiLMUNetPretrained model.input2=frame \
             dataset.test_set=smd exp.ckpt_iteration=1000000 \
             feature.sample_rate=16000 feature.segment_seconds=2.0 \
             feature.hop_seconds=1.0 feature.audio_feature=logmel \
             feature.frames_per_second=100 \
  --audio-eval-sample-rate 44100 \
  --audio-eval-frames-per-second 86 \
  --audio-eval-fft-size 1024


# Old Scores

In [2]:
!python pytorch/calculate_scores.py \
  model.name="FiLMUNetPretrained" model.input2="frame" exp.ckpt_iteration=1000000 \
  dataset.test_set=smd \
  feature.sample_rate=16000 feature.segment_seconds=2.0 feature.hop_seconds=1.0 \
  feature.audio_feature=logmel feature.frames_per_second=100

Evaluation Mode : Kim et al.
Model Name      : FiLMUNetPretrained+frame
Test Set        : smd
Kim Eval: 100%|███████████████| 49/49 [09:01<00:00, 11.06s/file, frame_err=9.96]

===== Kim-style Average Metrics =====
frame_max_error: 9.9590
frame_max_std: 8.3273
f1_score: 0.5935
precision: 0.5944
recall: 0.8877
frame_mask_f1: 0.9694
frame_mask_precision: 0.9694
frame_mask_recall: 0.9694
onset_masked_error: 16.0383
onset_masked_std: 12.4601


In [None]:
!python pytorch/inference.py \
  exp.run_infer="multi" model.type="velo" \
  model.name="DynestAudioCNN" \
  dataset.test_set="smd"

Evaluation Mode : SINGLE
Model Name      : FiLMUNetPretrained+frame
Test Set        : smd
Using device    : cpu
Checkpoint      : 1100000_iterations.pth

[Done] Score Calculation Time: 11.99 sec

===== FiLMUNetPretrained+frame, iter=1100000 =====
velocity_mae: 15.6400
velocity_std: 12.3622
velocity_recall: 0.5247


In [19]:
!python pytorch/calculate_scores.py exp.run_infer='single' model.type='velo' exp.ckpt_iteration=1000000\
     model.name='FiLMUNetPretrained' model.input2='frame'\
     dataset.test_set='smd' exp.num_workers=12 # maps, maestro

Evaluation Mode : SINGLE
Model Name      : FiLMUNetPretrained+frame
Test Set        : smd
Using device    : cpu
Checkpoint      : 1000000_iterations.pth

[Done] Score Calculation Time: 14.13 sec

===== FiLMUNetPretrained+frame, iter=1000000 =====
velocity_mae: 16.0383
velocity_std: 12.4601
velocity_recall: 0.5262


In [None]:
# !python pytorch/calculate_scores.py exp.run_infer='multi' exp.num_workers=12 model.type='velo'\
#   model.name='FiLMUNetPretrained' model.input2="frame" \
#   dataset.test_set='smd' # maps, maestro

# !python pytorch/calculate_scores.py exp.run_infer='single' model.type='velo' exp.ckpt_iteration=1100000\
#      model.name='FiLMUNetPretrained' model.input2='frame'\
#      dataset.test_set='smd' exp.num_workers=12 # maps, maestro

!python pytorch/calculate_scores.py exp.run_infer='multi' exp.num_workers=12 model.type='velo'\
  model.name='DynestAudioCNN'\
  dataset.test_set='smd' # maps, maestro

Evaluation Mode : MULTI
Model Name      : DynestAudioCNN
Test Set        : smd
Using device    : cpu
Found 1 checkpoints in /media/datadisk/home/22828187/zhanh/202510_hpt_data/workspaces/checkpoints/DynestAudioCNN
------------------------------------------------------------
[1/1] Evaluating: 0_iterations.pth
[Done] Time: 15.19 sec
velocity_mae: 15.3810
velocity_std: 10.4575
velocity_recall: 0.3875

[Saved] Summary in Wandb and CSV: /media/datadisk/home/22828187/zhanh/202510_hpt_data/workspaces/logs/DynestAudioCNN_smd.csv
All checkpoint scores completed.


In [2]:
!python pytorch/inference.py  model.type='velo'\
    exp.run_infer='single' exp.ckpt_iteration=1000000\
    model.name='FiLMUNetPretrained' model.input2='frame'\
    dataset.test_set='smd'\
    feature.sample_rate=16000 \
    feature.segment_seconds=2.0 \
    feature.hop_seconds=1.0 \
    feature.frames_per_second=100 \
    feature.audio_feature="logmel"

Inference Mode : SINGLE
Model Name     : FiLMUNetPretrained+frame
Test Set       : smd
Using Device   : cuda
Checkpoint     : 1000000_iterations.pth
Proc 1000000 Ckpt: 100%|██████████████████████| 49/49 [07:53<00:00,  9.67s/file]

[Done] Inference time: 474.09 sec


In [10]:
!python pytorch/calculate_scores.py exp.num_workers=12 model.type='velo'\
  exp.run_infer='single' exp.ckpt_iteration=1000000 \
  model.name='FiLMUNetPretrained' model.input2='frame'\
  dataset.test_set='smd'\
  feature.sample_rate=16000

Evaluation Mode : Kim-style (frame-level)
Model Name      : FiLMUNetPretrained+frame
Test Set        : smd
Kim-style scoring: 100%|█████| 49/49 [01:08<00:00,  1.40s/file, frame_err=18.56]

===== Kim-style Average Metrics =====
frame_max_error: 16.0383
std_max_error: 12.4601
average_precision_score: 0.1888
f1_score: 0.3170
precision_score: 0.1888
recall_score: 1.0000
frame_precision: 0.1888
frame_recall: 1.0000
frame_f1: 0.3170
onset_mean_error: 16.0383
onset_std_error: 12.4601


In [28]:
!python pytorch/calculate_scores.py \
  exp.batch_size=12 \
  model.name="Dual_Velocity_HPT" \
  exp.ckpt_iteration=18000 \
  model.input2=onset \
  dataset.test_set=smd \
  feature.audio_feature=logmel

Evaluation Mode : Kim et al.
Model Name      : Dual_Velocity_HPT+onset
Test Set        : smd
Kim Eval: 100%|██████████████| 49/49 [08:42<00:00, 10.67s/file, frame_err=16.07]

===== Kim-style Average Metrics =====
frame_max_error: 16.0671
frame_max_std: 8.9057
f1_score: 0.0470
precision: 0.0249
recall: 0.5000
frame_mask_f1: 1.0000
frame_mask_precision: 1.0000
frame_mask_recall: 1.0000
onset_masked_error: 9.8807
onset_masked_std: 6.9940


In [None]:
!python pytorch/inference.py exp.run_infer='multi' model.type='velo'\
     model.name='Single_Velocity_HPT'\
     dataset.test_set='smd' # maps, maestro

# !python pytorch/inference.py exp.run_infer='multi' model.type='velo'\
#      model.name='Dual_Velocity_HPT' model.input2='onset'\
#      dataset.test_set='smd' # maps, maestro

# !python pytorch/inference.py exp.run_infer='multi' model.type='velo'\
#      model.name='Triple_Velocity_HPT' model.input2='onset' model.input3='exframe'\
#      dataset.test_set='smd' # maps, maestro

Inference Mode : MULTI
Model Name     : Single_Velocity_HPT
Test Set       : smd
Using Device   : cuda
Found 2 checkpoints in ./workspaces/checkpoints/Single_Velocity_HPT
------------------------------------------------------------
[1/2] 195000_iteration.pth
Proc 195000 Ckpt: 100%|███████████████████████| 49/49 [06:25<00:00,  7.87s/file]
[Done] Time: 385.84 sec
------------------------------------------------------------
[2/2] 200000_iteration.pth
Proc 200000 Ckpt: 100%|███████████████████████| 49/49 [06:28<00:00,  7.93s/file]
[Done] Time: 388.63 sec

All checkpoint inference completed in 774.47 sec


In [None]:
!python pytorch/calculate_scores.py exp.run_infer='multi' exp.num_workers=12 model.type='velo'\
     model.name='Single_Velocity_HPT'\
     dataset.test_set='smd' # maps, maestro

# !python pytorch/calculate_scores.py exp.run_infer='multi' exp.num_workers=12 model.type='velo'\
#      model.name='Dual_Velocity_HPT' model.input2='onset'\
#      dataset.test_set='smd' # maps, maestro

# !python pytorch/calculate_scores.py exp.run_infer='multi' exp.num_workers=12 model.type='velo'\
#      model.name='Triple_Velocity_HPT' model.input2='onset' model.input3='exframe'\
#      dataset.test_set='smd' # maps, maestro

Evaluation Mode : MULTI
Model Name      : Single_Velocity_HPT
Test Set        : smd
Using device    : cpu
Found 2 checkpoints in ./workspaces/checkpoints/Single_Velocity_HPT
------------------------------------------------------------
[1/2] Evaluating: 195000_iteration.pth
[Done] Time: 36.30 sec
velocity_mae: 14.8968
velocity_std: 8.6206
velocity_recall: 0.7515
------------------------------------------------------------
[2/2] Evaluating: 200000_iteration.pth
[Done] Time: 41.78 sec
velocity_mae: 15.3933
velocity_std: 8.7335
velocity_recall: 0.7493

[Saved] Summary in Wandb and CSV: ./workspaces/logs/Single_Velocity_HPT_smd.csv
All checkpoint scores completed.


We conducted inferences on Cloud GPUs with the following code. Therefore, no results recorded in this jupyter notebook. You can find our results via [wandb report](https://api.wandb.ai/links/zhanh-uwa/wc7q3j5b).

In [None]:
MODEL_CONFIGS=(
    "model.name='Single_Velocity_HPT'"
    "model.name='Dual_Velocity_HPT' model.input2='onset'"
    "model.name='Dual_Velocity_HPT' model.input2='frame'"
    "model.name='Dual_Velocity_HPT' model.input2='exframe'"
    "model.name='Triple_Velocity_HPT' model.input2='onset' model.input3='frame'"
    "model.name='Triple_Velocity_HPT' model.input2='onset' model.input3='exframe'"
    "model.name='Triple_Velocity_HPT' model.input2='frame' model.input3='exframe'"
)

DATASET='maps' # 'smd', 'maps', or 'maestro'

# Get the specific config based on SLURM_ARRAY_TASK_ID
CONFIG=${MODEL_CONFIGS[$SLURM_ARRAY_TASK_ID]}

echo "Selected model config index: $SLURM_ARRAY_TASK_ID"
echo "Running inference with config: $CONFIG"
python pytorch/inference.py exp.run_infer='multi' model.type='velo' $CONFIG dataset.test_set="$DATASET"

echo "Running scoring with config: $CONFIG"
python pytorch/calculate_scores.py exp.run_infer='multi' exp.num_workers=12 model.type='velo' $CONFIG dataset.test_set="$DATASET"