ISMIR 2025 paper codebase
CountEM is a training framework for automatic music transcription (AMT) that relies only on unordered note count histograms, eliminating the need for aligned or ordered labels. This repository provides the official implementation of our ISMIR 2025 paper: “Count the Notes: Histogram‑Based Supervision for Automatic Music Transcription.”
See the project page for video demos and transcription results produced by CountEM.
- Train AMT models using weak, alignment-free supervision
- Leverages only histogram-level supervision (note counts)
- Built on top of Onsets & Frames and Unaligned Supervision codebases
- Comes with pre-trained models and scripts for training, inference, and data conversion
The implementation is based on the following projects:
- Unaligned Supervision – ICML 2022, alignment‑free training utilities.
- Onsets & Frames – the canonical PyTorch baseline for Onsets and Frames architecture.
- Repository Structure
- Quick Start
- Training & Inference
- Bootstrap Checkpoint
- Credits & Citation
- License
.
├── train.py # ▶ Entry point: training
├── inference.py # ▶ Entry point: inference/evaluation
├── ckpts/ # Contains a *link* to download a synthetic‑data transcriber
│ └── checkpoint_link.txt # (see Bootstrap Checkpoint)
├── datasets/ # Directory that should contain the datasets. You can see the expected format in datasets/README.md
├── onsets_and_frames/ # Upstream architecture (lightly modified)
├── conversion_maps/ # Instrument ↔︎ MIDI helpers
├── scripts/ # Utility scripts
│ ├── make_pitch_shifted_copies.py # Data augmentation script
│ └── make_parsed_tsv_from_midi.py # MIDI → TSV label conversion
├── NoteEM_tsv/ # directory for midi labels in tsv format (used for training and eval)
├── static/ , index.html # Demo / docs site assets (optional)
├── requirements.txt # Python dependencies
├── LICENSE.md
└── README.md
Requires Python ≥3.8 and a GPU with CUDA 11+.
We provide both pyproject.toml and requirements.txt
You can use uv for installation:
# 1) Create virtual environment and install dependancies using UV
uv syncOr you install using pip.
# 1) Create & activate a virtual‑env (recommended)
python -m venv .venv
source .venv/bin/activate # (or venv\Scripts\activate on Windows)
# 2) Install dependencies
pip install -r requirements.txt
We provide two pre-trained models on Hugging Face Hub for easy use without training:
- countem-synth - Transcriber trained on synthetic MIDI renderings
- countem-musicnet - Transcriber trained with CountEM framework on MusicNet dataset
# Install with Hugging Face Hub support
pip install huggingface_hub
# Or install all dependencies
pip install -r requirements.txtfrom onsets_and_frames.hf_model import CountEMModel
import soundfile as sf
# Load model from Hugging Face Hub (downloads automatically on first use)
model = CountEMModel.from_pretrained("Yoni232/countem-musicnet")
# Load your audio file (must be 16kHz)
audio, sr = sf.read("your_audio.flac", dtype="float32")
assert sr == 16000, f"Audio must be 16kHz, got {sr}Hz"
# Transcribe to MIDI
model.transcribe_to_midi(audio, "output.mid")Important: Audio must be sampled at 16kHz. You can resample using ffmpeg or librosa:
# Using ffmpeg
ffmpeg -i input.wav -ar 16000 output_16k.flac
# Using librosa in Python
import librosa
audio, sr = librosa.load("input.wav", sr=16000)See hf_example.py for a complete command-line example:
python hf_example.py input_audio.flac output.mid --model Yoni232/countem-musicnetWe do not ship large model weights.
Instead, the folder ckpts/ contains a text file checkpoint_link.txt with a downloads link to
- Transcriber trained purely on synthetic MIDI renderings.
- Transcriber trained using CountEM framework on the musicnet datast.
# Show full help
python train.py -h
# Example usage
python train.py \
--logdir "$LOGDIR" \
--dataset-name <your_dataset_name> \
--batch-size 8 \
--transcriber-ckpt ckpts/model-70.pt-
Audio data: by default the script expects your audio under
datasets/<your_dataset_name>/noteEM_audio.
Override with--data-dir-pathif you’ve moved your datasets elsewhere.
See datasets/README.md for the exact audio-folder layout. -
MIDI/TSV labels: by default it reads label files from
NoteEM_tsv/<your_dataset_name>.
You can override with--tsv-dir.
These must be in TSV format (onset,offset,note,velocity,instrument);
use the conversion tool at
scripts/make_parsed_tsv_from_midi.pyto generate them from your.midfiles.
See NoteEM_tsv/README.md for details on the TSV schema.
# List available flags
python inference.py -hWe provide a standalone evaluation utility to assess the accuracy of transcribed MIDI files against reference MIDI or TSV annotations using both note-level and frame-level metrics.
This is useful for benchmarking transcribers outside the training pipeline — for example, comparing inference results from inference.py with ground truth.
python eval.py \
--transcribed-dir evaluation/midis_no_pitch_shift_transcribed \
--reference-dir evaluation/new_inference_dir/results \
--outfile results.txt \
--tolerance 0.05 \
--shift 0.0 \
--parallel \
--max-workers 8-
--transcribed-dir:
Path to a folder containing transcribed.midor.midifiles, typically produced by a model (e.g., viainference.py).
Filenames must match or sort consistently with the reference files. -
--reference-dir:
Path to a folder with reference MIDI or TSV files.
Supported formats:.midor.midi— standard MIDI files.tsv— 5-column text files (onset,offset,pitch,velocity,instrument)
Must be tab-delimited and compatible withmir_eval.
-
--outfile:
Optional path to a.txtfile where summary metrics will be saved.
A CSV file with per-piece scores will also be generated alongside it. -
--tolerance:
Onset tolerance in seconds for matching (default:0.05). -
--shift:
Uniform time shift (in seconds) to apply to the reference labels.
Useful when aligning outputs from different renderings. -
--parallel:
Use multiprocessing for faster evaluation over many files. -
--max-workers:
Number of CPU workers for parallel mode (default:4).
- The evaluation assumes that the transcribed and reference files match by sorted filename order.
- If using
.tsvfiles in the reference directory, you can generate them using:
python scripts/make_parsed_tsv_from_midi.py --input-dir your_midis --output-dir NoteEM_tsv/your_dataset© Jonathan Yaffe (Tel Aviv University), Ben Maman (International Audio Laboratories Erlangen, Germany), Meinard Müller (International Audio Laboratories Erlangen, Germany), Amit Bermano (Tel Aviv University) 2025.
This project is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt the material for any purpose, provided that:
- Attribution — You give appropriate credit to the authors.
- ShareAlike — If you remix, transform, or build upon the work, you must distribute your contributions under the same license.
See the full text in LICENSE.md.