CountEM — Count the Notes

Histogram‑based supervision for Automatic Music Transcription (AMT) ISMIR 2025

ISMIR 2025 paper codebase

CountEM is a training framework for automatic music transcription (AMT) that relies only on unordered note count histograms, eliminating the need for aligned or ordered labels. This repository provides the official implementation of our ISMIR 2025 paper: “Count the Notes: Histogram‑Based Supervision for Automatic Music Transcription.”

Transcription Examples

See the project page for video demos and transcription results produced by CountEM.

Features

Train AMT models using weak, alignment-free supervision
Leverages only histogram-level supervision (note counts)
Built on top of Onsets & Frames and Unaligned Supervision codebases
Comes with pre-trained models and scripts for training, inference, and data conversion

Implementation

The implementation is based on the following projects:

Unaligned Supervision – ICML 2022, alignment‑free training utilities.
Onsets & Frames – the canonical PyTorch baseline for Onsets and Frames architecture.

Repository Structure

.
├── train.py                           # ▶ Entry point: training
├── inference.py                       # ▶ Entry point: inference/evaluation
├── ckpts/                             # Contains a *link* to download a synthetic‑data transcriber
│   └── checkpoint_link.txt            #  (see Bootstrap Checkpoint)
├── datasets/                          # Directory that should contain the datasets. You can see the expected format in datasets/README.md
├── onsets_and_frames/                 # Upstream architecture (lightly modified)
├── conversion_maps/                   # Instrument ↔︎ MIDI helpers
├── scripts/                           # Utility scripts
│   ├── make_pitch_shifted_copies.py   # Data augmentation script
│   └── make_parsed_tsv_from_midi.py   # MIDI → TSV label conversion
├── NoteEM_tsv/                        # directory for midi labels in tsv format (used for training and eval)
├── static/ , index.html               # Demo / docs site assets (optional)
├── requirements.txt                   # Python dependencies
├── LICENSE.md
└── README.md

Quick Start

Requires Python ≥3.8 and a GPU with CUDA 11+.

We provide both pyproject.toml and requirements.txt

You can use uv for installation:

# 1) Create virtual environment and install dependancies using UV
uv sync

Or you install using pip.

# 1) Create & activate a virtual‑env (recommended)
python -m venv .venv
source .venv/bin/activate        # (or venv\Scripts\activate on Windows)

# 2) Install dependencies
pip install -r requirements.txt

Using Pre-trained Models from Hugging Face Hub

We provide two pre-trained models on Hugging Face Hub for easy use without training:

countem-synth - Transcriber trained on synthetic MIDI renderings
countem-musicnet - Transcriber trained with CountEM framework on MusicNet dataset

Quick Start with Hugging Face Models

# Install with Hugging Face Hub support
pip install huggingface_hub

# Or install all dependencies
pip install -r requirements.txt

Basic Usage

from onsets_and_frames.hf_model import CountEMModel
import soundfile as sf

# Load model from Hugging Face Hub (downloads automatically on first use)
model = CountEMModel.from_pretrained("Yoni232/countem-musicnet")

# Load your audio file (must be 16kHz)
audio, sr = sf.read("your_audio.flac", dtype="float32")
assert sr == 16000, f"Audio must be 16kHz, got {sr}Hz"

# Transcribe to MIDI
model.transcribe_to_midi(audio, "output.mid")

Important: Audio must be sampled at 16kHz. You can resample using ffmpeg or librosa:

# Using ffmpeg
ffmpeg -i input.wav -ar 16000 output_16k.flac

# Using librosa in Python
import librosa
audio, sr = librosa.load("input.wav", sr=16000)

Example Script

See hf_example.py for a complete command-line example:

python hf_example.py input_audio.flac output.mid --model Yoni232/countem-musicnet

Bootstrap Checkpoint

We do not ship large model weights. Instead, the folder ckpts/ contains a text file checkpoint_link.txt with a downloads link to

Transcriber trained purely on synthetic MIDI renderings.
Transcriber trained using CountEM framework on the musicnet datast.

Training & Inference

Training

# Show full help
python train.py -h

# Example usage
python train.py \
  --logdir "$LOGDIR" \
  --dataset-name <your_dataset_name> \
  --batch-size 8 \
  --transcriber-ckpt ckpts/model-70.pt

Audio data: by default the script expects your audio under
datasets/<your_dataset_name>/noteEM_audio.
Override with --data-dir-path if you’ve moved your datasets elsewhere.
See datasets/README.md for the exact audio-folder layout.
MIDI/TSV labels: by default it reads label files from
NoteEM_tsv/<your_dataset_name>.
You can override with --tsv-dir.
These must be in TSV format (onset,offset,note,velocity,instrument);
use the conversion tool at
scripts/make_parsed_tsv_from_midi.py to generate them from your .mid files.
See NoteEM_tsv/README.md for details on the TSV schema.

Inference

# List available flags
python inference.py -h

Evaluation Script (`eval.py`)

We provide a standalone evaluation utility to assess the accuracy of transcribed MIDI files against reference MIDI or TSV annotations using both note-level and frame-level metrics.

This is useful for benchmarking transcribers outside the training pipeline — for example, comparing inference results from inference.py with ground truth.

Usage

python eval.py \
  --transcribed-dir evaluation/midis_no_pitch_shift_transcribed \
  --reference-dir evaluation/new_inference_dir/results \
  --outfile results.txt \
  --tolerance 0.05 \
  --shift 0.0 \
  --parallel \
  --max-workers 8

Arguments

--transcribed-dir:
Path to a folder containing transcribed .mid or .midi files, typically produced by a model (e.g., via inference.py).
Filenames must match or sort consistently with the reference files.
--reference-dir:
Path to a folder with reference MIDI or TSV files.
Supported formats:
- .mid or .midi — standard MIDI files
- .tsv — 5-column text files (onset, offset, pitch, velocity, instrument)
  Must be tab-delimited and compatible with mir_eval.
--outfile:
Optional path to a .txt file where summary metrics will be saved.
A CSV file with per-piece scores will also be generated alongside it.
--tolerance:
Onset tolerance in seconds for matching (default: 0.05).
--shift:
Uniform time shift (in seconds) to apply to the reference labels.
Useful when aligning outputs from different renderings.
--parallel:
Use multiprocessing for faster evaluation over many files.
--max-workers:
Number of CPU workers for parallel mode (default: 4).

Input Format

The evaluation assumes that the transcribed and reference files match by sorted filename order.
If using .tsv files in the reference directory, you can generate them using:

python scripts/make_parsed_tsv_from_midi.py --input-dir your_midis --output-dir NoteEM_tsv/your_dataset

License

© Jonathan Yaffe (Tel Aviv University), Ben Maman (International Audio Laboratories Erlangen, Germany), Meinard Müller (International Audio Laboratories Erlangen, Germany), Amit Bermano (Tel Aviv University) 2025.

This project is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt the material for any purpose, provided that:

Attribution — You give appropriate credit to the authors.
ShareAlike — If you remix, transform, or build upon the work, you must distribute your contributions under the same license.

See the full text in LICENSE.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CountEM — Count the Notes

Histogram‑based supervision for Automatic Music Transcription (AMT) ISMIR 2025

Transcription Examples

Features

Implementation

Table of Contents

Repository Structure

Quick Start

Using Pre-trained Models from Hugging Face Hub

Quick Start with Hugging Face Models

Basic Usage

Example Script

Bootstrap Checkpoint

Training & Inference

Training

Inference

Evaluation Script (`eval.py`)

Usage

Arguments

Input Format

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
NoteEM_tsv		NoteEM_tsv
ckpts		ckpts
conversion_maps		conversion_maps
datasets		datasets
onsets_and_frames		onsets_and_frames
scripts		scripts
static		static
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
evaluate.py		evaluate.py
hf_example.py		hf_example.py
index.html		index.html
inference.py		inference.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train.py		train.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

CountEM — Count the Notes

Histogram‑based supervision for Automatic Music Transcription (AMT) ISMIR 2025

Transcription Examples

Features

Implementation

Table of Contents

Repository Structure

Quick Start

Using Pre-trained Models from Hugging Face Hub

Quick Start with Hugging Face Models

Basic Usage

Example Script

Bootstrap Checkpoint

Training & Inference

Training

Inference

Evaluation Script (eval.py)

Usage

Arguments

Input Format

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

CountEM — Count the Notes

Evaluation Script (`eval.py`)

Packages