bat 🦇

A Beat As Token transformer architecture designed from scratch ~~to fail~~ to enhance beat token embeddings with pairwise beat comparisons.

Idea

Let's train BERT since we don't want to generate anything.
Let's take a MIDI file. If you don't have one, take your audio, run demucs + basic-pitch and pack tracks into one
Run beat tracking to know exact measure/beat onsets
~~Discard your file if it's not 4/4~~
Write a function that gives you all notes participating in a single beat
Now write a function that extracts binary features from a single beat. The more the better. Think relatively. Hey, music isn't about absolute pitches.
- Is there a bass onset at this beat?
- Eleven features for presence of all intervals above the bass?
- Does the melody have one/two/four notes? Where do they go?
- Is there a chord strumming?
- Which quarter of the measure is it?
- Does it have 16th hi-hats? Snare/kick/crash?
- etc etc
Make a bit vector out of these features. As we multiply it by a learnable matrix B, we'll get beat embeddings - these will be our tokens.
By the way, we'll go for the context of 512 tokens - because most of the tracks fit under 128 measures.
Now for the weird part. We need to make a first self-attention. A key-query-value dance. Let's calculate relative pairwise features between two beats and use it there.
So, write a function that extracts binary features from a pair of beats. Again, no absolute pitches allowed:
- Is the bass/chord/melody the same?
- Is the melody from beat 1 tranposed to beat 2?
- Are the bass/chord/melody starting from these two beats equal for the next 4/16/64 beats? (A bit of pre-compute for faster convergence.)
- What's the interval between leftmost bass notes of two beats?
- Is the second beat an exact transposition of all notes one tone up?
- Encode relative distance in measures in beats between these two. Look up rotary position encoding and invent your own.
- etc etc
Then multiply this on Key/Query/Value matrices, make self-attention, add MLP.
Stack more layers. Don't repeat the pairwise embeddings again - hope that a residual connection will help if need be. Use some standard self-attention with some relative distances.

Where to start

Start learning this on tiny sizes with a handful of features and just 2 self-attention layers. Maybe try a tiny embedding length - 32?

Oh, how do we train? What should [MASK] token be equal to? Well, let's do binary prediction of all beat features via a sigmoid. A loss will sum all sigmoids.

Fine-tuning

We can try to fine-tune on answering music theory questions about every beat:

is this a tonic chord
is this a true measure start, a 4-measure phrasing start
is this a start of a direct modulation up
what's the likely local scale
is this blues

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bat 🦇

Idea

Where to start

Fine-tuning

Literature

About

Releases

Packages

vpavlenko/bat

Folders and files

Latest commit

History

Repository files navigation

bat 🦇

Idea

Where to start

Fine-tuning

Literature

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages