| `one_token_stream_for_programs` | Best For                          | Pros                                                | Cons                                |
| ------- | --------------------------------- | --------------------------------------------------- | ----------------------------------- |
| `True`  | Simpler training, small datasets  | Unified sequence, learns inter-instrument relations | More token noise, program switching |
| `False` | Multi-instrument/orchestral music | Clean per-instrument modeling                       | Requires multi-stream handling      |

> \
> In this notebook we will try both by chaning the `one_token_stream_test` bool variable.

## To Fix generated tokens

### 1. `tokenizer.complete_sequence()`

**Purpose:**

When you generate or edit token sequences manually (e.g., from a model output), some tokens may be incomplete or inconsistent, missing tempo, bar, or program context, etc.
`complete_sequence()` fills in or fixes those missing pieces so that the sequence can be decoded safely into a valid MIDI file.

```python
tokens = model.generate(...)
tokens = tokenizer.complete_sequence(tokens)
midi = tokenizer.decode(tokens)
```

### 2. `fast=True`
If you use miditok, it can auto-repair or ignore invalid tokens during decoding:
```python
midi = tokenizer.decode(tokens, fast=True)
```
fast=True → skips/repairs inconsistent events

In [21]:
from miditok import REMI, TokenizerConfig

one_token_stream_test = True

# Set up a config that gives you REMI+ behavior
config = TokenizerConfig(                   
    use_programs=True,                         # Include instrument program numbers (to distinguish instruments)
    one_token_stream_for_programs              # (if true) Merge all instruments into a single token stream instead of separating them
    = one_token_stream_test,   
    program_changes = True,     
    use_time_signatures = True,                  # Include time signature tokens in the encoding
    use_chords=True,                           # (Optional) Detect and include chord tokens
    use_rests=True,                            # (Optional) Represent silence periods as rest tokens
    use_tempos=True,                           # (Optional) Include tempo change tokens
)

tokenizer = REMI(config)

tokenizer

564 tokens with ('T',) io format (one token stream), not trained

In [22]:
test_midi_file = "../../midi_tests/3-Due_pupille.mid"

token = tokenizer(test_midi_file)
token

TokSequence(tokens=['Bar_None', 'TimeSig_4/4', 'Position_0', 'Tempo_121.29', 'Program_74', 'Pitch_69', 'Velocity_103', 'Duration_1.0.8', 'Pitch_72', 'Velocity_103', 'Duration_1.0.8', 'Program_60', 'Pitch_53', 'Velocity_103', 'Duration_1.0.8', 'Program_42', 'Pitch_53', 'Velocity_79', 'Duration_1.0.8', 'Program_0', 'Pitch_69', 'Velocity_103', 'Duration_1.0.8', 'Pitch_72', 'Velocity_103', 'Duration_1.0.8', 'Position_8', 'Program_74', 'Pitch_67', 'Velocity_103', 'Duration_0.4.8', 'Pitch_70', 'Velocity_103', 'Duration_0.4.8', 'Program_60', 'Pitch_53', 'Velocity_103', 'Duration_0.4.8', 'Program_42', 'Pitch_53', 'Velocity_79', 'Duration_1.0.8', 'Program_0', 'Pitch_67', 'Velocity_103', 'Duration_0.4.8', 'Pitch_70', 'Velocity_103', 'Duration_0.4.8', 'Position_12', 'Program_74', 'Pitch_65', 'Velocity_103', 'Duration_0.4.8', 'Pitch_69', 'Velocity_103', 'Duration_0.4.8', 'Program_60', 'Pitch_53', 'Velocity_103', 'Duration_0.4.8', 'Program_0', 'Pitch_65', 'Velocity_103', 'Duration_0.4.8', 'Pitch_69

In [23]:
output = tokenizer.decode(token)
output

Score(ttype=Tick, tpq=16, begin=0, end=1504, tracks=4, notes=601, time_sig=1, key_sig=0, markers=0)

In [24]:
output.dump_midi("one_token_stream_True.mid")

In [13]:
from miditok import REMI, TokenizerConfig

one_token_stream_test = False

# Set up a config that gives you REMI+ behavior
config = TokenizerConfig(                   
    use_programs=True,                         # Include instrument program numbers (to distinguish instruments)
    one_token_stream_for_programs              # (if true) Merge all instruments into a single token stream instead of separating them
    = one_token_stream_test,        
    use_time_signatures=True,                  # Include time signature tokens in the encoding
    use_chords=True,                           # (Optional) Detect and include chord tokens
    use_rests=True,                            # (Optional) Represent silence periods as rest tokens
    use_tempos=True                            # (Optional) Include tempo change tokens
)

tokenizer = REMI(config)

tokenizer

564 tokens with ('I', 'T') io format, not trained

In [14]:
test_midi_file = "../../midi_tests/3-Due_pupille.mid"

token = tokenizer(test_midi_file)
token

[TokSequence(tokens=['Bar_None', 'TimeSig_4/4', 'Position_0', 'Tempo_121.29', 'Program_74', 'Pitch_72', 'Velocity_103', 'Duration_1.0.8', 'Position_8', 'Program_74', 'Pitch_70', 'Velocity_103', 'Duration_0.4.8', 'Position_12', 'Program_74', 'Pitch_69', 'Velocity_103', 'Duration_0.4.8', 'Position_16', 'Program_74', 'Pitch_69', 'Velocity_103', 'Duration_1.0.8', 'Position_24', 'Program_74', 'Pitch_69', 'Velocity_103', 'Duration_0.4.8', 'Position_28', 'Program_74', 'Pitch_69', 'Velocity_87', 'Duration_0.4.8', 'Position_31', 'Program_74', 'Pitch_72', 'Velocity_87', 'Duration_0.1.8', 'Bar_None', 'TimeSig_4/4', 'Position_0', 'Program_74', 'Pitch_70', 'Velocity_103', 'Duration_1.4.8', 'Position_12', 'Program_74', 'Pitch_69', 'Velocity_103', 'Duration_0.4.8', 'Position_16', 'Program_74', 'Pitch_67', 'Velocity_103', 'Duration_1.0.8', 'Rest_1.0.4', 'Position_0', 'Program_74', 'Pitch_69', 'Velocity_83', 'Duration_0.6.8', 'Rest_0.2.8', 'Position_8', 'Program_74', 'Pitch_67', 'Velocity_83', 'Duratio

In [15]:
output = tokenizer.decode(token)
output

Score(ttype=Tick, tpq=16, begin=0, end=1504, tracks=5, notes=601, time_sig=1, key_sig=0, markers=0)

In [16]:
output.dump_midi("one_token_stream_False.mid")

In [6]:
from miditok import REMI, TokenizerConfig

config = TokenizerConfig(
    # ----- Instruments -----
    use_programs=True,                         # Distinguish instruments via program numbers
    one_token_stream_for_programs=True,       # Merge instruments into one stream (if False → one stream per program)
    program_changes=True,                       # Add ProgramChange tokens

    # ----- Musical structure -----
    use_time_signatures=True,                   # Add TimeSignature tokens
    use_tempos=True,                            # Add Tempo tokens
    use_chords=True,                            # Detect and include chord tokens
    use_rests=True,                             # Represent silences as Rest tokens

    # ----- Velocity and duration -----
    include_velocity=True,                      # Include velocity tokens (for dynamics)
    )

tokenizer = REMI(config)
tokenizer


  super().__init__(tokenizer_config, params)


564 tokens with ('T',) io format (one token stream), not trained

In [7]:
test_midi_file = "../../midi_tests/3-Due_pupille.mid"

token = tokenizer.encode(test_midi_file)
token

TokSequence(tokens=['Bar_None', 'TimeSig_4/4', 'Position_0', 'Tempo_121.29', 'Program_74', 'Pitch_69', 'Velocity_103', 'Duration_1.0.8', 'Pitch_72', 'Velocity_103', 'Duration_1.0.8', 'Program_60', 'Pitch_53', 'Velocity_103', 'Duration_1.0.8', 'Program_42', 'Pitch_53', 'Velocity_79', 'Duration_1.0.8', 'Program_0', 'Pitch_69', 'Velocity_103', 'Duration_1.0.8', 'Pitch_72', 'Velocity_103', 'Duration_1.0.8', 'Position_8', 'Program_74', 'Pitch_67', 'Velocity_103', 'Duration_0.4.8', 'Pitch_70', 'Velocity_103', 'Duration_0.4.8', 'Program_60', 'Pitch_53', 'Velocity_103', 'Duration_0.4.8', 'Program_42', 'Pitch_53', 'Velocity_79', 'Duration_1.0.8', 'Program_0', 'Pitch_67', 'Velocity_103', 'Duration_0.4.8', 'Pitch_70', 'Velocity_103', 'Duration_0.4.8', 'Position_12', 'Program_74', 'Pitch_65', 'Velocity_103', 'Duration_0.4.8', 'Pitch_69', 'Velocity_103', 'Duration_0.4.8', 'Program_60', 'Pitch_53', 'Velocity_103', 'Duration_0.4.8', 'Program_0', 'Pitch_65', 'Velocity_103', 'Duration_0.4.8', 'Pitch_69

In [3]:
output = tokenizer.decode(token)
output
# Score(ttype=Tick, tpq=16, begin=0, end=1504, tracks=4, notes=601, time_sig=1, key_sig=0, markers=0)

Score(ttype=Tick, tpq=16, begin=0, end=1504, tracks=4, notes=601, time_sig=1, key_sig=0, markers=0)

In [4]:
output.dump_midi("one_token_stream_True_2.mid")

In [1]:
from music_tokenizer import get_tokenized_data

get_tokenized_data("../../dataset/midi_dataset", "../../dataset/tokens_folder")

  super().__init__(tokenizer_config, params)
Tokenizing MIDI files: 100%|██████████| 182/182 [00:33<00:00,  5.44it/s]


In [None]:
from pathlib import Path
from tqdm import tqdm

from music_tokenizer import get_tokenizer

def get_tokenized_data(midi_path:str):
    
    # 1. get tokenizer
    tokenizer = get_tokenizer()
    
    # 2. change midi_path string to Path objects
    midi_path = Path(midi_path)
    tokenized = []
    
    # 3. loop through every midi file in the midi_path
    for midi in tqdm(list(midi_path.glob("*.mid")), desc="Tokenizing MIDI files"):
        # 3.1. tokenize midi file
        tokens = tokenizer(midi)
        # 3.2. append the ids only to the tokenized list
        tokenized.append(tokens.ids)    
        
    return tokenized
            
tokenized = get_tokenized_data("../../dataset/midi_dataset")
        

Tokenizing MIDI files: 100%|██████████| 182/182 [00:31<00:00,  5.72it/s]
