# MAGNeT
Welcome to MAGNeT's demo jupyter notebook. 
Here you will find a self-contained example of how to use MAGNeT for music/sound-effect generation.

First, we start by initializing MAGNeT for music generation, you can choose a model from the following selection:
1. facebook/magnet-small-10secs - a 300M non-autoregressive transformer capable of generating 10-second music conditioned on text.
2. facebook/magnet-medium-10secs - 1.5B parameters, 10 seconds music samples.
3. facebook/magnet-small-30secs - 300M parameters, 30 seconds music samples.
4. facebook/magnet-medium-30secs - 1.5B parameters, 30 seconds music samples.

We will use the `facebook/magnet-small-10secs` variant for the purpose of this demonstration.

In [1]:
from audiocraft.models import MAGNeT

model = MAGNeT.get_pretrained('facebook/magnet-small-10secs')

A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
  from .autonotebook import tqdm as notebook_tqdm


Next, let us configure the generation parameters. Specifically, you can control the following:
* `use_sampling` (bool, optional): use sampling if True, else do argmax decoding. Defaults to True.
* `top_k` (int, optional): top_k used for sampling. Defaults to 0.
* `top_p` (float, optional): top_p used for sampling, when set to 0 top_k is used. Defaults to 0.9.
* `temperature` (float, optional): Initial softmax temperature parameter. Defaults to 3.0.
* `max_clsfg_coef` (float, optional): Initial coefficient used for classifier free guidance. Defaults to 10.0.
* `min_clsfg_coef` (float, optional): Final coefficient used for classifier free guidance. Defaults to 1.0.
* `decoding_steps` (list of n_q ints, optional): The number of iterative decoding steps, for each of the n_q RVQ codebooks.
* `span_arrangement` (str, optional): Use either non-overlapping spans ('nonoverlap') or overlapping spans ('stride1') 
                                      in the masking scheme. 

When left unchanged, MAGNeT will revert to its default parameters.

In [2]:
model.set_generation_params(
    use_sampling=True,
    top_k=0,
    top_p=0.9,
    temperature=3.0,
    max_cfg_coef=10.0,
    min_cfg_coef=1.0,
    decoding_steps=[int(20 * model.lm.cfg.dataset.segment_duration // 10),  10, 10, 10],
    span_arrangement='stride1'
)

Next, we can go ahead and start generating music given textual prompts.

### Text-conditional Generation - Music

In [3]:
from audiocraft.utils.notebook import display_audio

###### Text-to-music prompts - examples ######
# text = "80s electronic track with melodic synthesizers, catchy beat and groovy bass"
# text = "80s electronic track with melodic synthesizers, catchy beat and groovy bass. 170 bpm"
# text = "Earthy tones, environmentally conscious, ukulele-infused, harmonic, breezy, easygoing, organic instrumentation, gentle grooves"
# text = "Funky groove with electric piano playing blue chords rhythmically"
# text = "Rock with saturated guitars, a heavy bass line and crazy drum break and fills."
# text = "A grand orchestral arrangement with thunderous percussion, epic brass fanfares, and soaring strings, creating a cinematic atmosphere fit for a heroic battle"
# text = """There once was a shepherd boy who watched village sheep.
# He became bored and decided to prank the villagers by pretending a wolf was attacking"""
text = """
Grandeur: Majestic, monumental, heroic, grand
Orchestral: Symphonic, lush, rich, full-bodied
Intensity: Powerful, dramatic, intense, thrilling
Emotion: Emotional, poignant, moving, stirring
Epic Scale: Massive, expansive, vast, larger-than-life
Dynamic Scale: Dynamic, contrast, crecendos, swells
Chorus: Choral, Choir, vocal, harmonies"""

N_VARIATIONS = 1
descriptions = [text for _ in range(N_VARIATIONS)]

print(f"text prompt: {text}\n")
output = model.generate(descriptions=descriptions, progress=True, return_tokens=True)
display_audio(output[0], sample_rate=model.compression_model.sample_rate)

text prompt: 
Grandeur: Majestic, monumental, heroic, grand
Orchestral: Symphonic, lush, rich, full-bodied
Intensity: Powerful, dramatic, intense, thrilling
Emotion: Emotional, poignant, moving, stirring
Epic Scale: Massive, expansive, vast, larger-than-life
Dynamic Scale: Dynamic, contrast, crecendos, swells
Chorus: Choral, Choir, vocal, harmonies

    50 /     50

In [6]:
text = """
Reflective Melody: Contemplative, introspective, melodic, soul-stirring
Narrative Journey: Evocative storytelling, lyrical narration, emotional depth
Diverging Paths: Choices, crossroads, uncertainty, branching possibilities
Nature's Embrace: Woodsy ambiance, rustling leaves, whispered breezes
Exploration: Curiosity, discovery, venturing into the unknown
Echoes of Decision: Regret, determination, acceptance, the weight of choices
The Road Less Traveled: Adventure, risk-taking, forging one's own path
Legacy of Choices: Impact, consequence, the ripple effect of decisions
"""

N_VARIATIONS = 3
descriptions = [text for _ in range(N_VARIATIONS)]
print(f"text prompt: {text}\n")
output = model.generate(descriptions=descriptions, progress=True, return_tokens=True)
display_audio(output[0], sample_rate=model.compression_model.sample_rate)

text prompt: 
Reflective Melody: Contemplative, introspective, melodic, soul-stirring
Narrative Journey: Evocative storytelling, lyrical narration, emotional depth
Diverging Paths: Choices, crossroads, uncertainty, branching possibilities
Nature's Embrace: Woodsy ambiance, rustling leaves, whispered breezes
Exploration: Curiosity, discovery, venturing into the unknown
Echoes of Decision: Regret, determination, acceptance, the weight of choices
The Road Less Traveled: Adventure, risk-taking, forging one's own path
Legacy of Choices: Impact, consequence, the ripple effect of decisions


    50 /     50

In [24]:
text = """"""
N_VARIATIONS = 3
descriptions = [text for _ in range(N_VARIATIONS)]
print(f"text prompt: {text}\n")
output = model.generate(descriptions=descriptions, progress=True, return_tokens=True)
display_audio(output[0], sample_rate=model.compression_model.sample_rate)



text prompt: A very sad day

    50 /     50

### Text-conditional Generation - Sound Effects

Besides music, MAGNeT models can generate sound effects given textual prompts. 
First, let's load an Audio-MAGNeT model, out of the following collection: 
1. facebook/audio-magnet-small - a 300M non-autoregressive transformer capable of generating 10 second sound effects conditioned on text.
2. facebook/audio-magnet-medium - 10 second sound effect generation, 1.5B parameters.

We will use the `facebook/audio-magnet-small` variant for the purpose of this demonstration.

In [4]:
from audiocraft.models import MAGNeT

model = MAGNeT.get_pretrained('facebook/audio-magnet-small')

compression_state_dict.bin: 100%|██████████| 236M/236M [00:10<00:00, 23.4MB/s] 
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
state_dict.bin: 100%|██████████| 841M/841M [00:32<00:00, 25.7MB/s] 
spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 4.84MB/s]
tokenizer.json: 100%|██████████| 1.39M/1.39M [00:00<00:00, 15.3MB/s]
config.json: 100%|██████████| 1.21k/1.21k [00:00<00:00, 479kB/s]
model.safetensors: 100%|██████████| 2.95G/2.95G [01:46<00:00, 27.6MB/s]


The recommended parameters for sound generation are a bit different than the defaults in MAGNeT, let's initialize it: 

In [5]:
model.set_generation_params(
    use_sampling=True,
    top_k=0,
    top_p=0.8,
    temperature=3.5,
    max_cfg_coef=20.0,
    min_cfg_coef=1.0,
    decoding_steps=[int(20 * model.lm.cfg.dataset.segment_duration // 10),  10, 10, 10],
    span_arrangement='stride1'
)

Next, we can go ahead and start generating sounds given textual prompts.

In [28]:
from audiocraft.utils.notebook import display_audio
               
###### Text-to-audio prompts - examples ######
# text = "Seagulls squawking as ocean waves crash while wind blows heavily into a microphone."
# text = "A toilet flushing as music is playing and a man is singing in the distance."
# text = """Two roads diverged in a yellow wood,
# And sorry I could not travel both
# And be one traveler, long I stood
# And looked down one as far as I could
# To where it bent in the undergrowth;
# """
# Then took the other, as just as fair,
# And having perhaps the better claim,
# Because it was grassy and wanted wear;
# Though as for that the passing there
# Had worn them really about the same,

# text = """And both that morning equally lay
# In leaves no step had trodden black.
# Oh, I kept the first for another day!
# Yet knowing how way leads on to way,
# I doubted if I should ever come back."""

text = """Movie scene in a desert with percussion"""

# I shall be telling this with a sigh
# Somewhere ages and ages hence:
# Two roads diverged in a wood, and I—
# I took the one less traveled by,
# And that has made all the difference."""

N_VARIATIONS = 3
descriptions = [text for _ in range(N_VARIATIONS)]

print(f"text prompt: {text}\n")
output = model.generate(descriptions=descriptions, progress=True, return_tokens=True)
display_audio(output[0], sample_rate=model.compression_model.sample_rate)

text prompt: Movie scene in a desert with percussion

    50 /     50