<a href="https://colab.research.google.com/github/wjung1008/AI-Christmas/blob/main/AI_christmas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Original Code
https://github.com/openai/jukebox
https://colab.research.google.com/github/SMarioMan/jukebox/blob/master/jukebox/Interacting_with_Jukebox.ipynb#scrollTo=Zy4Rehq9ZKv_

### Check GPU in google colab
P100 is recommended, otherwise, it may get memory error.

In [5]:
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-f80de020-26ad-b982-abb8-f1a2e1f88674)


###Mount Google Drive to save sample levels as they are generated.

In [6]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### Prepare the environment.



In [7]:
!pip install git+https://github.com/openai/jukebox.git

Collecting git+https://github.com/openai/jukebox.git
  Cloning https://github.com/openai/jukebox.git to /tmp/pip-req-build-hlqg3mr3
  Running command git clone -q https://github.com/openai/jukebox.git /tmp/pip-req-build-hlqg3mr3
Building wheels for collected packages: jukebox
  Building wheel for jukebox (setup.py) ... [?25l[?25hdone
  Created wheel for jukebox: filename=jukebox-1.0-cp36-none-any.whl size=197909 sha256=98e1f227d25aeb2e0adf6b305e859cd85c027ee2f237fee29134410cd774a6a3
  Stored in directory: /tmp/pip-ephem-wheel-cache-__npbsv6/wheels/bd/b6/f9/ad38a67dd989a522bbe6677e95efbc4607cdcf71e7249485fe
Successfully built jukebox


## Import necessary packages


In [8]:
import jukebox
import torch as t
import librosa
import os
from IPython.display import Audio
from jukebox.make_models import make_vqvae, make_prior, MODELS, make_model
from jukebox.hparams import Hyperparams, setup_hparams
from jukebox.sample import sample_single_window, _sample, \
                           sample_partial_window, upsample, \
                           load_prompts
from jukebox.utils.dist_utils import setup_dist_from_mpi
from jukebox.utils.torch_utils import empty_cache
rank, local_rank, device = setup_dist_from_mpi()

Using cuda True


##Sample from the 5B or 1B Lyrics Model

In [9]:
model = '1b_lyrics' # or '5b' or '1b_lyrics'
hps = Hyperparams()
hps.sr = 44100
hps.n_samples = 3 if model in ('1b', '1b_lyrics') else 8
# Specifies the directory to save the sample in.
# We set this to the Google Drive mount point.
hps.name = '/content/gdrive/MyDrive/Jukebox/samples'
chunk_size = 16 if model in ('1b', '1b_lyrics') else 32
max_batch_size = 3 if model in ('1b', '1b_lyrics') else 16
hps.levels = 3
hps.hop_fraction = [.5,.5,.125]

vqvae, *priors = MODELS[model]
vqvae = make_vqvae(setup_hparams(vqvae, dict(sample_length = 1048576)), device)
top_prior = make_prior(setup_hparams(priors[-1], dict()), vqvae, device)

Downloading from azure
Restored from /root/.cache/jukebox/models/5b/vqvae.pth.tar
0: Loading vqvae in eval mode
Creating cond. autoregress with prior bins [79, 2048], 
dims [384, 6144], 
shift [ 0 79]
input shape 6528
input bins 2127
Self copy is False
Loading artist IDs from /usr/local/lib/python3.6/dist-packages/jukebox/data/ids/v3_artist_ids.txt
Loading artist IDs from /usr/local/lib/python3.6/dist-packages/jukebox/data/ids/v3_genre_ids.txt
Level:2, Cond downsample:None, Raw to tokens:128, Sample length:786432
Downloading from azure
Restored from /root/.cache/jukebox/models/1b_lyrics/prior_level_2.pth.tar
0: Loading prior in eval mode


# Select mode
Primed
- Train based on first part of the song and let the AI sing the rest

Ancestral mode
- Let AI sing the whole song based on the artist and genre

In [10]:
# Prime song creation using an arbitrary audio sample.
mode = 'primed'
codes_file=None
# Specify an audio file here.
audio_file = '/content/gdrive/MyDrive/Jukebox/samples/jingle_bells.mp3'
# Specify how many seconds of audio to prime on.
prompt_length_in_seconds=12

sample_hps = Hyperparams(dict(mode=mode, codes_file=codes_file, audio_file=audio_file, prompt_length_in_seconds=prompt_length_in_seconds))

## Specifying Artist and genre with lyrics

### Genre can be chosen here:
 https://github.com/openai/jukebox/blob/master/jukebox/data/ids/v2_genre_ids.txt

 #### sample should be short enough to avoid google colab free tier to expire.
 #### In my case, I set it to 30s to avoid google colab from crashing

In [11]:
sample_length_in_seconds = 30          # Full length of musical sample to generate - we find songs in the 1 to 4 minute
                                       # range work well, with generation time proportional to sample length.  
                                       # This total length affects how quickly the model 
                                       # progresses through lyrics (model also generates differently
                                       # depending on if it thinks it's in the beginning, middle, or end of sample)
hps.sample_length = (int(sample_length_in_seconds*hps.sr)//top_prior.raw_to_tokens)*top_prior.raw_to_tokens
assert hps.sample_length >= top_prior.n_ctx*top_prior.raw_to_tokens, f'Please choose a larger sampling rate'

In [12]:
# Note: Metas can contain different prompts per sample.
# By default, all samples use the same prompt.
metas = [dict(artist = "boney m",
            genre = "Pop",
            total_length = hps.sample_length,
            offset = 0,
            lyrics = """Jingle bells, Jingle bells, Jingle all the way
Oh what fun it is to ride in a one-horse open sleigh
Jingle bells, Jingle bells, Jingle all the way
Oh what fun it is to ride in a one-horse open sleigh
Jingle bells, Jingle bells, Jingle all the way
Oh what fun it is to ride in a one-horse open sleigh
Jingle bells, Jingle bells, Jingle all the way
Oh what fun it is to ride in a one-horse open sleigh
Dashing thro' the snow, in a one-horse open sleigh
O'er the fields we go, laughing all the way
Bells on bob-tails ring, making spirits bright
what fun it is to ride and sing a sleighing song tonight
Jingle bells, Jingle bells, Jingle all the way
Oh what fun it is to ride in a one-horse open sleigh
Jingle bells, Jingle bells, Jingle all the way
Oh what fun it is to ride in a one-horse open sleigh
A day or two ago, I thought I'd take a ride
And soon Miss Fanny Bright
Was seated at my side
The horse was lean and lank
Misfortune seemed his lot
he got into and drifted back, and we we've got upset
Jingle bells, Jingle bells, Jingle all the way
Oh what fun it is to ride in a one-horse open sleigh
Jingle bells, Jingle bells, Jingle all the way
Oh what fun it is to ride in a one-horse open sleigh
Jingle bells, Jingle bells, Jingle all the way
Oh what fun it is to ride in a one-horse open sleigh
Jingle bells, Jingle bells, Jingle all the way
Oh what fun it is to ride in a one-horse open sleigh
""",),] * hps.n_samples
labels = [None, None, top_prior.labeller.get_batch_labels(metas, 'cuda')]

Input artist boney m maps to boney m, which is not present in /usr/local/lib/python3.6/dist-packages/jukebox/data/ids/v3_artist_ids.txt. Defaulting to (artist_id, artist) = (0, unknown), if that seems wrong please format artist correctly
Input artist boney m maps to boney m, which is not present in /usr/local/lib/python3.6/dist-packages/jukebox/data/ids/v3_artist_ids.txt. Defaulting to (artist_id, artist) = (0, unknown), if that seems wrong please format artist correctly
Input artist boney m maps to boney m, which is not present in /usr/local/lib/python3.6/dist-packages/jukebox/data/ids/v3_artist_ids.txt. Defaulting to (artist_id, artist) = (0, unknown), if that seems wrong please format artist correctly


In [13]:
sampling_temperature = .98

lower_batch_size = 16
max_batch_size = 3 if model in ('1b', '1b_lyrics') else 16
lower_level_chunk_size = 32
chunk_size = 16 if model in ('1b', '1b_lyrics') else 32
sampling_kwargs = [dict(temp=.99, fp16=True, max_batch_size=lower_batch_size,
                        chunk_size=lower_level_chunk_size),
                    dict(temp=0.99, fp16=True, max_batch_size=lower_batch_size,
                         chunk_size=lower_level_chunk_size),
                    dict(temp=sampling_temperature, fp16=True, 
                         max_batch_size=max_batch_size, chunk_size=chunk_size)]

Now we're ready to sample from the model. We'll generate the top level (2) first, followed by the first upsampling (level 1), and the second upsampling (0). In this CoLab we load the top prior separately from the upsamplers, because of memory concerns on the hosted runtimes. If you are using a local machine, you can also load all models directly with make_models, and then use sample.py's ancestral_sampling to put this all in one step.

After each level, we decode to raw audio and save the audio files.

This next cell will take a while (approximately 10 minutes per 20 seconds of music sample)

In [14]:
if sample_hps.mode == 'ancestral':
  zs = [t.zeros(hps.n_samples,0,dtype=t.long, device='cuda') for _ in range(len(priors))]
  zs = _sample(zs, labels, sampling_kwargs, [None, None, top_prior], [2], hps)
elif sample_hps.mode == 'upsample':
  assert sample_hps.codes_file is not None
  # Load codes.
  data = t.load(sample_hps.codes_file, map_location='cpu')
  zs = [z.cuda() for z in data['zs']]
  assert zs[-1].shape[0] == hps.n_samples, f"Expected bs = {hps.n_samples}, got {zs[-1].shape[0]}"
  del data
  print('Falling through to the upsample step later in the notebook.')
elif sample_hps.mode == 'primed':
  assert sample_hps.audio_file is not None
  audio_files = sample_hps.audio_file.split(',')
  duration = (int(sample_hps.prompt_length_in_seconds*hps.sr)//top_prior.raw_to_tokens)*top_prior.raw_to_tokens
  x = load_prompts(audio_files, duration, hps)
  zs = top_prior.encode(x, start_level=0, end_level=len(priors), bs_chunks=x.shape[0])
  zs = _sample(zs, labels, sampling_kwargs, [None, None, top_prior], [2], hps)
else:
  raise ValueError(f'Unknown sample mode {sample_hps.mode}.')



Sampling level 2
Sampling 6144 tokens for [0,6144]. Conditioning on 4134 tokens
Primed sampling 3 samples with temp=0.98, top_k=0, top_p=0.0
283/283 [00:31<00:00,  8.91it/s]
2010/2010 [02:51<00:00, 11.75it/s]
Sampling 6144 tokens for [768,6912]. Conditioning on 5376 tokens
Primed sampling 3 samples with temp=0.98, top_k=0, top_p=0.0
360/360 [00:39<00:00,  9.16it/s]
768/768 [01:05<00:00, 11.77it/s]
Sampling 6144 tokens for [1536,7680]. Conditioning on 5376 tokens
Primed sampling 3 samples with temp=0.98, top_k=0, top_p=0.0
360/360 [00:39<00:00,  9.05it/s]
768/768 [01:05<00:00, 11.72it/s]
Sampling 6144 tokens for [2304,8448]. Conditioning on 5376 tokens
Primed sampling 3 samples with temp=0.98, top_k=0, top_p=0.0
360/360 [00:39<00:00,  9.18it/s]
768/768 [01:05<00:00, 11.75it/s]
Sampling 6144 tokens for [3072,9216]. Conditioning on 5376 tokens
Primed sampling 3 samples with temp=0.98, top_k=0, top_p=0.0
360/360 [00:39<00:00,  9.05it/s]
768/768 [01:05<00:00, 11.71it/s]
Sampling 6144 tokens

In [15]:
Audio(f'{hps.name}/level_2/item_0.wav')

## In my case, I couldn't upsample the output as google colab ran out of memory.
## I stopped here, but it could further be upsampled to remove noise. Refer to the original code for upsampling the output.