Since audio file is time series data, it has lots of frames per second(44100/sec) and each frame has a huge scope of amplitude(2^16). Furthermore, sound in real life has a cycle, which needs substantial length of frames, therefore it is required to handle the long range of data. As a result, the existing LSTM or RNN model in machine learning libraries are not proper to train the music data.
To overcome these problems, the large receptive fields have to be applied. I could found some methods that use this technic, Google Deepmind's Wavenet and WaveGAN. Wavenet is based on CNN and WaveGAN used transformation of DCGAN.
Of the two options, I chose the latter one. And I could generate asmr files by referencing the paper and the writer's Github code.
If you process the whole procedures in your local environment, it'll take a huge amount of time for each file. To train multiple dataset fastly, I used on-premise MLass NSML. By referencing NSML Document and NSML examples, you could port your code on NSML then it'll allocate GPU resources to you.
-
Chopin - Ballades(Krystian Zimerman)
-
Bach - Goldberg Variations(Glenn Gould)
-
Sergei Rachmaninov - Piano Concerto No.1 1st mov
-
Lala Land OST(insturments)
-
Philip Glass - Opening
-
Antonio Carlos Jobim - Wave
-
Autumn leaves
-
Bird
-
La Seine
-
NYC
-
Cutting
-
Tropical wave
-
Barber shop
-
Trafalgar square
-
Fry
-
Le Café
-
Yamanote Line