This Notebook is split up into two sections: 
- How to run the model for testing and training.
- Code documentation with description of important files, functions, and variables. 

# How to

## Test

To test the model on data, change or use the following variables and then run code below. (A test example is pre-run below, simply scroll below and play the desired audio). 

In [1]:
# variables to be changed
# Using VCTK data
filename = 'p292_022.wav' # should be included in project directory
sr_hr = 16000
songlength = 600

In [2]:
import sys  
sys.path.insert(0,'generators') 
sys.path.insert(0,'models') 
sys.path.insert(0,'helpers') 

import pickle

import random
import os
import numpy as np
import time
import glob
from pydub import AudioSegment

import librosa
import numpy as np

import tensorflow as tf
from tensorflow.keras.models import Model

import importlib

import vctk_audio_generator_2
importlib.reload(vctk_audio_generator_2)
from vctk_audio_generator_2 import VCTKGenerator

import audio_autoencoder
importlib.reload(audio_autoencoder)
from audio_autoencoder import create_autoencoder

import helper_functions
importlib.reload(helper_functions)
from helper_functions import get_input_shape, get_input_shape_2, make_song_list_file, vctk_tracklist_to_partition, audiosegment_to_ndarray

import testing
importlib.reload(testing)
from testing import get_segments_from_file, make_val_list, get_batch_from_audio, reformat_model_arrays, librosa_array_to_pydub, test_on_audio_file

Using TensorFlow backend.


In [3]:
# create model
number_of_layers = 4
skip_connections = True
input_shape = get_input_shape_2(filename, songlength, sr_hr)
print(input_shape)

(9600, 1)


### $r = 2$

In [4]:
ratio = 2

In [5]:
sr_lr = int(sr_hr/ratio)
model_input, model_output = create_autoencoder(input_shape, number_of_layers, ratio, skip_connections)
model = Model(inputs=model_input, outputs=model_output)
# load weights
model.load_weights('weights/wgan_weights_2/generator_weights_49_')

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x2b262b18a860>

In [6]:
# run model
lr_audio, hr_audio, generated_audio = test_on_audio_file(model, filename, songlength, input_shape, ratio, sr_lr, sr_hr)

In [7]:
print('Low Resolution Input:')
lr_audio

Low Resolution Input:


In [8]:
print('High Resolution Ground Truth:')
hr_audio

High Resolution Ground Truth:


In [9]:
print('High Resolution Output:')
generated_audio

High Resolution Output:


### $r = 4$

In [10]:
ratio = 4

In [11]:
sr_lr = int(sr_hr/ratio)
model_input, model_output = create_autoencoder(input_shape, number_of_layers, ratio, skip_connections)
model = Model(inputs=model_input, outputs=model_output)
# load weights
model.load_weights('weights/wgan_weights_4/generator_weights_49_')

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x2b268ffc4a90>

In [12]:
# run model
lr_audio, hr_audio, generated_audio = test_on_audio_file(model, filename, songlength, input_shape, ratio, sr_lr, sr_hr)

In [13]:
print('Low Resolution Input:')
lr_audio

Low Resolution Input:


In [14]:
print('High Resolution Ground Truth:')
hr_audio

High Resolution Ground Truth:


In [15]:
print('High Resolution Output:')
generated_audio

High Resolution Output:


## Train

Training the models is pretty easy given the code. All models trained are defined as classes and have their own training loops defined in a `.train()` method. These loops were custom written, although specific pieces like the UNet, VGGish, critic/discriminator were adapted from code (links can be found in each respective file). All model classes have the same format in terms of training steps. Specific pieces were adapted into `tensorflow.keras.Model` format.

To train a UNet on the VCTK dataset with the content extractor, use the following. If using the content generator, its weights can be found at https://drive.google.com/file/d/1mhqXZ8CANgHyepum7N4yrjiyIg6qaMe6/view.

In [9]:
import sys  
sys.path.insert(0,'generators') 
sys.path.insert(0,'models') 
sys.path.insert(0,'helpers') 

import random
import os
import tensorflow as tf
import time
import importlib


import vctk_audio_generator_2
importlib.reload(vctk_audio_generator_2)
from vctk_audio_generator_2 import VCTKGenerator

import audio_autoencoder
importlib.reload(audio_autoencoder)
from audio_autoencoder import create_autoencoder

import vggish
importlib.reload(vggish)
from vggish import create_VGGish

import helper_functions
importlib.reload(helper_functions)
from helper_functions import get_input_shape, make_song_list_file, vctk_tracklist_to_partition

import sr_unet
importlib.reload(sr_unet)
from sr_unet import SRUNet

To train the model, the following variables and paths need to be pre-defined (more info in documentation below)

In [10]:
use_content_extractor = False
    
saved_weights_dir_path = '/home/tretzlof/projects/def-pft3/tretzlof/audio_super_resolution/src/weights_fma/content_gan_weights_2'
ratio = 2
dataset_type = 'VCTK'
sr_hr = 16000
save_epoch_step = 1
songlength = 600
datapath = '/home/tretzlof/scratch/VCTK-Corpus/wav48'

sr_lr = sr_hr/ratio

number_of_layers = 4
lr = 3e-4
skip_connections = True    
    
batchsize = 64
epochs = 400
content_weights_file_path = '/home/tretzlof/projects/def-pft3/tretzlof/audio_super_resolution/src/weights/vggish/vggish_16k_r2.h5'
n_mels = 20
nfft = 1024
fmin=27.5
fmax=16000
power_melgram=2.0

In [11]:
if dataset_type == 'FMA':
    tracklistfile = '/home/tretzlof/projects/def-pft3/tretzlof/audio_super_resolution/src/fma_medium_audio_list.txt'
elif dataset_type == 'VCTK':
    tracklistfile = '/home/tretzlof/projects/def-pft3/tretzlof/audio_super_resolution/src/vctk_medium_audio_list.txt'
else:
    print(dataset_type, ' is not a correct dataset type')
    
if use_content_extractor: 
    save_pickle_file_path = '/home/tretzlof/projects/def-pft3/tretzlof/audio_super_resolution/src/unet_content_{}_losses.pickle'.format(ratio)
else:
    save_pickle_file_path = '/home/tretzlof/projects/def-pft3/tretzlof/audio_super_resolution/src/unet_no_content_{}_losses.pickle'.format(ratio)

print('\nsr_hr: {}\nratio: {}\nuse_content_extractor: {}\nsaved_weights_dir_path: {}\nsave_epoch_step: {}\nsave_pickle_file_path: {}\ntracklistfile:{}\n\n'.format(sr_hr, ratio, use_content_extractor, saved_weights_dir_path, save_epoch_step, save_pickle_file_path, tracklistfile))

input_shape = get_input_shape(tracklistfile, songlength, sr_hr)
input_length = input_shape[0]

partition, labels = vctk_tracklist_to_partition(tracklistfile, split=[1, 0])

training_generator = VCTKGenerator(partition['train'], songlength, batch_size=batchsize, 
                                   input_shape=input_shape, datapath=datapath, ratio=ratio, sr_lr=sr_lr, sr_hr=sr_hr)
validation_generator = None




# data_output_generator = create_data_generator(training_generator, num_workers)

create_generator = create_autoencoder
create_content_extractor = create_VGGish


sr_hr: 16000
ratio: 2
use_content_extractor: False
saved_weights_dir_path: /home/tretzlof/projects/def-pft3/tretzlof/audio_super_resolution/src/weights_fma/content_gan_weights_2
save_epoch_step: 1
save_pickle_file_path: /home/tretzlof/projects/def-pft3/tretzlof/audio_super_resolution/src/unet_no_content_2_losses.pickle
tracklistfile:/home/tretzlof/projects/def-pft3/tretzlof/audio_super_resolution/src/vctk_medium_audio_list.txt




In [12]:
unet_content = SRUNet(input_shape, input_length, songlength, datapath, save_pickle_file_path, sr_hr, ratio, 
                 tracklistfile, training_generator, validation_generator, number_of_layers, lr, 
                 create_generator, create_content_extractor, batchsize, epochs, skip_connections, use_content_extractor,
                 saved_weights_dir_path, content_weights_file_path, save_epoch_step, n_mels, nfft, fmin, fmax, 
                 power_melgram)

In [13]:
unet_content.train()



--------------------Epoch: 0--------------------
Train step 0/781, Generator loss: 0.00309737
Train step 1/781, Generator loss: 0.12155
Train step 2/781, Generator loss: 0.00107881


KeyboardInterrupt: 

To train the GAN, run the import and variable definition blocks above, then run the following.

In [None]:
from ps_discriminator import create_discriminator

import audiosr_gan
importlib.reload(audiosr_gan)
from audiosr_gan import AudioSRGan

In [None]:
audio_srgan = AudioSRGan(input_shape, input_length, songlength, datapath, save_pickle_file_path, sr_hr, ratio, 
                          tracklistfile, training_generator, validation_generator, number_of_layers, lr, 
                          create_discriminator, create_autoencoder, create_VGGish, batchsize, epochs, 
                          skip_connections, use_content_extractor, saved_weights_dir_path, content_weights_file_path, save_epoch_step, 
                          n_mels, nfft, fmin, fmax, power_melgram)

In [None]:
audio_srgan.train()

To train the WGAN, run the import and variable definition blocks for the UNet, then run the following.

In [None]:
import ps_discriminator
importlib.reload(ps_discriminator)
from ps_discriminator import create_discriminator

import audiosr_wgan
importlib.reload(audiosr_wgan)
from audiosr_wgan import AudioASRWGAN

In [None]:
audio_srwgan = AudioASRWGAN(input_shape, input_length, songlength, datapath, save_pickle_file_path, sr_hr, ratio, tracklistfile, 
                 training_generator, validation_generator, number_of_layers, lr, create_discriminator, 
                 create_autoencoder, create_content_extractor, batchsize, epochs, skip_connections,
                 use_content_extractor, saved_weights_dir_path, content_weights_file_path, n_critic, clip_value, 
                 save_epoch_step, n_mels, nfft, fmin, fmax, power_melgram)

In [None]:
audio_srwgan.train()

# Documentation

The code is split up into three subdirectories: `generators`, `models`, and `helpers`

## `generators`

### `vctk_audio_generator_2.py`

Creates a generator that will be turned into Tensorflow Dataset to generate data. A python generator function is created to generate batches. To load data, the generator has its `next` method called. When data is loaded, it is first loaded as a pydub `AudioSegment` before it is converted to the 1D sample array (which is the form used in the Python Librosa module). For some reason, Librosa was slower in loading data than the process used here : `AudioSegment`-> `Sample array`. A high resolution audio sample is first loaded and then the low resolution version is computed from it. 

In [None]:
VCTKGenerator(list_IDs, labels, songlength, batch_size, input_shape, datapath, ratio, sr_lr, sr_hr, n_channels=1, 
                 shuffle=True)

- `list_IDs`: List of strings. Contains list of filenames to be trained on.
- `songlength`: Integer. Length of input to the network 
- `batch_size`: Integer. Training batch size
- `input_shape`: Tuple. Input shape, should be of a one dimensional signal of the form `(n_samples, 1)` where `n_samples` is the number of samples in the input signal
- `datapath`: String. Path to directory where data is stored
- `ratio`: Integer. Upsampling ratio
- `sr_hr`: Integer. Low resolution sampling rate
- `sr_hr`: Integer. High resolution sampling rate

## `models`

The models in this project as well as their layer definitions are in this directory. 

### `audio_autoencoder.py`

This file defines the UNet generator model used in this project.

In [None]:
create_autoencoder(input_shape, number_of_layers, ratio, skip_connections)

Creates a UNet model.
- `number_of_layers`: Integer. Number of upsampling and downsampling blocks, in this project we used $B=4$
- `skip_connections`: Boolean. If skip connections should be used in the UNet. In this project, this is always set to `True`

### `ps_discriminator.py`

Creates discriminator and critic models (same function for both).

In [None]:
create_discriminator(input_shape, kernel_len=25, dim=64, use_batchnorm=False, phaseshuffle_rad=1)

### `vggish.py`

Creates VGGish model for content loss. 

In [None]:
create_VGGish(input_length, sr_hr, n_mels, hoplength, nfft, fmin, fmax, power_melgram, pooling='avg')

- `input_length`: Integer. Input length, this is the number of samples in the input
- `n_mels`: Number of frequency bands for spectrogram 
- `hoplength`: Hoplength for spectrogram
- `nfft`: Length of the FFT window for spectrogram 
- `fmin`: Lowest frequency for spectrogram 
- `fmax`: Highest frequency for spectrogram 
- `power_melgram`: Exponent for the magnitude spectrogram

### `sr_unet.py`

Defines a class to train the UNet model. This sets up the model, defines a training loop and loss, and implements Tensorflow's gradient tape to update the gradients. 

In [None]:
SRUNet(self, input_shape, input_length, songlength, datapath, save_pickle_file_path, sr_hr, ratio, tracklistfile, 
       training_generator, validation_generator, number_of_layers, lr, create_generator, create_content_extractor, 
       batchsize, epochs, skip_connections, use_content_extractor, saved_weights_dir_path, content_weights_file_path, 
       save_epoch_step, n_mels, nfft, fmin, fmax, power_melgram)

- `save_pickle_file_path`: String. Path to save a pickle file that contains training info such as validation loss
- `ratio`: Integer. Upsampling ratio
- `tracklistfile`: String. Path to a text file where each line denotes a file to be used for training
- `training_generator`: Tensorflow dataset for training
- `validation_generator`: Tensorflow dataset for validation (not used currently)
- `number_of_layers`: Integer. Number of upsampling and downsampling blocks, in this project we used $B=4$
- `lr`: Float. Learning rate 
- `create_generator`: Function that will be used to create the generator, e.g. `create_autoencoder` 
- `create_content_extractor`: Function that will be used to create the content extractor 
- `epochs`: Integer. Total epoch number to train for
- `skip_connections`: Boolean. If skip connections should be used in the generator 
- `use_content_extractor`: Boolean. If the content extractor should be used
- `saved_weights_dir_path`: String. Path to directory where weights should be saved
- `content_weights_file_path`: String. Path to file for downloading content extractor weights. 
- `save_epoch_step`: Integer. Amount of steps between saving weights. 

### `audiosr_gan.py`

Defines a class to train the GAN model. This sets up the model, defines a training loop and loss, and implements Tensorflow's gradient tape to update the gradients. 

In [None]:
AudioSRGan(input_shape, input_length, songlength, datapath, save_pickle_file_path, sr_hr, ratio, tracklistfile, 
           training_generator, validation_generator, number_of_layers, lr, create_discriminator, create_generator, 
           create_content_extractor, batchsize, epochs, skip_connections, use_content_extractor,
           saved_weights_dir_path, content_weights_file_path, save_epoch_step, n_mels, nfft, fmin, fmax, 
           power_melgram):

- `create_discriminator`: Function used for discriminator

### `audiosr_wgan.py`

In [None]:
AudioASRWGAN(input_shape, input_length, songlength, datapath, save_pickle_file_path, sr_hr, ratio, tracklistfile, 
                 training_generator, validation_generator, number_of_layers, lr, create_discriminator, 
                 create_autoencoder, create_content_extractor, batchsize, epochs , skip_connections,
                 use_content_extractor, saved_weights_dir_path, content_weights_file_path, n_critic, clip_value, 
                 save_epoch_step, n_mels, nfft, fmin, fmax, power_melgram):

- `n_critic`: Integer. Amount of steps the critic should be trained for in between generator training steps. 
- `clip_value`: Float. Amount to clip weights by. 

## `layers`

### `subpixel.py`

Defines the subpixel layer.

In [None]:
phase_shift_tensorlayer(r)

- `r`: Upsampling ratio

Computes subpixel operation. 

## `Kapre`

This directory uses code from https://github.com/keunwoochoi/kapre, this code allows one to implement a neural netowrk layer that computes the spectrogram of its input. In this project the Python module did not work, so the code was copied to be able to use it. 

## Helpers

### helper_functions.py

Defines helper functions to run code. Contains code to convert between pydub audio and librosa audio as well as creating code to make training files. 

### testing.py

Defines functions for testing. Includes code to generate data from an audio file name and various other helper functions for testing. 