# Introduction

This is our implementation of resnet inspired model on raw wave dataset(time domain) dataset. Here one can just hit run all and results will be generated in 2 csv files as inteded. The weights and models will be also saved for reproducibility. Thus this notebook can be used to train custom version of our model and experiment further.

# 0. Setup

At first we are importing and setting up all the libraries to implement this solution. If some of the files are not installed in your device, please refer to the readme file for installation direction

## utility libraries


In [1]:
!pip install -r "requirement (1).txt"



In [2]:
import os
import pathlib
from IPython import display
from pathlib import Path
from IPython.display import Audio
import matplotlib.pyplot as plt

# data handling libraries
import numpy as np
import librosa
import librosa.display
import pandas as pd
from sklearn.model_selection import train_test_split

# ML and DL Libraries
import tensorflow as tf
import tensorflow.keras as keras
from sklearn.model_selection import KFold


## declaring hyperparameters


In [3]:
seed = 42
SR = 16000 #sampling rate of each audio file
BATCH_SIZE = 1
AUD_LENGTH = 10# we are taking 10s audio for each sample to have an uniform dataset
TRAIN_TEST_SPLIT = 0.3

## all file paths


In [4]:
# insert needed paths here

# this is the known and unknown dataset path
DATASET_AUDIO_PATH = 'classwise_dataset/'

# this is the random extra data addition folders
ASVSPOOF_DATA_PATH  = 'external files/asvspoof'
LIBRISPEECH_DATA_PATH = 'external files\librispeech'

# this is the evaluation folder for phase 1 and 2
EVAL_PATH_1 = 'spcup_2022_eval_part1'
EVAL_PATH_2 = 'spcup_2022_eval_part1'

### saving paths
CSV_DIR = './'
MODEL_SAVE_DIR = './'
WEIGHT_SAVE_DIR = './'

# 1. Generating Dataset

In [5]:
# getting audio dataset path to divide into 3 datasets and also for making tf datasets later

class_names = os.listdir(DATASET_AUDIO_PATH)
print("Our class names: {}".format(class_names,))

audio_paths = []
labels = []
for label, name in enumerate(class_names):
    label = int(name)
    print("Processing speaker {}".format(name,))
    print("Actual Label ",label)
    dir_path = Path(DATASET_AUDIO_PATH) / name
    speaker_sample_paths = [
        os.path.join(dir_path, filepath)
        for filepath in os.listdir(dir_path)
        if filepath.endswith(".wav")
    ]
    audio_paths += speaker_sample_paths
    labels += [label] * len(speaker_sample_paths)

print(
    "Found {} files belonging to {} classes.".format(len(audio_paths), len(class_names))
)

Our class names: ['0', '1', '2', '3', '4', '5']
Processing speaker 0
Actual Label  0
Processing speaker 1
Actual Label  1
Processing speaker 2
Actual Label  2
Processing speaker 3
Actual Label  3
Processing speaker 4
Actual Label  4
Processing speaker 5
Actual Label  5
Found 6000 files belonging to 6 classes.


Now as our extra data addition is done, we are splitting the dataset into train and validation dataset

In [6]:
X_train, X_val, y_train, y_val = train_test_split(audio_paths, labels, test_size=TRAIN_TEST_SPLIT, random_state=seed)

In [7]:
#checking if the size is alright
print("Samples in train dataset: ",len(X_train))
print("Samples in validation dataset:",len(X_val))

Samples in train dataset:  4200
Samples in validation dataset: 1800


## B. Creating Repeated Dataset

Although cropping dataset to the minimum length or less could help us uniform the dataset, our data variance is very high(approximately 2.5 to 14.5s). In order to avoid such data loss we took sample size to be 10 sec and repeated the samples shorter than that instead of zero padding or other methods as in several ASR research it has shown better performance. 

In [8]:
# utility functions for repeating audio files
def repeated_data(file_path):
    """ This function will take a file path and give out truncated and padded to 10s version waveform"""
    y, sr = librosa.load(file_path,sr=SR)
    aud_length = AUD_LENGTH*sr # making all audio length 10 s and truncating the rest
    duration = librosa.get_duration(y=y, sr=sr)
    if duration < AUD_LENGTH:
        y = np.tile(y, int((aud_length/sr) // duration)+1)
    y = librosa.resample(y[:aud_length], orig_sr=sr, target_sr=SR)
    return y

def repeated_dataset(dataset):
    """ This function generated waveshape dataset"""
    new_ds = []
    for f in dataset:
        new_ds.append(repeated_data(f))
    return new_ds

## C. Creating DataGenerator

As we were facing memory issue with saving such huge array of raw waveshape dataset and ran out of memory for quite a period, we used this datagenerator in order to efficiently load the model into our training system using the allowed resources.

Also, we are using tf dataset API for processing our dataset as it gives better performance while using Tensorflow.

In [9]:
class DataGenerator(keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, list_IDs, labels, batch_size= BATCH_SIZE, 
                 n_classes=6, shuffle=True):
        'Initialization'
        self.dim = AUD_LENGTH * SR
        self.batch_size = batch_size
        self.labels = labels
        self.shuffle = shuffle
        self.list_IDs = list_IDs
        self.on_epoch_end()

    def path_to_audio(self,path):
        """Reads and decodes an audio file."""
        return repeated_data(path)

    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(len(self.list_IDs) / self.batch_size))

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Generate data
        X, y = self.__data_generation(indexes)

        return X, y

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(self.list_IDs))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

    def __data_generation(self, list_IDs_temp):
        'Generates data containing batch_size samples'
        
        X = []
        y = []
        # Generate data
        for i, ID in enumerate(list_IDs_temp):
            # Store sample
            _tempx = self.path_to_audio(self.list_IDs[ID])
            #_tempx = self.spect_audio(_tempx)
            X.append(_tempx)

            # Store class
            y.append(self.labels[ID])

        return np.reshape(np.array(X), (self.batch_size,SR*AUD_LENGTH,1)).astype(np.float32),np.array(y).astype(np.float32)

In [10]:
# generating tf datasets
train_ds = DataGenerator(X_train,y_train)
valid_ds = DataGenerator(X_val,y_val)

# 2. Building A Model

## A. Resnet based model

For our model we have tried out different models and fine tuned them for the best performance

In [11]:
### Model Name == "resnet_version"

# Resnet Block
def residual_block(xx, filters):
    """ This Block will work as the repeating Resnet Block for extracting important features from the waveshape"""
    yy = tf.keras.layers.Conv1D(filters, kernel_size = 3, padding="same")(xx)
    yy = tf.keras.layers.BatchNormalization()(yy)
    yy = tf.keras.layers.ReLU()(yy)
    
    yy = tf.keras.layers.Conv1D(filters, kernel_size = 3, padding="same")(yy)
    yy = tf.keras.layers.BatchNormalization()(yy)
    yy = tf.keras.layers.ReLU()(yy)
    
    yy = tf.keras.layers.Conv1D(filters, kernel_size = 3, padding="same")(yy)
    
    xx = tf.keras.layers.Conv1D(filters, kernel_size = 1, padding="same")(xx)
    
    xx = tf.keras.layers.Concatenate(axis=1)([xx,yy])
    xx = tf.keras.layers.ReLU()(xx)
    
    return xx

def resnet_version(input_shape, num_classes):
    inputs = tf.keras.layers.Input(shape=input_shape, name="input")
    x      = tf.keras.layers.Conv1D(16, kernel_size = 3, padding="same")(inputs)
    x      = tf.keras.layers.BatchNormalization()(x)
    x      = tf.keras.layers.ReLU()(x)
    x      = tf.keras.layers.MaxPool1D(pool_size = 4)(x)
    
    # stacked resnet modules
    # res1
    x      = residual_block(x,32)
    x      = tf.keras.layers.MaxPool1D(pool_size = 4)(x)
    # res2
    x      = residual_block(x,64)
    x      = tf.keras.layers.MaxPool1D(pool_size = 4)(x)
    # res3
    x      = residual_block(x,128)
    x      = tf.keras.layers.MaxPool1D(pool_size = 4)(x)
    # res4
    x      = residual_block(x,128)
    x      = tf.keras.layers.MaxPool1D(pool_size = x.shape[-1])(x)
    
    x      = tf.keras.layers.Flatten()(x)
    x      = tf.keras.layers.Dense(64, activation="relu")(x)
    x      = tf.keras.layers.Dense(32, activation="relu")(x)
    outputs = tf.keras.layers.Dense(num_classes, activation="softmax", name="output")(x)
    
    return tf.keras.models.Model(inputs=inputs, outputs=outputs)
    
aud_length = AUD_LENGTH * SR

model = resnet_version((aud_length, 1), len(class_names))

model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input (InputLayer)             [(None, 160000, 1)]  0           []                               
                                                                                                  
 conv1d (Conv1D)                (None, 160000, 16)   64          ['input[0][0]']                  
                                                                                                  
 batch_normalization (BatchNorm  (None, 160000, 16)  64          ['conv1d[0][0]']                 
 alization)                                                                                       
                                                                                                  
 re_lu (ReLU)                   (None, 160000, 16)   0           ['batch_normalization[0][0]']

# 3. Training the Model

for training we have chosed the cross validation method as it is proven to work well as the model can learn different distribution from diff classes. 

In [12]:
# training parameters
EPOCHS=1
NFOLDS=5

# chossing the model parameters
MODEL = "resnet_version"
LOSS = "sparse_categorical_crossentropy"
OPTIMIZER = "Adam"

# setting model and weight name
MODEL_NAME = "model_resnet1D_cv_"
WEIGHT_NAME = "weight_resnet1D_cv"

In [13]:
folds = KFold(n_splits=NFOLDS)
splits = folds.split(audio_paths, labels)

def evaluate_model(X_train, X_val, y_train, y_val,j):
    
    train_ds = DataGenerator(X_train,y_train)
    valid_ds = DataGenerator(X_val,y_val)
    
    aud_length = AUD_LENGTH * SR
    
    if MODEL == "resnet_version":
        model = resnet_version((aud_length, 1), len(class_names))
        
    epochs = EPOCHS
    batch_size = BATCH_SIZE
    
    model.compile(
    optimizer= OPTIMIZER, loss= LOSS, metrics=["accuracy"])
    weight_save_filename = WEIGHT_NAME +str(j)+"fold_.h5"
    
    lr_reduce = tf.keras.callbacks.ReduceLROnPlateau(monitor='loss', factor=0.5, patience=5, verbose=1, mode='min', min_lr=1e-9)
    earlystopping_cb = tf.keras.callbacks.EarlyStopping(monitor='loss', min_delta=0.001, patience=10, mode='min', restore_best_weights=True)
    mdlcheckpoint_cb = tf.keras.callbacks.ModelCheckpoint(weight_save_filename, monitor="val_accuracy", save_best_only=True,save_weights_only=True)
    
    history = model.fit(
    train_ds,
    epochs=EPOCHS,
    validation_data=valid_ds,
    callbacks=[lr_reduce,earlystopping_cb, mdlcheckpoint_cb],
)
 
    _, val_acc = model.evaluate(valid_ds, verbose = 1)
    
    model.load_weights(os.path.join(WEIGHT_SAVE_DIR,weight_save_filename)) #
    model.save(os.path.join(MODEL_SAVE_DIR,MODEL_NAME + str(j)+"fold_.h5"))
    return model, val_acc

fin_model = 1
cv_scores, model_history = list(), list()
train = audio_paths
targets = labels
for fold, (train_idx, val_idx) in enumerate(splits):
    X_train = []
    X_valid = []
    y_train = []
    y_valid = []
    for i in train_idx:
        X_train.append(train[i])
        y_train.append(targets[i])
    for j in val_idx:
        X_valid.append(train[j])
        y_valid.append(targets[j])

    print('-'*15, '>', f'Fold {fold+1}', '<', '-'*15)
    model, val_acc = evaluate_model(X_train, X_val, y_train, y_val,fold)
    print('>%.3f' % val_acc)
    cv_scores.append(val_acc)
    if val_acc == max(cv_scores):
        fin_model = model
    model_history.append(model)

--------------- > Fold 1 < ---------------
>0.364
--------------- > Fold 2 < ---------------
>0.504
--------------- > Fold 3 < ---------------
>0.606
--------------- > Fold 4 < ---------------
>0.359
--------------- > Fold 5 < ---------------
>0.713


## Generating Predictions

we will be using the ensemble of each model generated by our cross validation method to get the least variance in our prediction

In [14]:
def ensemble_predictions(members, testX,testy=1):
    yhats = [model.predict(testX) for model in members]
    yhats = np.array(yhats)
    # sum across ensemble members
    summed = np.sum(yhats, axis=0)
    # argmax across classes
    result = np.argmax(summed, axis=1)
    return result

In [15]:
preds = ensemble_predictions(model_history, valid_ds)
preds

array([2, 4, 2, ..., 5, 4, 4], dtype=int64)