# Speech AI Sandbox

In this notebook, I will apply various neural network architectures to the speech commands dataset.

In [1]:
from src.utils import load_audio, play_audio, get_waveforms
from src.lstm_autoencoder import LSTMAutoencoder

## 1) Setup

First, we need to load the required data. We will also define the training hyperparameters here.

### 1.1) Data Loading

In [2]:
train_dataset, test_dataset, SAMPLE_RATE = load_audio(fraction=0.1)
X_train, X_test = get_waveforms(train_dataset), get_waveforms(test_dataset)

In [3]:
play_audio(train_dataset[0])

Label: backward, Speaker ID: 0165e0e8, Utterance #: 0


### 1.2) Hyperparameters

In [4]:
LEARNING_RATE = 1e-3
EPOCHS = 4
VERBOSE = 1

## 2.) Audio Reconstruction

To generate realistic audio, a model must first learn to accurately reconstruct input audio from its latent representation. This step ensures the encoder captures essential information and the decoder can recover the original signal. Our initial goal is therefore to train models that can replicate the input audio as faithfully as possible.

### 2.1) LSTMAutoencoder

We start with the LSTMAutoencoder model, which combines a classic autoencoder architecture made up of an encoder and a decoder with an LSTM network. LSTMs are a type of recurrent neural network (RNN) that work well with sequential data like audio waveforms.

In [5]:
lstm_ae = LSTMAutoencoder(learning_rate=LEARNING_RATE, epochs=EPOCHS, verbose=VERBOSE)
lstm_ae.fit(X_train)

Epoch 4/4 - Loss: 0.0002


In [6]:
for i in [0, 1, 2]:
    X_pred = lstm_ae.reconstruct(X_test[i:i+1])
    print(f'Original audio and reconstruction #{i+1}:')
    display(play_audio((X_test[i], SAMPLE_RATE), path=f'audio/original_{i+1}.wav'))
    display(play_audio((X_pred[0], SAMPLE_RATE), path=f'audio/lstm_ae/reconstruction_{i+1}.wav'))

Original audio and reconstruction #1:


Original audio and reconstruction #2:


Original audio and reconstruction #3:
