<a href="https://colab.research.google.com/github/shoutong/colabs/blob/master/Copy_of_Public_phoneme_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center> <h1> Phoneme Recognition </h1> </center>

For this homework we will build a simple [Phoneme](https://https://en.wikipedia.org/wiki/Phoneme) Recognition Neural Network.

We will use the TIMIT dataset for this homework. It containts utterances form several different English speakers saying sentences (See more details [here](https://https://catalog.ldc.upenn.edu/LDC93S1)).

<center> <h2> Setup </h2> </center>

#### Google colaboratory

Before getting started, get familiar with google colaboratory:
https://colab.research.google.com/notebooks/welcome.ipynb

This is a neat python environment that works in the cloud and does not require you to
set up anything on your personal machine
(it also has some built-in IDE features that make writing code easier).
Moreover, it allows you to copy any existing collaboratory file, alter it and share
with other people. In this homework, we will ask you to copy current colaboraty,
complete all the tasks and share your colaboratory notebook with us so
that we can grade it.

#### Submission

Before you start working on this homework do the following steps:

1. Press __File > Save a copy in Drive...__ tab. This will allow you to have your own copy and change it.
2. Follow all the steps in this collaboratory file and write/change/uncomment code as necessary.
3. Do not forget to occasionally press __File > Save__ tab to save your progress.
4. After all the changes are done and progress is saved press __Share__ button (top right corner of the page), press __get shareable link__ and make sure you have the option __Anyone with the link can view__ selected.
5. Paste the link into your submission pdf file so that we can view it and grade.

In [0]:
import numpy as np
import random
import torch
random.seed(1234)
torch.manual_seed(1234)

<torch._C.Generator at 0x7f5b18284370>

# Dataset 
For convenience we have done some preprocessing of the TIMIT audio. In the files below, we have files `{train/dev/test}_feats.mat.npy` and `{train/dev/test}_labels.mat.npy` which contain [MFCC features](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum) and phoneme labels per frame.

The segment below downloads the preprocessed data.

In [0]:
!wget -q -nc https://raw.githubusercontent.com/jhu-intro-hlt/jhu-intro-hlt.github.io/master/data-phone-recognitiona/dev_feats.mat.npy
!wget -q -nc https://raw.githubusercontent.com/jhu-intro-hlt/jhu-intro-hlt.github.io/master/data-phone-recognitiona/train_feats.mat.npy
!wget -q -nc https://raw.githubusercontent.com/jhu-intro-hlt/jhu-intro-hlt.github.io/master/data-phone-recognitiona/test_feats.mat.npy
!wget -q -nc https://raw.githubusercontent.com/jhu-intro-hlt/jhu-intro-hlt.github.io/master/data-phone-recognitiona/dev_labels.mat.npy
!wget -q -nc https://raw.githubusercontent.com/jhu-intro-hlt/jhu-intro-hlt.github.io/master/data-phone-recognitiona/train_labels.mat.npy
!wget -q -nc https://raw.githubusercontent.com/jhu-intro-hlt/jhu-intro-hlt.github.io/master/data-phone-recognitiona/test_labels.mat.npy

Next, we define two methods to read the numpy formatted data. `get_labels` function maps each phoneme to a index (i.e. `int`).



In [0]:
def get_labels(data):
    label_dict = {}
    for y in data:
        label_dict[y] = label_dict.get(y, len(label_dict))
    return label_dict


def load_npy():
    train_feats = np.load('train_feats.mat.npy', allow_pickle=True)
    train_labels = np.load('train_labels.mat.npy', allow_pickle=True)
    label_idx = get_labels(train_labels)
    test_feats = np.load('test_feats.mat.npy', allow_pickle=True)
    test_labels = np.load('test_labels.mat.npy', allow_pickle=True)
    dev_feats = np.load('dev_feats.mat.npy', allow_pickle=True)
    dev_labels = np.load('dev_labels.mat.npy', allow_pickle=True)
    return label_idx, (train_feats, train_labels), (dev_feats, dev_labels), (test_feats, test_labels)


In [0]:
label_dict, train, dev, test = load_npy()

#Display the shape of the features and labels
print(train[0].shape, len(train[1]))
print(dev[0].shape, len(dev[1]))
print(test[0].shape, len(test[1]))

#Display the first 40 labels
print(train[1][:300])
#Dispplay the first 40 speech features
print(train[0][:40])

(200000, 39) 200000
(10000, 39) 10000
(10000, 39) 10000
['sil' 'sil' 'sil' 'sil' 'sil' 'sil' 'sil' 'sil' 'sil' 'sil' 'sil' 'sil'
 'sil' 'ax' 'ax' 'ax' 'ax' 'ax' 'ax' 's' 's' 's' 's' 's' 's' 's' 's' 's'
 's' 's' 's' 's' 's' 's' 's' 's' 'uw' 'uw' 'uw' 'uw' 'uw' 'uw' 'uw' 'uw'
 'uw' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'm' 'f' 'f' 'f' 'f' 'f' 'f' 'f' 'f'
 'f' 'f' 'f' 'f' 'f' 'ao' 'ao' 'ao' 'ao' 'ao' 'ao' 'r' 'r' 'r' 'r' 'r'
 'ix' 'ix' 'ix' 'vcl' 'vcl' 'vcl' 'vcl' 'vcl' 'z' 'z' 'z' 'z' 'z' 'z' 'z'
 'z' 'ae' 'ae' 'ae' 'ae' 'ae' 'ae' 'ae' 'ae' 'ae' 'ae' 'm' 'm' 'm' 'm'
 'cl' 'cl' 'cl' 'p' 'p' 'p' 'p' 'uh' 'uh' 'uh' 'l' 'l' 'l' 'l' 'l' 'l' 'l'
 'l' 'l' 'l' 'l' 'l' 'l' 'ax' 'ax' 'ax' 'ax' 'ax' 'ax' 'ax' 's' 's' 's'
 's' 's' 's' 's' 's' 's' 's' 's' 's' 's' 'ix' 'ix' 'ix' 'ix' 'ix' 'cl'
 'cl' 'cl' 'ch' 'ch' 'ch' 'ch' 'ch' 'ch' 'ch' 'ch' 'ch' 'ch' 'ch' 'uw'
 'uw' 'uw' 'uw' 'uw' 'uw' 'uw' 'uw' 'uw' 'ey' 'ey' 'ey' 'ey' 'ey' 'ey'
 'ey' 'ey' 'ey' 'ey' 'sh' 'sh' 'sh' 'sh' 'sh' 'sh' 'sh' 'sh' 'sh' 'sh'
 'sh

Every frame has a corresponding phoneme label and our goal is to train a model to predict label of unseen frames correctly.

A simple model can predict phonemes just by considering a single frame but adding context could improve the accuracy of our model. We add context by appending each frame with neighbouring frames. After that we batch our data instances for our model to leverage the parallel processing of GPU. This is done in the function below.

In [0]:
def batchify(data_feats, data_labels, batch_size, label_dict, window=5, to_cuda=False):
    batched_tdata = []
    curr_batch = []
    fz = np.zeros((window, 39))
    bz = np.zeros((window, 39))
    fl = ['sil'] * window
    bl = ['sil'] * window
    data_labels = fl + data_labels.tolist() + bl
    data_feats = np.concatenate((fz, data_feats, bz))
    for i in range(window, len(data_labels) - window):
        x = data_feats[i - window: i + window + 1]
        y = data_labels[i]
        tx = torch.Tensor(x).unsqueeze(0) # shape should be (1, 39, 2window)
        ty = torch.Tensor([label_dict[y]]) # shape should be (1, 1)
        if len(curr_batch) < batch_size:
            #if y != 'sil':
            curr_batch.append((tx, ty))
        else:
            _tx, _ty = zip(*curr_batch)
            b_tx = torch.cat(_tx, dim=0)
            b_ty = torch.cat(_ty, dim=0)
            if to_cuda:
                b_tx, b_ty = b_tx.cuda(), b_ty.cuda()
            batched_tdata.append((b_ty, b_tx))
            curr_batch = []
    if len(curr_batch) > 0:
        _tx, _ty = zip(*curr_batch)
        b_tx = torch.cat(_tx, dim=0)
        b_ty = torch.cat(_ty, dim=0)
        if to_cuda:
            b_tx, b_ty = b_tx.cuda(), b_ty.cuda()
        batched_tdata.append((b_ty, b_tx))
    return batched_tdata

In [0]:
window=0
batched_train_0 = batchify(train[0], train[1], 2000, label_dict, window, True)
batched_dev_0 = batchify(dev[0], dev[1], 2000, label_dict, window, True)
batched_test_0 = batchify(test[0], test[1], 2000, label_dict, window, True)

## Model

Below we will ask you to complete the definition of a simple network. You will have to write code in parts of code where #TODO is placed. You are expected to only add code. Do not change the provided code.

In [0]:
class MLP_Simple(torch.nn.Module):
    def __init__(self,
                 hidden_size,
                 num_labels):
        super().__init__()
        # Just single frame, therefore, 1 * 39
        self.layer0 = torch.nn.Linear(1 * 39, hidden_size)

        
        # Do not change the surrounding code, only add yours
        #TODO: Add more layers and activations here...
        
        self.final_layer = torch.nn.Linear(hidden_size, num_labels)

    def forward(self, x):
        """Generate output distribution and argmax
        Args:
            x: num frames of 39-dimensional feature vector (MFCC features)
        Return:
            dist: log probability for each output class
            pred: the label with highest log probability
        """
        batch_size, frames, features = x.shape
        x = x.squeeze(1)
        x = self.layer0(x)

        # TODO: use your defined layers here.
      

        y_hat = self.final_layer(x) # shape should be (batch_size, label_size)
        _, y_pred = y_hat.max(dim=-1) # y_pred shape should be (batch_size, 1)
        return y_hat, y_pred

Here we create an instance of our simple model that uses single frame without context.

In [0]:
model_simple = MLP_Simple(hidden_size=512, num_labels=len(label_dict))
print(model_simple)
print('num parameters:', sum([p.numel() for p in model_simple.parameters()]))

MLP_Simple(
  (layer0): Linear(in_features=39, out_features=512, bias=True)
  (final_layer): Linear(in_features=512, out_features=48, bias=True)
)
num parameters: 45104


## Training

Here we define a function `train_model` that performs optimization of model's parameters. 

In [0]:
def train_model(model, batched_train, batched_dev, batched_test, max_epoch=20):
    model = model.cuda()
    loss = torch.nn.CrossEntropyLoss(reduction='mean')
    optim = torch.optim.Adam(model.parameters())
  
    for epoch in range(max_epoch):
        random.shuffle(batched_train)
        train_loss = []
        train_acc = []
        model.train()
        for batch in batched_train:
            optim.zero_grad()
            y, x = batch
            y_hat, y_pred = model(x)
            batch_loss = loss(y_hat, y.long())
            batch_loss.backward()
            optim.step()
            batch_acc = (y_pred == y.long()).sum().item() / y.numel()
            train_loss.append(batch_loss.item())
            train_acc.append(batch_acc)
        _loss = sum(train_loss) / len(train_loss)
        _acc = sum(train_acc) / len(train_acc)
        print(f"Epoch {epoch}")
        print(f"train loss {_loss:.4f} train_acc {_acc:.4f}")
        dev_acc = []
        model.eval()
        for batch in batched_dev:
            y, x = batch
            with torch.no_grad():
                y_hat, y_pred = model(x)
                batch_acc = (y_pred == y.long()).sum().item() / y.numel()
                dev_acc.append(batch_acc)
        _acc = sum(dev_acc) / len(dev_acc)
        print(f"dev_acc {_acc:.4f}")
    test_acc = []
    model.eval()
    for batch in batched_test:
        y, x = batch
        with torch.no_grad():
            y_hat, y_pred = model(x)
            batch_acc = (y_pred == y.long()).sum().item() / y.numel()
            test_acc.append(batch_acc)
    _acc = sum(test_acc) / len(test_acc)
    print(f"training completed.\n")
    print(f"test_acc {_acc:.4f}")

train_model(model_simple, batched_train_0, batched_dev_0, batched_test_0)

Epoch 0
train loss 0.5294 train_acc 0.8144
dev_acc 0.5768
Epoch 1
train loss 0.4947 train_acc 0.8285
dev_acc 0.5682
Epoch 2
train loss 0.4865 train_acc 0.8319
dev_acc 0.5718
Epoch 3
train loss 0.4807 train_acc 0.8327
dev_acc 0.5720
Epoch 4
train loss 0.4651 train_acc 0.8374
dev_acc 0.5664
Epoch 5
train loss 0.4525 train_acc 0.8419
dev_acc 0.5699
Epoch 6
train loss 0.4426 train_acc 0.8464
dev_acc 0.5772
Epoch 7
train loss 0.4207 train_acc 0.8536
dev_acc 0.5640
Epoch 8
train loss 0.4172 train_acc 0.8542
dev_acc 0.5678
Epoch 9
train loss 0.4065 train_acc 0.8593
dev_acc 0.5648
Epoch 10
train loss 0.3941 train_acc 0.8629
dev_acc 0.5538
Epoch 11
train loss 0.3834 train_acc 0.8668
dev_acc 0.5668
Epoch 12
train loss 0.3720 train_acc 0.8697
dev_acc 0.5628
Epoch 13
train loss 0.3715 train_acc 0.8699
dev_acc 0.5635
Epoch 14
train loss 0.3572 train_acc 0.8748
dev_acc 0.5681
Epoch 15
train loss 0.3519 train_acc 0.8768
dev_acc 0.5712
Epoch 16
train loss 0.3468 train_acc 0.8786
dev_acc 0.5639
Epoch 1

# Adding Context

Next, we will look at the effect of more context on model performance. The segment below creates batched data with neighboring 5 frames from left and right (of the key frame, making 11 frames in total).

In [0]:
window=5
batched_train_5 = batchify(train[0], train[1], 2000, label_dict, window, True)
batched_dev_5 = batchify(dev[0], dev[1], 2000, label_dict, window, True)
batched_test_5 = batchify(test[0], test[1], 2000, label_dict, window, True)

In [0]:
class MLP_Context(torch.nn.Module):
    def __init__(self,
                 hidden_size,
                 num_labels,
                 window_size=5):
        super().__init__()
        self.layer0 = torch.nn.Linear((2 * window_size + 1) * 39, hidden_size)

        
        # Do not change the surrounding code, only add yours
        #TODO: Add more layers and activations here...

        self.final_layer = torch.nn.Linear(hidden_size, num_labels)

    def forward(self, x):
        """Generate output distribution and argmax
        Args:
            x: num frames of 39-dimensional feature vector (MFCC features)
        Return:
            dist: log probability for each output class
            pred: the label with highest log probability
        """
        batch_size, frames, features = x.shape
        x = x.view(batch_size, -1)
        x = self.layer0(x)

        # TODO: use your defined layer

        y_hat = self.final_layer(x) # shape should be (batch_size, label_size)
        _, y_pred = y_hat.max(dim=-1) # y_pred shape should be (batch_size, 1)
        return y_hat, y_pred

NameError: ignored

In [0]:
model_context = MLP_Context(hidden_size=512, num_labels=len(label_dict))
train_model(model_context, batched_train_5, batched_dev_5, batched_test_5)

Epoch 0
train loss 1.6566 train_acc 0.5111
dev_acc 0.5609
Epoch 1
train loss 1.2214 train_acc 0.6140
dev_acc 0.6009
Epoch 2
train loss 1.0732 train_acc 0.6553
dev_acc 0.6154
Epoch 3
train loss 0.9835 train_acc 0.6793
dev_acc 0.6286
Epoch 4
train loss 0.9218 train_acc 0.6963
dev_acc 0.6070
Epoch 5
train loss 0.8723 train_acc 0.7096
dev_acc 0.6226
Epoch 6
train loss 0.8189 train_acc 0.7257
dev_acc 0.6383
Epoch 7
train loss 0.7638 train_acc 0.7441
dev_acc 0.6465
Epoch 8
train loss 0.7176 train_acc 0.7568
dev_acc 0.6550
Epoch 9
train loss 0.6761 train_acc 0.7706
dev_acc 0.6464
Epoch 10
train loss 0.6372 train_acc 0.7824
dev_acc 0.6450
Epoch 11
train loss 0.6010 train_acc 0.7948
dev_acc 0.6519
Epoch 12
train loss 0.5649 train_acc 0.8057
dev_acc 0.6366
Epoch 13
train loss 0.5313 train_acc 0.8165
dev_acc 0.6462
Epoch 14
train loss 0.5080 train_acc 0.8239
dev_acc 0.6434
Epoch 15
train loss 0.4506 train_acc 0.8442
dev_acc 0.6451
Epoch 16
train loss 0.4223 train_acc 0.8520
dev_acc 0.6256
Epoch 1

## Tasks


1. Explore with different number of layers and hidden sizes for a window size of 5.

2. Explore different window sizes and report the one that worked best for you. Does increasing the context help?

3. (Optional) Write code below, trying to implement model based on convolutions and see if it performs better compared to the one with fully connected layers. You might find [this](https://pytorch.org/docs/stable/nn.html?highlight=conv1d#torch.nn.Conv1d) helpful.

In [0]:
#Cells for optional part.