# Build the speech model

Now that we've created spectrogram images, it's time to build the computer vision model. If you're following along with the different modules in this PyTorch learning path, then you should have a good understanding of how to create a computer vision model (in particular, see the "Introduction to Computer Vision with PyTorch" Learn module). You'll be using the `torchvision` package to build your vision model. The convolutional neural network (CNN) layer (`conv2d`) will be used to extract the unique features from the spectrogram image for each speech command.

Let's import the packages we need to build the model.

# 建立語音模型

現在我們已經創建了頻譜圖像，是時候建立計算機視覺模型了。如果您正在按照這個 PyTorch 學習路徑中的不同模塊進行學習，那麼您應該對如何創建計算機視覺模型有一個良好的理解（特別是請參考 "使用 PyTorch 進行計算機視覺入門" 學習模塊）。您將使用 torchvision 套件來建立您的視覺模型。卷積神經網絡（CNN）層（conv2d）將用於從每個語音命令的頻譜圖像中提取獨特的特徵。

讓我們導入我們建立模型所需的套件。

In [1]:
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import torch
import torchaudio
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, models, transforms
from torchinfo import summary
import pandas as pd
import os

## Load spectrogram images into a data loader for training

Here, we provide the path to our image data and use PyTorch's `ImageFolder` dataset helper class to load the images into tensors. We'll also normalize the images by resizing to a dimension of 201 x 81.

## 將頻譜圖像加載到用於訓練的數據加載器中

在這裡，我們提供了圖像數據的路徑，並使用 PyTorch 的 `ImageFolder` 數據集輔助類別將圖像加載為張量。我們還將對圖像進行歸一化，將其調整為 201 x 81 的維度。

In [2]:
data_path = './data/spectrograms' #looking in subfolder train

yes_no_dataset = datasets.ImageFolder(
    root=data_path,
    transform=transforms.Compose([transforms.Resize((201,81)),
                                  transforms.ToTensor()
                                  ])
)
print(yes_no_dataset)

Dataset ImageFolder
    Number of datapoints: 7985
    Root location: ./data/spectrograms
    StandardTransform
Transform: Compose(
               Resize(size=(201, 81), interpolation=bilinear, max_size=None, antialias=warn)
               ToTensor()
           )


`ImageFolder` automatically creates the image class labels and indices based on the folders for each audio class.  We'll use the `class_to_idx` to view the class mapping for the image dataset.

<img alt="Folder class index diagram" src="images/4-model-1.png" align="middle" />

In [3]:
class_map=yes_no_dataset.class_to_idx

print("\nClass category and index of the images: {}\n".format(class_map))


Class category and index of the images: {'no': 0, 'yes': 1}



## Split the data for training and testing
We'll need to split the data to use 80 percent to train the model, and 20 percent to test.

In [4]:
#split data to test and train
#use 80% to train
train_size = int(0.8 * len(yes_no_dataset))
test_size = len(yes_no_dataset) - train_size
yes_no_train_dataset, yes_no_test_dataset = torch.utils.data.random_split(yes_no_dataset, [train_size, test_size])

print("Training size:", len(yes_no_train_dataset))
print("Testing size:",len(yes_no_test_dataset))

Training size: 6388
Testing size: 1597


Because the dataset was randomly split, let's count the training data to verify that the data has a fairly even distribution between the images in the `yes` and 
`no` categories.

In [5]:
from collections import Counter

# labels in training set
train_classes = [label for _, label in yes_no_train_dataset]
Counter(train_classes)

Counter({1: 3248, 0: 3140})

Load the data into the `DataLoader` and specify the batch size of how the data will be divided and loaded in the training iterations. We'll also set the number of workers to specify the number of subprocesses to load the data.

In [9]:
train_dataloader = torch.utils.data.DataLoader(
    yes_no_train_dataset,
    batch_size=15,
    num_workers=2,
    shuffle=True
)

test_dataloader = torch.utils.data.DataLoader(
    yes_no_test_dataset,
    batch_size=15,
    num_workers=2,
    shuffle=True
)

Let's take a look at what our training tensor looks like:

In [10]:
td = train_dataloader.dataset[0][0][0][0]
print(td)

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.])


Get GPU for training, or use CPU if GPU isn't available.

In [11]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using {} device'.format(device))

Using cuda device


#### Create the convolutional neural network


[ ![Diagram showing a convolutional neural network.](./images/4-model-2.png) ](./images/4-model-2.png#lightbox)

We'll define our layers and parameters:

- `conv2d`: Takes an input of 3 `channels`, which represents RGB colors because our input images are in color. The 32 represents the number of feature map images produced from the convolutional layer. The images are produced after you apply a filter on each image in a channel, with a 5 x 5 kernel size and a stride of 1. `Max pooling` is set with a 2 x 2 kernel size to reduce the dimensions of the filtered images. We apply the `ReLU` activation to replace the negative pixel values to 0.
- `conv2d`: Takes the 32 output images from the previous convolutional layer as input. Then, we increase the output number to 64 feature map images, after a filter is applied on the 32 input images, with a 5 x 5 kernel size and a stride of 1. `Max pooling` is set with a 2 x 2 kernel size to reduce the dimensions of the filtered images. We apply the `ReLU` activation to replace the negative pixel values to 0.
- `dropout`: Removes some of the features extracted from the `conv2d` layer with the ratio of 0.50, to prevent overfitting.
- `flatten`: Converts features from the `conv2d` output image into the linear input layer.
- `Linear`: Takes a number of 51136 features as input, and sets the number of outputs from the network to be 50 logits. The next layer will take the 50 inputs and produces 2 logits in the output layer. The `ReLU` activation function will be applied to the neurons across the linear network to replace the negative values to 0. The 2 output values will be used to predict the classification `yes` or `no`.  
- `log_Softmax`: An activation function applied to the 2 output values to predict the probability of the audio classification.

After defining the CNN, we'll set the device to run it.

#### 創建卷積神經網絡

我們將定義我們的層和參數：

- `conv2d`：接受 3 個 `通道` 的輸入，表示 RGB 顏色，因為我們的輸入圖像是彩色的。32 表示從卷積層產生的特徵圖圖像數。這些圖像是在每個通道的每個圖像上應用 5 x 5 的內核大小和步長為 1 後生成的。我們設置了 2 x 2 的內核大小進行最大池化，以減小過濾圖像的尺寸。我們應用 `ReLU` 激活函數以將負像素值替換為 0。
- `conv2d`：將前一個卷積層的 32 個輸出圖像作為輸入。然後，我們將輸出數量增加到 64 個特徵圖圖像，經過 32 個輸入圖像應用 5 x 5 的內核大小和步長 1 的過濾器後。我們設置了 2 x 2 的內核大小進行最大池化，以減小過濾圖像的尺寸。我們應用 `ReLU` 激活函數以將負像素值替換為 0。
- `dropout`：以 0.50 的比率從 `conv2d` 層中刪除一些特徵，以防止過擬合。
- `flatten`：將 `conv2d` 輸出圖像的特徵轉換為線性輸入層。
- `Linear`：以 51136 個特徵作為輸入，並將網絡的輸出數量設置為 50 個 logits。下一層將接受 50 個輸入並在輸出層中生成 2 個 logits。我們將應用 `ReLU` 激活函數以將線性網絡上的神經元的負值替換為 0。2 個輸出值將用於預測分類 `yes` 或 `no`。
- `log_Softmax`：應用於 2 個輸出值以預測音訊分類的概率。

在定義完 CNN 後，我們將設置運行它的設備。

In [12]:
class CNNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=5)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(51136, 50)
        self.fc2 = nn.Linear(50, 2)


    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        #x = x.view(x.size(0), -1)
        x = self.flatten(x)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = F.relu(self.fc2(x))
        return F.log_softmax(x,dim=1)  

model = CNNet().to(device)

## Create train and test functions

Now you set the cost function, learning rate, and optimizer. Then you define the train and test functions that you'll use to train and test the model by using the CNN.

## 創建訓練和測試函數

現在，您設置了成本函數、學習速率和優化器。然後，您定義了將用於使用 CNN 訓練和測試模型的訓練和測試函數。

In [15]:
# cost function used to determine best parameters
cost = torch.nn.CrossEntropyLoss()

# used to create optimal parameters
learning_rate = 0.0001
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Create the training function

def train(dataloader, model, loss, optimizer):
    model.train()
    size = len(dataloader.dataset)
    for batch, (X, Y) in enumerate(dataloader):
        
        X, Y = X.to(device), Y.to(device)
        optimizer.zero_grad()
        pred = model(X)
        loss = cost(pred, Y)
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f'loss: {loss:>7f}  [{current:>5d}/{size:>5d}]')


# Create the validation/test function

def test(dataloader, model):
    size = len(dataloader.dataset)
    model.eval()
    test_loss, correct = 0, 0

    with torch.no_grad():
        for batch, (X, Y) in enumerate(dataloader):
            X, Y = X.to(device), Y.to(device)
            pred = model(X)

            test_loss += cost(pred, Y).item()
            correct += (pred.argmax(1)==Y).type(torch.float).sum().item()

    test_loss /= size
    correct /= size

    print(f'\nTest Error:\nacc: {(100*correct):>0.1f}%, avg loss: {test_loss:>8f}\n')

## Train the model

Now let's set the number of epochs, and call our `train` and `test` functions for each iteration. We'll iterate through the training network by the number of epochs.  As we train the model, we'll calculate the loss as it decreases during the training. In addition, we'll display the accuracy as the optimization increases.

## 訓練模型

現在讓我們設置 epoch 的數量，並為每個迭代調用我們的 `train` 和 `test` 函數。我們將根據 epoch 的數量來迭代訓練網絡。在訓練模型時，我們將計算訓練過程中損失的下降。此外，我們將顯示隨著優化的增加而提高的準確性。

In [16]:
epochs = 15

for t in range(epochs):
    print(f'Epoch {t+1}\n-------------------------------')
    train(train_dataloader, model, cost, optimizer)
    test(test_dataloader, model)
print('Done!')

Epoch 1
-------------------------------
loss: 0.692127  [    0/ 6388]
loss: 0.395299  [ 1500/ 6388]
loss: 0.253122  [ 3000/ 6388]
loss: 0.317790  [ 4500/ 6388]
loss: 0.561404  [ 6000/ 6388]

Test Error:
acc: 89.9%, avg loss: 0.016103

Epoch 2
-------------------------------
loss: 0.234733  [    0/ 6388]
loss: 0.351605  [ 1500/ 6388]
loss: 0.123859  [ 3000/ 6388]
loss: 0.193750  [ 4500/ 6388]
loss: 0.217090  [ 6000/ 6388]

Test Error:
acc: 90.5%, avg loss: 0.014598

Epoch 3
-------------------------------
loss: 0.066517  [    0/ 6388]
loss: 0.086646  [ 1500/ 6388]
loss: 0.120797  [ 3000/ 6388]
loss: 0.072816  [ 4500/ 6388]
loss: 0.119996  [ 6000/ 6388]

Test Error:
acc: 92.0%, avg loss: 0.013056

Epoch 4
-------------------------------
loss: 0.124941  [    0/ 6388]
loss: 0.057019  [ 1500/ 6388]
loss: 0.118088  [ 3000/ 6388]
loss: 0.026632  [ 4500/ 6388]
loss: 0.032528  [ 6000/ 6388]

Test Error:
acc: 93.0%, avg loss: 0.011396

Epoch 5
-------------------------------
loss: 0.121537  [   

Let's look at the summary breakdown of the model architecture. It shows the number of filters used for the feature extraction and image reduction from pooling for each convolutional layer. Next, it shows 51136 input features and the 2 outputs used for classification in the linear layers.

讓我們來查看模型架構的摘要概述。它顯示了用於特徵提取和從池化中減少圖像的每個卷積層的過濾器數量。接下來，它顯示了 51136 個輸入特徵以及用於線性層中的分類的 2 個輸出。

In [17]:
summary(model, input_size=(15, 3, 201, 81))

Layer (type:depth-idx)                   Output Shape              Param #
CNNet                                    [15, 2]                   --
├─Conv2d: 1-1                            [15, 32, 197, 77]         2,432
├─Conv2d: 1-2                            [15, 64, 94, 34]          51,264
├─Dropout2d: 1-3                         [15, 64, 94, 34]          --
├─Flatten: 1-4                           [15, 51136]               --
├─Linear: 1-5                            [15, 50]                  2,556,850
├─Linear: 1-6                            [15, 2]                   102
Total params: 2,610,648
Trainable params: 2,610,648
Non-trainable params: 0
Total mult-adds (Units.GIGABYTES): 3.05
Input size (MB): 2.93
Forward/backward pass size (MB): 82.80
Params size (MB): 10.44
Estimated Total Size (MB): 96.17

 ## Test the model
 
You should have got somewhere between a 93-95 percent accuracy by the 15th epoch. Here we grab a batch from our test data, and see how the model performs on the predicted result and the actual result. 

In [18]:
model.eval()
test_loss, correct = 0, 0
class_map = ['no', 'yes']

with torch.no_grad():
    for batch, (X, Y) in enumerate(test_dataloader):
        X, Y = X.to(device), Y.to(device)
        pred = model(X)
        print("Predicted:\nvalue={}, class_name= {}\n".format(pred[0].argmax(0),class_map[pred[0].argmax(0)]))
        print("Actual:\nvalue={}, class_name= {}\n".format(Y[0],class_map[Y[0]]))
        break

Predicted:
value=0, class_name= no

Actual:
value=0, class_name= no

