# Task description
- Classify the speakers of given features.
- Main goal: Learn how to use transformer.
- Baselines:
  - Easy: Run sample code and know how to use transformer.
  - Medium: Know how to adjust parameters of transformer.
  - Hard: Construct [conformer](https://arxiv.org/abs/2005.08100) which is a variety of transformer.

- Other links
  - Kaggle: [link](https://www.kaggle.com/t/859c9ca9ede14fdea841be627c412322)
  - Slide: [link](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/hw/HW04/HW04.pdf)
  - Data: [link](https://drive.google.com/file/d/1T0RPnu-Sg5eIPwQPfYysipfcz81MnsYe/view?usp=sharing)
  - Video (Chinese): [link](https://www.youtube.com/watch?v=EPerg2UnGaI)
  - Video (English): [link](https://www.youtube.com/watch?v=Gpz6AUvCak0)
  - Solution for downloading dataset fail.: [link](https://drive.google.com/drive/folders/13T0Pa_WGgQxNkqZk781qhc5T9-zfh19e?usp=sharing)

# Download dataset
- Please follow [here](https://drive.google.com/drive/folders/13T0Pa_WGgQxNkqZk781qhc5T9-zfh19e?usp=sharing) to download data
- Data is [here](https://drive.google.com/file/d/1gaFy8RaQVUEXo2n0peCBR5gYKCB-mNHc/view?usp=sharing)

In [2]:
!gdown --id '1A7x5ndSbACK3QyUolMDoHRIFZXh7nLs0' --output Dataset.zip
!unzip Dataset.zip

Failed to retrieve file url:

	Too many users have viewed or downloaded this file recently. Please
	try accessing the file again later. If the file you are trying to
	access is particularly large or is shared with many people, it may
	take up to 24 hours to be able to view or download the file. If you
	still can't access a file after 24 hours, contact your domain
	administrator.

You may still be able to access the file from the browser:

	https://drive.google.com/uc?id=1A7x5ndSbACK3QyUolMDoHRIFZXh7nLs0

but Gdown can't. Please check connections and permissions.
unzip:  cannot find or open Dataset.zip, Dataset.zip.zip or Dataset.zip.ZIP.


# Data

## Dataset
- Original dataset is [Voxceleb1](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/).
- The [license](https://creativecommons.org/licenses/by/4.0/) and [complete version](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/files/license.txt) of Voxceleb1.
- We randomly select 600 speakers from Voxceleb1.
- Then preprocess the raw waveforms into mel-spectrograms.

- Args:
  - data_dir: The path to the data directory.
  - metadata_path: The path to the metadata.
  - segment_len: The length of audio segment for training.
- The architecture of data directory \\
  - data directory \\
  |---- metadata.json \\
  |---- testdata.json \\
  |---- mapping.json \\
  |---- uttr-{random string}.pt \\

- The information in metadata
  - "n_mels": The dimention of mel-spectrogram.
  - "speakers": A dictionary.
    - Key: speaker ids.
    - value: "feature_path" and "mel_len"


For efficiency, we segment the mel-spectrograms into segments in the traing step.

In [3]:
import os
import json
import torch
import random
from pathlib import Path
from torch.utils.data import Dataset
from torch.nn.utils.rnn import pad_sequence


class myDataset(Dataset):
  def __init__(self, data_dir, segment_len=128):
    self.data_dir = data_dir
    self.segment_len = segment_len

    # Load the mapping from speaker neme to their corresponding id.
    mapping_path = Path(data_dir) / "mapping.json"
    mapping = json.load(mapping_path.open())
    self.speaker2id = mapping["speaker2id"]

    # Load metadata of training data.
    metadata_path = Path(data_dir) / "metadata.json"
    metadata = json.load(open(metadata_path))["speakers"]

    # Get the total number of speaker.
    self.speaker_num = len(metadata.keys())
    self.data = []
    for speaker in metadata.keys():
      for utterances in metadata[speaker]:
        self.data.append([utterances["feature_path"], self.speaker2id[speaker]])

  def __len__(self):
    return len(self.data)

  def __getitem__(self, index):
    feat_path, speaker = self.data[index]
    # Load preprocessed mel-spectrogram.
    mel = torch.load(os.path.join(self.data_dir, feat_path))

    # Segmemt mel-spectrogram into "segment_len" frames.
    if len(mel) > self.segment_len:
      # Randomly get the starting point of the segment.
      start = random.randint(0, len(mel) - self.segment_len)
      # Get a segment with "segment_len" frames.
      mel = torch.FloatTensor(mel[start:start+self.segment_len])
    else:
      mel = torch.FloatTensor(mel)
    # Turn the speaker id into long for computing loss later.
    speaker = torch.FloatTensor([speaker]).long()
    return mel, speaker

  def get_speaker_number(self):
    return self.speaker_num

## Dataloader
- Split dataset into training dataset(90%) and validation dataset(10%).
- Create dataloader to iterate the data.


In [4]:
import torch
from torch.utils.data import DataLoader, random_split
from torch.nn.utils.rnn import pad_sequence


def collate_batch(batch):
  # Process features within a batch.
  """Collate a batch of data."""
  mel, speaker = zip(*batch)
  # Because we train the model batch by batch, we need to pad the features in the same batch to make their lengths the same.
  mel = pad_sequence(mel, batch_first=True, padding_value=-20)    # pad log 10^(-20) which is very small value.
  # mel: (batch size, length, 40)
  return mel, torch.FloatTensor(speaker).long()


def get_dataloader(data_dir, batch_size, n_workers):
  """Generate dataloader"""
  dataset = myDataset(data_dir)
  speaker_num = dataset.get_speaker_number()
  # Split dataset into training dataset and validation dataset
  trainlen = int(0.9 * len(dataset))
  lengths = [trainlen, len(dataset) - trainlen]
  trainset, validset = random_split(dataset, lengths)

  train_loader = DataLoader(
    trainset,
    batch_size=batch_size,
    shuffle=True,
    drop_last=True,
    num_workers=n_workers,
    pin_memory=True,
    collate_fn=collate_batch,
  )
  valid_loader = DataLoader(
    validset,
    batch_size=batch_size,
    num_workers=n_workers,
    drop_last=True,
    pin_memory=True,
    collate_fn=collate_batch,
  )

  return train_loader, valid_loader, speaker_num


# Model
- TransformerEncoderLayer:
  - Base transformer encoder layer in [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
  - Parameters:
    - d_model: the number of expected features of the input (required).

    - nhead: the number of heads of the multiheadattention models (required).

    - dim_feedforward: the dimension of the feedforward network model (default=2048).

    - dropout: the dropout value (default=0.1).

    - activation: the activation function of intermediate layer, relu or gelu (default=relu).

- TransformerEncoder:
  - TransformerEncoder is a stack of N transformer encoder layers
  - Parameters:
    - encoder_layer: an instance of the TransformerEncoderLayer() class (required).

    - num_layers: the number of sub-encoder-layers in the encoder (required).

    - norm: the layer normalization component (optional).

In [20]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class FeedForwardModule(nn.Module):
    def __init__(self, d_model, dim_ff, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, dim_ff),
            nn.ReLU(),          # 沒有就用 nn.ReLU() 也可以
            nn.Dropout(dropout),
            nn.Linear(dim_ff, d_model),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        # x: (B, T, D)
        return self.net(x)
class ConvolutionModule(nn.Module):
    def __init__(self, d_model, kernel_size=15, dropout=0.1):
        super().__init__()
        self.layer_norm = nn.LayerNorm(d_model)
        self.pointwise_conv1 = nn.Conv1d(d_model, 2 * d_model, kernel_size=1)
        self.glu = nn.GLU(dim=1)
        self.depthwise_conv = nn.Conv1d(
            d_model,
            d_model,
            kernel_size=kernel_size,
            padding=kernel_size // 2,
            groups=d_model,          # depthwise
        )
        self.batch_norm = nn.BatchNorm1d(d_model)
        self.act = nn.SiLU()        # 或 Swish / ReLU
        self.pointwise_conv2 = nn.Conv1d(d_model, d_model, kernel_size=1)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # x: (B, T, D)
        x = self.layer_norm(x)
        x = x.transpose(1, 2)       # (B, D, T) 給 Conv1d 用

        x = self.pointwise_conv1(x)
        x = self.glu(x)             # (B, D, T)

        x = self.depthwise_conv(x)
        x = self.batch_norm(x)
        x = self.act(x)

        x = self.pointwise_conv2(x)
        x = self.dropout(x)

        x = x.transpose(1, 2)       # 回到 (B, T, D)
        return x

class ConformerBlock(nn.Module):
    def __init__(self,
                 d_model,
                 nhead,
                 dim_ff,
                 conv_kernel_size=15,
                 dropout=0.1):
        super().__init__()
        self.ffn1 = FeedForwardModule(d_model, dim_ff, dropout)
        self.ffn2 = FeedForwardModule(d_model, dim_ff, dropout)

        self.self_attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=nhead,
            dropout=dropout,
            batch_first=False,      # 我們會用 (T, B, D) 給它
        )

        self.conv_module = ConvolutionModule(d_model, conv_kernel_size, dropout)

        self.norm_mha = nn.LayerNorm(d_model)
        self.norm_out = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # x: (B, T, D)
        # FFN1
        residual = x
        x = residual + 0.5 * self.ffn1(x)

        # MHA
        residual = x
        x_norm = self.norm_mha(x)
        # 轉成 (T, B, D) 給 MHA
        x_attn_in = x_norm.transpose(0, 1)
        x_attn, _ = self.self_attn(x_attn_in, x_attn_in, x_attn_in)
        x_attn = x_attn.transpose(0, 1)    # 回到 (B, T, D)
        x = residual + self.dropout(x_attn)

        # Conv module
        residual = x
        x_conv = self.conv_module(x)
        x = residual + self.dropout(x_conv)

        # FFN2
        residual = x
        x = residual + 0.5 * self.ffn2(x)

        # 最後再做一次 LayerNorm
        x = self.norm_out(x)
        return x



class Classifier(nn.Module):
    def __init__(self,
                 d_model=144,
                 n_spks=600,
                 dropout=0.1,
                 num_layers=4,
                 nhead=4,
                 dim_ff=576,
                 conv_kernel_size=15):
        super().__init__()

        self.prenet = nn.Linear(40, d_model)

        # 一堆 conformer blocks
        self.blocks = nn.ModuleList([
            ConformerBlock(
                d_model=d_model,
                nhead=nhead,
                dim_ff=dim_ff,
                conv_kernel_size=conv_kernel_size,
                dropout=dropout,
            )
            for _ in range(num_layers)
        ])

        self.pred_layer = nn.Sequential(
            nn.Linear(d_model, d_model),
            nn.ReLU(),
            nn.Linear(d_model, n_spks),
        )

    def forward(self, mels):
        # mels: (B, T, 40)
        x = self.prenet(mels)    # (B, T, D)

        for blk in self.blocks:
            x = blk(x)           # (B, T, D)

        stats = x.mean(dim=1)    # mean pooling → (B, D)
        out = self.pred_layer(stats)
        return out


# Learning rate schedule
- For transformer architecture, the design of learning rate schedule is different from that of CNN.
- Previous works show that the warmup of learning rate is useful for training models with transformer architectures.
- The warmup schedule
  - Set learning rate to 0 in the beginning.
  - The learning rate increases linearly from 0 to initial learning rate during warmup period.

In [21]:
import math

import torch
from torch.optim import Optimizer
from torch.optim.lr_scheduler import LambdaLR


def get_cosine_schedule_with_warmup(
  optimizer: Optimizer,
  num_warmup_steps: int,
  num_training_steps: int,
  num_cycles: float = 0.5,
  last_epoch: int = -1,
):
  """
  Create a schedule with a learning rate that decreases following the values of the cosine function between the
  initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the
  initial lr set in the optimizer.

  Args:
    optimizer (:class:`~torch.optim.Optimizer`):
      The optimizer for which to schedule the learning rate.
    num_warmup_steps (:obj:`int`):
      The number of steps for the warmup phase.
    num_training_steps (:obj:`int`):
      The total number of training steps.
    num_cycles (:obj:`float`, `optional`, defaults to 0.5):
      The number of waves in the cosine schedule (the defaults is to just decrease from the max value to 0
      following a half-cosine).
    last_epoch (:obj:`int`, `optional`, defaults to -1):
      The index of the last epoch when resuming training.

  Return:
    :obj:`torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule.
  """

  def lr_lambda(current_step):
    # Warmup
    if current_step < num_warmup_steps:
      return float(current_step) / float(max(1, num_warmup_steps))
    # decadence
    progress = float(current_step - num_warmup_steps) / float(
      max(1, num_training_steps - num_warmup_steps)
    )
    return max(
      0.0, 0.5 * (1.0 + math.cos(math.pi * float(num_cycles) * 2.0 * progress))
    )

  return LambdaLR(optimizer, lr_lambda, last_epoch)


# Model Function
- Model forward function.

In [22]:
import torch


def model_fn(batch, model, criterion, device):
  """Forward a batch through the model."""

  mels, labels = batch
  mels = mels.to(device)
  labels = labels.to(device)

  outs = model(mels)

  loss = criterion(outs, labels)

  # Get the speaker id with highest probability.
  preds = outs.argmax(1)
  # Compute accuracy.
  accuracy = torch.mean((preds == labels).float())

  return loss, accuracy


# Validate
- Calculate accuracy of the validation set.

In [23]:
from tqdm import tqdm
import torch


def valid(dataloader, model, criterion, device):
  """Validate on validation set."""

  model.eval()
  running_loss = 0.0
  running_accuracy = 0.0
  pbar = tqdm(total=len(dataloader.dataset), ncols=0, desc="Valid", unit=" uttr")

  for i, batch in enumerate(dataloader):
    with torch.no_grad():
      loss, accuracy = model_fn(batch, model, criterion, device)
      running_loss += loss.item()
      running_accuracy += accuracy.item()

    pbar.update(dataloader.batch_size)
    pbar.set_postfix(
      loss=f"{running_loss / (i+1):.2f}",
      accuracy=f"{running_accuracy / (i+1):.2f}",
    )

  pbar.close()
  model.train()

  return running_accuracy / len(dataloader)


# Main function

In [24]:
from tqdm import tqdm

import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader, random_split


def parse_args():
  """arguments"""
  config = {
    "data_dir": "./Dataset",
    "save_path": "model.ckpt",
    "batch_size": 32,
    "n_workers": 8,
    "valid_steps": 2000,
    "warmup_steps": 1000,
    "save_steps": 10000,
    "total_steps": 70000,
  }

  return config


def main(
  data_dir,
  save_path,
  batch_size,
  n_workers,
  valid_steps,
  warmup_steps,
  total_steps,
  save_steps,
):
  """Main function."""
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  print(f"[Info]: Use {device} now!")

  train_loader, valid_loader, speaker_num = get_dataloader(data_dir, batch_size, n_workers)
  train_iterator = iter(train_loader)
  print(f"[Info]: Finish loading data!",flush = True)

  model = Classifier(n_spks=speaker_num).to(device)
  criterion = nn.CrossEntropyLoss()
  optimizer = AdamW(model.parameters(), lr=1e-3)
  scheduler = get_cosine_schedule_with_warmup(optimizer, warmup_steps, total_steps)
  print(f"[Info]: Finish creating model!",flush = True)

  best_accuracy = -1.0
  best_state_dict = None

  pbar = tqdm(total=valid_steps, ncols=0, desc="Train", unit=" step")

  for step in range(total_steps):
    # Get data
    try:
      batch = next(train_iterator)
    except StopIteration:
      train_iterator = iter(train_loader)
      batch = next(train_iterator)

    loss, accuracy = model_fn(batch, model, criterion, device)
    batch_loss = loss.item()
    batch_accuracy = accuracy.item()

    # Updata model
    loss.backward()
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

    # Log
    pbar.update()
    pbar.set_postfix(
      loss=f"{batch_loss:.2f}",
      accuracy=f"{batch_accuracy:.2f}",
      step=step + 1,
    )

    # Do validation
    if (step + 1) % valid_steps == 0:
      pbar.close()

      valid_accuracy = valid(valid_loader, model, criterion, device)

      # keep the best model
      if valid_accuracy > best_accuracy:
        best_accuracy = valid_accuracy
        best_state_dict = model.state_dict()

      pbar = tqdm(total=valid_steps, ncols=0, desc="Train", unit=" step")

    # Save the best model so far.
    if (step + 1) % save_steps == 0 and best_state_dict is not None:
      torch.save(best_state_dict, save_path)
      pbar.write(f"Step {step + 1}, best model saved. (accuracy={best_accuracy:.4f})")

  pbar.close()


if __name__ == "__main__":
  main(**parse_args())


[Info]: Use cuda now!
[Info]: Finish loading data!
[Info]: Finish creating model!


Train: 100% 2000/2000 [00:46<00:00, 43.42 step/s, accuracy=0.19, loss=3.55, step=2000]
Valid: 100% 6944/6944 [00:02<00:00, 3295.32 uttr/s, accuracy=0.21, loss=3.77]
Train: 100% 2000/2000 [00:45<00:00, 44.13 step/s, accuracy=0.25, loss=3.04, step=4000]
Valid: 100% 6944/6944 [00:01<00:00, 3483.43 uttr/s, accuracy=0.32, loss=3.12]
Train: 100% 2000/2000 [00:45<00:00, 44.21 step/s, accuracy=0.44, loss=2.11, step=6000]
Valid: 100% 6944/6944 [00:02<00:00, 3430.40 uttr/s, accuracy=0.41, loss=2.58]
Train: 100% 2000/2000 [00:45<00:00, 43.89 step/s, accuracy=0.62, loss=1.67, step=8000]
Valid: 100% 6944/6944 [00:01<00:00, 3502.86 uttr/s, accuracy=0.50, loss=2.14]
Train: 100% 2000/2000 [00:45<00:00, 43.99 step/s, accuracy=0.50, loss=2.20, step=1e+4]
Valid: 100% 6944/6944 [00:01<00:00, 3521.79 uttr/s, accuracy=0.54, loss=1.99]
Train:   0% 8/2000 [00:00<00:53, 37.16 step/s, accuracy=0.62, loss=1.85, step=1e+4]

Step 10000, best model saved. (accuracy=0.5395)


Train: 100% 2000/2000 [00:45<00:00, 44.18 step/s, accuracy=0.38, loss=2.42, step=12000]
Valid: 100% 6944/6944 [00:01<00:00, 3530.55 uttr/s, accuracy=0.58, loss=1.81]
Train: 100% 2000/2000 [00:45<00:00, 44.37 step/s, accuracy=0.69, loss=1.36, step=14000]
Valid: 100% 6944/6944 [00:01<00:00, 3504.97 uttr/s, accuracy=0.61, loss=1.68]
Train: 100% 2000/2000 [00:45<00:00, 44.36 step/s, accuracy=0.53, loss=1.71, step=16000]
Valid: 100% 6944/6944 [00:01<00:00, 3528.31 uttr/s, accuracy=0.62, loss=1.60]
Train: 100% 2000/2000 [00:45<00:00, 44.03 step/s, accuracy=0.75, loss=0.62, step=18000]
Valid: 100% 6944/6944 [00:01<00:00, 3500.20 uttr/s, accuracy=0.65, loss=1.45]
Train: 100% 2000/2000 [00:45<00:00, 44.26 step/s, accuracy=0.69, loss=1.16, step=2e+4]
Valid: 100% 6944/6944 [00:02<00:00, 3462.54 uttr/s, accuracy=0.66, loss=1.43]
Train:   0% 8/2000 [00:00<00:58, 34.01 step/s, accuracy=0.66, loss=1.25, step=2e+4]

Step 20000, best model saved. (accuracy=0.6645)


Train: 100% 2000/2000 [00:45<00:00, 44.29 step/s, accuracy=0.75, loss=0.93, step=22000]
Valid: 100% 6944/6944 [00:01<00:00, 3498.33 uttr/s, accuracy=0.70, loss=1.29]
Train: 100% 2000/2000 [00:45<00:00, 43.94 step/s, accuracy=0.78, loss=1.03, step=24000]
Valid: 100% 6944/6944 [00:02<00:00, 3346.07 uttr/s, accuracy=0.70, loss=1.25]
Train: 100% 2000/2000 [00:45<00:00, 44.26 step/s, accuracy=0.84, loss=0.87, step=26000]
Valid: 100% 6944/6944 [00:01<00:00, 3557.62 uttr/s, accuracy=0.72, loss=1.16]
Train: 100% 2000/2000 [00:45<00:00, 44.38 step/s, accuracy=0.94, loss=0.44, step=28000]
Valid: 100% 6944/6944 [00:02<00:00, 3428.21 uttr/s, accuracy=0.73, loss=1.09]
Train: 100% 2000/2000 [00:45<00:00, 44.22 step/s, accuracy=0.75, loss=0.68, step=3e+4]
Valid: 100% 6944/6944 [00:02<00:00, 3399.72 uttr/s, accuracy=0.75, loss=1.06]
Train:   0% 9/2000 [00:00<00:48, 41.37 step/s, accuracy=0.91, loss=0.47, step=3e+4]

Step 30000, best model saved. (accuracy=0.7461)


Train: 100% 2000/2000 [00:45<00:00, 44.35 step/s, accuracy=0.84, loss=0.53, step=32000]
Valid: 100% 6944/6944 [00:01<00:00, 3478.81 uttr/s, accuracy=0.75, loss=1.05]
Train: 100% 2000/2000 [00:45<00:00, 43.87 step/s, accuracy=0.88, loss=0.54, step=34000]
Valid: 100% 6944/6944 [00:02<00:00, 3336.52 uttr/s, accuracy=0.77, loss=0.94]
Train: 100% 2000/2000 [00:45<00:00, 43.88 step/s, accuracy=0.97, loss=0.32, step=36000]
Valid: 100% 6944/6944 [00:02<00:00, 3360.51 uttr/s, accuracy=0.78, loss=0.93]
Train: 100% 2000/2000 [00:45<00:00, 43.95 step/s, accuracy=0.94, loss=0.24, step=38000]
Valid: 100% 6944/6944 [00:02<00:00, 3438.67 uttr/s, accuracy=0.80, loss=0.88]
Train: 100% 2000/2000 [00:45<00:00, 44.41 step/s, accuracy=0.94, loss=0.23, step=4e+4]
Valid: 100% 6944/6944 [00:02<00:00, 3340.95 uttr/s, accuracy=0.80, loss=0.88]
Train:   0% 8/2000 [00:00<00:59, 33.21 step/s, accuracy=0.91, loss=0.44, step=4e+4]

Step 40000, best model saved. (accuracy=0.8029)


Train: 100% 2000/2000 [00:44<00:00, 44.64 step/s, accuracy=0.84, loss=0.44, step=42000]
Valid: 100% 6944/6944 [00:02<00:00, 3365.40 uttr/s, accuracy=0.80, loss=0.87]
Train: 100% 2000/2000 [00:44<00:00, 44.55 step/s, accuracy=0.94, loss=0.18, step=44000]
Valid: 100% 6944/6944 [00:02<00:00, 3439.33 uttr/s, accuracy=0.82, loss=0.78]
Train: 100% 2000/2000 [00:45<00:00, 43.58 step/s, accuracy=0.84, loss=0.47, step=46000]
Valid: 100% 6944/6944 [00:02<00:00, 3332.94 uttr/s, accuracy=0.82, loss=0.82]
Train: 100% 2000/2000 [00:45<00:00, 44.10 step/s, accuracy=0.97, loss=0.15, step=48000]
Valid: 100% 6944/6944 [00:02<00:00, 3455.16 uttr/s, accuracy=0.83, loss=0.75]
Train: 100% 2000/2000 [00:45<00:00, 43.79 step/s, accuracy=0.88, loss=0.35, step=5e+4]
Valid: 100% 6944/6944 [00:02<00:00, 3276.72 uttr/s, accuracy=0.84, loss=0.74]
Train:   0% 8/2000 [00:00<00:53, 37.39 step/s, accuracy=0.94, loss=0.17, step=5e+4]

Step 50000, best model saved. (accuracy=0.8355)


Train: 100% 2000/2000 [00:46<00:00, 43.38 step/s, accuracy=0.97, loss=0.14, step=52000]
Valid: 100% 6944/6944 [00:02<00:00, 3438.08 uttr/s, accuracy=0.84, loss=0.70]
Train: 100% 2000/2000 [00:45<00:00, 44.00 step/s, accuracy=0.91, loss=0.17, step=54000]
Valid: 100% 6944/6944 [00:02<00:00, 3382.76 uttr/s, accuracy=0.84, loss=0.71]
Train: 100% 2000/2000 [00:45<00:00, 44.08 step/s, accuracy=0.91, loss=0.34, step=56000]
Valid: 100% 6944/6944 [00:02<00:00, 3395.62 uttr/s, accuracy=0.85, loss=0.66]
Train: 100% 2000/2000 [00:45<00:00, 43.63 step/s, accuracy=0.88, loss=0.58, step=58000]
Valid: 100% 6944/6944 [00:01<00:00, 3516.01 uttr/s, accuracy=0.85, loss=0.63]
Train: 100% 2000/2000 [00:45<00:00, 44.22 step/s, accuracy=0.97, loss=0.25, step=6e+4]
Valid: 100% 6944/6944 [00:02<00:00, 3412.47 uttr/s, accuracy=0.86, loss=0.66]
Train:   0% 8/2000 [00:00<00:55, 35.94 step/s, accuracy=0.94, loss=0.26, step=6e+4]

Step 60000, best model saved. (accuracy=0.8567)


Train: 100% 2000/2000 [00:45<00:00, 43.86 step/s, accuracy=1.00, loss=0.06, step=62000]
Valid: 100% 6944/6944 [00:02<00:00, 3355.47 uttr/s, accuracy=0.87, loss=0.60]
Train: 100% 2000/2000 [00:45<00:00, 43.89 step/s, accuracy=0.94, loss=0.29, step=64000]
Valid: 100% 6944/6944 [00:02<00:00, 3345.39 uttr/s, accuracy=0.86, loss=0.62]
Train: 100% 2000/2000 [00:44<00:00, 44.51 step/s, accuracy=0.91, loss=0.22, step=66000]
Valid: 100% 6944/6944 [00:02<00:00, 3395.65 uttr/s, accuracy=0.86, loss=0.60]
Train: 100% 2000/2000 [00:45<00:00, 44.31 step/s, accuracy=1.00, loss=0.08, step=68000]
Valid: 100% 6944/6944 [00:01<00:00, 3472.44 uttr/s, accuracy=0.86, loss=0.62]
Train: 100% 2000/2000 [00:46<00:00, 43.21 step/s, accuracy=0.94, loss=0.25, step=7e+4]
Valid: 100% 6944/6944 [00:01<00:00, 3507.92 uttr/s, accuracy=0.86, loss=0.64]
Train:   0% 0/2000 [00:00<?, ? step/s]


Step 70000, best model saved. (accuracy=0.8674)


# Inference

## Dataset of inference

In [25]:
import os
import json
import torch
from pathlib import Path
from torch.utils.data import Dataset


class InferenceDataset(Dataset):
  def __init__(self, data_dir):
    testdata_path = Path(data_dir) / "testdata.json"
    metadata = json.load(testdata_path.open())
    self.data_dir = data_dir
    self.data = metadata["utterances"]

  def __len__(self):
    return len(self.data)

  def __getitem__(self, index):
    utterance = self.data[index]
    feat_path = utterance["feature_path"]
    mel = torch.load(os.path.join(self.data_dir, feat_path))

    return feat_path, mel


def inference_collate_batch(batch):
  """Collate a batch of data."""
  feat_paths, mels = zip(*batch)

  return feat_paths, torch.stack(mels)


## Main funcrion of Inference

In [26]:
import json
import csv
from pathlib import Path
from tqdm.notebook import tqdm

import torch
from torch.utils.data import DataLoader

def parse_args():
  """arguments"""
  config = {
    "data_dir": "./Dataset",
    "model_path": "./model.ckpt",
    "output_path": "./output.csv",
  }

  return config


def main(
  data_dir,
  model_path,
  output_path,
):
  """Main function."""
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  print(f"[Info]: Use {device} now!")

  mapping_path = Path(data_dir) / "mapping.json"
  mapping = json.load(mapping_path.open())

  dataset = InferenceDataset(data_dir)
  dataloader = DataLoader(
    dataset,
    batch_size=1,
    shuffle=False,
    drop_last=False,
    num_workers=8,
    collate_fn=inference_collate_batch,
  )
  print(f"[Info]: Finish loading data!",flush = True)

  speaker_num = len(mapping["id2speaker"])
  model = Classifier(n_spks=speaker_num).to(device)
  model.load_state_dict(torch.load(model_path))
  model.eval()
  print(f"[Info]: Finish creating model!",flush = True)

  results = [["Id", "Category"]]
  for feat_paths, mels in tqdm(dataloader):
    with torch.no_grad():
      mels = mels.to(device)
      outs = model(mels)
      preds = outs.argmax(1).cpu().numpy()
      for feat_path, pred in zip(feat_paths, preds):
        results.append([feat_path, mapping["id2speaker"][str(pred)]])

  with open(output_path, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(results)


if __name__ == "__main__":
  main(**parse_args())


[Info]: Use cuda now!
[Info]: Finish loading data!
[Info]: Finish creating model!


  0%|          | 0/6000 [00:00<?, ?it/s]