# Train & Align NN AM with Win-to-MFCC and simple speaker adaptation
Repeatedly re-align phone labels sequence while training the phones model.
To avoid proliferation of the more frequent phones (and mostly the silence), we diminish b() probabilities of frequent phones during re-alignment. We use 3 states pre phone.

We use simple speaker adaptation by making average cepstra over the recording visible as NN inputs (very simple i-vector like approach). We split MFCC to 4 groups according to log-energy (split to above average and below average, then each group is further split into two the same way). Future versions might compute MFCC averages for key frequent phones.

We iterate in the Baum-Welch stype but no GMMs are used, we start directly with a NN which is quite able to get out of the initial mess very quickly.

This notebook expects CommonVoice files converted to wavs and initial_train.tsv (made from train.tsv) looking like this:
```
wav	sentence
/some/path/common_voice_cs_23896695.wav	Ve srovnání s jinými sýry je téměř bez zápachu.
/some/path/common_voice_cs_23896696.wav	Na stavbě se podíleli příslušníci sedmnácti národů.
...
```

# Config

In [1]:
infile = 'initial_train.tsv'# no phone targets in this tsv yet
sideview = 9 # how many additional MFCC frames before and after the focus point are seen
mid_size = 100
filename_base_base = "default_training"

In [2]:
import sys
if sys.path[0] != '..':
    sys.path[0:0] = ['..'] # prepend main Prak directory
from prongen.hmm_pron import *
from acmodel.matrix import * # will be removed, mostly old experiments not needed anymore
from acmodel.praat_ifc import *
from acmodel.hmm_acmodel import *

In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"
#device = "cpu"
print(f"Using {device} device") # FIXME: does not really use GPU anymore, some tiny glitch...
from acmodel.nn_acmodel import *

Using cuda device


In [4]:
#b_set = b123_set # uncomment to use untied tristate models of phones (except silence)

## Get training data

In [5]:
import pandas as pd

In [6]:
df = pd.read_csv(infile, sep="\t", keep_default_na=False)
hmms = get_training_hmms(infile, derivatives=0)

In [7]:
df

Unnamed: 0,wav,sentence
0,/data4T/commonvoice/cv-corpus-7.0-2021-07-21/c...,Ve srovnání s jinými sýry je téměř bez zápachu.
1,/data4T/commonvoice/cv-corpus-7.0-2021-07-21/c...,Na stavbě se podíleli příslušníci sedmnácti ná...
2,/data4T/commonvoice/cv-corpus-7.0-2021-07-21/c...,"Děkuji vám, vaše sdělení je jasné."
3,/data4T/commonvoice/cv-corpus-7.0-2021-07-21/c...,Při následné záchranné operaci byl zabit i pil...
4,/data4T/commonvoice/cv-corpus-7.0-2021-07-21/c...,Občasně konzumuje i větší hmyz a výjimečně i r...
...,...,...
10751,/data4T/commonvoice/cv-corpus-7.0-2021-07-21/c...,Celkově tyto změny v signalizaci negativně ovl...
10752,/data4T/commonvoice/cv-corpus-7.0-2021-07-21/c...,Zvyk je tendence vykonávat za určitých okolnos...
10753,/data4T/commonvoice/cv-corpus-7.0-2021-07-21/c...,Jeho žena Marie byla mladší sestra spisovatele...
10754,/data4T/commonvoice/cv-corpus-7.0-2021-07-21/c...,Za stejnou roli získal i Oscara.


In [8]:
for hmm in hmms:
    triple_hmm_states(hmm) # Upgrade to 3 states per phone (just for duration, b() is still shared)
    #multiply_hmm_states(hmm)
    #triple_hmm_states(hmm, untied=True) # use 3 different states for all nonsilent phones

In [9]:
def create_start_targets(hmm):
    """
    Create mostly fictional targets for direct launch of the NN training.
    Except some silence at the begining/end, most targets will be false.
    We just take the b string as it is (with triple states), even with
    variants (!) and put it in the middle of the training data, with silence
    around.
    """
    states = len(hmm.b)
    frames = hmm.mfcc.size()[0]
    before = (frames-states)//2
    after = frames-before-states
    hmm.targets = '|'*before+hmm.b+'|'*after

In [10]:
for hmm in hmms:
    create_start_targets(hmm)

In [11]:
hmms[0].targets

'||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||vvveee|||sssrrrooovvvnnnáááňňňýýý|||sssjjjyyynnnýýýmmmyyy|||sssýýýrrryyy|||jjjeee|||tttééémmmňňňeeeŘŘŘ|||řřřbbbeeezzzzzzááápppaaaHHHuuu|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||'

In [12]:
df['targets'] = [hmm.targets for hmm in hmms]
tsv_zero = filename_base_base+"_0000.tsv"
df.to_csv(tsv_zero, sep="\t", index=False) # artificial start targets

In [13]:
#b_log_corr = b_log_corrections(infile) # get b() corrections based on frequency
b_log_corr = b_log_corrections(tsv_zero, b_set=b_set) # get b() corrections based on frequency
#b_log_corr = torch.tensor([0]*len(b_set))

In [14]:
len(b_set)

45

In [15]:
all_mfcc, all_targets = collect_training_material(hmms)

out_size = len(b_set)
in_size = hmms[0].mfcc.size(1)

" ".join(b_set), out_size, in_size

('? A E G H N O Z a b c d e f g h j k l m n o p r s t u v y z | á é ó ú ý č ď ň Ř ř š ť Ž ž',
 45,
 13)

## Add speaker vectors (mean cepstra in 4 energy bands)

In [16]:
# Make s-vectors
all_speaker_vectors_refs = []
for hmm in hmms:
    hmm.speaker_vector = mfcc_make_speaker_vector(hmm.mfcc)
    ref = hmm.speaker_vector.to(device)
    all_speaker_vectors_refs += [ref]*len(hmm.mfcc)

## Changes for the Window-to-MFCC input

In [17]:
in_size = hmms[0].mfcc.size(1) * (sideview+1+sideview) + 4*13 # added s-vector
in_size

299

In [18]:
# for alignment decoding, change mfcc in all hmms (for training, we already have a copy)
# NOTE: Make speaker vectors BEFORE this!
for hmm in hmms:
    hmm.mfcc = mfcc_win_view(mfcc_add_sideview(hmm.mfcc, sideview=sideview), sideview=sideview)

## Setup training

In [19]:
model = NeuralNetwork(in_size, out_size, mid_size).to(device) # 50 20 100=svec 50=sv50
print(model)

NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=299, out_features=100, bias=True)
    (1): ReLU()
    (2): Linear(in_features=100, out_features=100, bias=True)
    (3): ReLU()
    (4): Linear(in_features=100, out_features=100, bias=True)
    (5): ReLU()
    (6): Linear(in_features=100, out_features=45, bias=True)
  )
)


In [20]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

training_data = SpeechDataset(all_mfcc, all_targets, b_set, sideview=sideview, speaker_vectors=all_speaker_vectors_refs) # initial alignment

In [None]:
for mega_epoch in range(1, 50): # starting from 1 as we have zero tsv
    print(f"======= Train {filename_base_base}, Epoch {mega_epoch} ========")

    all_targets = "".join([hmm.targets for hmm in hmms])  # collect alignments
    training_data.all_targets = all_targets  # just update the object with new targets

    train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True) # new dataloader for this alignment

    train_n_epochs(train_dataloader, optimizer, model, criterion, 5 if mega_epoch>=5 else mega_epoch) # at start, align more often

    
    filename_base = f"{filename_base_base}_{'%04d' % mega_epoch}"
    torch.save(model.state_dict(), filename_base+".pth")

    #break

    print('Interrupted training for re-alignment...')

    model.eval() # switch to evaluation mode


    for idx, hmm in enumerate(hmms):
        if idx%1000==0:
            print(f"Align {idx}")   
        alp = align_hmm(hmm, model, b_set, b_log_corr=b_log_corr, group_tripled=True)
        #alp = align_hmm(hmm, model, b_set, b_log_corr=b_log_corr*1.0, group_tripled=False)


    df['targets'] = [hmm.targets for hmm in hmms]

    df.to_csv(filename_base+".tsv", sep="\t", index=False)
    
    b_log_corr = b_log_corrections(filename_base+".tsv", b_set=b_set) # get new b() corrections based on frequency

[1, 20000] loss: 1.479
[1, 40000] loss: 1.463
[1, 60000] loss: 1.452
Interrupted training for re-alignment...
Align 0
Align 1000
Align 2000
Align 3000
Align 4000
Align 5000
Align 6000
Align 7000
Align 8000
Align 9000
Align 10000
[1, 20000] loss: 1.662
[1, 40000] loss: 1.614
[1, 60000] loss: 1.596
[2, 20000] loss: 1.571
[2, 40000] loss: 1.565
[2, 60000] loss: 1.561
Interrupted training for re-alignment...
Align 0
Align 1000
Align 2000
Align 3000
Align 4000
Align 5000
Align 6000
Align 7000
Align 8000
Align 9000
Align 10000
[1, 20000] loss: 1.154
[1, 40000] loss: 1.133
[1, 60000] loss: 1.128
[2, 20000] loss: 1.114
[2, 40000] loss: 1.112
[2, 60000] loss: 1.110
[3, 20000] loss: 1.100
[3, 40000] loss: 1.101
[3, 60000] loss: 1.101
Interrupted training for re-alignment...
Align 0
Align 1000
Align 2000
Align 3000
Align 4000
Align 5000
Align 6000
Align 7000
Align 8000
Align 9000
Align 10000
[1, 20000] loss: 0.891
[1, 40000] loss: 0.884
[1, 60000] loss: 0.879
[2, 20000] loss: 0.870
[2, 40000] los

[1, 20000] loss: 0.563
[1, 40000] loss: 0.566
[1, 60000] loss: 0.569
[2, 20000] loss: 0.563
[2, 40000] loss: 0.565
[2, 60000] loss: 0.567
[3, 20000] loss: 0.563
[3, 40000] loss: 0.566
[3, 60000] loss: 0.567
[4, 20000] loss: 0.563
[4, 40000] loss: 0.566
[4, 60000] loss: 0.565
[5, 20000] loss: 0.563
[5, 40000] loss: 0.564
[5, 60000] loss: 0.565
Interrupted training for re-alignment...
Align 0
Align 1000
Align 2000
Align 3000
Align 4000
Align 5000
Align 6000
Align 7000
Align 8000
Align 9000
Align 10000
[1, 20000] loss: 0.564
[1, 40000] loss: 0.565
[1, 60000] loss: 0.564
[2, 20000] loss: 0.563
[2, 40000] loss: 0.564
[2, 60000] loss: 0.566
[3, 20000] loss: 0.563
[3, 40000] loss: 0.564
[3, 60000] loss: 0.566
[4, 20000] loss: 0.562
[4, 40000] loss: 0.565
[4, 60000] loss: 0.564
[5, 20000] loss: 0.563
[5, 40000] loss: 0.565
[5, 60000] loss: 0.564
Interrupted training for re-alignment...
Align 0
Align 1000
Align 2000
Align 3000
Align 4000
Align 5000
Align 6000
Align 7000
Align 8000
Align 9000
Al