# Sources

<ul>
    <li><a href='https://github.com/neychev/small_DL_repo/tree/master/datasets/Multi30k'>Multi30k Datasets</a></li>
    <li><a href='https://commonvoice.mozilla.org/en/datasets'>German Audio Datase</a></li>
    <li><a href='https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-german'>Pretrained German SSR</a></li>
    <li><a href='https://huggingface.co/oliverguhr/spelling-correction-german-base'>Spelling Correcter</a></li>
</ul>

# Some Comments

Model was build with ChatGPT's help, a many-many-many tutorials that I simply can't include all here, but will pick the most important.

And finally, with the help of my poor laptop that went through agony, pain, 43 kernel crashes, 3 blue screens, and days of debugging

# Tutorials Used

<ol>
    <li>
        <a href='https://towardsdatascience.com/build-your-own-transformer-from-scratch-using-pytorch-84c850470dcb'>
            Build your own Transformer from scratch using Pytorch
        </a>
    </li>
    <li>
        <a href='https://pytorch.org/audio/stable/tutorials/speech_recognition_pipeline_tutorial.html'>
            SPEECH RECOGNITION WITH WAV2VEC2
        </a>
    </li>
    <li>
        <a href='https://medium.com/mlearning-ai/build-speech-to-text-model-from-scratch-580bc2c107a'>
            Build Speech to Text Model from Scratch
        </a>
    </li>
    <li>
        <a href='https://www.youtube.com/watch?v=U0s0f995w14&ab_channel=AladdinPersson'>
            Pytorch Transformers from Scratch (Attention is all you need) | VIDEO
        </a>
    </li>
<ol>

# Model Architecture

<div style='display: flex; justify-content:center;'>
    <img src='./architecture.jpg' width='90%'/>
</div>

# Imports

In [1]:
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import pandas as pd
import torchaudio
import torch
from transformers import pipeline
import torch
import torch.nn as nn
import torch.optim as optim
import spacy
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator
from tqdm import tqdm
import numpy as np




# Utility Variables

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
batch_size = 32

# Utility Functions 

In [93]:
def speech_file_to_waveform(path):
    waveform, sample_rate = torchaudio.load(path,format = path.split(".")[-1])
    if sample_rate != 16000:
        waveform = torchaudio.functional.resample(waveform, sample_rate, 16000)

    return waveform

def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    
    if sampling_rate != 16000:
        speech_array = torchaudio.functional.resample(speech_array, sampling_rate, 16000)

    batch["speech"] = speech_array[0].numpy()
    batch["sentence"] = batch["sentence"].lower()
    return batch

def predict_speech(inputs, model, processor):
    inputs = processor(inputs, sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to(device), attention_mask=inputs.attention_mask.to(device)).logits

    pred_ids = torch.argmax(logits, dim=-1)[0]
    return processor.decode(pred_ids)

def fix_spelling(sentence,spelling_model):
    prefix = 'correct: '
    sentence = prefix + sentence
    corrected = spelling_model(sentence,max_length=256)
    return corrected[0]['generated_text']

def load_checkpoint(checkpoint, model, optimizer):
    print("=> Loading checkpoint")
    model.load_state_dict(checkpoint["state_dict"])
    optimizer.load_state_dict(checkpoint["optimizer"])

def save_checkpoint(state, filename="my_checkpoint.pth.tar"):
    print("=> Saving checkpoint")
    torch.save(state, filename)

def translate_sentence(model, sentence, german, english, device, tokenizer, max_length=50):
    if type(sentence) == str:
        tokens = [token.text.lower() for token in tokenizer(sentence)]
    else:
        tokens = [token.lower() for token in sentence]

    tokens.insert(0, german.init_token)
    tokens.append(german.eos_token)

    text_to_indices = [german.vocab.stoi[token] for token in tokens]

    sentence_tensor = torch.LongTensor(text_to_indices).unsqueeze(1).to(device)

    outputs = [english.vocab.stoi["<sos>"]]
    
    for i in range(max_length):
        trg_tensor = torch.LongTensor(outputs).unsqueeze(1).to(device)

        with torch.no_grad():
            output = model(sentence_tensor, trg_tensor)

        best_guess = output.argmax(2)[-1, :].item()
        outputs.append(best_guess)

        if best_guess == english.vocab.stoi["<eos>"]:
            break

    translated_sentence = [english.vocab.itos[idx] for idx in outputs]
    # remove start token
    return translated_sentence[1:]

def remove_tokens(sentence):
    special_tokens = ['<sos>','<eos>','<unk>','<pad>']
    sent = []
    for token in sentence:
        if token not in special_tokens:
            sent.append(token)
            
    return sent

def filter(sent, vocab,tokenizer):
    sent = tokenizer(sent)
    for token in sent:
        if token not in vocab:
            return False
        
    return True

# Data

In [55]:
df = pd.read_csv('../data/de/validated.tsv', sep='\t')
df = df[['path', 'sentence']]
df['path'] = '../data/de/clips/' + df['path'].astype(str)
df['sentence'] = df['sentence'].apply(lambda x: x.lower())

In [9]:
spacy_de = spacy.load("de_core_news_sm")
spacy_eng = spacy.load("en_core_web_sm")

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]


def tokenize_eng(text):
    return [tok.text for tok in spacy_eng.tokenizer(text)]


german = Field(tokenize=tokenize_de, lower=True, init_token="<sos>", eos_token="<eos>")

english = Field(
    tokenize=tokenize_eng, lower=True, init_token="<sos>", eos_token="<eos>"
)

train_data, valid_data, test_data = Multi30k.splits(
    exts=(".de", ".en"), fields=(german, english)
)

german.build_vocab(train_data, max_size=10000, min_freq=2)
english.build_vocab(train_data, max_size=10000, min_freq=2)

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size=batch_size,
    sort_within_batch=True,
    sort_key=lambda x: len(x.src),
    device=device,
)

In [50]:
df['sentence'] = df['sentence'].apply(tokenize_de)

In [51]:
df

Unnamed: 0,path,sentence
0,../data/de/clips/common_voice_de_38356033.mp3,"[Ebenso, sind, Aktivitäten, bezüglich, der, Qu..."
1,../data/de/clips/common_voice_de_38211356.mp3,"[Ich, werde, Ihnen, das, sofort, beweisen, .]"
2,../data/de/clips/common_voice_de_38422759.mp3,"[Die, M]"
3,../data/de/clips/common_voice_de_38194475.mp3,"[Sie, selbst, tritt, dann, jeweils, nur, in, e..."
4,../data/de/clips/common_voice_de_38194477.mp3,"[Er, wohnt, in, Sigulda, .]"
...,...,...
6862,../data/de/clips/common_voice_de_38436731.mp3,"[Die, wirtschaftliche, Radikalkur, hat, sozial..."
6863,../data/de/clips/common_voice_de_38436745.mp3,"[Direkt, im, Anschluss, unterschrieb, er, eine..."
6864,../data/de/clips/common_voice_de_38436747.mp3,"[Was, ich, in, Zukunft, verhindert, sehen, möc..."
6865,../data/de/clips/common_voice_de_38436748.mp3,"[Während, dieses, Aufenthalts, starb, er, an, ..."


# Speech To Text (German)

In [10]:
model_name = "jonatasgrosman/wav2vec2-large-xlsr-53-german"
processor = Wav2Vec2Processor.from_pretrained(model_name)
acoustic_model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)

Some weights of the model checkpoint at jonatasgrosman/wav2vec2-large-xlsr-53-german were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at jonatasgrosman/wav2vec2-large-xlsr-53-german and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You

# Spelling Correction

In [11]:
spelling_model = pipeline("text2text-generation",model="oliverguhr/spelling-correction-german-base")

# Translator (German - English)

In [12]:
class Transformer(nn.Module):
    def __init__(
        self,
        embedding_size,
        src_vocab_size,
        trg_vocab_size,
        src_pad_idx,
        num_heads,
        num_encoder_layers,
        num_decoder_layers,
        forward_expansion,
        dropout,
        max_len,
        device,
    ):
        super(Transformer, self).__init__()
        self.src_word_embedding = nn.Embedding(src_vocab_size, embedding_size)
        self.src_position_embedding = nn.Embedding(max_len, embedding_size)
        self.trg_word_embedding = nn.Embedding(trg_vocab_size, embedding_size)
        self.trg_position_embedding = nn.Embedding(max_len, embedding_size)

        self.device = device
        self.transformer = nn.Transformer(
            embedding_size,
            num_heads,
            num_encoder_layers,
            num_decoder_layers,
            forward_expansion,
            dropout,
        )
        self.fc_out = nn.Linear(embedding_size, trg_vocab_size)
        self.dropout = nn.Dropout(dropout)
        self.src_pad_idx = src_pad_idx

    def make_src_mask(self, src):
        src_mask = src.transpose(0, 1) == self.src_pad_idx

        # (N, src_len)
        return src_mask.to(self.device)

    def forward(self, src, trg):
        src_seq_length, N = src.shape
        trg_seq_length, N = trg.shape

        src_positions = (
            torch.arange(0, src_seq_length)
            .unsqueeze(1)
            .expand(src_seq_length, N)
            .to(self.device)
        )

        trg_positions = (
            torch.arange(0, trg_seq_length)
            .unsqueeze(1)
            .expand(trg_seq_length, N)
            .to(self.device)
        )
        src_word_embedding = self.src_word_embedding(src)
        src_position_embedding = self.src_position_embedding(src_positions)

        trg_word_embedding = self.trg_word_embedding(trg)
        trg_position_embedding = self.trg_position_embedding(trg_positions)

        embed_src = self.dropout(src_word_embedding + src_position_embedding)
        embed_trg = self.dropout(trg_word_embedding + trg_position_embedding)

        src_padding_mask = self.make_src_mask(src)
        trg_mask = self.transformer.generate_square_subsequent_mask(trg_seq_length).to(
            self.device
        )

        out = self.transformer(
            embed_src,
            embed_trg,
            src_key_padding_mask=src_padding_mask,
            tgt_mask=trg_mask,
        )
        out = self.fc_out(out)
        return out

In [13]:
load_model = True
learning_rate = 3e-4

src_vocab_size = len(german.vocab)
trg_vocab_size = len(english.vocab)
embedding_size = 512
num_heads = 8
num_encoder_layers = 3
num_decoder_layers = 3
dropout = 0.10
max_len = 100
forward_expansion = 4
src_pad_idx = english.vocab.stoi["<pad>"]

In [14]:
model = Transformer(
    embedding_size,
    src_vocab_size,
    trg_vocab_size,
    src_pad_idx,
    num_heads,
    num_encoder_layers,
    num_decoder_layers,
    forward_expansion,
    dropout,
    max_len,
    device,
).to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)



In [15]:
load_checkpoint(torch.load("my_checkpoint.pth.tar"), model, optimizer)

=> Loading checkpoint


### Training for above model

In [None]:
num_epochs = 10

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, factor=0.1, patience=10, verbose=True
)

if load_model:
    load_checkpoint(torch.load("my_checkpoint.pth.tar"), model, optimizer)


pad_idx = english.vocab.stoi["<pad>"]
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)


sentence = "ein pferd geht unter einer brücke neben einem boot."

def decode_sentence(sentence,vocab):
    # decode from int to string
    return [vocab.itos[idx] for idx in sentence]

from tqdm import tqdm

best_valid_loss = float("inf")

for epoch in tqdm(range(num_epochs), total=num_epochs, unit="epoch", desc="Epoch"):
    model.train()
    losses = []

    for batch_idx, batch in enumerate(train_iterator):
        inp_data = batch.src.to(device)
        target = batch.trg.to(device)


        output = model(inp_data, target[:-1, :])

        output = output.reshape(-1, output.shape[2])
        target = target[1:].reshape(-1)

        optimizer.zero_grad()

        loss = criterion(output, target)
        losses.append(loss.item())

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)

        optimizer.step()

    mean_loss = sum(losses) / len(losses)
    scheduler.step(mean_loss)

    print(f"Mean loss at epoch {epoch} is {mean_loss:.5f}")

    model.eval()
    
    val_loss = []

    for batch_idx, batch in enumerate(valid_iterator):
        inp_data = batch.src.to(device)
        target = batch.trg.to(device)

        output = model(inp_data, target[:-1, :])

        output = output.reshape(-1, output.shape[2])
        target = target[1:].reshape(-1)

        loss = criterion(output, target)
        val_loss.append(loss.item())

    print(f"Validation loss at epoch {epoch} is {loss:.5f}")


    avg_val_loss = sum(val_loss) / len(val_loss)

    if avg_val_loss < best_valid_loss:
        print(f"Saving model at epoch {epoch}")
        print(f"Validation loss decreased from {best_valid_loss:.5f} to {avg_val_loss:.5f}")
        best_valid_loss = avg_val_loss
        checkpoint = {
            "state_dict": model.state_dict(),
            "optimizer": optimizer.state_dict(),
        }
        save_checkpoint(checkpoint)

# Final Model

In [94]:
filtered = df[df['sentence'].apply(lambda x: filter(x, tokenizer=tokenize_de, vocab=german.vocab.stoi))]

In [95]:
sampled = filtered

In [82]:
sampled = sampled.apply(speech_file_to_array_fn, axis=1)

In [83]:
sampled['predicted'] = sampled['speech'].apply(lambda x: predict_speech(x, acoustic_model, processor))

In [84]:
tqdm.pandas()
sampled['predicted_fixed'] = sampled['predicted'].progress_apply(lambda x: fix_spelling(x,spelling_model))

100%|██████████| 155/155 [03:00<00:00,  1.16s/it]


In [85]:
sampled

Unnamed: 0,path,sentence,speech,predicted,predicted_fixed
2,../data/de/clips/common_voice_de_38422759.mp3,die m,"[1.33550524e-14, 2.0671935e-13, 1.0357326e-13,...",die m,Die M.
22,../data/de/clips/common_voice_de_38272068.mp3,was genau soll das sein?,"[-3.4281543e-13, -4.872646e-14, 1.5407771e-12,...",was genau soll das sein,Was genau soll das sein?
78,../data/de/clips/common_voice_de_38174091.mp3,das kann so nicht bleiben.,"[-9.74852e-12, -9.262564e-12, 2.2325806e-11, -...",das kanns so nicht bleiben,Das kann so nicht bleiben.
80,../data/de/clips/common_voice_de_38174095.mp3,dabei wurde ein bauarbeiter getötet und drei w...,"[3.5853782e-15, -3.357858e-16, -1.2464322e-14,...",dabei wurde ein bauarbeiter getötet und drei w...,Dabei wurde ein Bauarbeiter getötet und drei w...
218,../data/de/clips/common_voice_de_38181894.mp3,im dorf befindet sich auch eine fabrik.,"[2.657345e-15, -1.9893757e-15, 6.4312305e-16, ...",im dorf befindet sich auch eine fabrik,Im Dorf befindet sich auch eine Fabrik.
...,...,...,...,...,...
6682,../data/de/clips/common_voice_de_38395163.mp3,da wäre ich sehr traurig.,"[-6.2494054e-16, 4.5298488e-15, 6.5760656e-15,...",da wäre ich sehr traurig,Da wäre ich sehr traurig.
6716,../data/de/clips/common_voice_de_38395204.mp3,ihre kleinen augen stehen weit oben am kopf.,"[1.7646292e-14, 4.405076e-14, 7.24041e-14, 2.1...",ihre kleinen augen stehen weit oben am kopf,Ihre kleinen Augen stehen weit oben am Kopf.
6833,../data/de/clips/common_voice_de_38436640.mp3,genau das ist richtig!,"[2.5489205e-12, 9.356387e-12, 4.0864946e-12, 1...",genau das ist richtig,Genau das ist richtig.
6834,../data/de/clips/common_voice_de_38436641.mp3,drei hunde sind einfach zu viele.,"[2.385959e-13, -7.5150205e-13, -1.7912293e-12,...",drei hunde sind einfach zu viele,Drei Hunde sind einfach zu viele.


In [86]:
tqdm.pandas()
sampled['translated'] = sampled['predicted_fixed'].progress_apply(lambda x: translate_sentence(model, x, german, english, device, spacy_de, max_length=50))

100%|██████████| 155/155 [00:22<00:00,  6.94it/s]


In [87]:
sampled['translated'] = sampled['translated'].apply(lambda x: ' '.join(x).strip())

In [88]:
sampled

Unnamed: 0,path,sentence,speech,predicted,predicted_fixed,translated
2,../data/de/clips/common_voice_de_38422759.mp3,die m,"[1.33550524e-14, 2.0671935e-13, 1.0357326e-13,...",die m,Die M.,the <unk> is <unk> . <eos>
22,../data/de/clips/common_voice_de_38272068.mp3,was genau soll das sein?,"[-3.4281543e-13, -4.872646e-14, 1.5407771e-12,...",was genau soll das sein,Was genau soll das sein?,a person is looking at his board . <eos>
78,../data/de/clips/common_voice_de_38174091.mp3,das kann so nicht bleiben.,"[-9.74852e-12, -9.262564e-12, 2.2325806e-11, -...",das kanns so nicht bleiben,Das kann so nicht bleiben.,the <unk> is using n't tools . <eos>
80,../data/de/clips/common_voice_de_38174095.mp3,dabei wurde ein bauarbeiter getötet und drei w...,"[3.5853782e-15, -3.357858e-16, -1.2464322e-14,...",dabei wurde ein bauarbeiter getötet und drei w...,Dabei wurde ein Bauarbeiter getötet und drei w...,there is three men wearing a construction and ...
218,../data/de/clips/common_voice_de_38181894.mp3,im dorf befindet sich auch eine fabrik.,"[2.657345e-15, -1.9893757e-15, 6.4312305e-16, ...",im dorf befindet sich auch eine fabrik,Im Dorf befindet sich auch eine Fabrik.,a village is in the shower of a factory . <eos>
...,...,...,...,...,...,...
6682,../data/de/clips/common_voice_de_38395163.mp3,da wäre ich sehr traurig.,"[-6.2494054e-16, 4.5298488e-15, 6.5760656e-15,...",da wäre ich sehr traurig,Da wäre ich sehr traurig.,there is some work very high - war . <eos>
6716,../data/de/clips/common_voice_de_38395204.mp3,ihre kleinen augen stehen weit oben am kopf.,"[1.7646292e-14, 4.405076e-14, 7.24041e-14, 2.1...",ihre kleinen augen stehen weit oben am kopf,Ihre kleinen Augen stehen weit oben am Kopf.,their small eyes stand on the head . <eos>
6833,../data/de/clips/common_voice_de_38436640.mp3,genau das ist richtig!,"[2.5489205e-12, 9.356387e-12, 4.0864946e-12, 1...",genau das ist richtig,Genau das ist richtig.,the <unk> is about . <eos>
6834,../data/de/clips/common_voice_de_38436641.mp3,drei hunde sind einfach zu viele.,"[2.385959e-13, -7.5150205e-13, -1.7912293e-12,...",drei hunde sind einfach zu viele,Drei Hunde sind einfach zu viele.,three dogs are running in the same . <eos>


In [89]:
result = sampled[['sentence','predicted_fixed','translated']]

In [90]:
result

Unnamed: 0,sentence,predicted_fixed,translated
2,die m,Die M.,the <unk> is <unk> . <eos>
22,was genau soll das sein?,Was genau soll das sein?,a person is looking at his board . <eos>
78,das kann so nicht bleiben.,Das kann so nicht bleiben.,the <unk> is using n't tools . <eos>
80,dabei wurde ein bauarbeiter getötet und drei w...,Dabei wurde ein Bauarbeiter getötet und drei w...,there is three men wearing a construction and ...
218,im dorf befindet sich auch eine fabrik.,Im Dorf befindet sich auch eine Fabrik.,a village is in the shower of a factory . <eos>
...,...,...,...
6682,da wäre ich sehr traurig.,Da wäre ich sehr traurig.,there is some work very high - war . <eos>
6716,ihre kleinen augen stehen weit oben am kopf.,Ihre kleinen Augen stehen weit oben am Kopf.,their small eyes stand on the head . <eos>
6833,genau das ist richtig!,Genau das ist richtig.,the <unk> is about . <eos>
6834,drei hunde sind einfach zu viele.,Drei Hunde sind einfach zu viele.,three dogs are running in the same . <eos>


# Conclusion

If the vocab would be much bigger, with much bigger vocab, the accuracy of the model would be much better.

However, it has its own disadvantage. The model with large training data would take exponentially longer to train, or exponentially more space.

Anyways, for the sentences that consist of tokens from vocab, the translation is relatevly close (according to my own knowledge of German)

Overall, the model did a good job