There are 2(3) parts of my course project:

* grapheme-to-phoneme
* speech-to-text
* (questin answering as a bonus)

#### speech2text metric

The most popular metric for evaluating s2t models is WER

https://en.wikipedia.org/wiki/Word_error_rate

In [23]:
import numpy as np
import pandas as pd

from jiwer import wer

In [30]:
def word_error_rate(r, h, split=True):
    """
    Given two list of strings how many word error rate(insert, delete or substitution).

    Parameters
    ----------
    r : str
        reference sentence
    H : str
        hypothesis sentence
    split : bool, default True
        split sentence by words. In case of character error rate CER should be set as False

    Returns
    -------
    result : float
    """
    
    if split:
        r = r.split()
        h = h.split()
        
    d = np.zeros((len(r) + 1) * (len(h) + 1), dtype=np.uint16)
    d = d.reshape((len(r) + 1, len(h) + 1))
    for i in range(len(r) + 1):
        for j in range(len(h) + 1):
            if i == 0:
                d[0][j] = j
            elif j == 0:
                d[i][0] = i

    for i in range(1, len(r) + 1):
        for j in range(1, len(h) + 1):
            if r[i - 1] == h[j - 1]:
                d[i][j] = d[i - 1][j - 1]
            else:
                substitution = d[i - 1][j - 1] + 1
                insertion = d[i][j - 1] + 1
                deletion = d[i - 1][j] + 1
                d[i][j] = min(substitution, insertion, deletion)
    result = float(d[len(r)][len(h)]) / len(r)

    return result

In [31]:
sentence1 = 'Дівчина насупилась набурмосилась і від того ще покращала.'
sentence2 = 'Дівчина накрутилась напросилась і від того ще покращала.'

word_error_rate(sentence1, sentence2)

0.25

In [32]:
wer(sentence1, sentence2)

0.25

#### real-life example

In [33]:
df = pd.read_excel('stt_report.xlsx')#.drop('wer_gc', 1)

In [34]:
%%time

df['wer'] = df.apply(lambda x: word_error_rate(x.transcript, x.pred_gc), axis=1)
df['wer'] = np.where(df['wer'] > 1, 1, df['wer'])

CPU times: user 648 ms, sys: 0 ns, total: 648 ms
Wall time: 647 ms


In [35]:
df.head()

Unnamed: 0,file_name,transcript,pred_gc,wer
0,PILOT_20200206-201536_VDAD_0797068313_1239978-...,"alo, cho hỏi bùi thái an đang nghe máy đúng kh...",alo phải không chị thấy em luôn hả chị,0.884615
1,PILOT_20200206-201537_VDAD_0915234827_1239965-...,"cảm ơn đã chờ máy, chị gấm nghe máy hả chị? Al...",cảm ơn đã kiểu bánh chị dám nghe máy hả chị alo,0.927273
2,PILOT_20200206-201551_VDAD_0777060852_1239985-...,cảm ơn đã chờ máy,cảm ơn đã kiểu bánh,0.4
3,PILOT_20200206-201800_VDAD_0345591318_1240055-...,"alo, cho em hỏi anh thế nghe máy hả anh? Dạ em...",alo cho em hỏi anh thế mà máy em chào anh thì ...,0.973333
4,PILOT_20200206-202015_VDAD_0936646364_1240105-...,"alo, cho hỏi số thuê bao này của vũ thị hằng đ...",zalo cho thuê bao này của vũ thị hằng,0.428571


In [36]:
np.mean(df['wer'])

0.641254431881672

#### grapheme2phoneme metric

Evaluation approach is similar to s2t, but instead of words we compare phonemes. Equation for such calculation is similar to WER, so we can use the same function for grapheme error rate evaluation

In [39]:
word1 = "Д' і в ч и н а"
word2 = 'Д и в ч і н а'
word3 = "Н а к р у т и л а с'"
word4 = "Н а т р у т и л а с' я"

print(word_error_rate(word1, word2))
word_error_rate(word3, word4)

0.42857142857142855


0.2