## Project metrics explanation

good source of all NLP metrics: https://github.com/gcunhase/NLPMetrics

There are 2(3) parts of my course project:

* grapheme-to-phoneme
* speech-to-text
* (questin answering as a bonus)

#### speech2text metric

The most popular metric for evaluating s2t models is WER.
But we will also review here other metrics:  
ROUGE, BLEU, METEOR and WAcc(as extension of WER)

source: https://en.wikipedia.org/wiki/Word_error_rate

In [1]:
import re
import numpy as np
import pandas as pd

from jiwer import wer
from numba import jit

In [2]:
@jit
def word_error_rate(r, h, split=True):
    """
    Given two list of strings how many word error rate(insert, delete or substitution).

    Parameters
    ----------
    r : str
        reference sentence
    H : str
        hypothesis sentence
    split : bool, default True
        split sentence by words. In case of character error rate CER should be set as False

    Returns
    -------
    result : float
    """
    
    if split:
        r = r.split()
        h = h.split()
        
    d = np.zeros((len(r) + 1) * (len(h) + 1), dtype=np.uint16)
    d = d.reshape((len(r) + 1, len(h) + 1))
    for i in range(len(r) + 1):
        for j in range(len(h) + 1):
            if i == 0:
                d[0][j] = j
            elif j == 0:
                d[i][0] = i

    for i in range(1, len(r) + 1):
        for j in range(1, len(h) + 1):
            if r[i - 1] == h[j - 1]:
                d[i][j] = d[i - 1][j - 1]
            else:
                substitution = d[i - 1][j - 1] + 1
                insertion = d[i][j - 1] + 1
                deletion = d[i - 1][j] + 1
                d[i][j] = min(substitution, insertion, deletion)
    result = float(d[len(r)][len(h)]) / len(r)

    return result

In [3]:
hypothesis = "Це тестове синтезоване речення яке порівнюється з синтезованим моделлю іншим"
reference = "Це тестове еталонне речення яке ми будемо намагатися порівняти з синтезованим моделлю"

In [6]:
word_error_rate(reference, hypothesis)

0.5

In [7]:
wer(truth=reference, hypothesis=hypothesis)

0.5

#### real-life example

In [8]:
df = pd.read_excel('stt_report.xlsx')
df['transcript'] = df['transcript'].map(lambda x: re.sub(r"[\.\?\,]", '', x))

In [9]:
%%time

df['wer'] = df.apply(lambda x: wer(truth=x.transcript, hypothesis=x.pred_gc), axis=1)
df['wer'] = np.where(df['wer'] > 1, 1, df['wer'])

CPU times: user 644 ms, sys: 635 µs, total: 645 ms
Wall time: 644 ms


In [10]:
df

Unnamed: 0,file_name,transcript,pred_gc,wer
0,PILOT_20200206-201536_VDAD_0797068313_1239978-...,alo cho hỏi bùi thái an đang nghe máy đúng khô...,alo phải không chị thấy em luôn hả chị,0.730769
1,PILOT_20200206-201537_VDAD_0915234827_1239965-...,cảm ơn đã chờ máy chị gấm nghe máy hả chị Alo ...,cảm ơn đã kiểu bánh chị dám nghe máy hả chị alo,0.918182
2,PILOT_20200206-201551_VDAD_0777060852_1239985-...,cảm ơn đã chờ máy,cảm ơn đã kiểu bánh,0.4
3,PILOT_20200206-201800_VDAD_0345591318_1240055-...,alo cho em hỏi anh thế nghe máy hả anh Dạ em c...,alo cho em hỏi anh thế mà máy em chào anh thì ...,0.971208
4,PILOT_20200206-202015_VDAD_0936646364_1240105-...,alo cho hỏi số thuê bao này của vũ thị hằng đú...,zalo cho thuê bao này của vũ thị hằng,0.428571
5,PILOT_20200206-202142_VDAD_0979028682_1240142-...,alo cho hỏi số thuê bao này của trương thị loa...,hello của trương thị lan phú yên ai là trương ...,0.748148
6,PILOT_20200206-202202_VDAD_0907890030_1240170-...,alo chào anh cho hỏi chị phải số điện thoại củ...,alo chào anh cho hỏi cái điện thoại này có chị...,0.4375
7,PILOT_20200206-202313_VDAD_0394408951_1240218-...,alo cho hỏi phải anh phước đang nghe máy không...,alo cho hỏi phan phước đang nghe máy không mà ...,0.72973
8,PILOT_20200206-202316_VDAD_0947228560_1240217-...,alo,alo,0.0


In [11]:
ref, hyp = df.iloc[6].values[1:3]

In [12]:
wer(truth=ref, hypothesis=hyp)

0.4375

In [13]:
np.mean(df['wer'])

0.596012024507655

#### grapheme2phoneme metric

Evaluation approach is similar to s2t, but instead of words we compare phonemes. Equation for such calculation is similar to WER, so we can use the same function for grapheme error rate evaluation

In [14]:
word1 = "Д' і в ч и н а"
word2 = 'Д и в ч і н а'
word3 = "Н а к р у т и л а с'"
word4 = "Н а т р у т и л а с' я"

print(word_error_rate(word1, word2))
word_error_rate(word3, word4)

0.42857142857142855


0.2

### WAcc

When reporting the performance of a speech recognition system, sometimes word accuracy (WAcc) is used instead

WAcc = 1 - WER

As all metrics below are going to be maximized, we will use WAcc instead of WER for comparing purposes

### Comparing other NLP metrics

#### ROUGE

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation

source: https://www.aclweb.org/anthology/W04-1013.pdf  

human explanation: https://kavita-ganesan.com/what-is-rouge-and-how-it-works-for-evaluation-of-summaries/


I would explaing ROUGE as text summarization based precision/recal/f1_score metric.
It takes into account unigrams, bigrams of tested texts, aswell longest common sequence or common skip-grams.

In [15]:
from rouge import Rouge

In [16]:
print(reference)
print(hypothesis)

Це тестове еталонне речення яке ми будемо намагатися порівняти з синтезованим моделлю
Це тестове синтезоване речення яке порівнюється з синтезованим моделлю іншим


In [17]:
rouge = Rouge()
scores = rouge.get_scores(hyps=hypothesis, refs=reference)
scores

[{'rouge-1': {'f': 0.6363636314049588, 'p': 0.7, 'r': 0.5833333333333334},
  'rouge-2': {'f': 0.39999999505,
   'p': 0.4444444444444444,
   'r': 0.36363636363636365},
  'rouge-l': {'f': 0.6363636314049588, 'p': 0.7, 'r': 0.5833333333333334}}]

In [18]:
print("WAcc:", 1 - wer(truth=reference, hypothesis=hypothesis))

WAcc: 0.5


In [19]:
print(ref)
print(hyp)

alo chào anh cho hỏi chị phải số điện thoại của chị kim ngân không ạ
alo chào anh cho hỏi cái điện thoại này có chị không ạ


In [20]:
rouge = Rouge()
scores = rouge.get_scores(hyps=hyp, refs=ref)
scores

[{'rouge-1': {'f': 0.6896551674673009, 'p': 0.7692307692307693, 'r': 0.625},
  'rouge-2': {'f': 0.4444444395061729, 'p': 0.5, 'r': 0.4},
  'rouge-l': {'f': 0.7142857093112245,
   'p': 0.7692307692307693,
   'r': 0.6666666666666666}}]

In [21]:
print("WAcc:", 1 - wer(truth=ref, hypothesis=hyp))

WAcc: 0.5625


As you can see, the main problem of applying ROUGE to s2t problems is that in takes into account only TP and FP samples, but mostly ignores the order of words as well as distance.  
Of course, somehows rouge-N will take it into account with bigger N-grams. But it's not as clear as levenstein approach in WER.  
ROUGE-L also not really applicable, as we only count common longest sequence and ignores all other staff

But in general, this metric can be used fot s2t. IMO, it's just not as good as WER

### BLEU

source: https://www.aclweb.org/anthology/P02-1040.pdf

From article:

BLEU, or the Bilingual Evaluation Understudy, is a score for comparing a candidate translation of text to one or more reference translations.

The primary programming task for a BLEU implementor is to compare n-grams of the candidate with the n-grams of the reference translation and count the number of matches. These matches are position-independent. The more the matches, the better the candidate translation is.

In [22]:
from nltk.translate.bleu_score import sentence_bleu

In [23]:
print(reference)
print(hypothesis)

Це тестове еталонне речення яке ми будемо намагатися порівняти з синтезованим моделлю
Це тестове синтезоване речення яке порівнюється з синтезованим моделлю іншим


In [24]:
print("WAcc:", 1 - wer(truth=reference, hypothesis=hypothesis))

WAcc: 0.5


In [28]:
print("BLUE 1-gram:", sentence_bleu(references=[reference.split()], hypothesis=hypothesis.split(), weights=(1, 0, 0, 0)))
print("BLUE 2-gram:", sentence_bleu(references=[reference.split()], hypothesis=hypothesis.split(), weights=(0, 1, 0, 0)))
print("BLUE cummulative 2-gram:", sentence_bleu(references=[reference.split()], hypothesis=hypothesis.split(), weights=(0.5, 0.5, 0, 0)))
print("BLUE cummulative 3-gram:", sentence_bleu(references=[reference.split()], hypothesis=hypothesis.split(), weights=(0.33, 0.33, 0.33, 0)))
print("BLUE cummulative 4-gram (default):", sentence_bleu(references=[reference.split()], hypothesis=hypothesis.split(), weights=(0.25, 0.25, 0.25, 0.25)))

BLUE 1-gram: 0.5731115271545874
BLUE 2-gram: 0.3638803347013253
BLUE cummulative 2-gram: 0.4566661957296586
BLUE cummulative 3-gram: 0.2804035645280751
BLUE cummulative 4-gram (default): 4.440517594603186e-78


In [29]:
print(ref)
print(hyp)

alo chào anh cho hỏi chị phải số điện thoại của chị kim ngân không ạ
alo chào anh cho hỏi cái điện thoại này có chị không ạ


In [30]:
print("WAcc:", 1 - wer(truth=ref, hypothesis=hyp))

WAcc: 0.5625


In [31]:
print("BLUE 1-gram:", sentence_bleu(references=[ref.split()], hypothesis=hyp.split(), weights=(1, 0, 0, 0)))
print("BLUE 2-gram:", sentence_bleu(references=[ref.split()], hypothesis=hyp.split(), weights=(0, 1, 0, 0)))
print("BLUE cummulative 2-gram:", sentence_bleu(references=[ref.split()], hypothesis=hyp.split(), weights=(0.5, 0.5, 0, 0)))
print("BLUE cummulative 3-gram:", sentence_bleu(references=[ref.split()], hypothesis=hyp.split(), weights=(0.33, 0.33, 0.33, 0)))
print("BLUE cummulative 4-gram (default):", sentence_bleu(references=[ref.split()], hypothesis=hyp.split(), weights=(0.25, 0.25, 0.25, 0.25)))

BLUE 1-gram: 0.6107097367830394
BLUE 2-gram: 0.3969613289089756
BLUE cummulative 2-gram: 0.4923699307340427
BLUE cummulative 3-gram: 0.37724841138205684
BLUE cummulative 4-gram (default): 0.30215132342213097


Again, the same situation as with ROUGE, we are counting N-grams. 
And if in case of ROUGE-S, we can at least capture such case as:  
"мама мыла раму" -> "мама раму"  
by using skip-grams, both metrics will ignore such sample:  
"мама мыла раму" -> "раму мама мыла"

In [34]:
print(1- wer("мама мыла раму", "мама раму"))
print(sentence_bleu(references=["мама мыла раму".split()], hypothesis="мама раму".split(), weights=(1, 0, 0, 0)))
print(sentence_bleu(references=["мама мыла раму".split()], hypothesis="мама раму".split(), weights=(0.5, 0.5, 0, 0)))
rouge.get_scores(hyps="мама раму", refs="мама мыла раму")

0.6666666666666667
0.6065306597126334
9.047424648113057e-155


[{'rouge-1': {'f': 0.7999999952000001, 'p': 1.0, 'r': 0.6666666666666666},
  'rouge-2': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-l': {'f': 0.7999999952000001, 'p': 1.0, 'r': 0.6666666666666666}}]

In [35]:
print(1- wer("мама мыла раму", "раму мама мыла"))
print(sentence_bleu(references=["мама мыла раму".split()], hypothesis="раму мама мыла".split(), weights=(1, 0, 0, 0)))
print(sentence_bleu(references=["мама мыла раму".split()], hypothesis="раму мама мыла".split(), weights=(0.5, 0.5, 0, 0)))
rouge.get_scores(hyps="раму мама мыла", refs="мама мыла раму")

0.33333333333333337
1.0
0.7071067811865476


[{'rouge-1': {'f': 0.999999995, 'p': 1.0, 'r': 1.0},
  'rouge-2': {'f': 0.4999999950000001, 'p': 0.5, 'r': 0.5},
  'rouge-l': {'f': 0.6666666616666668,
   'p': 0.6666666666666666,
   'r': 0.6666666666666666}}]

Why this is important? For some languages, like Enlgish, analytical languages, the order of words make sense.
And only WER penalize us for changing predicted words order

### METEOR

source: https://www.cs.cmu.edu/~alavie/METEOR/pdf/Lavie-Agarwal-2007-METEOR.pdf

tl;dr
METEOR is BLUE on steroids. In addition to comparing N-grams, it also takes into account coincidences of stemmed words and synonyms. This ideally lays on MT problem, but mostly useless for s2t.

In [36]:
from nltk.translate.meteor_score import single_meteor_score

In [37]:
print(reference)
print(hypothesis)

Це тестове еталонне речення яке ми будемо намагатися порівняти з синтезованим моделлю
Це тестове синтезоване речення яке порівнюється з синтезованим моделлю іншим


In [38]:
print("WAcc:", 1 - wer(truth=reference, hypothesis=hypothesis))

WAcc: 0.5


In [39]:
single_meteor_score(reference=reference, hypothesis=hypothesis)

0.5698720166032515

In [40]:
print(ref)
print(hyp)

alo chào anh cho hỏi chị phải số điện thoại của chị kim ngân không ạ
alo chào anh cho hỏi cái điện thoại này có chị không ạ


In [41]:
print("WAcc:", 1 - wer(truth=ref, hypothesis=hyp))

WAcc: 0.5625


In [42]:
single_meteor_score(reference=ref, hypothesis=hyp)

0.6165605095541401

### Conclusion

As we see, all 3 metrics are works with N-grams.  
``ROUGE-S`` is good, because takes into account skip-grams, but on the other hand, it ignores order. And fact that it's implementation is absent in all common used libraries, makes me think that it actually not popular among researchers.  
``BLUE`` just comparing common n-grams and aggregates results.  
``METEOR`` is the best one for this purposes, as it also aligns hypothesis to the reference, which affects on ``penalty`` in the final result

In [43]:
r, h = "мама мыла раму", "мама раму"

In [44]:
print("WAcc: ", 1- wer(r, h))
print("BLUE 1-gram: ", sentence_bleu(references=[r.split()], hypothesis=h.split(), weights=(1, 0, 0, 0)))
print("BLUE commulative 2-gram: ", sentence_bleu(references=[r.split()], hypothesis=h.split(), weights=(0.5, 0.5, 0, 0)))
print("METEOR: ", single_meteor_score(reference=r, hypothesis=h))
print("ROUGE:")
rouge.get_scores(hyps=h, refs=r)

WAcc:  0.6666666666666667
BLUE 1-gram:  0.6065306597126334
BLUE commulative 2-gram:  9.047424648113057e-155
METEOR:  0.3448275862068965
ROUGE:


[{'rouge-1': {'f': 0.7999999952000001, 'p': 1.0, 'r': 0.6666666666666666},
  'rouge-2': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-l': {'f': 0.7999999952000001, 'p': 1.0, 'r': 0.6666666666666666}}]

In [45]:
r, h = "мама мыла раму", "раму мыла мама"

In [46]:
print("WAcc: ", 1- wer(r, h))
print("BLUE 1-gram: ", sentence_bleu(references=[r.split()], hypothesis=h.split(), weights=(1, 0, 0, 0)))
print("BLUE commulative 2-gram: ", sentence_bleu(references=[r.split()], hypothesis=h.split(), weights=(0.5, 0.5, 0, 0)))
print("METEOR: ", single_meteor_score(reference=r, hypothesis=h))
print("ROUGE:")
rouge.get_scores(hyps=h, refs=r)

WAcc:  0.33333333333333337
BLUE 1-gram:  1.0
BLUE commulative 2-gram:  1.491668146240062e-154
METEOR:  0.5
ROUGE:


[{'rouge-1': {'f': 0.999999995, 'p': 1.0, 'r': 1.0},
  'rouge-2': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-l': {'f': 0.3333333283333334,
   'p': 0.3333333333333333,
   'r': 0.3333333333333333}}]

In [47]:
r, h = "the cat sat on the mat", "sat on the mat the cat"

In [48]:
print("WAcc: ", 1- wer(r, h))
print("BLUE 1-gram: ", sentence_bleu(references=[r.split()], hypothesis=h.split(), weights=(1, 0, 0, 0)))
print("BLUE commulative 2-gram: ", sentence_bleu(references=[r.split()], hypothesis=h.split(), weights=(0.5, 0.5, 0, 0)))
print("METEOR: ", single_meteor_score(reference=r, hypothesis=h))
print("ROUGE:")
rouge.get_scores(hyps=h, refs=r)

WAcc:  0.33333333333333337
BLUE 1-gram:  1.0
BLUE commulative 2-gram:  0.8944271909999159
METEOR:  0.7106481481481481
ROUGE:


[{'rouge-1': {'f': 0.999999995, 'p': 1.0, 'r': 1.0},
  'rouge-2': {'f': 0.7999999950000002, 'p': 0.8, 'r': 0.8},
  'rouge-l': {'f': 0.7999999950000002, 'p': 0.8, 'r': 0.8}}]

#### Time test

In [49]:
from time import time

In [50]:
start_time = time()
_ = 1- wer(r, h)
print(f"WAcc exec time: {(time() - start_time) * 1000:.2f} miliseconds")

start_time = time()
_ = 1- word_error_rate(r, h)
print(f"WAcc with JIT exec time: {(time() - start_time) * 1000:.2f} miliseconds")

start_time = time()
_ = sentence_bleu(references=[r.split()], hypothesis=h.split(), weights=(1, 0, 0, 0))
print(f"BLUE 1-gram exec time: {(time() - start_time) * 1000:.2f} miliseconds")

start_time = time()
_ = sentence_bleu(references=[r.split()], hypothesis=h.split(), weights=(0.5, 0.5, 0, 0))
print(f"BLUE commulative 2-gram exec time: {(time() - start_time) * 1000:.2f} miliseconds")

start_time = time()
_ = single_meteor_score(reference=r, hypothesis=h)
print(f"METEOR exec time: {(time() - start_time) * 1000:.2f} miliseconds")

start_time = time()
_ = rouge.get_scores(hyps=h, refs=r)
print(f"ROUGE exec time: {(time() - start_time) * 1000:.2f} miliseconds")

WAcc exec time: 0.88 miliseconds
WAcc with JIT exec time: 0.64 miliseconds
BLUE 1-gram exec time: 0.56 miliseconds
BLUE commulative 2-gram exec time: 0.50 miliseconds
METEOR exec time: 0.24 miliseconds
ROUGE exec time: 0.56 miliseconds


So, there are no performance advantage among described metrics. So, this can't be a point to chose the best one.
Howewer, if the word order in generated text is important for us, we should choose ``METEOR`` or ``WAcc``.
``METEOR`` more penalizing absent words, ``WAcc`` is just ``Levenstain distance`` for sentences and takes everything into account evenly