# 【語音辨識 - Whisper】 準確與否需要有一把 📏尺來衡量


前面我們介紹了幾個關於Whisper的基本概念，這裡附上 [🚀傳送門](https://vocus.cc/article/644526c8fd89780001ffdd9f) ，歡迎好好閱讀一番，但我們除了學會如何用語音辨識的工具之外，「準確率」對我們來說也是一個非常重要的一環，但我們究竟應該要如何評估所謂的準確率呢？ 不知道沒關係，當您看完這個篇章就能夠學會如何計算文字的「字元錯誤率」、「字詞錯誤率」...，非常值得您細細品嘗與學習，就讓我們往下一步步的完成評估準確率的程序吧！

這次的評估工具我們會使用jiwer這一套來進行說明，它支援了多種的計算方式，包括： WER、CER、MER...等，那這些計算方式各有什麼不同呢？ 就讓我們繼續看下去吧！

## 安裝套件

In [1]:
# 錯誤率計算工具
!pip install jiwer

# 移除掉與語音辨識套件相同名稱的套件
# !pip uninstall whisper

# 語音辨識ASR
!pip install -U openai-whisper

# Hugging Face資料集函式庫
!pip install datasets

# 斷詞器
!pip install jiaba

Collecting jiwer
  Downloading jiwer-3.0.3-py3-none-any.whl (21 kB)
Collecting rapidfuzz<4,>=3 (from jiwer)
  Downloading rapidfuzz-3.6.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-3.0.3 rapidfuzz-3.6.2
Collecting openai-whisper
  Downloading openai-whisper-20231117.tar.gz (798 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m798.6/798.6 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper)
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 有哪些不同的計算方式呢？

### 以「詞」為單位進行計算

我們先來看看詞的計算結果如下：

In [None]:


reference = "今天 天氣 很好 嗎"
hypothesis = "今天 天氣 很 好 啊"

out = jiwer.process_words(reference, hypothesis)
print(jiwer.visualize_alignment(out))

NameError: name 'jiwer' is not defined

#### 詞錯誤率 Word Error Rate(WER)
WER是以「詞」為單位進行計算，它用來衡量句子中有多少詞彙需要進行修改才能和正確答案一樣。

```bash
公式: (S + D + I) / (H + S + D)
計算過程: (2 + 0 + 1) / (2 + 2 + 0)
3 / 4 ≈ 75%。
```

💡 既然是以`詞`為單位的話，那麼我們的答案與辨識結果請先進行斷詞(通常用空白隔開)， 標點符號也是考量的因素之一喔。

#### 平均錯誤率 Mean Error Rate(MER)
這項指標與WER主要差別在於分母的部分尚未將`Insertion`給考量進來計算，因為它衡量的不僅是詞彙層級，而是句子層級，因此會更加全面。

```bash
公式： (S + D + I) / (H + S + D + I)
計算過程： (2 + 0 + 1) / (2 + 2 + 0 + 1)

3 / 5 ≈ 60%
```

#### 詞保留率 Word Information Preservation(WIP)
這項指標主要在評估我們的辨識結果究竟有多少比例的字詞是一模一樣完全正確的。

```bash
num_rf_words = 正確答案字詞數 = 4
num_hp_words = 辨識結果字詞數 = 5
公式： (H / num_rf_words) * (H / num_hp_words)
計算過程: (2 / 4) * (2 / 5)
0.5 * 0.4 ≈ 20%
```
#### 詞漏失率 Word Information Lost(WIL)
既然有詞的保留率，那麼相反的就是漏失率，因此上述的結果得出之後，用1減去保留率就是漏失率，可以粗略的評估總共漏失了多少比率。
```bash
公式: 1 - wip
1 - 0.2 ≈ 80%
```

## 以「字元」為單位進行計算

### 字元錯誤率 Character Error Rate(CER)
CER是以「字元」為單位進行計算，底下的例子以「字元」為單位會發現有1個substitution，因此總共7個字元錯了1個等於：

```
1 / 7 = 14.28%
```

💡 既然是以`字元`為單位的話，那麼我們的答案與辨識結果請將空白給去除， 才不會也被計算進去喔， 甚至標點符號...等都是考量的因素之一。

In [2]:
#連到到自己的google drive
from google.colab import drive
drive.mount('/content/drive')
import os
os.chdir('/content/drive/MyDrive/Colab Notebooks')

Mounted at /content/drive


In [7]:
import re
def remove_repeated_words(text):
    pattern = r'(\w{2,100})\1'
    while True:
        new_text = re.sub(pattern, r'\1', text)
        if new_text == text: break
        text = new_text
    return text

# new_transcript = []
# for transcript in transcripts:
#     new_transcript.append(remove_repeated_words(transcript))

In [9]:
import jiwer
with open("openai20221025 臺南市政府第566次市政會議.txt","r",encoding="utf-8") as openaiasr:
  hypothesis=openaiasr.read()
with open("corrected/20221025-臺南市政府第566次市政會議_正確版.docx.txt","r",encoding="utf-8") as correctedasr:
  reference=correctedasr.read()

#reference = "今天天氣很好嗎"
#hypothesis = "今天天氣很好啊"
#hypothesis=remove_repeated_words(hypothesis)
error = jiwer.cer(reference, hypothesis)
print(error)
output = jiwer.process_characters(reference, hypothesis)
#print(jiwer.visualize_alignment(output))


0.2553241557830333


In [None]:
import jiwer
with open("seihing20221025 臺南市政府第566次市政會議.txt","r",encoding="utf-8") as openaiasr:
  hypothesis=openaiasr.read()
with open("corrected/20221025-臺南市政府第566次市政會議_正確版.docx.txt","r",encoding="utf-8") as correctedasr:
  reference=correctedasr.read()

#reference = "今天天氣很好嗎"
#hypothesis = "今天天氣很好啊"

#output = jiwer.process_characters(reference, hypothesis)
#print(jiwer.visualize_alignment(output))
error = jiwer.cer(reference, hypothesis)
print(error)


0.39475232380280034


In [None]:
import jiwer
with open("openai20230829台南市政府第609次市政會議 直播.txt","r",encoding="utf-8") as openaiasr:
  hypothesis=openaiasr.read()
with open("corrected/202310829 台南市政府 第609次市政會議 正確版.txt","r",encoding="utf-8") as correctedasr:
  reference=correctedasr.read()

#reference = "今天天氣很好嗎"
#hypothesis = "今天天氣很好啊"

#output = jiwer.process_characters(reference, hypothesis)
#print(jiwer.visualize_alignment(output))
error = jiwer.cer(reference, hypothesis)
print(error)

0.2962521125360374


In [None]:
import jiwer
with open("seiching20230829台南市政府第609次市政會議 直播.txt","r",encoding="utf-8") as openaiasr:
  hypothesis=openaiasr.read()
with open("corrected/202310829 台南市政府 第609次市政會議 正確版.txt","r",encoding="utf-8") as correctedasr:
  reference=correctedasr.read()

#reference = "今天天氣很好嗎"
#hypothesis = "今天天氣很好啊"

error = jiwer.cer(reference, hypothesis)
print(error)

0.3848295059151009


In [None]:
import jiwer
with open("openai20231225 台南市政府 第626次市政會議 直播.txt","r",encoding="utf-8") as openaiasr:
  hypothesis=openaiasr.read()
with open("corrected/20231225 台南市政府 第626次市政會議 正確版.txt","r",encoding="utf-8") as correctedasr:
  reference=correctedasr.read()

#reference = "今天天氣很好嗎"
#hypothesis = "今天天氣很好啊"

error = jiwer.cer(reference, hypothesis)
print(error)

0.36810352365130716


In [None]:
import jiwer
with open("seiching20231225 台南市政府 第626次市政會議 直播.txt","r",encoding="utf-8") as openaiasr:
  hypothesis=openaiasr.read()
with open("corrected/20231225 台南市政府 第626次市政會議 正確版.txt","r",encoding="utf-8") as correctedasr:
  reference=correctedasr.read()

#reference = "今天天氣很好嗎"
#hypothesis = "今天天氣很好啊"

error = jiwer.cer(reference, hypothesis)
print(error)

0.37789630147766023


In [None]:
import jiwer
with open("openai20240130 台南市政府 第631次市政會議 直播.txt","r",encoding="utf-8") as openaiasr:
  hypothesis=openaiasr.read()
with open("corrected/20240130 台南市政府 第631次市政會議 正確版.txt","r",encoding="utf-8") as correctedasr:
  reference=correctedasr.read()

#reference = "今天天氣很好嗎"
#hypothesis = "今天天氣很好啊"

error = jiwer.cer(reference, hypothesis)
print(error)

0.3496987951807229


In [None]:
import jiwer
with open("seiching20240130 台南市政府 第631次市政會議 直播.txt","r",encoding="utf-8") as openaiasr:
  hypothesis=openaiasr.read()
with open("corrected/20240130 台南市政府 第631次市政會議 正確版.txt","r",encoding="utf-8") as correctedasr:
  reference=correctedasr.read()

#reference = "今天天氣很好嗎"
#hypothesis = "今天天氣很好啊"

error = jiwer.cer(reference, hypothesis)
print(error)

0.35873493975903614
