# Summary of test runs

In [1]:
import pandas as pd
import jiwer
import os

## Build combined Dataframe

My runs are saved as CSV in the `/out`-directory. Here I import them into a combined dataframe for analytics.

The test runs were augmented with virtual test runs without punctuation using [./01_remove_pc.ipynb]().

In [2]:
OUT_DIR = "../out"

def import_df(run):
    df = pd.read_csv(os.path.join(OUT_DIR, run))
    df["run"] = run
    return df

runs = [el for el in os.listdir(OUT_DIR) if el.endswith(".csv")]
run_data = pd.concat([import_df(run) for run in runs], keys=runs, ignore_index=True)
run_data

Unnamed: 0,id,wer,pred_transcription,time,pred_transcription_clean,wer_clean,run
0,2961-961-0000,0.220503,socrates begins the timaeus with a summary of...,4.024658,socrates begins the timaeus with a summary of...,0.031008,baseline_large-v3-turbo.csv
1,4970-29093-0000,0.111667,you'll never dig it out of the astor library ...,4.773546,youll never dig it out of the astor library t...,0.021667,baseline_large-v3-turbo.csv
2,6930-76324-0001,0.180556,they were certainly no near the solution of t...,3.610766,they were certainly no near the solution of t...,0.053241,baseline_large-v3-turbo.csv
3,7729-102255-0000,0.241182,the bogus legislature numbered thirty-six mem...,9.543177,the bogus legislature numbered thirty-six mem...,0.052431,baseline_large-v3-turbo.csv
4,5105-28240-0000,0.119658,"Fast as his legs could carry him, Servadak ha...",3.614152,fast as his legs could carry him servadak had...,0.029915,baseline_large-v3-turbo.csv
...,...,...,...,...,...,...,...
244,1995-1837-0000,0.205941,"He knew the silver fleece, his and zora's must...",100.129853,he knew the silver fleece his and zoras must b...,0.061386,2025-04-09_13-12-43_librispeech-pc-test-clean_...
245,237-126133-0000,0.321348,"Here she would stay, comforted and soothed amo...",92.567921,here she would stay comforted and soothed amon...,0.184270,2025-04-09_13-12-43_librispeech-pc-test-clean_...
246,6829-68771-0000,0.261561,"So, to the surprise of the Democratic Committe...",132.127315,so to the surprise of the democratic committee...,0.114162,2025-04-09_13-12-43_librispeech-pc-test-clean_...
247,5683-32879-0000,0.217949,It was not very much past eleven that morning ...,105.007132,it was not very much past eleven that morning ...,0.088172,2025-04-09_13-12-43_librispeech-pc-test-clean_...


## Analytics

### Run summary

For each run on librispeech pc, there is also a virtual test run on librispeech by removing capitalization and punctuation from the prediction and true transcript. I interpret the difference as the punctuation error; all instances, where the correct word was predicted, but the case or punctuation did not match.

In [18]:
mean_wer = run_data.groupby("run")["wer_clean"].mean()

mean_wer_pc = run_data.groupby("run")["wer"].mean()

mean_pc_errors = mean_wer_pc - mean_wer
mean_pc_error_percentage = (pc_only_errors / mean_wer_pc).apply(lambda x: f"{x:.2%}")

summary = pd.DataFrame({
    "mean_wer_pc": mean_wer_pc,
    "mean_wer": mean_wer,
    "mean_pc_errors": mean_pc_errors,
    "mean_pc_error_percentage": mean_pc_error_percentage
}).reset_index()
summary = summary.rename(columns={"run": "Run"})
summary

Unnamed: 0,Run,mean_wer_pc,mean_wer,mean_pc_errors,mean_pc_error_percentage
0,2025-04-09_13-12-43_librispeech-pc-test-clean_...,0.245097,0.129239,0.115858,47.27%
1,2025-04-13_12-32-11_librispeech-pc-test-clean_...,0.247452,0.121528,0.125924,50.89%
2,baseline_large-v3-turbo.csv,0.17595,0.041012,0.134937,76.69%


| Column                    | Description                                                                     |
|---------------------------|---------------------------------------------------------------------------------|
| `mean_wer_pc`             | Mean WER of test run on LibriSpeech PC (with punctuation)                       |
| `mean_wer`                | Mean WER of *virtual* test run without punctuation                              |
| `mean_pc_errors`          | Mean rate of punctuation-only errors (correct word, incorrect punctuation)      |
| `mean_pc_error_percentage`| Percentage of total errors that are only punctuation errors                     |

### Calculate divergence from baseline

By comparing the relative difference of a run to the offline baseline, one gets a sense of the quality impacts of the online transcription.

In [None]:
baseline = summary[summary["Run"].str.contains("baseline", case=False)].iloc[0]

summary["mean_wer_rel_diff"] = (summary["mean_wer"] - baseline["mean_wer"]) / baseline["mean_wer"]
summary["mean_wer_rel_diff"] = summary["mean_wer_rel_diff"].apply(lambda x: f"{x:+.2%}")
summary["mean_wer_pc_rel_diff"] = (summary["mean_wer_pc"] - baseline["mean_wer_pc"]) / baseline["mean_wer_pc"]
summary["mean_wer_pc_rel_diff"] = summary["mean_wer_pc_rel_diff"].apply(lambda x: f"{x:+.2%}")
summary[["Run", "mean_wer", "mean_wer_rel_diff", "mean_wer_pc", "mean_wer_pc_rel_diff"]]

Unnamed: 0,Run,mean_wer,mean_wer_rel_diff,mean_wer_pc,mean_wer_pc_rel_diff
0,2025-04-09_13-12-43_librispeech-pc-test-clean_...,0.129239,+215.12%,0.245097,+39.30%
1,2025-04-13_12-32-11_librispeech-pc-test-clean_...,0.121528,+196.32%,0.247452,+40.64%
2,baseline_large-v3-turbo.csv,0.041012,+0.00%,0.17595,+0.00%


| Column                 | Description                                                          |
|------------------------|----------------------------------------------------------------------|
| `Run`                  | Identifier or name of the evaluation run                             |
| `mean_wer`             | Mean Word Error Rate (WER) without punctuation                       |
| `mean_wer_rel_diff`    | Relative difference of `mean_wer` compared to the baseline (in %)    |
| `mean_wer_pc`          | Mean Word Error Rate with punctuation (LibriSpeech PC test)          |
| `mean_wer_pc_rel_diff` | Relative difference of `mean_wer_pc` compared to the baseline (in %) |