In [7]:
import webvtt
import jiwer

First, we'll use the Python webvtt library (https://webvtt-py.readthedocs.io/en/latest/usage.html) to read in a sample .vtt file. With each caption line, it parses out the start time, end time, and caption text. But here, we just want the caption text.

In [12]:
original_caption_file = 'original_vtt_files/CFCH-SFF_2014_0629_China_Teahouse_Commons_0004-000001.vtt'
caption_list = [caption.text for caption in webvtt.read(original_caption_file)]
caption_list[:5]

["Speaker 1: Hello, hello everyone we're getting set up for our next session in Teahouse Commons.",
 'Speaker 1: You are in the China Program of the 2014 Smithsonian Folklife Festival.',
 'Speaker 1: This program is presented through a collaboration by the Center for Folklife and Cultural Heritage at the Smithsonian with the China International Cultural Association, working with the China Arts and Entertainment Group in Beijing.',
 "Speaker 1: Thanks a lot for being here. Today I don't know if you've noticed but we're doing programs slightly different today.",
 'Speaker 1: The China Festival program actually brought over 120 artists and presenters from China to participate in this program.']

For the purposes of our error rate calculator, we join all of the individual lines of text together with a space to make 1 really long string.

In [13]:
transcription_full_text = ' '.join(caption_list)
transcription_full_text[:250]

"Speaker 1: Hello, hello everyone we're getting set up for our next session in Teahouse Commons. Speaker 1: You are in the China Program of the 2014 Smithsonian Folklife Festival. Speaker 1: This program is presented through a collaboration by the Cen"

Then we do the same thing with the .vtt file generated by Whisper...

In [14]:
whisper_caption_file = 'whisper_small_vtt/CFCH-SFF_2014_0629_China_Teahouse_Commons_0004-000001.vtt'
whisper_caption_list = [caption.text for caption in webvtt.read(whisper_caption_file)]
whisper_caption_list[:5]

["Hello, hello everyone. We're getting set up for our next session in tea house comments. You are in the",
 'China program of the 2014 Smithsonian Folklife Festival',
 'This program is presented through a collaboration by the Center for Folklife and Cultural Heritage at the Smithsonian with the China',
 'International Culture Association',
 'Working with the China Arts and Entertainment Group in Beijing. Thanks a lot for being here today']

In [15]:
whisper_full_text = ' '.join(whisper_caption_list)
whisper_full_text[:250]

"Hello, hello everyone. We're getting set up for our next session in tea house comments. You are in the China program of the 2014 Smithsonian Folklife Festival This program is presented through a collaboration by the Center for Folklife and Cultural H"

Now we use the jiwer Python library (https://github.com/jitsi/jiwer) to align the 2 full transcriptions. For the initial assessment, we see a 36.87% WER (or word error rate).

In [16]:
out = jiwer.process_words(transcription_full_text, whisper_full_text)
print(jiwer.visualize_alignment(out))

sentence 1
REF: Speaker 1: Hello, hello  everyone we're getting set up for our next session in Teahouse Commons.   Speaker 1: You are in the China Program of the 2014 Smithsonian Folklife Festival. Speaker 1: This program is presented through a collaboration by the Center for Folklife and Cultural Heritage at the Smithsonian with the China International Cultural Association, working with the China Arts and Entertainment Group in Beijing. Speaker 1: Thanks a lot for being here. Today I don't know if you've noticed but we're doing programs slightly different today. Speaker 1: The China Festival program actually brought over 120 artists and presenters from China to participate in this program. Speaker 1: But today    we are doing something a little bit different, we're calling it "Diaspora Day". Speaker 1: There are about    4 million people of Chinese ancestry in the United States, and today we're celebrating the way in which when people move, how they transform and adapt culture, commun

It looks like a lot of those errors come from mismatches in upper/lower case, and punctuations. Let's try converting both to lower case first. It looks like that drops the WER to 31.7%.

In [17]:
out = jiwer.process_words(transcription_full_text.lower(), whisper_full_text.lower())
print(jiwer.visualize_alignment(out))

sentence 1
REF: speaker 1: hello, hello  everyone we're getting set up for our next session in teahouse commons.   speaker 1: you are in the china program of the 2014 smithsonian folklife festival. speaker 1: this program is presented through a collaboration by the center for folklife and cultural heritage at the smithsonian with the china international cultural association, working with the china arts and entertainment group in beijing. speaker 1: thanks a lot for being here. today i don't know if you've noticed but we're doing programs slightly different today. speaker 1: the china festival program actually brought over 120 artists and presenters from china to participate in this program. speaker 1: but today    we are doing something a little bit different, we're calling it "diaspora day". speaker 1: there are about    4 million people of chinese ancestry in the united states, and today we're celebrating the way in which when people move, how they transform and adapt culture, commun

And now with some regular expression help (that I found from Stack Overflow), we can remove the punctuations from both strings. Now we're down to 20.35% WER. Not too bad. 

Especially since it looks like most of the remaining errors come from Speaker Tags that are left in from the original transcription. If we can remove those, I'm sure we would get a much better rate.

In [19]:
import re, string

In [20]:
regex = re.compile('[%s]' % re.escape(string.punctuation))

In [22]:
out = jiwer.process_words(regex.sub('', transcription_full_text.lower()), 
                          regex.sub('', whisper_full_text.lower()))
print(jiwer.visualize_alignment(out))

sentence 1
REF: speaker 1 hello hello everyone were getting set up for our next session in teahouse commons  speaker 1 you are in the china program of the 2014 smithsonian folklife festival speaker 1 this program is presented through a collaboration by the center for folklife and cultural heritage at the smithsonian with the china international cultural association working with the china arts and entertainment group in beijing speaker 1 thanks a lot for being here today i dont know if youve noticed but were doing programs slightly different today speaker 1 the china festival program actually brought over 120 artists and presenters from china to participate in this program speaker 1 but today   we are doing something a little bit different were calling it diaspora day speaker 1 there are about    4 million people of chinese ancestry in the united states and today were celebrating the way in which when people move how they transform and adapt culture community and traditions to the new *