In the first line we install Whisper!

In [None]:
!pip install git+https://github.com/openai/whisper.git 

Next we pull down some audio to transcribe.

Finally, we run Whisper! It may take a little time to get started, but soon the transcription should start to appear.

In [None]:
!whisper "/content/morrison_297569.mp4" --model small --language English 

## Checking Whisper's Work

Whisper hasn't just produced text, it's given us time intervals where it believes that text occurred. In this section we'll read in Whisper's transcript, split up the audio according to Whisper's timestamps, and then print Whisper's text and play the corresponding audio. How well do they match?

In [None]:
import pandas as pd
import numpy as np
import IPython.display as ipd

import warnings
warnings.filterwarnings('ignore')

Whisper's output is saved in `.vtt` format; we'll install `webvtt-py`, a package that can read that format.

In [None]:
!pip install webvtt-py

In [None]:
import webvtt

`librosa` is a library for reading and manipulating audio files.

In [None]:
import librosa

We have two custom functions here, one to convert H:M:S timestamps into seconds, and another to trim out a chunk of audio corresponding to a particular `start` and `end` time.

In [None]:
def simple_hms(s):
  h,m,sec = [float(x) for x in s.split(':')]
  return 3600*h + 60*m + sec

In [None]:
def trim_audio(row,audio,sample_rate):
  t = np.arange(len(audio))
  t = t/sample_rate
  f = np.where( (t>=row.start_s) & (t<=row.end_s) )
  return audio[f]

As promised, we use `webvtt` to read in the transcript and `librosa` to read in the audio.

In [None]:
transcript = webvtt.read('/content/morrison_297569.vtt')
audio,sample_rate = librosa.load('/content/morrison_297569.mp4')

For convenience we're going to set up a Pandas dataframe to store the various quantities we want to track. Each row will correspond to one segment of the Whisper transcript.

In [None]:
df = pd.DataFrame(columns=['start','end','text'])

df['start'] = [x.start for x in transcript]
df['end'] = [x.end for x in transcript]
df['text'] = [x.text for x in transcript]
df['start_s'] = df['start'].apply(simple_hms)
df['end_s'] = df['end'].apply(simple_hms)
df['audio'] = df.apply(trim_audio,axis=1,args=(audio,sample_rate))
df.head()

In [None]:
df.to_csv("morrison_297569.csv")

In [None]:
#df.to_json("morrison_297569_main.json")
