<a href="https://colab.research.google.com/github/torstenek/Whisper/blob/main/WhisperVideoDrive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you're looking at this on GitHub and new to Python Notebooks or Colab, click the Google Colab badge above 👆

#📼 OpenAI Whisper + Google Drive Video Transcription

📺 Getting started video: https://youtu.be/YGpYinji7II

###This application will extract audio from all the video files in a Google Drive folder and create a high-quality transcription with OpenAI's Whisper automatic speech recognition system.

*Note: This requires giving the application permission to connect to your drive. Only you will have access to the contents of your drive, but please read the warnings carefully.*

This notebook application:
1. Connects to your Google Drive when you give it permission.
2. Creates a WhisperVideo folder and three subfolders (ProcessedVideo, AudioFiles and TextFiles.)
3. When you run the application it will search for all the video files (.mp4, .mov, mkv and .avi) in your WhisperVideo folder, transcribe them and then move the file to WhisperVideo/ProcessedVideo and save the transcripts to WhisperVideo/TextFiles. It will also add a copy of the new audio file to WhisperVideo/AudioFiles

###**For faster performance set your runtime to "GPU"**
*Click on "Runtime" in the menu and click "Change runtime type". Select "GPU".*


**Note: If you add a new file after running this application you'll need to remount the drive in step 1 to make them searchable**

##1. Load the code libraries

In [None]:
!pip install git+https://github.com/openai/whisper.git 
!pip install git+https://github.com/linto-ai/whisper-timestamped
!sudo apt update && sudo apt install ffmpeg
!pip install librosa
!pip install -Uqq ipdb
import ipdb

import whisper
# import whisper_timestamped as whisper
import time
import librosa
import soundfile as sf
import re
import os
import json

# model = whisper.load_model("tiny.en")
# model = whisper.load_model("base.en")  
model = whisper.load_model("small.en") # load the small model
# model = whisper.load_model("medium.en")
# model = whisper.load_model("large")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-w1tv3pb_
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-w1tv3pb_
  Resolved https://github.com/openai/whisper.git to commit 6dea21fd7f7253bfe450f1e2512a0fe47ee2d258
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/linto-ai/whisper-timestamped
  Cloning https://github.com/linto-ai/whisper-timestamped to /tmp/pip-req-build-7e_f1xhv
  Running command git clone --filter=blob:none --quiet https://github.com/linto-ai/whisper-timestamped /tmp/pip-req-build-7e_f

##2. Give the application permission to mount the drive and create the folders

In [None]:
# Create the Drive folders
from google.colab import drive
drive.mount("/content/drive", force_remount=True) # This will prompt for authorization.

# This will create the WhisperVideo files if they don't exist.
folders =  ["WhisperVideo/", "WhisperVideo/ProcessedVideo/", "WhisperVideo/TextFiles/", "WhisperVideo/AudioFiles/", "WhisperVideo/JsonFiles/"]
for folder in folders:
  path = "/content/drive/MyDrive/" + folder
  if not os.path.exists(path): # Create the folder if it does not exist
    os.mkdir(path)


Mounted at /content/drive


In [None]:
# Read the file data.json into a Python dictionary

import json

# Function to read a JSON file and return its content as a Python dictionary
def read_json_file(file_path):
    with open(file_path, 'r') as file:
        data = json.load(file)
    return data

def find_end_of_sentence(segmentText):
    print(segmentText)
    # find the end of the sentence
    # if there is a period, then the end of the sentence is the period
    # if there is a question mark, then the end of the sentence is the question mark
    # if there is an exclamation point, then the end of the sentence is the exclamation point
    # if there is no period, question mark, or exclamation point, then the end of the sentence is the end of the string
    # return the end of the sentence
    end_of_sentence = -1
    if "." in segmentText:
        end_of_sentence = segmentText.find(".")
    elif "?" in segmentText:
        end_of_sentence = segmentText.find("?")
    elif "!" in segmentText:
        end_of_sentence = segmentText.find("!")
    return end_of_sentence

# Function to determine if the segment is an incomplete sentence.
# If the segmentText does not contain a period, question mark, or exclamation point,
# then the segment is an incomplete sentence.
def is_incomplete_sentence(segmentText):
    end_of_sentence = find_end_of_sentence(segmentText)
    return True if end_of_sentence == -1 else False


# Transcribe traverses the segmentsArray starting from segmentIndex and forwards to find the
# end of the next complete sentence.

def transcribe(accumulatedSentenceStart, accumulatedSentenceText, segmentIndex, segmentsArray):
    # We are at the end of the segmentsArray. Return.
    if segmentIndex >= len(segmentsArray):
        return
    currentSegment = segmentsArray[segmentIndex]
    currentSegmentText = currentSegment.get("text")
    thisAccumulatedSentenceText = ""
    thisStartTimeStamp = currentSegment.get("start")

    # if previousSegment is None: then we set thisStartTimeStamp to the start time of the current segment
    # else we set thisStartTimeStamp to the end time of the previous segment
    if accumulatedSentenceStart is not None:
        thisStartTimeStamp = accumulatedSentenceStart

    if accumulatedSentenceText is not None:
        thisSentenceText = accumulatedSentenceText

    # If the segment is an incomplete sentence, then append the segmentText to the sentence
    # and call transcribe again with the next segment.
    if is_incomplete_sentence(currentSegmentText):
        newSentenceAccumulatedText = thisAccumulatedSentenceText + currentSegmentText
        transcribe(accumulatedSentenceStart, newSentenceAccumulatedText, segmentIndex + 1, segmentsArray)
    else:
        endOfSentenceTextIndex = find_end_of_sentence(currentSegmentText)
        endOfSentenceText = currentSegmentText[0:endOfSentenceTextIndex + 1]

        # If the segment is a complete sentence, then create a new segment and append it to the newSegmentList.
        completeSentence = accumulatedSentenceText + endOfSentenceText
        completeSentenceEnd = currentSegment.get("end")

        # If the segment is a complete sentence, then create a new segment and append it to the newSegmentList.
        # The new segment will have the same start time as the first segment in the sentence.
        # The new segment will have the same end time as the last segment in the sentence.
        # The new segment will have the text of the sentence.
        newSegment = {"text": completeSentence, "start": accumulatedSentenceStart, "end": completeSentenceEnd}
        newSegmentList.append(newSegment)

        # Is there remaining text in the current segment?
        # If so, then call transcribe with the remaining text.
        # If not, then call transcribe with the next segment.
        if endOfSentenceTextIndex + 1 < len(currentSegmentText):
            remainingText = currentSegmentText[endOfSentenceTextIndex + 2:]
            transcribe(thisStartTimeStamp, remainingText, segmentIndex + 1, segmentsArray)
        else:
            transcribe(None, None, segmentIndex + 1, segmentsArray)


##3. Upload any video files you want transcribed in the "WhisperVideo" folder in your Google Drive.

##4. Extract audio from the video files and create a transcription

In [8]:
import json

# an empty dictionary
newSegmentList = []
# Load all the audio file paths in a Google Drive folder
from google.colab import drive
drive.mount("/content/drive", force_remount=True) # This will prompt for authorization.

# Get the list of video files from the WhisperVideo folder
video_files = os.listdir("/content/drive/MyDrive/WhisperVideo/")

# Loop through the video files and transcribe them
for video_file in video_files:

  # Skip the file if it is not a video format
  if not video_file.endswith((".mp4", ".mov", ".avi", ".mkv")):
    continue

  # Extract the audio from the video file using librosa
  video_path = "/content/drive/MyDrive/WhisperVideo/" + video_file
  audio_path = "/content/drive/MyDrive/WhisperVideo/AudioFiles/" + video_file[:-4] + ".wav" # Replace the video extension with .wav

  y, sr = librosa.load(video_path, sr=16000) # Load the audio with 16 kHz sampling rate
  sf.write(audio_path, y, sr) # Save the audio as a wav file

  # Transcribe the audio file using Whisper
  result = model.transcribe(audio_path)

  print(json.dumps(result, indent = 2, ensure_ascii = False))

  segments = result.get("segments")
  print(json.dumps(segments, indent = 2, ensure_ascii = False))
  if segments is None:
      print("No segments found")
      exit(1)

  transcribe(None, None, 0, segments)
  

  print(json.dumps(newSegmentList, indent = 2, ensure_ascii = False))

  text = result["text"].strip()
  text = text.replace(". ", ".\n\n")

  # Save the transcription as a text file in Google Docs
  text_file = video_file[:-4] + ".txt" # Replace the video extension with .txt
  text_path = "/content/drive/MyDrive/WhisperVideo/TextFiles/" + text_file
  with open(text_path, "w") as f:
    f.write(text)

  json_file = video_file[:-4] + ".json"
  json_path = "/content/drive/MyDrive/WhisperVideo/JsonFiles/" + json_file
  with open(json_path, "w") as f:
    f.write(json.dumps(newSegmentList, indent = 2, ensure_ascii = False))
    
  # Move the video file to the ProcessedVideo folder
  processed_path = "/content/drive/MyDrive/WhisperVideo/ProcessedVideo/" + video_file
  os.rename(video_path, processed_path)

  # Print a message to indicate the progress
  print("Processed {video_file} and saved the transcription as {text_file}")

Mounted at /content/drive




{
  "text": " So it seems like I need to record more audio for Whisper to be able to learn my speech patterns and for Whisper to correctly punctuate what I'm saying. So in order to provide some more verbiage here I'm just gonna babble for a little bit. I hope you're listening Whisper and that based on what I'm saying you'll be able to punctuate the transcription accordingly. Thank you and may your feet go with you.",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 6.32,
      "text": " So it seems like I need to record more audio for Whisper to be able to learn",
      "tokens": [
        50363,
        1406,
        340,
        2331,
        588,
        314,
        761,
        284,
        1700,
        517,
        6597,
        329,
        28424,
        525,
        284,
        307,
        1498,
        284,
        2193,
        50679
      ],
      "temperature": 0.0,
      "avg_logprob": -0.1390289867625517,
      "compression_ratio"