# **Videos Transcription and Translation with Faster Whisper and ChatGPT**


[![notebook shield](https://img.shields.io/static/v1?label=&message=Notebook&color=blue&style=for-the-badge&logo=googlecolab&link=https://colab.research.google.com/github/lewangdev/autotranslate/blob/main/autotranslate.ipynb)](https://colab.research.google.com/github/lewangdev/autotranslate/blob/main/autotranslate.ipynb)
[![repository shield](https://img.shields.io/static/v1?label=&message=Repository&color=blue&style=for-the-badge&logo=github&link=https://github.com/lewangdev/autotranslate)](https://github.com/lewangdev/autotranslate)

This Notebook will guide you through the transcription and translation of video using [Faster Whisper](https://github.com/guillaumekln/faster-whisper) and ChatGPT. You'll be able to explore most inference parameters or use the Notebook as-is to store the transcript, translation and video audio in your Google Drive.

In [None]:
#@markdown # **Check GPU type** 🕵️

#@markdown The type of GPU you get assigned in your Colab session defined the speed at which the video will be transcribed.
#@markdown The higher the number of floating point operations per second (FLOPS), the faster the transcription.
#@markdown But even the least powerful GPU available in Colab is able to run any Whisper model.
#@markdown Make sure you've selected `GPU` as hardware accelerator for the Notebook (Runtime &rarr; Change runtime type &rarr; Hardware accelerator).

#@markdown |  GPU   |  GPU RAM   | FP32 teraFLOPS |     Availability   |
#@markdown |:------:|:----------:|:--------------:|:------------------:|
#@markdown |  T4    |    16 GB   |       8.1      |         Free       |
#@markdown | P100   |    16 GB   |      10.6      |      Colab Pro     |
#@markdown | V100   |    16 GB   |      15.7      |  Colab Pro (Rare)  |

#@markdown ---
#@markdown **Factory reset your Notebook's runtime if you want to get assigned a new GPU.**

!nvidia-smi -L

!nvidia-smi

In [None]:
#@markdown # **Install libraries** 🏗️
#@markdown This cell will take a little while to download several libraries.

#@markdown ---
! pip install faster-whisper==0.10.0
! pip install yt-dlp==2023.11.16
! pip install openai==0.28.1

# Windows Libs：https://github.com/Purfview/whisper-standalone-win/releases/download/libs/cuBLAS.and.cuDNN_win_v2.7z
! apt install -y p7zip-full p7zip-rar
! wget https://github.com/Purfview/whisper-standalone-win/releases/download/libs/cuBLAS.and.cuDNN_linux_v2.7z
! 7z x cuBLAS.and.cuDNN_linux_v2.7z -o/usr/lib



In [None]:
#@markdown # **Import libraries for Python** 🐍

#@markdown This cell will import all libraries for python code.
import sys
import warnings
from faster_whisper import WhisperModel
from pathlib import Path
import yt_dlp
import subprocess
import torch
import shutil
import numpy as np
from IPython.display import display, Markdown, YouTubeVideo

device = torch.device('cuda:0')
print('Using device:', device, file=sys.stderr)

In [None]:
#@markdown # **Optional:** Save data in Google Drive 💾
#@markdown Enter a Google Drive path and run this cell if you want to store the results inside Google Drive.

# Uncomment to copy generated images to drive, faster than downloading directly from colab in my experience.
from google.colab import drive
drive_mount_path = Path("/") / "content" / "drive"
drive.mount(str(drive_mount_path))
drive_mount_path /= "My Drive"
#@markdown ---
drive_path = "Colab Notebooks/Videos Transcription and Translation" #@param {type:"string"}
#@markdown ---
#@markdown **Run this cell again if you change your Google Drive path.**

drive_whisper_path = drive_mount_path / Path(drive_path.lstrip("/"))
drive_whisper_path.mkdir(parents=True, exist_ok=True)

In [None]:
#@markdown # **Model selection** 🧠

#@markdown As of the first public release, there are 4 pre-trained options to play with:

#@markdown |  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
#@markdown |:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
#@markdown |  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~0.8 GB     |      ~32x      |
#@markdown |  base  |    74 M    |     `base.en`      |       `base`       |     ~1.0 GB     |      ~16x      |
#@markdown | small  |   244 M    |     `small.en`     |      `small`       |     ~1.4 GB     |      ~6x       |
#@markdown | medium |   769 M    |    `medium.en`     |      `medium`      |     ~2.7 GB     |      ~2x       |
#@markdown | large-v1  |   1550 M   |        N/A         |      `large-v1`       |    ~4.3 GB     |       1x       |
#@markdown | large-v2  |   1550 M   |        N/A         |      `large-v2`       |    ~4.3 GB     |       1x       |
#@markdown | large-v3  |   1550 M   |        N/A         |      `large-v2`       |    ~3.6 GB     |       1x       |

#@markdown ---
model_size = 'large-v2' #@param ['tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2', 'large-v3']
device_type = "cuda" #@param {type:"string"} ['cuda', 'cpu']
compute_type = "float16" #@param {type:"string"} ['float16', 'int8_float16', 'int8']
#@markdown ---
#@markdown **Run this cell again if you change the model.**

model = WhisperModel(model_size, device=device_type, compute_type=compute_type)


In [None]:
#@markdown # **Video selection** 📺

#@markdown Enter the URL of the video you want to transcribe

#@markdown #### **Video or playlist URL**
URL = "https://youtu.be/pTCxXZh6VyE?si=ItBkmwhihuxInLjp" #@param {type:"string"}
# store_audio = True #@param {type:"boolean"}
#@markdown ---
def download_video(URL):
  video_path_local_list = []

  ydl_opts = {
        'format': 'm4a/bestaudio/best',
        'outtmpl': '%(id)s.%(ext)s',
        # ℹ️ See help(yt_dlp.postprocessor) for a list of available Postprocessors and their arguments
        'postprocessors': [{  # Extract audio using ffmpeg
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'wav',
        }]
    }

  with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        error_code = ydl.download([URL])
        list_video_info = [ydl.extract_info(URL, download=False)]

  for video_info in list_video_info:
        video_path_local_list.append(Path(f"{video_info['id']}.wav"))
  print(video_path_local_list)
  for video_path_local in video_path_local_list:
      if video_path_local.suffix == ".mp4":
          video_path_local = video_path_local.with_suffix(".wav")
          result  = subprocess.run(["ffmpeg", "-i", str(video_path_local.with_suffix(".mp4")), "-vn", "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", str(video_path_local)])
  return video_path_local



In [None]:
def run_model(video_path_local):
  #@markdown # **Run the model** 🚀

  #@markdown Run this cell to execute the transcription of the video. This can take a while and very based on the length of the video and the number of parameters of the model selected above.
  def seconds_to_time_format(s):
      # Convert seconds to hours, minutes, seconds, and milliseconds
      hours = s // 3600
      s %= 3600
      minutes = s // 60
      s %= 60
      seconds = s // 1
      milliseconds = round((s % 1) * 1000)

      # Return the formatted string
      return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d},{int(milliseconds):03d}"


  #@markdown ## **Parameters** ⚙️

  #@markdown ### **Behavior control**
  #@markdown #### Language
  language_options = {
      "Auto Detect": "auto",
      "English": "en",
      "中文(Chinese)": "zh",
      "日本語(Japanese)": "ja",
      "Deutsch(German)": "de",
      "Français(French)": "fr"
  }

  language_option = "Auto Detect" #@param ["Auto Detect", "English", "中文(Chinese)", "日本語(Japanese)", "Deutsch(German)", "Français(French)"] {allow-input: true}
  language = language_options.get(language_option, language_option)

  #@markdown #### initial prompt
  initial_prompt = "Hello, Let's begin to talk." #@param {type:"string"}
  #@markdown ---
  #@markdown #### Word-level timestamps
  word_level_timestamps = True #@param {type:"boolean"}
  #@markdown ---
  #@markdown #### VAD filter
  vad_filter = False #@param {type:"boolean"}
  vad_filter_min_silence_duration_ms = 50 #@param {type:"integer"}
  #@markdown ---


  segments, info = model.transcribe(str(video_path_local), beam_size=5,
                                    language=None if language == "auto" else language,
                                    initial_prompt=initial_prompt,
                                    word_timestamps=word_level_timestamps,
                                    vad_filter=vad_filter,
                                    vad_parameters=dict(min_silence_duration_ms=vad_filter_min_silence_duration_ms))

  language_detected = info.language
  display(Markdown(f"Detected language '{info.language}' with probability {info.language_probability}"))

  fragments = []

  for segment in segments:
    print(f"[{seconds_to_time_format(segment.start)} --> {seconds_to_time_format(segment.end)}] {segment.text}")
    if word_level_timestamps:
      for word in segment.words:
        ts_start = seconds_to_time_format(word.start)
        ts_end = seconds_to_time_format(word.end)
        #print(f"[{ts_start} --> {ts_end}] {word.word}")
        fragments.append(dict(start=word.start,end=word.end,text=word.word))
    else:
      ts_start = seconds_to_time_format(segment.start)
      ts_end = seconds_to_time_format(segment.end)
      #print(f"[{ts_start} --> {ts_end}] {segment.text}")
      fragments.append(dict(start=segment.start,end=segment.end,text=segment.text))

  def merge_word_to_sentences():
    #@title Merge words/segments to sentences

    #@markdown Run this cell to merge words/segments to sentences.
    #@markdown ## **Parameters** ⚙️

    #@markdown ### **Behavior control**
    #@markdown #### Milliseconds gap between_two sentences
    max_gap_ms_between_two_sentence = 200 #@param {type:"integer"}

    import json

    # Merge words/segments to sentences
    def merge_fragments(fragments, gap_ms):
      new_fragments = []
      new_fragment = {}
      length = len(fragments)
      for i, fragment in enumerate(fragments):
        start = fragment['start']
        end = fragment['end']
        text = fragment['text']

        if new_fragment.get('start', None) is None:
          new_fragment['start'] = start
        if new_fragment.get('end', None) is None:
          new_fragment['end'] = end
        if new_fragment.get('text', None) is None:
          new_fragment['text'] = ""

        if start - new_fragment['end'] > gap_ms:
          new_fragments.append(new_fragment)
          new_fragment = dict(start=start, end=end, text=text)
          continue

        new_fragment['end'] = end

        #delimiter = '' if text.startswith('-') else ' '
        delimiter = ' ' if language_detected in ['en', 'de', 'fr'] else ''
        new_fragment['text'] = f"{new_fragment['text']}{delimiter}{text.lstrip()}"

        # End of a sentence when symbols found: [.?]
        if (len(text) > 0 and text[-1] in ['.', '?', '。', '？', '!', '！']) or i == length-1:
          new_fragments.append(new_fragment)
          new_fragment = {}
      return new_fragments


    new_fragments = merge_fragments(fragments, max_gap_ms_between_two_sentence/1000.0)

    # Save as json file
    json_ext_name = ".json"
    json_transcript_file_name = video_path_local.stem + json_ext_name
    with open(json_transcript_file_name, 'w') as f:
      f.write(json.dumps(new_fragments))
    display(Markdown(f"**Transcript SRT file created: {video_path_local.parent / json_transcript_file_name}**"))

    # Save as srt
    srt_ext_name = ".srt"
    srt_transcript_file_name = video_path_local.stem + srt_ext_name
    with open(srt_transcript_file_name, 'w') as f:
      for sentence_idx, fragment in enumerate(new_fragments):
        ts_start = seconds_to_time_format(fragment['start'])
        ts_end = seconds_to_time_format(fragment['end'])
        text = fragment['text']
        print(f"[{ts_start} --> {ts_end}] {text}")
        f.write(f"{sentence_idx + 1}\n")
        f.write(f"{ts_start} --> {ts_end}\n")
        f.write(f"{text.strip()}\n\n")

    try:
      shutil.copy(video_path_local.parent / srt_transcript_file_name,
                drive_whisper_path / srt_transcript_file_name
      )
      display(Markdown(f"**Transcript SRT file created: {drive_whisper_path / srt_transcript_file_name}**"))
      df.at[index, 'status'] = 'Transcribed'
    except:
      display(Markdown(f"**Transcript SRT file created: {video_path_local.parent / srt_transcript_file_name}**"))

  merge_word_to_sentences()


In [None]:
# @markdown # **Mount Drive And Load CSV**
# @markdown You can find path inside drive folder after mounting the drive.
#@markdown CSV File path must be from Google Drive as we have to update it while transcribing.


import pandas as pd
from google.colab import drive
google_drive_csv_path = "/content/drive/MyDrive/video_links.csv" #@param {type:"string"}

# @markdown If CSV file fails it will convert the Video Given above in Video Selection Section. So if you want to transcribe a single video you can just give  a dummy path in  this input.

# Mount Google Drive to save and load checkpoints
drive.mount('/content/drive', force_remount=True)
#Read CSV File
try:
  df = pd.read_csv(google_drive_csv_path)
  print(df)
except:
  video_path = download_video(URL)
  run_model(video_path)

In [None]:
# @markdown #Main Cell
# @markdown This code block will run the loop through all the rows in CSV file and transcribe it. Once you run this code block it will transcribe all the videos and mark all the videos converetd s Transcribed so that they won't get transcribed again.
for index, row in df.iterrows():
    video_url = row['video_url']
    status = row['status'].strip().lower()
    print(video_url, "-", status)
    # Check if the video is transcribed
    if status != 'transcribed':
      video_path = download_video(video_url)
      run_model(video_path)
      # Once the video is transcribed, update the status in the CSV file
      df.to_csv('/content/drive/MyDrive/video_links.csv', index=False)


