<a href="https://colab.research.google.com/github/vanity0616/Meeting-Minutes-Whisper/blob/main/Meeting_Minutes_Transcription_Using_OpenAI's_Whisper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Meeting Minutes Transcription Using OpenAI's Whisper**

Whisper (https://github.com/openai/whisper) is a general-purpose speech recognition model, which can perform multilingual speech recognition as well as speech translation. This Notebook will guide you how to record face-to-face meetings or in-person classes using Whisper.

<font size='5'>[IMPORTANT]:

1.   Be sure to run the following processes one by one, do not skip any steps.

2.   Make sure you select GPU as hardware accelerator in notebook settings, otherwise the processing speed will be very slow.

# **0. Requirement**

In [None]:
#@markdown This cell will take a little while to download several libraries, including Whisper.

! pip install git+https://github.com/openai/whisper.git
! pip install sounddevice wavio
! pip install ipywebrtc notebook
! apt install ffmpeg
! apt-get install libportaudio2

import os
import numpy as np

try:
    import tensorflow  # required in Colab to avoid protobuf compatibility issues
except ImportError:
    pass

import torch
import pandas as pd
import whisper
import torchaudio

from ipywebrtc import AudioRecorder, CameraStream
from IPython.display import Audio, display
import ipywidgets as widgets

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# **1. Select File**

*1.1 Local File*

*1.2 Online Recording*





In [None]:
#@title **1.1 Upload Local File**
#@markdown You can upload any audio and video file, then it will be transformed as wav automatically.
!pip install pydub
from pydub import AudioSegment
from google.colab import files
import os
use_drive = False
uploaded = files.upload()
file_name = list(uploaded.keys())[0]
new_name = "my_recording.wav"
os.rename(file_name, new_name)
audio = AudioSegment.from_file(new_name)
audio.export(new_name, format="wav")

!ls

print('File uploaded，please continue to upload more or execute next cell')

In [None]:
#@title **1.2 Online Recording (Use your microphone)**
#@markdown We need to enable some Colab widgets in order to make an audio recording.
from google.colab import output
output.enable_custom_widget_manager()

In [None]:
#@markdown Once you've pressed the circle button, talk. When done, press the circle button once more. The widget will start playing back what it captured.
camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

In [None]:
#@markdown PyTorch cannot read the above-captured audio format. We transform our recording into a format that PyTorch can understand in this step.

with open('recording.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav my_recording.wav -y -hide_banner -loglevel panic

In the actual test scene, the effect of online recording is not that ideal. So, I recommend to use vocaroo (https://vocaroo.com/) to make an online audio recording and then upload it.

# **2. Models and Text Export**

*2.1 Select and Run Modle*

*2.2 Text Export*

In [None]:
#@title **2.1 Select and Run Modle**

#@markdown Whisper is capable of detecting the input language and performing transcriptions for many languages. However, to be on the safe side, we can explicitly tell Whisper which language to expect.

language_options = whisper.tokenizer.TO_LANGUAGE_CODE
language_list = list(language_options.keys())

In [None]:
lang_dropdown = widgets.Dropdown(options=language_list, value='english')
output = widgets.Output()
display(lang_dropdown)

In [None]:
task_dropdown = widgets.Dropdown(options=['transcribe', 'translate'], value='transcribe')
output = widgets.Output()
display(task_dropdown)

In [None]:
#@markdown |  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
#@markdown |:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
#@markdown |  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~1 GB     |      ~32x      |
#@markdown |  base  |    74 M    |     `base.en`      |       `base`       |     ~1 GB     |      ~16x      |
#@markdown | small  |   244 M    |     `small.en`     |      `small`       |     ~2 GB     |      ~6x       |
#@markdown | medium |   769 M    |    `medium.en`     |      `medium`      |     ~5 GB     |      ~2x       |
#@markdown | large  |   1550 M   |        N/A         |      `large`       |    ~10 GB     |       1x       |
#@markdown ---
Model = 'small.en' #@param ['tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large']
#@markdown ---
#@markdown **Run this cell again if you change the model.**

#@markdown In the actual test scene, I made a compromise between accuracy and transcription speed, choosing the largest possible model while ensuring that the transcription speed was not too slow. Finally, a small model is a good option.

whisper_model = whisper.load_model(Model)

if lang_dropdown.value == "english":
  model = whisper.load_model(Model)
else:
  model = whisper.load_model("tiny")
print(
    f"Model is {'multilingual' if model.is_multilingual else 'English-only'} "
    f"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters."
)

In [None]:
options = whisper.DecodingOptions(language=lang_dropdown.value, task=task_dropdown.value, without_timestamps=True)
options

In [None]:
#@title **2.2 Text Export**
#@markdown All that's left to do now is feed our audio into Whisper.

# Load the entire audio file
audio = whisper.load_audio("my_recording.wav")

# Transcribe the audio using the model
result = whisper.transcribe(model, audio)

# Get the transcribed text
text = result["text"]

# Split the text into individual sentences
sentences = text.split('.')

# Remove empty sentences
sentences = [sentence.strip() for sentence in sentences if sentence.strip()]

# Format the sentences with line breaks
formatted_text = '\n'.join(sentences)

# Print the formatted text
print(formatted_text)


# **3. Analysis and Summary**

In [None]:
#@title **3.1 OpenAI**
#@markdown Install and import OpenAI
!pip install openai
import openai

In [None]:
#@title **3.2 OpenAI API Key**
#@markdown Imput OpenAI Api Key and enter
from getpass import getpass
openai.api_key = getpass()

In [None]:
#@title **3.3 Splits the text**
#@markdown Divide the text into 4 parts evenly. 4 is random, according to the length of the text, you can customize.

allwords=formatted_text.split(" ")
import numpy as np
parts=np.array_split(allwords,4)
parts

In [None]:
#@title **3.4 Rearrange the first paragraph**
#@markdown To begin, let's rearrange the given phrases into coherent sentences and paragraphs. We'll start with the first paragraph
para0=' '.join(list(parts[0]))
para0

In [None]:
#@title **3.5 Summarize the first paragraph**
prompt0=f"{para0}\n\ntl;dr:"

response0 = openai.Completion.create(
  model="text-davinci-003",
  prompt=prompt0,
  temperature=0,
  max_tokens=100,
  top_p=1.0,
  frequency_penalty=0.0,
  presence_penalty=1
)
response0

In [None]:
#@title **3.6 Only "text" in "choices"**
summary0=response0["choices"][0]["text"]
summary0

In [None]:
#@title **3.7 Summary**
#@markdown Then run the for loop, and get all paragraphs of the summary.
allsummary=[]

for part in parts:

  para=' '.join(list(part))

  prompt=f"{para}\n\ntl;dr:"

  response = openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt,
    temperature=0,
    max_tokens=100,
    top_p=1.0,
    frequency_penalty=0.0,
    presence_penalty=1
    )

  summary=response["choices"][0]["text"]
  allsummary.append(summary)

  result=" ".join(allsummary)

In [None]:
#@markdown Get Summary
import textwrap
print(textwrap.fill(result,150))

# **4. GUI**


In [None]:
#@title **4.1 Install the Web UI Toolkit**
#@markdown We'll be using Gradio to provide the widgets we need to do audio recording.
! pip install gradio -q
! pip install gradio torch torchaudio


In [None]:
import gradio as gr
import torch
import torchaudio

In [None]:
#@title **4.2 Web Interface**
#@markdown After running this script, you should see some widgets below that you can use to upload local file or record live audio, and see the text transcript.

#@markdown Unfortunately, I failed to make this web interface export the summary. However, as long as you run this Notebook step by step, you will get what you want.
import gradio as gr
import textwrap
model = whisper.load_model("small")
def your_transcription_function(audio):
  audio = whisper.load_audio(audio)
  result = whisper.transcribe(model, audio)
  text = result["text"]
  sentences = text.split('.')
  sentences = [sentence.strip() for sentence in sentences if sentence.strip()]
  formatted_text = '\n'.join(sentences)
  return formatted_text
def analyze_and_summarize(audio):
    result = your_transcription_function(audio)
    wrapped_text = textwrap.fill(result, width=150)
    return wrapped_text
def update_output_text(audio,record):
    if audio != '':
      audio = audio
    else:
      audio = record
    summary = analyze_and_summarize(audio)
    return summary

audio_input = gr.inputs.Audio(source="upload", label="Upload Audio File",type = 'filepath')
recording_input = gr.inputs.Audio(source="microphone", label="Record Audio",type = 'filepath')

interface = gr.Interface(fn=update_output_text, inputs=[audio_input, recording_input],
                         outputs="text")
interface.launch(share=True,debug=True)