<img src="../images/coefficient-pyconde.png" width=1200>

# Whispered Secrets: Building An Open-Source Tool To Live Transcribe & Summarize Conversations
## 1. Transcription
**Questions?** contact@coefficient.ai / [@CoefficientData](https://twitter.com/CoefficientData)

---

## 0. Imports 📦

In [1]:
from queue import Queue

import numpy as np
import speech_recognition as sr
import torch
import whisper

## 1. Listen 🎤️

<img src="../images/speechrecognition.png" width=1200>

<img src="../images/sr-enginesupport.png" width=400>

### Configure the microphone

In [2]:
sr.Microphone.list_microphone_names()

['Jabra Evolve 75 SE',
 'Jabra Evolve 75 SE',
 'Dell USB Audio',
 'Dell USB Audio',
 'Jabra Evolve 75 SE',
 'Jabra Evolve 75 SE',
 'MacBook Pro Microphone',
 'MacBook Pro Speakers',
 'Microsoft Teams Audio']

In [3]:
print("Available microphone devices are: ")
for index, name in enumerate(sr.Microphone.list_microphone_names()):
    print(f'{index}: Microphone with name "{name}" found')

Available microphone devices are: 
0: Microphone with name "Jabra Evolve 75 SE" found
1: Microphone with name "Jabra Evolve 75 SE" found
2: Microphone with name "Dell USB Audio" found
3: Microphone with name "Dell USB Audio" found
4: Microphone with name "Jabra Evolve 75 SE" found
5: Microphone with name "Jabra Evolve 75 SE" found
6: Microphone with name "MacBook Pro Microphone" found
7: Microphone with name "MacBook Pro Speakers" found
8: Microphone with name "Microsoft Teams Audio" found


In [5]:
mic_index = 0
#int(input("Please enter the index of the microphone you want to use: "))

In [6]:
source = sr.Microphone(sample_rate=16000, device_index=mic_index)

### Listen & transcribe

In [7]:
recorder = sr.Recognizer()

In [8]:
with sr.Microphone() as source:
    print("Say something!")
    audio = recorder.listen(source)

Say something!


In [9]:
try:
    print(
        f"Whisper thinks you said: '{recorder.recognize_whisper(audio, language="english").strip()}'",
    )
except sr.UnknownValueError:
    print("Whisper could not understand audio")
except sr.RequestError as e:
    print(f"Could not request results from Whisper; {e}")

Whisper thinks you said: 'Hello?'


### Live transcription

In [10]:
audio_model = whisper.load_model("tiny.en")

In [11]:
# SpeechRecognizer will detect when speech ends.
recorder = sr.Recognizer()

# Energy level for mic to detect.
recorder.energy_threshold = 300

In [12]:
# Dynamic energy compensation lowers the energy threshold dramatically to
# a point where the SpeechRecognizer never stops recording.
recorder.dynamic_energy_threshold = False

In [13]:
with source:
    recorder.adjust_for_ambient_noise(source)

In [14]:
# Thread safe Queue for passing data from the threaded recording callback.
data_queue = Queue()

In [15]:
def record_callback(_, audio: sr.AudioData) -> None:
    """
    Threaded callback function to receive audio data when recordings finish.

    audio: An AudioData containing the recorded bytes.
    """
    data_queue.put(audio.get_raw_data())

In [16]:
transcription = [""]

#### 👇 **START TALKING!**

In [17]:
# How real time the recording is in seconds.
record_timeout = 2.0

# Create a background thread that will pass us raw audio bytes.
# We could do this manually but SpeechRecognizer provides a nice helper.
recorder.listen_in_background(
    source,
    record_callback,
    phrase_time_limit=record_timeout,
)

print("Model 'tiny.en' loaded & listening...\n")

Model 'tiny.en' loaded & listening...



In [18]:
data_queue.empty()

True

In [19]:
data_queue

<queue.Queue at 0x134be5ac0>

In [20]:
# Combine audio data from queue
audio_data = b"".join(list(data_queue.queue))
data_queue.queue.clear()

In [21]:
audio_data

b''

In [22]:
from IPython.display import Audio

In [23]:
# Play the audio
sample_rate = 44100
Audio(audio_data, rate=sample_rate)

In [24]:
# Convert in-ram buffer to something the model can use directly without needing a
# temp file. Convert data from 16 bit wide integers to floating point with a width
# of 32 bits. Clamp the audio stream frequency to a PCM wavelength compatible
# default of 32768hz max.
audio_np = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768.0
audio_np

array([], dtype=float32)

In [25]:
Audio(audio_np.tobytes(), rate=44100)

In [26]:
# Read the transcription.
result = audio_model.transcribe(audio_np, fp16=torch.cuda.is_available())
result

{'text': '', 'segments': [], 'language': 'en'}

In [None]:
text = result["text"].strip()
text

"Oh, I'm roasting. Hello, I'm roasting."

In [None]:
transcription.append(text)

# 2. Live transcription demo - run `python -m demo.transcribe` from repo root 🔊

<img src="../images/transcribe.gif" width=1200>

### Change #1: typer CLI

<img src="../images/typer.png" width=1000>

<img src="../images/typer2.png" width=1000>

### Change #2: Load tiny, small, medium models

<img src="../images/load-models.png" width=800>

### Change #3: Infinite loop!

<img src="../images/loop.png" width=800>

### Change #4: Phrase detection

<img src="../images/phrase1.png" width=800>

<img src="../images/phrase2.png" width=800>