# Speech recognition and summarization using colab

## Project Overview

In this project, we'll build a system that can automatically recognize speech and summarize it. This can be used for automatically transcribing and summarizing lecture recordings, podcasts, or videos.

We'll also include a way to hook up a microphone to automatically record and transcribe audio for live notetaking. This could be used to record and transcribe meetings in real-time.

**Project Steps**

1. Create a speech recognition system using vosk
2. Add punctuation to the text transcript using recasepunc
3. Summarize the text using a huggingface summarization pipeline
4. Create a widget to record and transcribe live audio

In [1]:
# Install required libraries
!pip install vosk
!pip install pydub
!pip install transformers
!pip install torch -f https://download.pytorch.org/whl/torch_stable.html
!apt install libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg
!pip install PyAudio
!pip install ipywidgets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Reading package lists... Done
Building dependency tree       
Reading state information... Done
libportaudio2 is already the newest version (19.6.0-1).
libportaudiocpp0 is already the newest version (19.6.0-1).
portaudio19-dev is already the newest version (19.6.0-1).
libasound2-dev is already the newest version (1.1.3-5ubuntu0.6).
ffmpeg is already the newest version (7:3.4.11-0ubuntu0.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgr

In [2]:
# import required libraries

import os
from google.colab import files
import shutil
import zipfile
from vosk import Model, KaldiRecognizer
from pydub import AudioSegment
import subprocess
from transformers import pipeline
import ipywidgets as widgets
from IPython.display import display
import json
import pyaudio

In [3]:
# vosk speech recognition

FRAME_RATE = 16000
CHANNELS=1

# model = Model(model_name="vosk-model-en-us-0.22")
model = Model(model_name="vosk-model-small-en-us-0.15")
# For a smaller download size, use model = Model(model_name="vosk-model-small-en-us-0.15")
rec = KaldiRecognizer(model, FRAME_RATE)
rec.SetWords(True)

- Upload your mp3 files to `content` folder on colab runtime

In [4]:
files.upload()

Saving marketplace_full.mp3 to marketplace_full (1).mp3
Saving marketplace.mp3 to marketplace (1).mp3


In [5]:
# verify that mp3 files exist

os.listdir('/content')

['.config',
 'vosk-recasepunc-en-0.22.zip',
 'vosk-recasepunc-en-0.22',
 'marketplace_full.mp3',
 '.ipynb_checkpoints',
 'marketplace.mp3',
 'sample_data']

In [6]:
mp3 = AudioSegment.from_mp3("/content/marketplace.mp3")
mp3 = mp3.set_channels(CHANNELS)
mp3 = mp3.set_frame_rate(FRAME_RATE)

In [7]:
rec.AcceptWaveform(mp3.raw_data)
result = rec.Result()

In [8]:
text = json.loads(result)["text"]

In [9]:
text

"the funny thing about the big economic news of the day the fed raising interest rates have a percentage point was that there was only really one bit of actual news in the news and the interest rate increase wasn't it you know it was common i know it was common wall street news common businesses knew it was common so on this fed day on this program something a little bit different j powell in his own words five of i'm his most used economic words from today's press conference where number one of course it's the biggie two percent inflation flesh and inflation inflation inflation place in english dealing with inflation bells big worry that thing keeping him up at night price stability is the feds whole ballgame right now pal basically said as much to day or number two"

In [None]:
# Download recasepunc

!wget --no-check-certificate \
    https://alphacephei.com/vosk/models/vosk-recasepunc-en-0.22.zip \
    -O /content/vosk-recasepunc-en-0.22.zip

In [None]:
local_zip = '/content/vosk-recasepunc-en-0.22.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/content')
zip_ref.close()

In [10]:
# recase and punctuation

cased = subprocess.check_output('python vosk-recasepunc-en-0.22/recasepunc.py predict vosk-recasepunc-en-0.22/checkpoint', shell=True, text=True, input=text)

In [11]:
cased

"The funny thing about the big economic news of the day, the Fed raising interest rates have a percentage point was that there was only really one bit of actual news in the news. And the interest rate increase wasn ' t it. You know, it was common. I know it was common Wall Street news, common businesses knew it was common. So on this Fed day on this program, something a little bit different. J. Powell, in his own words, five of I ' m his most used economic words from today ' s press conference, where number one, Of course, it ' s the biggie Two percent inflation, flesh and inflation inflation inflation place in English. Dealing with inflation bells. Big worry, that thing keeping him up at night. Price stability is the Feds whole ballgame right now, pal. Basically said as much to day or number two.\n"

In [12]:
# Create speech recognition function

def voice_recognition(filename):
    # model = Model(model_name="vosk-model-en-us-0.22")
    model = Model(model_name="vosk-model-small-en-us-0.15")
    # For a smaller download size, use model = Model(model_name="vosk-model-small-en-us-0.15")
    rec = KaldiRecognizer(model, FRAME_RATE)
    rec.SetWords(True)
    
    mp3 = AudioSegment.from_mp3(filename)
    mp3 = mp3.set_channels(CHANNELS)
    mp3 = mp3.set_frame_rate(FRAME_RATE)
    
    step = 45000
    transcript = ""
    for i in range(0, len(mp3), step):
        print(f"Progress: {i/len(mp3)}")
        segment = mp3[i:i+step]
        rec.AcceptWaveform(segment.raw_data)
        result = rec.Result()
        text = json.loads(result)["text"]
        transcript += text
    
    cased = subprocess.check_output('python vosk-recasepunc-en-0.22/recasepunc.py predict vosk-recasepunc-en-0.22/checkpoint', shell=True, text=True, input=transcript)
    return cased

In [13]:
# Trancribe bigger mp3 file

transcript = voice_recognition("/content/marketplace_full.mp3")

Progress: 0.0
Progress: 0.02666815218151411
Progress: 0.05333630436302822
Progress: 0.08000445654454233
Progress: 0.10667260872605644
Progress: 0.13334076090757055
Progress: 0.16000891308908466
Progress: 0.18667706527059877
Progress: 0.21334521745211288
Progress: 0.240013369633627
Progress: 0.2666815218151411
Progress: 0.29334967399665524
Progress: 0.3200178261781693
Progress: 0.34668597835968346
Progress: 0.37335413054119754
Progress: 0.4000222827227117
Progress: 0.42669043490422576
Progress: 0.4533585870857399
Progress: 0.480026739267254
Progress: 0.5066948914487681
Progress: 0.5333630436302822
Progress: 0.5600311958117963
Progress: 0.5866993479933105
Progress: 0.6133675001748246
Progress: 0.6400356523563386
Progress: 0.6667038045378528
Progress: 0.6933719567193669
Progress: 0.720040108900881
Progress: 0.7467082610823951
Progress: 0.7733764132639093
Progress: 0.8000445654454234
Progress: 0.8267127176269374
Progress: 0.8533808698084515
Progress: 0.8800490219899657
Progress: 0.90671717

In [14]:
# summarization of transcribed audio
summarizer = pipeline("summarization")
# summarizer = pipeline("summarization", model="t5-small")
# For a smaller model, use: summarizer = pipeline("summarization", model="t5-small")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [15]:
split_tokens = transcript.split(" ")
docs = []
for i in range(0, len(split_tokens), 850):
    selection = " ".join(split_tokens[i:(i+850)])
    docs.append(selection)

In [16]:
summaries = summarizer(docs)

In [17]:
summary = "\n\n".join([d["summary_text"] for d in summaries])

In [18]:
print(summary)

 Elon Musk has sealed the deal to buy Twitter . He has offered to buy the company for fifty four dollars, twentycents a share . Twitter has managed Clinton on it 's platform and he wants to change that. He wants to run it as a better business and show the the the. the. .

 Twitter shares on this Monday up almost six percent still, though a couple about shy of must offer of once again fifty four dollars, twenty cents . Inflation rate for low income households was running at about nine percent before high income households as much closer to the average inflation rate at eight and a half percent .

 Americans and Hispanics are disproportionately likely to lose their jobs and your job, which you can read as your wages, right ? That affects what you buy, which gets reflected in those inflation numbers . Some economists in their survey don 't expect their firms to pay workers more in the future, say over the next three months .

 Masterworks allows you to invest in artworks by Bank C, Basque

## Microphone live speech recognition

- Note: This may only work on local system since it needs access to microphone

In [None]:
# Find local microphone index

p = pyaudio.PyAudio()
p.get_device_count()

for i in range(p.get_device_count()):
    print(p.get_device_info_by_index(i).get('name'))

In [None]:
# Function to record from microphone

def record_microphone(seconds=10, chunk=1024, audio_format=pyaudio.paInt16):
    p = pyaudio.PyAudio()

    stream = p.open(format=audio_format,
                    channels=CHANNELS,
                    rate=FRAME_RATE,
                    input=True,
                    input_device_index=2, # match this index to your local microphone index
                    frames_per_buffer=chunk)

    frames = []

    for i in range(0, int(FRAME_RATE / chunk * seconds)):
        data = stream.read(chunk)
        frames.append(data)

    stream.stop_stream()
    stream.close()
    p.terminate()

    sound = AudioSegment(
        data=b''.join(frames),
        sample_width=p.get_sample_size(audio_format),
        frame_rate=FRAME_RATE,
        channels=CHANNELS
    )
    sound.export("temp.mp3", "mp3")

In [None]:
record_microphone()

In [None]:
# Create buttons for start and stop recording and transcribe/summarize live recording from microphone

record_button = widgets.Button(
    description='Record',
    disabled=False,
    button_style='success',
    tooltip='Record',
    icon='microphone'
)

summary = widgets.Output()

def start_recording(data):
    with summary:
        display("Starting the recording.")
        record_microphone()
        display("Finished recording.")
        transcript = voice_recognition("temp.mp3")
        display(f"Transcript: {transcript}")

record_button.on_click(start_recording)

display(record_button, summary)