# Speech Recognition And Summarization 

## Project Overview

In this project, we'll build a system that can automatically recognize speech and summarize it. This can be used for automatically transcribing and summarizing lecture recordings, podcasts, or videos.

We'll also include a way to hook up a microphone to automatically record and transcribe audio for live notetaking. This could be used to record and transcribe meetings in real-time.

By the end of this project, you'll have a speech to text project that you can continue to build on.

### Project Steps

1. Create a speech recognition system using vosk
2. Add punctuation to the text transcript using recasepunc
3. Summarize the text using a huggingface summarization pipeline
4. Create a widget to record and transcribe live audio

In [1]:
# import required libraries

from vosk import Model, KaldiRecognizer
from pydub import AudioSegment
import subprocess
from transformers import pipeline
import ipywidgets as widgets
from IPython.display import display
import json
import pyaudio

  return torch._C._cuda_getDeviceCount() > 0


In [2]:
FRAME_RATE = 16000
CHANNELS=1

# model = Model(model_name="vosk-model-en-us-0.22")
model = Model(model_name="vosk-model-small-en-us-0.15")
# For a smaller download size, use model = Model(model_name="vosk-model-small-en-us-0.15")
rec = KaldiRecognizer(model, FRAME_RATE)
rec.SetWords(True)

LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from /home/ranga/.cache/vosk/vosk-model-small-en-us-0.15/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:282) Loading HCL and G from /home/ranga/.cache/vosk/vosk-model-small-en-us-0.15/graph/HCLr.fst /home/ranga/.cache/vosk/vosk-model-small-en-us-0.15/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo /home/ranga/.cache/vosk/vosk-model-small-en-us-0.15/graph/phones/word_boundary.int


In [3]:
mp3 = AudioSegment.from_mp3("marketplace.mp3")
mp3 = mp3.set_channels(CHANNELS)
mp3 = mp3.set_frame_rate(FRAME_RATE)

In [4]:
rec.AcceptWaveform(mp3.raw_data)
result = rec.Result()

In [5]:
text = json.loads(result)["text"]

In [6]:
text

"the funny thing about the big economic news of the day the fed raising interest rates have a percentage point was that there was only really one bit of actual news in the news and the interest rate increase wasn't it you know it was common i know it was common wall street news common businesses knew it was common so on this fed day on this program something a little bit different j powell in his own words five of i'm his most used economic words from today's press conference where number one of course it's the biggie two percent inflation flesh and inflation inflation inflation place in english dealing with inflation bells big worry that thing keeping him up at night price stability is the feds whole ballgame right now pal basically said as much to day or number two"

In [7]:
# %pip install transformers

In [8]:
# %pip install torch -f https://download.pytorch.org/whl/torch_stable.html

In [9]:
cased = subprocess.check_output('python recasepunc/recasepunc.py predict recasepunc/checkpoint', shell=True, text=True, input=text)

  return torch._C._cuda_getDeviceCount() > 0
Downloading vocab.txt: 100%|██████████| 226k/226k [00:00<00:00, 700kB/s] 
Downloading tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 12.3kB/s]
Downloading config.json: 100%|██████████| 570/570 [00:00<00:00, 55.9kB/s]
Downloading pytorch_model.bin: 100%|██████████| 420M/420M [01:24<00:00, 5.24MB/s] 
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expe

In [10]:
cased

"The funny thing about the big economic news of the day, the Fed raising interest rates have a percentage point was that there was only really one bit of actual news in the news. And the interest rate increase wasn ' t it. You know, it was common. I know it was common Wall Street news. Common businesses knew it was common. So on this Fed day, on this program something a little bit different. J. Powell, in his own words, five of I ' m his most used economic words from today ' s press conference, where number one, Of course, it ' s the biggie Two percent inflation, flesh and inflation inflation inflation Place in English. Dealing with inflation bells. Big worry, that thing keeping him up at night. Price stability is the Feds whole ballgame right now, Pal, Basically said as much to day or number two.\n"

In [11]:
# Create speech recognition function

def voice_recognition(filename):
    # model = Model(model_name="vosk-model-en-us-0.22")
    model = Model(model_name="vosk-model-small-en-us-0.15")
    # For a smaller download size, use model = Model(model_name="vosk-model-small-en-us-0.15")
    rec = KaldiRecognizer(model, FRAME_RATE)
    rec.SetWords(True)
    
    mp3 = AudioSegment.from_mp3(filename)
    mp3 = mp3.set_channels(CHANNELS)
    mp3 = mp3.set_frame_rate(FRAME_RATE)
    
    step = 45000
    transcript = ""
    for i in range(0, len(mp3), step):
        print(f"Progress: {i/len(mp3)}")
        segment = mp3[i:i+step]
        rec.AcceptWaveform(segment.raw_data)
        result = rec.Result()
        text = json.loads(result)["text"]
        transcript += text
    
    cased = subprocess.check_output('python recasepunc/recasepunc.py predict recasepunc/checkpoint', shell=True, text=True, input=transcript)
    return cased

In [12]:
# Trancribe bigger mp3 file
transcript = voice_recognition("marketplace_full.mp3")

LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from /home/ranga/.cache/vosk/vosk-model-small-en-us-0.15/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:282) Loading HCL and G from /home/ranga/.cache/vosk/vosk-model-small-en-us-0.15/graph/HCLr.fst /home/ranga/.cache/vosk/vosk-model-small-en-us-0.15/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo /home/ranga/.cache/vosk/vosk-model-small-en-us-0.15/graph/phones/word_boundary.int


Progress: 0.0
Progress: 0.02666815218151411
Progress: 0.05333630436302822
Progress: 0.08000445654454233
Progress: 0.10667260872605644
Progress: 0.13334076090757055
Progress: 0.16000891308908466
Progress: 0.18667706527059877
Progress: 0.21334521745211288
Progress: 0.240013369633627
Progress: 0.2666815218151411
Progress: 0.29334967399665524
Progress: 0.3200178261781693
Progress: 0.34668597835968346
Progress: 0.37335413054119754
Progress: 0.4000222827227117
Progress: 0.42669043490422576
Progress: 0.4533585870857399
Progress: 0.480026739267254
Progress: 0.5066948914487681
Progress: 0.5333630436302822
Progress: 0.5600311958117963
Progress: 0.5866993479933105
Progress: 0.6133675001748246
Progress: 0.6400356523563386
Progress: 0.6667038045378528
Progress: 0.6933719567193669
Progress: 0.720040108900881
Progress: 0.7467082610823951
Progress: 0.7733764132639093
Progress: 0.8000445654454234
Progress: 0.8267127176269374
Progress: 0.8533808698084515
Progress: 0.8800490219899657
Progress: 0.90671717

  return torch._C._cuda_getDeviceCount() > 0
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# summarization of transcribed audio
# summarizer = pipeline("summarization")
summarizer = pipeline("summarization", model="t5-small")
# For a smaller model, use: summarizer = pipeline("summarization", model="t5-small")

In [None]:
split_tokens = transcript.split(" ")
docs = []
for i in range(0, len(split_tokens), 850):
    selection = " ".join(split_tokens[i:(i+850)])
    docs.append(selection)

In [None]:
summaries = summarizer(docs)

In [None]:
summary = "\n\n".join([d["summary_text"] for d in summaries])

In [None]:
print(summary)

## Microphone live speech recognition

In [None]:
# Find local microphone index

p = pyaudio.PyAudio()
p.get_device_count()

for i in range(p.get_device_count()):
    print(p.get_device_info_by_index(i).get('name'))

In [None]:
# Function to record from microphone

def record_microphone(seconds=10, chunk=1024, audio_format=pyaudio.paInt16):
    p = pyaudio.PyAudio()

    stream = p.open(format=audio_format,
                    channels=CHANNELS,
                    rate=FRAME_RATE,
                    input=True,
                    input_device_index=2, # match this index to your local microphone index
                    frames_per_buffer=chunk)

    frames = []

    for i in range(0, int(FRAME_RATE / chunk * seconds)):
        data = stream.read(chunk)
        frames.append(data)

    stream.stop_stream()
    stream.close()
    p.terminate()

    sound = AudioSegment(
        data=b''.join(frames),
        sample_width=p.get_sample_size(audio_format),
        frame_rate=FRAME_RATE,
        channels=CHANNELS
    )
    sound.export("temp.mp3", "mp3")

In [None]:
record_microphone()

In [None]:
# Create buttons for start and stop recording and transcribe/summarize live recording from microphone

record_button = widgets.Button(
    description='Record',
    disabled=False,
    button_style='success',
    tooltip='Record',
    icon='microphone'
)

summary = widgets.Output()

def start_recording(data):
    with summary:
        display("Starting the recording.")
        record_microphone()
        display("Finished recording.")
        transcript = voice_recognition("temp.mp3")
        display(f"Transcript: {transcript}")

record_button.on_click(start_recording)

display(record_button, summary)