# Speech to Text

It does not require GPU access to run this notebook.

This notebook is authored by [Anmol Talwar](https://www.linkedin.com/in/anmol-talwar-922061164/) - Founder and Trainer at Talent Catalyst AI

Visit the blogs on [Talent Catalyst AI](https://talentcatalystai.com/) to learn more on New Gen Technology

**SpeechRecognition**

Speech recognition is the ability of computer software to identify words and phrases in spoken language and convert them to human-readable text. We will use [SpeechRecognition library](https://pypi.org/project/SpeechRecognition/) for our task.

Using this library, we do not need to build any ML model as library provides us with convenient wrappers for various well-known public speech recognition APIs such as  :

* CMU Sphinx (works offline)
* **Google Speech Recognition**
* Google Cloud Speech API
* Wit.ai
* Microsoft Azure Speech
* Microsoft Bing Voice Recognition (Deprecated)
* Houndify API
* IBM Speech to Text
* Snowboy Hotword Detection (works offline)
* Tensorflow
* Vosk API (works offline)
* OpenAI whisper (works offline)
* Whisper API

In [1]:
#intstalling the dependancies
!pip install SpeechRecognition pydub

Collecting SpeechRecognition
  Downloading SpeechRecognition-3.10.4-py2.py3-none-any.whl (32.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.8/32.8 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub, SpeechRecognition
Successfully installed SpeechRecognition-3.10.4 pydub-0.25.1


In [3]:
#importing the required packages
import speech_recognition as sr

In [4]:
#Audio File Path
filename = "/content/machine-learning_speech-recognition_16-122828-0002.wav"

In [5]:
# initialize the recognizer
r = sr.Recognizer()

In [6]:
# open the file
with sr.AudioFile(filename) as source:
    # listen for the data (load audio to memory)
    audio_data = r.record(source)
    # recognize (convert from speech to text)
    text = r.recognize_google(audio_data)
    print(text)

I believe you are just talking nonsense


**Transcribing Large Audio Files**

The below function uses split_on_silence() function from pydub.silence module to split audio data into chunks on silence. The min_silence_len parameter is the minimum length of silence in milliseconds to be used for a split.

silence_thresh is the threshold in which anything quieter than this will be considered silence, I have set it to the average dBFS minus 14, keep_silence argument is the amount of silence to leave at the beginning and the end of each chunk detected in milliseconds.

In [7]:
# importing libraries
import speech_recognition as sr
import os
from pydub import AudioSegment
from pydub.silence import split_on_silence

In [8]:
# create a speech recognition object
r = sr.Recognizer()

# a function to recognize speech in the audio file
# so that we don't repeat ourselves in in other functions
def transcribe_audio(path):
    # use the audio file as the audio source
    with sr.AudioFile(path) as source:
        audio_listened = r.record(source)
        # try converting it to text
        text = r.recognize_google(audio_listened)
    return text

In [9]:
# a function that splits the audio file into chunks on silence
# and applies speech recognition
def get_large_audio_transcription_on_silence(path):
    """Splitting the large audio file into chunks
    and apply speech recognition on each of these chunks"""
    # open the audio file using pydub
    sound = AudioSegment.from_file(path)
    # split audio sound where silence is 500 miliseconds or more and get chunks
    chunks = split_on_silence(sound,
        # experiment with this value for your target audio file
        min_silence_len = 500,
        # adjust this per requirement
        silence_thresh = sound.dBFS-14,
        # keep the silence for 1 second, adjustable as well
        keep_silence=500,
    )
    folder_name = "audio-chunks"
    # create a directory to store the audio chunks
    if not os.path.isdir(folder_name):
        os.mkdir(folder_name)
    whole_text = ""
    # process each chunk
    for i, audio_chunk in enumerate(chunks, start=1):
        # export audio chunk and save it in
        # the `folder_name` directory.
        chunk_filename = os.path.join(folder_name, f"chunk{i}.wav")
        audio_chunk.export(chunk_filename, format="wav")
        # recognize the chunk
        try:
            text = transcribe_audio(chunk_filename)
        except sr.UnknownValueError as e:
            print("Error:", str(e))
        else:
            text = f"{text.capitalize()}. "
            print(chunk_filename, ":", text)
            whole_text += text
    # return the text for all chunks detected
    return whole_text

In [10]:
path = "/content/machine-learning_speech-recognition_7601-291468-0006.wav"
print("\nFull text:", get_large_audio_transcription_on_silence(path))

audio-chunks/chunk1.wav : Here's a bird which he had fixed in a bowery or a country seat. 
audio-chunks/chunk2.wav : Add a short distance from the city. 
audio-chunks/chunk3.wav : Just that what is now called dutch street. 
audio-chunks/chunk4.wav : Soon abounded with proofs of his ingenuity. 
audio-chunks/chunk5.wav : Patent smoke. 
audio-chunks/chunk6.wav : It required a horse to work some. 
audio-chunks/chunk7.wav : Dutch ovens that roasted meat without fire. 
audio-chunks/chunk8.wav : Carts that went before the horses. 
audio-chunks/chunk9.wav : Weather cox that turned against the wind and other wrong-headed contrivances. 
audio-chunks/chunk10.wav : Set astonished and confounded all beholders. 

Full text: Here's a bird which he had fixed in a bowery or a country seat. Add a short distance from the city. Just that what is now called dutch street. Soon abounded with proofs of his ingenuity. Patent smoke. It required a horse to work some. Dutch ovens that roasted meat without fire. C