# What I learned when trying to transcribe an audio file with Python


## Overview
Once a month I attend a meeting at my Department to discuss issues related to the Chemistry Undergraduate Course. An audio of the meeting is recorded with the help of a mobile phone, and later the secretary writes the respective minutes that are then filed after being approved by the participants.  In general, the recording time is about 2 hours long and there are eight participants who speak Brazilian Portuguese and have different accents.

In an attempt to automate the transcription of this audio (named `meeting.wav`) and to make it one of my adventures with Python, I decided to use Python to get this job done.

Since I have no previous knowledge about speech recognition, the process of converting audio into text, and because I don't want to recreate the wheel, I'll use already built programs and ideas (references include) and try to adapt them to solve the described problem. Moreover, I hope I'll learn a little bit about speech recognition, audio transcription and Python.

## Tools
* [Python 3](https://www.python.org/)
* [pydub](http://pydub.com/) module
* [SpeechRecognition](https://github.com/Uberi/speech_recognition#readme) module
* [FFmpeg.exe](https://ffmpeg.org/) program (record, convert and stream audio and video)
* [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/)
* An audio editor (optional), as [Audacity](https://www.audacityteam.org/), for example. 

Python modules can be installed using `pip` or `conda` and after installing the program `ffmpeg.exe` don't forget to add it to the system `path`. An audio editor also can help to visualize the audio data and get some useful parameters like *silence threshold* in dB and *silence lenght in miliseconds*. 

## Automatic speech recognition (ASR) system
In general, an ASR system can be represented by the following simplified diagram:

![Speech recognition diagram](img/speech.jpg)

where a digital audio input, obtained from a previously recorded file/microphone, is processed by an automatic speech recognition system, responsible for the painful task of interpreting human speech by a computer, and provides as an output the desired transcribed text. On internet, you can find a lot of useful information about speech recognition with Python, but I started with:
[The ultimate guide to speech recognition with Python](https://realpython.com/python-speech-recognition/) and [Speech recognition is hard](https://towardsdatascience.com/speech-recognition-is-hard-part-1-258e813b6eb7).

The speech recognition process can be implemented by using an online API (Application Programming Interface). Some APIs' examples are:
* [Assemblyai](https://pypi.org/project/assemblyai/)
* [Google Cloud Speech](https://pypi.org/project/google-cloud-speech/)
* [IBM Watson](https://pypi.org/project/watson-developer-cloud/) 
* [Pocketsphinx](https://pypi.org/project/pocketsphinx/)
* [SpeechRecognition](https://pypi.org/project/SpeechRecognition/)

The package used here is *SpeechRecognition* with the *Google Web Speech API* because it doesn't require an account and it is easy to use. It should be noted that this API comes with a default key and don't requires any authentication, however since Google may revoke it at any time it should be used only for testing purposes.

### 1. An idealized audio model
This was the first [python tutorial](https://pythonbasics.org/transcribe-audio/#Installprequisites) I found useful to started with. I used it to transcribe the audio `harvard.wav` I found on [GitHub](https://github.com/realpython/python-speech-recognition) just to see how it would perform. Even though the input language of this audio is English, the speech recognition engine also supports other languages. And, since I used an audio file in .wav, I didn't need to convert it to .mp3.
 
The program will perform the following actions to get the audio transcripted:
* convert from a .mp3 to a .wav audio file
* load the audio file
* use a speech recognition engine

The output will be the transcription of the original audio file.

In [9]:
import speech_recognition as sr
from os import path
from pydub import AudioSegment

# convert mp3 file to wav                                                       
# sound = AudioSegment.from_mp3("harvard.mp3")
# sound.export("harvard.wav", format="wav")

# transcribe audio file                                                         
AUDIO_FILE = "harvard.wav"

# use the audio file as the audio source                                        
r = sr.Recognizer()
with sr.AudioFile(AUDIO_FILE) as source:
        audio = r.record(source)  # read the entire audio file                  

        print("Transcription: " + r.recognize_google(audio))

Transcription: the stale smell of old beer lingers it takes heat to bring out the odor a cold dip restores health and zest a salt pickle taste fine with ham tacos al Pastore are my favorite a zestful food is be hot cross bun


The transcription of the audio file `harvard.wav`, presented above, was precise.

Since I want to transcribe an audio from Brazilian Portuguese I asked four different people to read a short passage from "The Art of War" in Portuguese, and recorded with my mobile phone. The resulting audio files were then converted from .3gpp format to .wav using FFmpeg:

`ffmpeg -i file.3gpp file.wav`

I also set the language to Brazilian Portuguese in the speech recognition engine:
`print("Transcription: " + r.recognize_google(audio, language='pt-BR'))`

The speakers in these audio files are from different gender, ages and accents, the reading was paused and the ambient noise was minimal.

In [1]:
import speech_recognition as sr
from os import path
from pydub import AudioSegment

# convert mp3 file to wav                                                       
# sound = AudioSegment.from_mp3("harvard.mp3")
# sound.export("harvard.wav", format="wav")

# loop over four audio files 
for i in range(1,5):
    AUDIO_FILE = "speaker"+str(i)+".wav"

# use the audio file as the audio source                                        
    r = sr.Recognizer()
    with sr.AudioFile(AUDIO_FILE) as source:
        audio = r.record(source)  # read the entire audio file                  

        print(f'\nSpeaker {i}:\n')
        print(" Transcription: " + r.recognize_google(audio, language='pt-Br'))


Speaker 1:

 Transcription: comandar muitos é o mesmo que comandar poucos tudo é uma questão de organização controlar muitos ou poucos é uma mesma e única coisa é apenas uma questão de formação e sinalizações

Speaker 2:

 Transcription: como andar muito é o mesmo que comandar pouco tudo é uma questão de organização controlar muito ou pouco é uma mesma e única coisa e apenas uma questão de formação e sinalizações

Speaker 3:

 Transcription: comandar muitos é o mesmo que comandar poucos tudo uma questão de organização controlar muitos ou poucos é uma mesma e única coisa é apenas uma questão de formação e sinalizações

Speaker 4:

 Transcription: comandar muitos é o mesmo que comandar poucos tudo é uma questão de organização controlar muitos ou poucos é uma mesma e única coisa é apenas uma questão de formação e sinalizações


The transcriptions these audio files using `language='pt-Br'` are shown above, and were also precise.

### 2. A "real world" audio

Now let's see how it will perform with a more "real world" messy audio file, that I previously mentioned . 

In [None]:
import speech_recognition as sr
from os import path
from pydub import AudioSegment

# transcribe audio file                                                         
AUDIO_FILE = "meeting.wav"

# use the audio file as the audio source                                        
r = sr.Recognizer()
with sr.AudioFile(AUDIO_FILE) as source:
        audio = r.record(source)  # read the entire audio file                  

        print("Transcription: " + r.recognize_google(audio, language='pt-BR'))

This time the program returned several errors (not shown) when trying to transcribe my real, 2 hours long, audio file. I [found](https://www.geeksforgeeks.org/python-speech-recognition-on-large-audio-files/) that the speech recognition process accuracy decreases for long audio files and the Google Speech Recognition API can't recognize long audio files with reasonable accuracy. Moreover, the recording quality issue also must be considered, as well as the quality and speed of internet connection. The audio size problem can be addressed in two ways:
* First, by dividing the original file into small chunks using as criterion of division, the silence threshold, i.e., the pauses we make between sentences. The problem here is the choice of the duration of this pause, different people pause for different times. In this case, I used a more elaborated [code](https://www.geeksforgeeks.org/python-speech-recognition-on-large-audio-files/).
* Second, by splitting the original file into small chunks of a defined length, as 30 s for example, using FFmpeg: `ffmpeg -i meeting.wav -f segment -segment_time 30 -c copy chunk%d.wav`. In this case, I used a modified version of the first code presented here.

### 3. Processing large audio files: silence threshold criterion
I tried different combinations of the variables `min_silence_len` (e.g., 500 ms, 1000 ms and 2000 ms) and `silence_thresh` (e.g., -16, -22, -30, -42, -50). An audio editor can be very helpful here.

In [None]:
# importing libraries 
import speech_recognition as sr 
import os  
from pydub import AudioSegment 
from pydub.silence import split_on_silence 
  
# this function splits the audio file into chunks 
# and applies speech recognition 
def silence_based_conversion(path = "my_audio.wav"): 
    # open the audio file stored in 
    # the local system as a wav file. 
    song = AudioSegment.from_wav(path) 
  
    # open two files: in the first one it will concatenate   
    # and store the recognized text and in the second one
    # it will write some error messages
    fh = open("transcribed.txt", "w+")
    flog = open("transcribed.log", "w+")
          
    # split track where silence is 0.5 seconds  
    # or more and get chunks 
    chunks = split_on_silence(song, 
        # must be silent for at least 0.5 seconds 
        # or 500 ms. adjust this value based on user 
        # requirement. if the speaker stays silent for  
        # longer, increase this value. else, decrease it. 
        min_silence_len = 2000, 
  
        # consider it silent if quieter than -16 dBFS 
        # adjust this per requirement 
        silence_thresh = -50
    ) 
  
    # create a directory to store the audio chunks. 
    try: 
        os.mkdir('chunks') 
    except(FileExistsError): 
        pass
  
    # move into the directory to 
    # store the audio files. 
    os.chdir('chunks') 
  
    i = 0
    # process each chunk 
    for chunk in chunks:       
        # Create 0.5 seconds silence chunk 
        chunk_silent = AudioSegment.silent(duration = 10) 
  
        # add 0.5 sec silence to beginning and  
        # end of audio chunk. This is done so that 
        # it doesn't seem abruptly sliced. 
        audio_chunk = chunk_silent + chunk + chunk_silent 
  
        # export audio chunk and save it in  
        # the current directory. 
        print("saving chunk{0}.wav".format(i)) 
        # specify the bitrate to be 192 k 
        audio_chunk.export("./chunk{0}.wav".format(i), bitrate ='192k', format ="wav") 
  
        # the name of the newly created chunk 
        filename = 'chunk'+str(i)+'.wav'
  
        print("Processing chunk "+str(i)) 
  
        # get the name of the newly created chunk 
        # in the AUDIO_FILE variable for later use. 
        file = filename 
  
        # create a speech recognition object 
        r = sr.Recognizer() 
  
        # recognize the chunk 
        with sr.AudioFile(file) as source: 
            # remove this if it is not working 
            # correctly. 
            r.adjust_for_ambient_noise(source) 
            audio_listened = r.listen(source) 
  
        # try converting it to text 
        try: 
            rec = r.recognize_google(audio_listened, language = 'pt-Br') 
            # write the output to the file. 
            fh.write(rec+". ") 
  
        # catch any errors. 
        except sr.UnknownValueError:
            # write the output to the file. 
            flog.write(f"file = {filename}: could not understand audio")
  
        except sr.RequestError as e:
            # write the output to the file. 
            flog.write(f"file = {filename}: could not request results. check you internet connection")
  
        i += 1
  
    os.chdir('..') 
  
 
if __name__ == '__main__':
    silence_based_conversion(path = 'meeting.wav')   # set the name of the audio file that should be transcribed

### 4. Processing large audio files: file size criterion
I split the original file in parts of 30 s and 60 s each part using FFmpeg as mentioned before. Since the audio files were already created I just looped through the chunk files.

In [None]:
# importing libraries
import speech_recognition as sr
import os
from pydub import AudioSegment

# open two files, to store the recognized text and
# to store error messages
fh = open("transcription.txt", "w+")
flog = open("transcription.log", "w+")
  
# loop through audio chunks
for i in range(0, 116):      # set the range of the chunks
    filename = 'chunks/chunk'+str(i)+'.wav'
    print(f'i = {i} | filename = {filename}\n')
    file = filename

    r = sr.Recognizer()
    with sr.AudioFile(file) as source:
#        r.adjust_for_ambient_noise(source, duration=4000)
        audio = r.listen(source)
    
    # try converting it to text 
    try:
        rec = r.recognize_google(audio, language = 'pt-BR')
        fh.write(rec+". ") 
    
    # catch any errors    
    except sr.UnknownValueError:
        flog.write(f"file = {filename}: could not understand audio")
        
    except sr.RequestError as e:
        flog.write(f"file = {filename}: could not request results. check you internet connection")
        

### 5. Concluding remarks

From the two methods used here, I was able to extract more text using the second one, but in both cases, the quality of the transcribed text is far from reality. The audio used was registered using a mobile phone, so it doesn't have a high quality. However, I believe that the greatest difficulty in transcription was caused by the way of speaking of the participants who speak with different accents and dialects or speak very fast and don't separate words and sentences what makes their speech unclear to the engine, without using a more sophisticated and trained algorithm. Many other factors like age, gender, context, intent, etc, also can affect the recognition process.

[Human speech is not a simple phenomenon](https://www.youtube.com/watch?v=FZ0PDJbRU5I) and it is influenced by many complex factors that are constantly changing as we change who speak. Machines recognition systems work much better in ideal conditions where we can reduce noise level and the speak is paced. Fortunately, there are algorithms and techniques more sophisticated to tackle this problem but this is not the subject here.

The main goal was to use Python to try to solve a particular problem while learning Python itself and Jupyter Lab. This was my first personal project in Python so there is much to improve.