# Homework 5 - Google Cloud Speech-to-Text

## Part 1 

### Imports

In [1]:
import Levenshtein as Lev
import io
import os

from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types

import numpy as np

### Recognize Function
Uses Google's API to generate text from wav file

In [2]:
def google_recognize(audio_name,auth_key):

    os.environ["GOOGLE_APPLICATION_CREDENTIALS"]=auth_key

    # Instantiates a client
    client = speech.SpeechClient()

    # The name of the audio file to transcribe
    file_name = audio_name

    # Loads the audio into memory
    with io.open(file_name, 'rb') as audio_file:
        content = audio_file.read()
        audio = types.RecognitionAudio(content=content)

    config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=16000,
        language_code='en-US')

    # Detects speech in the audio file
    response = client.recognize(config, audio)
    for result in response.results:
        return result.alternatives[0].transcript

### Word Error Rate
WER is computed by dividing the sum of substitutions, deletions and insertions by the total number words.  
We can use Levenshtein distance to calculate the WER between two sentences.

In [3]:
def wer(s1, s2):

    s1 =s1.lower()
    s2 =s2.lower()
    b = set(s1.lower().split() + s2.lower().split())
    
    word2char = dict(zip(b, range(len(b))))


    w1 = [chr(word2char[w]) for w in s1.split()]
    w2 = [chr(word2char[w]) for w in s2.split()]
    return Lev.distance(''.join(w1), ''.join(w2))/float(len(s2.split()))

### Path to Test Files

In [4]:
test_path = "./TIMIT_full/test/"

In [5]:
d_list = sorted(os.listdir(test_path))

### Path to Google's API key

In [7]:
key = "b659tutorial-259123-a3daa20bf890.json"

### Get Real Sentence
This function is used to get the original transcript of the wav file

In [8]:
def get_real_sen(path):
    file = open(path+'.txt')
    for line in file:
        line = line.split(' ')
        return ' '.join(line[2:])

### Calculate WER for all Sentences in the TIMIT test data

In [9]:
wer_all = []
count = 0
for d in d_list:
    person_list = sorted(os.listdir(test_path+d+'/'))
    for person in person_list:
        new_path = test_path+d+'/'+person + '/'
        files = sorted(os.listdir(new_path))
        for file in files:
            fname, ext = file.split('.')
            if ext == 'wav':
                count += 1
                if count % 100 == 0:
                    print(count)
                
                audio = new_path + '/' + fname
                pred_sen = google_recognize(audio+'.wav',key)+'.'
                real_sen = get_real_sen(audio)
                wer_all.append(wer(pred_sen, real_sen)) 

100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600


### Mean and Standard Deviation of WER on TIMIT Test Data

In [10]:
wer_np = np.array(wer_all)
mean = wer_np.mean()
std = wer_np.std()
print(mean)
print(std)

0.1218402405384106
0.16213492486584333


### Comparision Google Cloud Speech-to-Text API v/s Kaldi (Traditional ASR)
Google Cloud Speech-to-Text API perfoms much better than Kaldi's Traditional ASR models. Kaldi's Best Models SGMM2 and SGMM2 + SMI had WER ~19.3 where as mean WER of Google Cloud is ~12.2 (0.122x100)  

### Things I noticed
Google Cloud Speech-to-Text API doesn't add a Full Stop in the end of the sentence if it just has to predict one sentence. This originally made the mean WER and std WER worse (23.2 and 15.7 respectively). So I add a full stop in the transcription predicted by the API to improve the WER.  

I assume Google's API does this to improve performance of it's Voice Assistants as usually they just recieve one sentence commands so to add a full stop and then remove it for knowledge extraction would be a waste.

## Part 2

### Imports

In [9]:
import soundfile as sndfl

### Train File to be used as noise

In [10]:
noise_path = './TIMIT_full/train/dr1/fcjf0/sa1.wav'
noise, sr = sndfl.read(noise_path)

### Desired SNRs

In [11]:
dsnr = [-5, 0, 10, 25]

### Lists to store WERs respective to Desired SNRs

In [12]:
wer_5 = []
wer_0 = []
wer_10 = []
wer_25 = []

### Generate Noisy
This function uses the clean signal, noise signal and desired SNR value to generate noisy signal

In [13]:
def generate_noisy(signal, noise, dsnr):

    while len(noise) < len(signal):
        noise = np.hstack([noise, noise])
    noise = noise[:len(signal)]

    b_square = np.sum(np.square(signal))/np.sum(np.square(noise))
    b_square = b_square/(10**(dsnr/10))
    b = np.sqrt(b_square)
    noisy = signal + b*noise
    return noisy

### Temp Path to store the noisy signals

In [14]:
temp_path = test_path + 'temp/'

### Calculate WER for all Signals in the TIMIT test data
For each clean signal I generate 4 Noisy Signal with dSNR = -5, 0, 10 and 25. Then I use Google's API to predict the text transciptions on each of the noisy signals. Then I calculate the WER and append each WER to it's respective list

In [34]:
count = 0

for d in d_list:
    person_list = sorted(os.listdir(test_path+d+'/'))
    for person in person_list:
        new_path = test_path+d+'/'+person + '/'
        
        files = sorted(os.listdir(new_path))
        for file in files:
            fname, ext = file.split('.')
            if ext == 'wav':
                count += 1
                if count % 50 == 0:
                    print(count)
                    
                audio = new_path + '/' + fname
                real_sen = get_real_sen(audio)
                clean_signal, sr = sndfl.read(audio+'.wav')
                
                for d2 in dsnr:
                    noisy_signal = generate_noisy(clean_signal, noise, d2)
                    noisy_path = temp_path + fname + "_" + str(d2) + '.wav' 
                    sndfl.write(noisy_path, noisy_signal, sr)
                    pred_sen = google_recognize(noisy_path, key)
                    if pred_sen == None:
                        pred_sen = ""
                    else:
                        pred_sen += '.'
                    result = wer(pred_sen, real_sen)
                    
                    if d2 == -5:
                        wer_5.append(result)
                    elif d2 == 0:
                        wer_0.append(result)
                    elif d2 == 10:
                        wer_10.append(result)
                    else:
                        wer_25.append(result)

50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950
1000
1050
1100
1150
1200
1250
1300
1350
1400
1450
1500
1550
1600
1650
1700
1750
1800
1850


### Converting List to Numpy Array

In [35]:
wer_5_np = np.array(wer_5)
wer_0_np = np.array(wer_0)
wer_10_np = np.array(wer_10)
wer_25_np = np.array(wer_25)

### Mean and Standard Devations of WERs for all Desired SNRs

In [43]:
mean = wer_5_np.mean()
std = wer_5_np.std()
print("For -5 db SNR signals, \n Mean WER = " + str(mean) + " Standard Deviation = " + str(std))

For -5 db SNR signals, 
 Mean WER = 1.0858575080673294 Standard Deviation = 0.3595471852656777


In [44]:
mean = wer_0_np.mean()
std = wer_0_np.std()
print("For 0 db SNR signals, \n Mean WER = " + str(mean) + " Standard Deviation = " + str(std))

For 0 db SNR signals, 
 Mean WER = 0.9084751703780346 Standard Deviation = 0.3000168169901338


In [45]:
mean = wer_10_np.mean()
std = wer_10_np.std()
print("For 10 db SNR signals, \n Mean WER = " + str(mean) + " Standard Deviation = " + str(std))

For 10 db SNR signals, 
 Mean WER = 0.21941584869441266 Standard Deviation = 0.2568347970989601


In [46]:
mean = wer_25_np.mean()
std = wer_25_np.std()
print("For 25 db SNR signals, \n Mean WER = " + str(mean) + " Standard Deviation = " + str(std))

For 25 db SNR signals, 
 Mean WER = 0.12538260887505803 Standard Deviation = 0.16328324974300254


## Discussion
As SNR decreases the performance decreases (or Mean WER increases).
For the SNR values -5 and 0 the power of noise is comparable to the power of Speech. Therefore, we get such a high mean WER for them (108.59 for -5db and 90.85 for 0db). Also, on hearing these noisy samples we can notice that the speech of the two speakers intermingle and it is difficult to deduce what each person is saying.

For 10 db signals we get Mean WER comparable to results of KALDI.  On hearing these signals we can notice that it takes a little effort to ignore the noise signal. Kaldi's Best Models SGMM2 and SGMM2 + SMI had WER ~19.3 and for 10db we get Mean WER ~21.9

Whereas, for 25db signals the Mean WER is comparable to the results we got in part 1 (signals without any noise). For 25db signals the mean WER is 12.53 and for clean signals the WER is 12.18. On hearing the 25db signals we can notice that the noise signal is almost inaudible and 

Similarly, as the SNR decreases the Standard Deviation also increases. As we know the Standard deviation tells us how members of a group are spread out from the mean value. Therefore it is safe to say that for High SNR all the WER are near to the mean (they do not vary too much) and for low SNR and WER is highly varied. 