In [1]:
import io


from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types

#### Instantiate a speech client

In [2]:
client = speech.SpeechClient()

#### Using .flac file stored in a Cloud Storage bucket
This audio file is a song with lyrics. We will use the speech API to transcribe the lyrics of this song.

Currently, only Google Cloud Storage URIs are supported, which must be specified in the following format: gs://bucket_name/object_name

Source of the file: https://www.kaggle.com/toponowicz/spoken-language-identification/home 

In [3]:
audio = types.RecognitionAudio(uri='gs://cloud-ml-api/audio-file-english.flac')

#### Create a RecognitionConfig object
The RecognitionConfig provides information to the recognizer that specifies how to process the request

In [4]:
config = types.RecognitionConfig(
  encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
  language_code='en-US' )

#### Performs synchronous speech recognition
The call to client.recognize will return results after all audio has been sent and processed. This function can only handle audio with a length of less than 1 minute

In [5]:
response = client.recognize(config, audio)

#### Parse the response
The results from the speech recognition are stored in the results property of the response object. This property contains a list with the transcript for sequential portion of the audio.  

Note that the result contains a field calls alternatives which may contain multiple alternative transcriptions - here we only have one version which is the default, since we did not ask for alternative transcriptions.

Also, the transcription is not wholly accurate which is why we may require alternatives.

In [6]:
i = 1

for result in response.results:
    print('Section ', i, ': ', result.alternatives[0].transcript)
    i += 1

Section  1 :  a day with John Milton
Section  2 :   about 4 on a September morning of 1665


### Using longer audio files requires a different function to be called
The client.recognize() function cannot be used for audio longer than 1 minute

#### Source of file:
http://freemusicarchive.org/music/Andrew_Walton/Fresh_Delivery/In_Place_Of_Fear#

The original MP3 file has been converted to FLAC format.

The file comes under an attribution share-alike license: https://creativecommons.org/licenses/by-sa/4.0/

In [7]:
long_audio = types.RecognitionAudio(uri='gs://cloud-ml-api/Andrew_Walton-In_Place_Of_Fear_mono.flac')

#### Create a new config
* This audio contains speech in a British accent, so the language_code in the config changes to en-GB
* We also ask for 2 alternatives for the transcription (one in addition to the single alternative provided by default). The alternatives will tend to vary around words or phrases where there is ambiguity in the pronunciation

In [8]:
long_audio_config = types.RecognitionConfig(
  encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
  language_code='en-GB', 
  max_alternatives = 2 )

#### Use the asynchronous long_running_recognize() function
The previously used recognize() function is synchronous and faster than long_running_recognize. However, it can only be used for short audio files of less than a minute. The long_running_recognize() function returns and longrunning.operation interface object which is used for long-running operations.

Config is similar to the one used previously

In [9]:
operation = client.long_running_recognize(long_audio_config, long_audio)

#### Retrieve response from the operation object
Here, we can specify a timeout in order to restrict the amount of time spent processing the audio file

In [10]:
print('Waiting for operation to complete...')
response = operation.result(timeout=300)

Waiting for operation to complete...


#### Parse the response
This is similar to what we did previously - fetch the transcript for each section of the audio in the response. Here, we get the transcripts for the first version of the response.

In [11]:
i = 1

for result in response.results:
    print('Section ', i, ': ', result.alternatives[0].transcript)
    i += 1

Section  1 :  in place of Fear antibiotics a magic Bullet to cure oral infection stop bacteria in their tracks and give you protection but there's a snag a floor I hate which Darwin for sore as Bugs divide and conquer they will have the last hurrah drugs are not a panacea for every malady the kill microbes of course but no permanent remedy exists the bacilli that survive can go on to thrive evolve go forth and multiply I'll drugs only drive evolutionary pressure the stakes are ranked higher we need to use our brain power to put out this fire but big Pharma has no interest driven on
Section  2 :  migrate companies focus on the bottom line not what people need there is profit and drugs which customers keep on taking statins lower cholesterol and keep the money flowing a molecule to eat here and there will give me a patent without any of the effort none of the Investment to design your medicine which will soon be obsolete in a vicious circle I bet he demands of Wall Street Viagra Parcs up

#### Check the alternative transcription
Since we had asked for 2 alternatives in the config for the audio recognition, we can check the alternative version of the transcript. The varation between this and the previous version tends to be around words or phrases where the pronunciation is a bit ambiguous.

In [12]:
i = 1

for result in response.results:
    print('Section ', i, ': ', result.alternatives[1].transcript)
    i += 1

Section  1 :  in place of Fear antibiotics a magic Bullet to cure oral infection stop bacteria in their tracks and give you protection but there's a snag a floor I hate which Darwin for sore as Bugs divide and conquer they will have the last hurrah drugs are not a panacea for every malady the kill microbes of course but no permanent remedy exists the bacilli that survive can go on to thrive evolve go forth and multiply Isle drugs only drive evolutionary pressure the stakes are amped higher we need to use our brain power to put out this fire but big Pharma has no interest driven on
Section  2 :  migrate companies focus on the bottom line not what people need there is profit and drugs which customers keep on taking statins lower cholesterol and keep the money flowing a molecule to eat here and there will give me a patent without any of the effort none of the Investment to design your medicine which will soon be obsolete in a vicious circle I bet he demands of Wall Street Viagra Parcs up 

### Usign Linear16 format (i.e .raw) file
Here, we upload a file from our own file system. The original file was downloaded from: <br />
https://www.google.com/url?q=https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/cloud-client/resources/audio.raw&sa=D&source=hangouts&ust=1534482936994000&usg=AFQjCNGc4hzXwY4fJxapFZpoEis6d3zx1g

In [17]:
linear_file = 'datasets/audio.raw'

#### Initialize a types.RecognitionAudio object with the raw contents of an audio file
We need to load the binary contents of the audio file to the object. This is the only other way to load an audio file other than passing in the URI location of a file on Google Cloud Storage (as done previously)


In [18]:
with io.open(linear_file, 'rb') as audio_file:
      content = audio_file.read()
      raw_audio = types.RecognitionAudio(content=content)

#### Define the config for this speech-to-text conversion
* The AutoEncoding when using raw content is LINEAR16. 
* We explicitly set the rate at which the audio data should be sampled. If we leave it blank, the native sample rate of the audio source is used. By setting a value, we are asking for the source to be re-sampled at the specified rate

In [19]:
raw_audio_config = types.RecognitionConfig(
  encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
  sample_rate_hertz=16000,
  language_code='en-US')

In [20]:
response = client.recognize(raw_audio_config, raw_audio)

In [21]:
for result in response.results:
    print('Transcript: {}'.format(result.alternatives[0].transcript))

Transcript: how old is the Brooklyn Bridge
