# Overview

Amazon Transcribe uses advanced machine learning technologies to recognize speech in audio files and transcribe them into text. You can use Amazon Transcribe to convert audio to text and to create applications that incorporate the content of audio files. For example, you can transcribe the audio track from a video recording to create closed captioning for the video. 

This notebook introduces you to the transcribe service and the various API calls that are available.  This also includes customization of the transcription service via custom language models.  We will use a common metric for measuring the accuracy, word error rate (WER)

resources:
* https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/transcribe.html
* https://aws.amazon.com/blogs/machine-learning/evaluating-an-automatic-speech-recognition-service/
* https://aws.amazon.com/blogs/machine-learning/improving-speech-to-text-transcripts-from-amazon-transcribe-using-custom-vocabularies-and-amazon-augmented-ai/
* https://aws.amazon.com/blogs/machine-learning/building-custom-language-models-to-supercharge-speech-to-text-performance-for-amazon-transcribe/

*Aaron Sengstacken*

*Machine Learning Solutions Architect*
___


# Basic Amazon Transcribe Useage 

### Asyncronous 

Most transcribe jobs are asynchronous where you submit a job and wait for completion.  To submit a job you must first have the audio file in S3.  Below is an example of how to submit a transcribe job

In [1]:
# import libraries
import json
import boto3
import time

In [2]:
job_name = 'test_transcript'
job_uri = 'https://s3.amazonaws.com/random.datasets.sengstacken/tmp/clm-blog-16k-audio.m4a'
output_bucket = 'random.datasets.sengstacken'
role = 'arn:aws:iam::431615879134:role/service-role/AmazonTranscribeServiceRoleFullAccess-MyTranscribeRole'
 
# transcribe audio
transcribe = boto3.client('transcribe')

In [3]:
response = transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    LanguageCode='en-US',
    Media={
        'MediaFileUri':job_uri
    },
    OutputBucketName=output_bucket,
    Settings={
#        'ChannelIdentification':True,
        'ShowAlternatives':True,
        'MaxAlternatives':2
    },
    JobExecutionSettings={
        'AllowDeferredExecution': True,
        'DataAccessRoleArn':role
    },
    ContentRedaction={
        'RedactionType': 'PII',
        'RedactionOutput': 'redacted_and_unredacted'
    },
 
)
 
while True:
    print("Transcription Started")
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        print(status['TranscriptionJob']['TranscriptionJobStatus'])
        break
    print("Not done yet!")
    time.sleep(10)
    
print(transcribe.get_transcription_job(TranscriptionJobName=job_name))

Transcription Started
Not done yet!
Transcription Started
Not done yet!
Transcription Started
Not done yet!
Transcription Started
Not done yet!
Transcription Started
Not done yet!
Transcription Started
Not done yet!
Transcription Started
COMPLETED
{'TranscriptionJob': {'TranscriptionJobName': 'test_transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'mp4', 'Media': {'MediaFileUri': 'https://s3.amazonaws.com/random.datasets.sengstacken/tmp/clm-blog-16k-audio.m4a'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/random.datasets.sengstacken/test_transcript.json', 'RedactedTranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/random.datasets.sengstacken/redacted-test_transcript.json'}, 'StartTime': datetime.datetime(2021, 5, 26, 13, 55, 38, 73000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2021, 5, 26, 13, 55, 38, 34000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2021, 5,

You can look for all transcription jobs with .list_transcription_jobs()

In [4]:
response = transcribe.list_transcription_jobs(
    Status='COMPLETED',
)
print(response)

{'Status': 'COMPLETED', 'TranscriptionJobSummaries': [{'TranscriptionJobName': 'test_transcript', 'CreationTime': datetime.datetime(2021, 5, 26, 13, 55, 38, 34000, tzinfo=tzlocal()), 'StartTime': datetime.datetime(2021, 5, 26, 13, 55, 38, 73000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2021, 5, 26, 13, 56, 37, 24000, tzinfo=tzlocal()), 'LanguageCode': 'en-US', 'TranscriptionJobStatus': 'COMPLETED', 'OutputLocationType': 'CUSTOMER_BUCKET', 'ContentRedaction': {'RedactionType': 'PII', 'RedactionOutput': 'redacted_and_unredacted'}}], 'ResponseMetadata': {'RequestId': '0792238d-a963-4d57-9c60-8ba39cb1abf3', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Wed, 26 May 2021 13:56:46 GMT', 'x-amzn-requestid': '0792238d-a963-4d57-9c60-8ba39cb1abf3', 'content-length': '376', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}


Or, you can use .get_transcription_job() to get information about the specified job

In [5]:
response = transcribe.get_transcription_job(
    TranscriptionJobName=job_name
)

print(response)

{'TranscriptionJob': {'TranscriptionJobName': 'test_transcript', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'mp4', 'Media': {'MediaFileUri': 'https://s3.amazonaws.com/random.datasets.sengstacken/tmp/clm-blog-16k-audio.m4a'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/random.datasets.sengstacken/test_transcript.json', 'RedactedTranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/random.datasets.sengstacken/redacted-test_transcript.json'}, 'StartTime': datetime.datetime(2021, 5, 26, 13, 55, 38, 73000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2021, 5, 26, 13, 55, 38, 34000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2021, 5, 26, 13, 56, 37, 24000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': True, 'MaxAlternatives': 2}, 'ContentRedaction': {'RedactionType': 'PII', 'RedactionOutput': 'redacted_and_unredacted'}}, 'ResponseMetadata'

Once the transcription job is complete, the output location with the JSON transcript is found in the response

In [6]:
# get name of output file
response['TranscriptionJob']['Transcript']['TranscriptFileUri']

'https://s3.us-east-1.amazonaws.com/random.datasets.sengstacken/test_transcript.json'

Let's now copy the JSON output from Transcribe to the local notebook instance.  Note that we are using the CLI interface for s3 here.  We could also use the boto3 interface with s3 or manually copy the file to the notebook instance

In [7]:
# copy file
!aws s3 cp s3://random.datasets.sengstacken/test_transcript.json .

download: s3://random.datasets.sengstacken/test_transcript.json to ./test_transcript.json


Now, let's take a look at the output file

In [8]:
!head test_transcript.json

{"jobName":"test_transcript","accountId":"431615879134","isRedacted":false,"results":{"transcripts":[{"transcript":"The 2020 holiday season is right around the corner and with the way that the year has been going, we can all hope for a little excitement around the next gen, video game consoles coming out soon. So what's the difference in hard respects between the upcoming PlayStation five and xbox series X. Well, let's take a look under the hood of each of these consoles. The PS five features an A. M. D. S. And two CPU with up to 3.5 gigahertz frequency is sports and I am the radio on GPU that tells 10.3 teraflops running up to 2.23 gigahertz memory and storage, respectively, doll in at 16 gigabytes and 825 gigabytes. The PS five supports both PS the PS five supports both four K. And A K resolutions. Meanwhile, the Xbox series X also features an A. M. D. S. And two CPU but clocks in at 3.8 gigahertz. Instead, the console boasts a similar M. D. Accustomed Gpu with 12 teraflops and 1.8 t

Using Python we can read the file and pull out the transcription field and write it to a text file

In [9]:
with open('test_transcript.json') as f:
    data = json.load(f)

print(data['results']['transcripts'][0]['transcript'])

with open('temp_transcript.txt','w') as f:
    f.write(data['results']['transcripts'][0]['transcript'])

The 2020 holiday season is right around the corner and with the way that the year has been going, we can all hope for a little excitement around the next gen, video game consoles coming out soon. So what's the difference in hard respects between the upcoming PlayStation five and xbox series X. Well, let's take a look under the hood of each of these consoles. The PS five features an A. M. D. S. And two CPU with up to 3.5 gigahertz frequency is sports and I am the radio on GPU that tells 10.3 teraflops running up to 2.23 gigahertz memory and storage, respectively, doll in at 16 gigabytes and 825 gigabytes. The PS five supports both PS the PS five supports both four K. And A K resolutions. Meanwhile, the Xbox series X also features an A. M. D. S. And two CPU but clocks in at 3.8 gigahertz. Instead, the console boasts a similar M. D. Accustomed Gpu with 12 teraflops and 1.8 to 5 gigahertz memory is the same as that of the PS five come in at 16 gigabytes, but the default storage is where th

___

# Evaluation Metrics

When evaluating the performance of speech recognition models, we want to use an objective metric.  Unfortuantely, there isn't an standard metric that is used.  The most common metric is Word Error Rate (WER).  The WER compares a reference text to a hypothesis text and is defined like this:

\begin{equation*}
WER = \frac{(S+D+I)}{N}
\end{equation*}

where

- S is the number of substitutions: anytime a word gets replaced (for example, “twinkle” is transcribed as “crinkle”)
- D is the number of deletions:  anytime a word is omitted from the transcript (for example, “get it done” becomes “get done”)
- I is the number of insertions:  anytime a word gets added that wasn’t said (for example, “trailblazers” becomes “tray all blazers”)
- N is the number of words in the reference

Many times the accuracy of speech recognition systems is evaluated using the word accuracy (WAcc).  Word accuracy is defined as:

\begin{equation*}
WAcc = 1 - WER
\end{equation*}

Note that WER can be greater than 1.0 (values closer to zero indicate better performance), and therefore WAcc can be less than 0.0.

To calculate WER we'll use some available python libraries

### Python Libraries for WER

* [python-Levenshtein](https://pypi.org/project/python-Levenshtein/#id1)
* [jiwer](https://pypi.org/project/jiwer/)
* [asr-evaluation](https://github.com/belambert/asr-evaluation)
* [WER-in-python](https://github.com/zszyellow/WER-in-python)

For this demo we'll use asr-evaluation and jiwer

In [10]:
!pip install asr-evaluation



In [11]:
!wer -h

usage: wer [-h] [-i | -r] [--head-ids] [-id] [-c] [-p] [-m count] [-a] [-e]
           ref hyp

Evaluate an ASR transcript against a reference transcript.

positional arguments:
  ref                   Reference transcript filename
  hyp                   ASR hypothesis filename

optional arguments:
  -h, --help            show this help message and exit
  -i, --print-instances
                        Print all individual sentences and their errors.
  -r, --print-errors    Print all individual sentences that contain errors.
  --head-ids            Hypothesis and reference files have ids in the first
                        token? (Kaldi format)
  -id, --tail-ids, --has-ids
                        Hypothesis and reference files have ids in the last
                        token? (Sphinx format)
  -c, --confusions      Print tables of which words were confused.
  -p, --print-wer-vs-length
                        Print table of average WER grouped by reference
                        sent

In [12]:
!wer -i -a aws_blog_groundtruth.txt temp_transcript.txt  

REF: the 2020 holiday season is right around the [31mCORNER.[0m and with the way that the [31m****[0m [31mYEAR’S[0m been going, we can all hope for a little excitement around the [31m****[0m [31mNEXT-GEN[0m video game consoles coming out soon. [31mSO,[0m [31mWHAT’S[0m the difference in [31mHARDWARE[0m [31mSPECS   [0m between the upcoming playstation [31m5   [0m and xbox series [31mX?[0m well, [31mLET’S[0m take a look under the [31mHOODS[0m of each [31m**[0m these consoles. the [31m**[0m [31mPS5 [0m features an [31m**[0m [31m**[0m [31m**[0m [31mAMD[0m [31mZEN[0m [31m2  [0m cpu with up to 3.5 [31mGHZ      [0m [31mFREQUENCY.[0m [31mIT[0m sports [31m***[0m [31m*[0m [31m**[0m [31mAN [0m [31mAMD  [0m [31mRADEON[0m gpu that [31mTOUTS[0m 10.3 [31mTERAFLOPS,[0m running up to 2.23 [31mGHZ.     [0m memory and [31mSTORAGE [0m [31mRESPECTIVELY [0m [31mDIAL[0m in at 16 [31mGB       [0m and 825 [31mGB.       [0m the [31m**

In [13]:
!pip install jiwer



In [14]:
from jiwer import wer

In [15]:
with open('aws_blog_groundtruth.txt') as f:
    ground_truth = f.read()
    
with open('test_transcript.json') as f:
    asr = json.load(f)

print(asr['results']['transcripts'][0]['transcript'])

The 2020 holiday season is right around the corner and with the way that the year has been going, we can all hope for a little excitement around the next gen, video game consoles coming out soon. So what's the difference in hard respects between the upcoming PlayStation five and xbox series X. Well, let's take a look under the hood of each of these consoles. The PS five features an A. M. D. S. And two CPU with up to 3.5 gigahertz frequency is sports and I am the radio on GPU that tells 10.3 teraflops running up to 2.23 gigahertz memory and storage, respectively, doll in at 16 gigabytes and 825 gigabytes. The PS five supports both PS the PS five supports both four K. And A K resolutions. Meanwhile, the Xbox series X also features an A. M. D. S. And two CPU but clocks in at 3.8 gigahertz. Instead, the console boasts a similar M. D. Accustomed Gpu with 12 teraflops and 1.8 to 5 gigahertz memory is the same as that of the PS five come in at 16 gigabytes, but the default storage is where th

In [16]:
error = wer(ground_truth.lower(), asr['results']['transcripts'][0]['transcript'].lower())
print(error)

0.336


this example highlights how the WER calculation can be different between various libraries.  How can we drive the WER lower?  The lower the WER the better the transcription and the higher the accuracy

___
# Advanced Amazon Transcribe

____
## Custom Vocabularies

In [17]:
# create custom vocabulary
response = transcribe.create_vocabulary(
    VocabularyName='custom_vocab2',
    LanguageCode='en-US',
    VocabularyFileUri='https://s3.amazonaws.com/random.datasets.sengstacken/tmp/custom_vocab_table.txt'
)

In [16]:
response = transcribe.get_vocabulary(
    VocabularyName='custom_vocab2'
)
print(response)

{'VocabularyName': 'custom_vocab2', 'LanguageCode': 'en-US', 'VocabularyState': 'READY', 'LastModifiedTime': datetime.datetime(2021, 5, 25, 18, 32, 17, 591000, tzinfo=tzlocal()), 'DownloadUri': 'https://s3.us-east-1.amazonaws.com/aws-transcribe-dictionary-model-us-east-1-prod/431615879134/custom_vocab2/4acc90d0-e801-4af3-b239-7f105bed1be9/input.txt?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEIb%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJIMEYCIQCTH3kzbMF2hw%2FY0XXWlSS8g0g31pYQnK3V2xkgldMO8gIhAIkg9yY29VelQ3NkjWTWP8Wfjvx0ZqoL7XXs1BWbfi8gKvoDCC4QAhoMMjc2NjU2NDMzMTUzIgz%2Fn7fe3YHv%2BoRfNCoq1wOPPFB0ft5ZKXf2G0zE4RhH%2FzKWKFr2r75R24%2F8PsRfMkUlG3YD%2F8i7tq8MI21bmdk7e%2FXEPQgHa36VASRsLy7B8D05HlEZ%2BNSaZedKzIFwAoN3CM%2FCKwPEtU0GjL9mhcKsGOjTryvlji4fFtE30Ovc6OQchEMxhGCPh5rRdBNey578XdwTWS8P9Z%2FAUK0bYpuoxSI678%2B3UAA1Juh84IoiG%2BhCLHU%2BzZnUXji2H03IWu%2BTxv3pALdlcveCX3%2B3MB2fbfJw4reLmN%2Fp0%2BrsBl7Lux%2BtRiwxb%2FXwvO3p39VEbRpqkMTEvPsjw%2FnKev1dpj3qsy5ZWdoMglhTVtFlr5%2FnzaKQdOlU3%2BkZvLejYDrpf9c2qqzF

In [17]:
# list all vocabs that are in the "READY" state
response = transcribe.list_vocabularies(
    StateEquals='READY',
)

Now, let's call the same transcription job.  This time lets use a custom vocabulary and see how the WER improves.

In [18]:
job_name = 'transcribe_w_vocab'
response = transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    LanguageCode='en-US',
    Media={
        'MediaFileUri':job_uri
    },
    OutputBucketName=output_bucket,
    JobExecutionSettings={
        'AllowDeferredExecution': True,
        'DataAccessRoleArn':role
    },
    Settings={
        'VocabularyName': 'custom_vocab2',
    },
 
)
 
while True:
    print("Transcription Started")
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        print(status['TranscriptionJob']['TranscriptionJobStatus'])
        break
    print("Not done yet!")
    time.sleep(10)
    
print(transcribe.get_transcription_job(TranscriptionJobName=job_name))

Transcription Started
Not done yet!
Transcription Started
Not done yet!
Transcription Started
Not done yet!
Transcription Started
Not done yet!
Transcription Started
Not done yet!
Transcription Started
COMPLETED
{'TranscriptionJob': {'TranscriptionJobName': 'transcribe_w_vocab', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'mp4', 'Media': {'MediaFileUri': 'https://s3.amazonaws.com/random.datasets.sengstacken/tmp/clm-blog-16k-audio.m4a'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/random.datasets.sengstacken/transcribe_w_vocab.json'}, 'StartTime': datetime.datetime(2021, 5, 26, 13, 57, 59, 692000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2021, 5, 26, 13, 57, 59, 669000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2021, 5, 26, 13, 58, 44, 363000, tzinfo=tzlocal()), 'Settings': {'VocabularyName': 'custom_vocab2', 'ChannelIdentification': False, 'ShowAlternatives': False}}, 

In [20]:
def get_transcript(job_name):
    
    # get transcript URI
    transcript = transcribe.get_transcription_job(TranscriptionJobName=job_name)['TranscriptionJob']['Transcript']['TranscriptFileUri']
    
    # download the json transcription output
    s3 = boto3.client('s3')
    s3.download_file(transcript.split('/')[-2], transcript.split('/')[-1], transcript.split('/')[-1])
    
    # read the json transcript
    with open(transcript.split('/')[-1]) as f:
        data = json.load(f)

    # write the raw transcript to a TXT file
    with open(transcript.split('/')[-1].split('.')[0]+'.txt','w') as f:
        f.write(data['results']['transcripts'][0]['transcript'])
    
    return data['results']['transcripts'][0]['transcript']

In [21]:
transcript = get_transcript(job_name)

In [22]:
transcript

"The 2020 holiday season is right around the corner and with the way that the year has been going, we can all hope for a little excitement around the next gen, video game consoles coming out soon. So what's the difference in hard respects between the upcoming PlayStation five and xbox series X. Well, let's take a look under the hood of each of these consoles. The PS five features an A M. D. S and two CPU with up to 3.5 gigahertz frequency is sports and AMD radio on GPU. That tells 10.3 teraflops running up to 2.23 GHz memory and storage respectively, doll in at 16, GB and 825. GB The PS five supports both PS The PS five supports both four K and AK resolutions. Meanwhile, the Xbox series X also features an a M. D s and two CPU but clocks in at 3.8. GHz instead, the console boasts a similar AMD accustomed Gpu with 12 teraflops and 1.8 to 5 GHz memory is the same as that of the PS five come in at 16 GB but the default storage is where the system has an edge bring out a massive one terabyt

In [23]:
!wer -i -a aws_blog_groundtruth.txt {job_name +'.txt'} 

REF: the 2020 holiday season is right around the [31mCORNER.[0m and with the way that the [31m****[0m [31mYEAR’S[0m been going, we can all hope for a little excitement around the [31m****[0m [31mNEXT-GEN[0m video game consoles coming out soon. [31mSO,[0m [31mWHAT’S[0m the difference in [31mHARDWARE[0m [31mSPECS   [0m between the upcoming playstation [31m5   [0m and xbox series [31mX?[0m well, [31mLET’S[0m take a look under the [31mHOODS[0m of each [31m**[0m these consoles. the [31m**[0m [31mPS5 [0m features an [31m*[0m [31m**[0m [31m**[0m [31mAMD[0m [31mZEN[0m [31m2  [0m cpu with up to 3.5 [31mGHZ      [0m [31mFREQUENCY.[0m [31mIT[0m sports [31mAN [0m amd [31m*****[0m [31mRADEON[0m [31mGPU [0m that [31mTOUTS[0m 10.3 [31mTERAFLOPS,[0m running up to 2.23 [31mGHZ.[0m memory and storage [31mRESPECTIVELY [0m [31mDIAL[0m in at [31m16 [0m gb and [31m825 [0m [31mGB.[0m the [31m**[0m [31mPS5 [0m supports both ps the 

In [24]:
error = wer(ground_truth.lower(), transcript.lower())
print(error)

0.316


Woohoo!  We've improved the WER from 33.6% to 31.6%

___
# Custom Language Models

A recent addtion to Transcribe was custom language models.  Use custom language models to train and develop language models that are domain-specific. For example, you can use custom language models to improve transcription performance for domains such as legal, hospitality, finance, and insurance. Although the general model provided by Amazon Transcribe works well in most instances, custom language models might produce even more accurate results.

To train a custom language model, you must upload text data from your specific use case to Amazon Simple Storage Service (Amazon S3), provide Amazon Transcribe with permission to access that data, and choose a base model. A base model is a general speech recognition model, which you customize with your text data. 

In [23]:
training_data = 's3://random.datasets.sengstacken/transcribe/languagemodel/train/'
tuning_data = 's3://random.datasets.sengstacken/transcribe/languagemodel/tune/'

In [24]:
response = transcribe.create_language_model(
    LanguageCode='en-US',
    BaseModelName='WideBand',
    ModelName='python_lm_train_tuning',
    InputDataConfig={
        'S3Uri': training_data,
        'TuningDataS3Uri': tuning_data,
        'DataAccessRoleArn': role
    }
)


In [25]:
response = transcribe.describe_language_model(
    ModelName='python_lm_train_tuning'
)
response

{'LanguageModel': {'ModelName': 'python_lm_train_tuning',
  'CreateTime': datetime.datetime(2021, 5, 25, 18, 32, 34, 402000, tzinfo=tzlocal()),
  'LastModifiedTime': datetime.datetime(2021, 5, 25, 23, 57, 54, 171000, tzinfo=tzlocal()),
  'LanguageCode': 'en-US',
  'BaseModelName': 'WideBand',
  'ModelStatus': 'COMPLETED',
  'UpgradeAvailability': False,
  'InputDataConfig': {'S3Uri': 's3://random.datasets.sengstacken/transcribe/languagemodel/train/',
   'TuningDataS3Uri': 's3://random.datasets.sengstacken/transcribe/languagemodel/tune/',
   'DataAccessRoleArn': 'arn:aws:iam::431615879134:role/service-role/AmazonTranscribeServiceRoleFullAccess-MyTranscribeRole'}},
 'ResponseMetadata': {'RequestId': '6ccc55b3-6b65-432c-b415-805f689b4aa6',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Wed, 26 May 2021 14:00:02 GMT',
   'x-amzn-requestid': '6ccc55b3-6b65-432c-b415-805f689b4aa6',
   'content-length': '526',
   'connection': 'keep-alive'

In [26]:
response = transcribe.list_language_models(
    StatusEquals='COMPLETED',
)
response

{'Models': [{'ModelName': 'python_lm_train_tuning',
   'CreateTime': datetime.datetime(2021, 5, 25, 18, 32, 34, 402000, tzinfo=tzlocal()),
   'LastModifiedTime': datetime.datetime(2021, 5, 25, 23, 57, 54, 171000, tzinfo=tzlocal()),
   'LanguageCode': 'en-US',
   'BaseModelName': 'WideBand',
   'ModelStatus': 'COMPLETED',
   'UpgradeAvailability': False,
   'InputDataConfig': {'S3Uri': 's3://random.datasets.sengstacken/transcribe/languagemodel/train/',
    'TuningDataS3Uri': 's3://random.datasets.sengstacken/transcribe/languagemodel/tune/',
    'DataAccessRoleArn': 'arn:aws:iam::431615879134:role/service-role/AmazonTranscribeServiceRoleFullAccess-MyTranscribeRole'}}],
 'ResponseMetadata': {'RequestId': 'f707d5b1-efc7-498c-b3de-09412ca9d713',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Wed, 26 May 2021 14:00:07 GMT',
   'x-amzn-requestid': 'f707d5b1-efc7-498c-b3de-09412ca9d713',
   'content-length': '521',
   'connection': 'keep-al

In [27]:
job_name = 'transcribe_w_lm'
lm_name = 'python_lm_train_tuning'
response = transcribe.start_transcription_job(
    TranscriptionJobName=job_name,
    LanguageCode='en-US',
    Media={
        'MediaFileUri':job_uri
    },
    OutputBucketName=output_bucket,
    JobExecutionSettings={
        'AllowDeferredExecution': True,
        'DataAccessRoleArn':role
    },
    ModelSettings={
        'LanguageModelName': lm_name
    },
 
)
 
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        print(status['TranscriptionJob']['TranscriptionJobStatus'])
        break
    print("Not done yet!")
    time.sleep(10)
    
print(transcribe.get_transcription_job(TranscriptionJobName=job_name))

Not done yet!
Not done yet!
Not done yet!
Not done yet!
Not done yet!
Not done yet!
Not done yet!
Not done yet!
COMPLETED
{'TranscriptionJob': {'TranscriptionJobName': 'transcribe_w_lm', 'TranscriptionJobStatus': 'COMPLETED', 'LanguageCode': 'en-US', 'MediaSampleRateHertz': 44100, 'MediaFormat': 'mp4', 'Media': {'MediaFileUri': 'https://s3.amazonaws.com/random.datasets.sengstacken/tmp/clm-blog-16k-audio.m4a'}, 'Transcript': {'TranscriptFileUri': 'https://s3.us-east-1.amazonaws.com/random.datasets.sengstacken/transcribe_w_lm.json'}, 'StartTime': datetime.datetime(2021, 5, 26, 14, 0, 12, 484000, tzinfo=tzlocal()), 'CreationTime': datetime.datetime(2021, 5, 26, 14, 0, 12, 454000, tzinfo=tzlocal()), 'CompletionTime': datetime.datetime(2021, 5, 26, 14, 1, 29, 480000, tzinfo=tzlocal()), 'Settings': {'ChannelIdentification': False, 'ShowAlternatives': False}, 'ModelSettings': {'LanguageModelName': 'python_lm_train_tuning'}}, 'ResponseMetadata': {'RequestId': '58fbeef6-e5ec-4553-b640-6adfff55c

In [28]:
transcript = get_transcript(job_name)

In [29]:
!wer -i -a aws_blog_groundtruth.txt {job_name +'.txt'} 

REF: the 2020 holiday season is right around the corner. and with the way that the [31mYEAR’S[0m been going, we can all [31mHOPE [0m for a little excitement around the [31m****[0m [31mNEXT-GEN[0m video game consoles coming out soon. [31mSO,[0m [31mWHAT’S[0m the difference in [31mHARDWARE[0m [31mSPECS   [0m between the upcoming playstation [31m5   [0m and xbox series [31mX?[0m well, [31mLET’S[0m take a look under the [31mHOODS[0m of each [31m**[0m these consoles. the [31m**[0m [31mPS5 [0m features an [31m**[0m [31mAMD[0m [31mZEN[0m [31m2  [0m cpu with up to 3.5 [31mGHZ      [0m frequency. it sports [31mAN [0m [31mAMD[0m [31mRADEON[0m gpu that [31mTOUTS[0m 10.3 [31mTERAFLOPS,[0m running up [31mTO  [0m 2.23 [31mGHZ.     [0m memory and [31mSTORAGE[0m [31mRESPECTIVELY[0m [31mDIAL         [0m [31mIN    [0m at 16 [31mGB       [0m and [31m825[0m [31mGB.       [0m the [31m**[0m [31mPS5 [0m supports both ps the [31m**[0m 

In [30]:
error = wer(ground_truth.lower(), transcript.lower())
print(error)

0.288


# Summary

The initial WER for the example audio was 33.6.  We were able to improve the WER from 33.6 to 31.6 using custom vocabularies.  Addtionally we were able to train a custom language model and improved the WER from 31.6 to 28.8.  The data used in this notebook was a modified version from the [blog post](https://aws.amazon.com/blogs/machine-learning/building-custom-language-models-to-supercharge-speech-to-text-performance-for-amazon-transcribe/) the describes how to impliment custom language models. 