# <img src="imgs/transcribe.png" alt="Amazon Transcribe" style="width: 70px;"/> Amazon Transcribe 


Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for developers to add speech-to-text capability to their applications. Using the Amazon Transcribe API, you can analyze audio files stored in Amazon S3 and have the service return a text file of the transcribed speech. You can also send a live audio stream to Amazon Transcribe and receive a stream of transcripts in real time.

Amazon Transcribe can be used for lots of common applications, including the transcription of customer service calls and generating subtitles on audio and video content. The service can transcribe audio files stored in common formats, like WAV and MP3, with time stamps for every word so that you can easily locate the audio in the original source by searching for the text. Amazon Transcribe is continually learning and improving to keep pace with the evolution of language.


# How it works
Amazon Transcribe converts speech to text. A basic transcription request produces a transcript that contains data about the transcribed content, including confidence scores and timestamps for each word or punctuation mark. For a complete list of features that you can apply to your transcription, refer to the feature summary.

Transcription methods can be separated into two main categories:

* Batch transcription jobs: Transcribe media files that have been uploaded into an Amazon S3 bucket.

* Streaming transcriptions: Transcribe media streams in real time.

## Here's a list of the supported file formats by a batch transcribe job

![supported format](imgs/transcribe-supported-format.png)

# Getting Started
In this project, we are going to use Amazon Transcribe to create subtitle files for the given movies.
Disclaimer: These movies are obtained from the internet archive and it's freely available on the internet. The movies used in this example are:

* [Wonderful World 1959](https://archive.org/details/0731_Wonderful_World_19_01_23_00) 
* [Achievement USA 1955](https://archive.org/details/Achievem1955)
* [Adventure of Mr Wonderbird]()


## Setting up the boto and sagemaker sessions

In [297]:
import boto3
import sagemaker
import time
import os

In [298]:
transcribe_client = boto3.client("transcribe")
session = sagemaker.Session()
default_bucket = session.default_bucket()

In [299]:
media_file_s3_input_prefix = "data/amazon-transcribe/input"
media_file_s3_output_prefix = "data/amazon-transcribe/output"

def submit_transcribe_job(media_file_s3_uri, 
                          output_bucket_name, 
                          output_prefix="data/amazon-transcribe/output", 
                          media_format="mp4", 
                          identify_language=True, 
                          identify_multiple_languages=True):
    
    media_file_name = media_file_s3_uri.split("/")[-1]
    output_name = media_file_name.replace(" ", "_")
    transcription_name = f"{output_name}-{time.time()}"
    
    response = transcribe_client.start_transcription_job(
    TranscriptionJobName=transcription_name,
    MediaFormat=media_format,
    Media={
        'MediaFileUri': media_file_s3_uri,
    },
    OutputBucketName=output_bucket_name,
    OutputKey=f"{output_prefix}/{output_name}/",
    IdentifyMultipleLanguages=identify_multiple_languages,
    Subtitles={
            'Formats': [ 'srt' ],
            'OutputStartIndex': 1
        },
    Settings={
        'ShowSpeakerLabels': True,
        'MaxSpeakerLabels': 10
        }
    )
    return response
    

## Loop through the S3 bucket location and invoke a transcribe job for each media file.

In [300]:
s3_client = boto3.client("s3")
objects = s3_client.list_objects_v2(Bucket=default_bucket, Prefix=media_file_s3_input_prefix)

responses = []
for obj in objects['Contents']:
    key = obj['Key']
    base_file_name = os.path.basename(key)
    media_format = key.split(".")[-1]
    input_file = f"s3://{default_bucket}/{key}"
    response = submit_transcribe_job(media_file_s3_uri=input_file, 
                                     output_bucket_name=default_bucket, 
                                     output_prefix=media_file_s3_output_prefix,
                                     media_format=media_format)
    responses.append(response)
    

## Here we'll loop through all the jobs and monitor the status of each.

In [301]:
while True:
    completion_cnt = 0
    for response in responses:
        job_name = response['TranscriptionJob']['TranscriptionJobName']
        job = transcribe_client.get_transcription_job( TranscriptionJobName=job_name)     
        job_status = job['TranscriptionJob']['TranscriptionJobStatus']
        if job_status in [ 'FAILED', 'COMPLETED' ]:
            print(f"Job {job_name} completed with status: {job_status}")
            completion_cnt +=1
    if completion_cnt == len(responses):
        break
    else:
        time.sleep(60)


Job Achievem1955.mp4-1676770842.3578079 completed with status: COMPLETED
Job bandicam-2021-02-15.mp4-1676770842.4853857 completed with status: COMPLETED
Job Achievem1955.mp4-1676770842.3578079 completed with status: COMPLETED
Job bandicam-2021-02-15.mp4-1676770842.4853857 completed with status: COMPLETED
Job 0731_Wonderful_World_19_01_23_00_3mb.mp4-1676770842.1546438 completed with status: COMPLETED
Job Achievem1955.mp4-1676770842.3578079 completed with status: COMPLETED
Job bandicam-2021-02-15.mp4-1676770842.4853857 completed with status: COMPLETED


In [302]:
from urllib.parse import urlparse

os.makedirs("output", exist_ok = True)

for response in responses:
    job_name = response['TranscriptionJob']['TranscriptionJobName']
    transcription_response = transcribe_client.get_transcription_job(
        TranscriptionJobName=job_name
    ) 
    subfolder_name = f"output/{job_name}"
    os.makedirs(subfolder_name, exist_ok = True)
    media_s3_uri = transcription_response['TranscriptionJob']['Media']['MediaFileUri']
    parsed_media_uri = urlparse(media_s3_uri)
    media_bucket = parsed_media_uri.hostname
    media_file_key = parsed_media_uri.path[1:]
    base_media_filename = parsed_media_uri.path.split("/")[-1]
    
    transcription_file_uri = transcription_response['TranscriptionJob']['Transcript']['TranscriptFileUri']
    subtitle_file_uri = transcription_response['TranscriptionJob']['Subtitles']['SubtitleFileUris'][0]

    parsed_transcription_uri = urlparse(transcription_file_uri).path.split("/")
    transcription_bucket = parsed_transcription_uri[1]
    trascription_s3_key = "/".join(parsed_transcription_uri[2:])
    base_transcription_filename = parsed_transcription_uri[-1]

    parsed_subtitle_uri = urlparse(subtitle_file_uri).path.split("/")
    subtitle_bucket = parsed_subtitle_uri[1]
    subtitle_s3_key = "/".join(parsed_subtitle_uri[2:])
    base_subtitle_filename = parsed_subtitle_uri[-1]

    s3_client.download_file(media_bucket, media_file_key, f"{subfolder_name}/{base_media_filename}")
    s3_client.download_file(transcription_bucket, trascription_s3_key, f"{subfolder_name}/{base_transcription_filename}")
    s3_client.download_file(subtitle_bucket, subtitle_s3_key, f"{subfolder_name}/{base_subtitle_filename}")
    
    

In [303]:
!pip install ffmpeg-python -q

Keyring is skipped due to an exception: 'keyring.backends'
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## To test out the subtitles, let's try to run a video file without subtitles

In [304]:
subfolder_name = "output/Achievem1955.mp4-1676770842.3578079"
base_media_filename="Achievem1955.mp4"
base_subtitle_filename = "Achievem1955.mp4-1676770842.3578079.srt"
base_transcription_filename = "Achievem1955.mp4-1676770842.3578079.json"

In [305]:
from IPython.display import HTML

local_media_file = f"{subfolder_name}/{base_media_filename}"
HTML(f"""
    <video alt="video" controls>
        <source src={local_media_file} type="video/mp4">
    </video>
""")


In [306]:
!apt update && apt install ffmpeg -y -qq

Get:1 http://security.debian.org/debian-security buster/updates InRelease [34.8 kB]
Hit:2 http://deb.debian.org/debian buster InRelease       [0m
Hit:3 http://deb.debian.org/debian buster-updates InRelease[0m
Get:4 http://security.debian.org/debian-security buster/updates/main amd64 Packages [433 kB]
Fetched 468 kB in 0s (1255 kB/s)  [0m[33m[33m
Reading package lists... Done
Building dependency tree       
Reading state information... Done
59 packages can be upgraded. Run 'apt list --upgradable' to see them.
ffmpeg is already the newest version (7:4.1.10-0+deb10u1).
0 upgraded, 0 newly installed, 0 to remove and 59 not upgraded.


In [307]:
import ffmpeg

In [308]:
local_subtitle_file = f"{subfolder_name}/{base_subtitle_filename}"
local_media_file_no_ext = local_media_file.split("/")[-1].split(".")[0]
local_file_no_ext = f"{subfolder_name}/{local_media_file_no_ext}"
local_media_file_ext = local_media_file.split(".")[-1]
subtitled_media_file = f'{local_file_no_ext}.srt.{local_media_file_ext}'

## To test the subtitle files, we'll use an open source video processing tool called FFMEG to apply subtitle into the original video file

In [None]:
try:
    video = ffmpeg.input(local_media_file)
    audio = video.audio
    ffmpeg.concat(video.filter("subtitles", local_subtitle_file), audio, v=1, a=1).output(subtitled_media_file).run(capture_stdout=True, capture_stderr=True,  overwrite_output=True)
except ffmpeg.Error as e:
    print('stdout:', e.stdout.decode('utf8'))
    print('stderr:', e.stderr.decode('utf8'))
    raise e

## Now we'll validate the processed video embedded wit the subtitle file in the previous step

In [None]:
HTML(f"""
    <video alt="video" controls>
        <source src={subtitled_media_file} type="video/mp4">
    </video>
""")


## Transcribe Job Output
Transcription output is in JSON format. The first part of your transcript contains the transcript itself, in paragraph form, followed by additional data for every word and punctuation mark. The data provided depends on the features you include in your request. 

* At a minimum, your transcript contains the **start time, end time, and confidence score** for every word. 
* All batch transcripts are stored in Amazon S3 buckets. 

Let's examine the transcription output from the job that created the subtitle for the videos we just saw.

In [267]:
local_transcription_output = f"{subfolder_name}/{base_transcription_filename}"

In [276]:
import json
with open(local_transcription_output, "r") as f:
    transcribe_output = json.load(f)

Top level items returned from the transcription job are:

* jobName
* accountId
* results
* status

In [278]:
transcribe_output.keys()

dict_keys(['jobName', 'accountId', 'results', 'status'])

Let's dive into the results to better understand the structure of the transcription

In [280]:
transcribe_output['results'].keys()

dict_keys(['transcripts', 'speaker_labels', 'items', 'language_codes'])

If **ShowSpeakerLabels** is provided in setting of a Amazon transcribe job, transacribe will identify the speakers and provide partitioning detail for each individual speaker.

Here are some of the detail available for each identified speaker:

* number of speakers
* overall start time and end time for each identified speaker
* speaker by individual segment, with timeline that corresponds to each speaker.

In [286]:
# showing a segment of a speaker
transcribe_output['results']['speaker_labels']['segments'][0]

{'start_time': '22.41',
 'speaker_label': 'spk_0',
 'end_time': '30.7',
 'items': [{'start_time': '22.41',
   'speaker_label': 'spk_0',
   'end_time': '22.69'},
  {'start_time': '22.69', 'speaker_label': 'spk_0', 'end_time': '22.96'},
  {'start_time': '22.96', 'speaker_label': 'spk_0', 'end_time': '23.13'},
  {'start_time': '23.13', 'speaker_label': 'spk_0', 'end_time': '23.24'},
  {'start_time': '23.25', 'speaker_label': 'spk_0', 'end_time': '23.79'},
  {'start_time': '23.79', 'speaker_label': 'spk_0', 'end_time': '24.05'},
  {'start_time': '24.05', 'speaker_label': 'spk_0', 'end_time': '24.16'},
  {'start_time': '24.16', 'speaker_label': 'spk_0', 'end_time': '24.27'},
  {'start_time': '24.27', 'speaker_label': 'spk_0', 'end_time': '25.06'},
  {'start_time': '25.44', 'speaker_label': 'spk_0', 'end_time': '25.73'},
  {'start_time': '25.74', 'speaker_label': 'spk_0', 'end_time': '25.91'},
  {'start_time': '25.91', 'speaker_label': 'spk_0', 'end_time': '26.02'},
  {'start_time': '26.02',

The output also returns the entire transcription for each channel. Here's the transcription output from the video that we just watched

In [290]:
transcribe_output['results']['transcripts'][0]['transcript']

"It looks like an ordinary day in the USA. But in the city of flint michigan all is excitement, Even a small fryer buzzing and the older boys and girls are let out of school. Oh, this is a day! The whole town's a bustle. Yes, siree. There's going to be a parade too. And what's a parade without festive budding and gala decorations and vans? So the boys with the tall Chicos practice their structure and all over town. The final touches are put on sleek and shiny floats for this parade is going to be a mile long and some of the out of town floats have to be hustled over the road to make it on time. And in the town auditorium, a troop of broadway and Hollywood artists feverishly polished their song and dance routines, symbolizing the teamwork and progress of GM people everywhere for their show is to be played to a standout audience of very important guests, hundreds of guests, many of whom will arrive in flint on a train pulled by a glistening diesel. A golden engine for a golden day. Yes s

# Transcripts
For each identified word, the transcribe job returns a time window, identified language (code), speaker label, confident scores and the type of word

Let's take a look at the first 5 items returned from the transcription job:

In [296]:
transcribe_output['results']['items'][:5]

[{'start_time': '22.41',
  'language_code': 'en-US',
  'speaker_label': 'spk_0',
  'end_time': '22.69',
  'alternatives': [{'confidence': '1.0', 'content': 'It'}],
  'type': 'pronunciation'},
 {'start_time': '22.69',
  'language_code': 'en-US',
  'speaker_label': 'spk_0',
  'end_time': '22.96',
  'alternatives': [{'confidence': '1.0', 'content': 'looks'}],
  'type': 'pronunciation'},
 {'start_time': '22.96',
  'language_code': 'en-US',
  'speaker_label': 'spk_0',
  'end_time': '23.13',
  'alternatives': [{'confidence': '1.0', 'content': 'like'}],
  'type': 'pronunciation'},
 {'start_time': '23.13',
  'language_code': 'en-US',
  'speaker_label': 'spk_0',
  'end_time': '23.24',
  'alternatives': [{'confidence': '1.0', 'content': 'an'}],
  'type': 'pronunciation'},
 {'start_time': '23.25',
  'language_code': 'en-US',
  'speaker_label': 'spk_0',
  'end_time': '23.79',
  'alternatives': [{'confidence': '1.0', 'content': 'ordinary'}],
  'type': 'pronunciation'}]