# Step 1: Read Watson Speech to Text transcript

A transcript file, generated by Watson Speech to Text, for a sample product video is available here: [product_video.txt](https://raw.githubusercontent.com/spackows/CASCON-2021_Processing_video/main/sample-product-video/product_video.txt)

In this step, download that transcript file to the notebook working directory.

In [62]:
# Download the file
import urllib.request
transcript_url = "https://raw.githubusercontent.com/spackows/CASCON-2021_Processing_video/main/sample-product-video/product_video.txt"
transcript_filename = "product_video.txt"
urllib.request.urlretrieve( transcript_url, transcript_filename )

('product_video.txt', <http.client.HTTPMessage at 0x7f82901075e0>)

In [10]:
# View the contents of the working directory
!ls

product_video.txt


# Step 2: Read corrected transcript

A manually corrected transcript for the same product video is available here: [product_video_corrected.txt](https://raw.githubusercontent.com/spackows/CASCON-2021_Processing_video/main/sample-product-video/product_video_corrected.txt)

In this step, download that corrected transcript file to the notebook working directory.

In [63]:
# Download the file
import urllib.request
corrected_url = "https://raw.githubusercontent.com/spackows/CASCON-2021_Processing_video/main/sample-product-video/product_video_corrected.txt"
corrected_filename = "product_video_corrected.txt"
urllib.request.urlretrieve( corrected_url, corrected_filename )

('product_video_corrected.txt', <http.client.HTTPMessage at 0x7f8290107af0>)

In [17]:
# View the contents of the working directory
!ls

product_video_corrected.txt  product_video.txt


# Step 3: Pull the transcript text and the corrected text into strings

In [64]:
import re
import string

# Strip out timestamps and paste the text into 
# one, long, lowercase string with no punctuation
def stringFromTranscript( filename ):
    with open( filename ) as file:
        lines_arr = file.readlines()
    #print( lines_arr )
    txt = ""
    for line in lines_arr:
        if re.match( r"\S+", line ) and not re.match( r"^\d{2}\:\d{2}\:\d{2}", line ):
            txt += line.strip().lower() + " "
    txt = re.sub( r"\s*\%hesitation\s*", " ", txt )
    punc_symbols = punc_symbols = re.sub( r"\-", "", string.punctuation )
    regex1 = re.compile( "\s+[" + re.escape( punc_symbols ) + "]" )
    regex2 = re.compile( "[" + re.escape( punc_symbols ) + "]\s+" )
    regex3 = re.compile( "[" + re.escape( punc_symbols ) + "]" )
    txt = regex1.sub( " ", txt )
    txt = regex2.sub( " ", txt )
    txt = regex3.sub( "", txt )
    txt = re.sub( r"\s+", " ", txt )
    txt = re.sub( r"^\s+", "", txt )
    txt = re.sub( r"\s+$", "", txt )
    return txt

In [65]:
transcript_txt = stringFromTranscript( transcript_filename );
print( "Speech to text output:\n" )
print( transcript_txt[0:350], "\n..." )

Speech to text output:

this video shows you how to create a watson studio project from the home page you can create a project projects are way to organize resources for specific data science task or goal rajit includes data collaborators notebooks models and so one all to support finding insights for a well defined and fairly narrow goal for example how weather affects s 
...


In [66]:
corrected_txt = stringFromTranscript( corrected_filename );
print( "Corrected transcript:\n" )
print( corrected_txt[0:350], "\n..." )

Corrected transcript:

this video shows you how to create a watson studio project from the home page you can create a project projects are a way to organize resources for a specific data science task or goal a project includes data collaborators notebooks models and so on all to support finding insights for a well-defined and fairly narrow goal for example how weather af 
...


# Step 4: Compare the original transcript with the corrected one

This step uses the library [`difflib`](https://docs.python.org/3/library/difflib.html) to find words removed, replace, and added.

In [67]:
import difflib


d = difflib.Differ()
diff = d.compare( transcript_txt.split(), corrected_txt.split() )

# In the output below:
# - Words with a "+" preceeding them were added in the corrections
# - Words with a "-" preceeding them were removed in the corrections
# - Words with a "?" preceedings them can be ignored in our case
# See the description: https://docs.python.org/3/library/difflib.html#difflib.Differ
print ( "..." )
print( "\n".join( list( diff )[20:50] ) )
print( "..." )

...
  projects
  are
+ a
  way
  to
  organize
  resources
  for
+ a
  specific
  data
  science
  task
  or
  goal
- rajit
+ a
+ project
  includes
  data
  collaborators
  notebooks
  models
  and
  so
- one
?   -

+ on
  all
  to
...


In [43]:
# Create a helper function to visually display the differences
def htmlDiff( transcript_txt, corrected_txt ):
    d = difflib.Differ()
    diff = d.compare( transcript_txt.split(), corrected_txt.split())
    html = ""
    for word in list( diff ):
        if re.match( r"^\- ", word ):
            html += "<span style='color: red;'>" + re.sub( r"^\-\s+", "", word ) + "</span>" + " "
        elif re.match( r"^\+ ", word ):
            html += "<span style='color: green;'>" + re.sub( r"\+\s+", "", word ) + "</span>" + " "
        elif re.match( r"^\? ", word ):
            html += " " + " "
        else:
            html += word + " "
    return html

In [71]:
# In the output below:
# - Words in red were removed 
from IPython.core.display import display, HTML
html = htmlDiff( transcript_txt, corrected_txt )
display( HTML( "...<br/>" + html[143:1840] + "<br/>..." ) )

In [82]:
display( HTML( "...<br/>" + html[5573:] + "<br/>..." ) )

# Step 5: Generate dictionary files for customizing Watson Speech to Text

The transcript above was generated by Watson Speech to Text, using the built-in model: "en-US_BroadbandModel".

In our projects, we customized the language model in our Watson Speech to Text service to recognize our domain-specific jargon.  We did this by creating dictionaries of custom words:
1. Programmatically compare the original transcript with the manually corrected transcript using diff
2. Use the words added to generate an initial custom words dictionary
3. Manually correct and refine the custom words dictionary
4. Use the Watson Speech to Text API to add the custom words dictionary

Steps 1. and 2. are shown below.

See also:
- [Understanding customization](https://cloud.ibm.com/docs/speech-to-text?topic=speech-to-text-customization)
- [Creating a custom language model](https://cloud.ibm.com/docs/speech-to-text?topic=speech-to-text-languageCreate)
- [Add words to the custom language model](https://cloud.ibm.com/docs/speech-to-text?topic=speech-to-text-languageCreate#addWords)
- [API: Add custom words](https://cloud.ibm.com/apidocs/speech-to-text#addwords)

In [163]:
# Create a helper function to generate the initial custom words dictionary

def listAdditions( transcript_txt, corrected_txt ):
    d = difflib.Differ()
    diff = d.compare( transcript_txt.split(), corrected_txt.split())
    added_words = []
    txt = ""
    for word in list( diff ):
        if re.match( r"^\+", word ):
            txt += word[1:]
        elif re.search( r"\S", txt ):
            txt = re.sub( r"^\s*", "", txt )
            if txt not in added_words:
                added_words.append( txt )
            txt = ""
    added_words.sort()
    return added_words

In [164]:
added_words = listAdditions( transcript_txt, corrected_txt )
added_words

['a',
 'a project',
 'an',
 'and',
 'and a',
 'definedcrowd',
 'federation',
 'github',
 'on',
 'or',
 'pak',
 'predict',
 'readme',
 'saml',
 'select',
 'so on',
 'then create',
 'there',
 'well-defined',
 'zipped']

The custom words dictionaries are in this form:
```
[
   { word : "IBM-Cloud", sounds_like : [ "I. B. M. Cloud" ],     display_as : "IBM Cloud" },
   { word : "README",    sounds_like : [ "read me" ],            display_as : "README" },
   { word : "GitHub",    sounds_like : [ "git hub", "get hub" ], display_as : "GitHub" },
   ...
]
```

Where:
- `word` is an id for the custom word (ids cannot contain spaces)
- `sounds_like` provides one or more phonetically written pronounciations
- `display_as` is how you want the custom word to be written in the transcript (including capitalization)


In [183]:
import json

# Helper function to generate inital dictionary, ready for manual refinement
def genDictionary( added_words ):
    custom_words = []
    for term in added_words:
        custom_words.append( { "word" : term, "sounds_like" : [ term ], "display_as" : term } )
    return custom_words  

# Helper function to view dictionary
def prettyPrintDictionary( custom_words ):
    str = "[\n"
    for entry in custom_words:
        word = "word: '" + entry["word"] + "'" + " "*( 13 - len( entry["word"] ) )
        sounds_like = "sounds_like: [ '" + entry["sounds_like"][0] + "' ]" + " "*( 13 - len( entry["sounds_like"][0] ) )
        display_as = "display_as: '" + entry["display_as"] + "'" + " "*( 13 - len( entry["display_as"] ) )
        str += "   { " + word + sounds_like + display_as + " },\n"
    str = re.sub( r"\s*,$", "", str )
    str += "]"
    print( str )

In [191]:
custom_words = genDictionary( added_words )
prettyPrintDictionary( custom_words )

[
   { word: 'a'            sounds_like: [ 'a' ]            display_as: 'a'             },
   { word: 'a project'    sounds_like: [ 'a project' ]    display_as: 'a project'     },
   { word: 'an'           sounds_like: [ 'an' ]           display_as: 'an'            },
   { word: 'and'          sounds_like: [ 'and' ]          display_as: 'and'           },
   { word: 'and a'        sounds_like: [ 'and a' ]        display_as: 'and a'         },
   { word: 'definedcrowd' sounds_like: [ 'definedcrowd' ] display_as: 'definedcrowd'  },
   { word: 'federation'   sounds_like: [ 'federation' ]   display_as: 'federation'    },
   { word: 'github'       sounds_like: [ 'github' ]       display_as: 'github'        },
   { word: 'on'           sounds_like: [ 'on' ]           display_as: 'on'            },
   { word: 'or'           sounds_like: [ 'or' ]           display_as: 'or'            },
   { word: 'pak'          sounds_like: [ 'pak' ]          display_as: 'pak'           },
   { word: 'predict