# About

This Python notebook demonstrates using the Python library [`punctuator`](https://pypi.org/project/punctuator/) to add punctuation and capitalization to the Speech to Text output of a meeting recording.

A pretrained model, created by [Ottokar Tilk](https://ee.linkedin.com/in/ottokar-tilk), is used. See: [Paper: Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration](https://www.isca-speech.org/archive/pdfs/interspeech_2016/tilk16_interspeech.pdf)

The notebook also demonstrates using the [Natural Language Toolkit](https://www.nltk.org/) to break the punctuated transcript into an array of sentences.

This notebook is a sample to support a paper presentation at CASCONxEVOKE 2021.

See:
- [CASCONxEVOKE 2021](https://pheedloop.com/casconevoke2021/site/home)
- [Presentation](https://pheedloop.com/casconevoke2021/site/sessions/?id=SESPZ87C5K5VZKT28)
- [Samples GitHub repo](https://github.com/spackows/CASCON-2021_Processing_video)

# Step 1: Download Speech to Text output to working directory

Watson Speech to Text output for a sample meeting recording is available here: [favorite-animals-short-meeting_STT-raw-results.json](https://raw.githubusercontent.com/spackows/CASCON-2021_Processing_video/main/sample-meeting/favorite-animals-short-meeting_STT-raw-results.json)

In this step, download that file to the notebook working directory.

In [1]:
# Download the file
import urllib.request
stt_results_url = "https://raw.githubusercontent.com/spackows/CASCON-2021_Processing_video/main/sample-meeting/favorite-animals-short-meeting_STT-raw-results.json"
stt_results_filename = "favorite-animals-short-meeting_STT-raw-results.json"
urllib.request.urlretrieve( stt_results_url, stt_results_filename )

('favorite-animals-short-meeting_STT-raw-results.json',
 <http.client.HTTPMessage at 0x7faff8608550>)

In [5]:
# View the contents of the working directory
!ls

favorite-animals-short-meeting_STT-raw-results.json


# Step 2: Pull all text into a long string

In [4]:
# Read the raw Watson Speech to Text JSON results from the file
import json
with open( stt_results_filename ) as json_file:
    stt_results_json = json.load( json_file )
print( json.dumps( stt_results_json["results"][0:3], indent=2 ) )

[
  {
    "final": true,
    "alternatives": [
      {
        "transcript": "thanks everybody for joining me for this short meeting ",
        "confidence": 0.95
      }
    ]
  },
  {
    "final": true,
    "alternatives": [
      {
        "transcript": "what I wanted to do today was to go around the room and asked people to share what is their favorite animal and why ",
        "confidence": 0.9
      }
    ]
  },
  {
    "final": true,
    "alternatives": [
      {
        "transcript": "okay hi my name is heather and my favorite animal is a dog and the reason that it's my favorite animal how did a lot of people scare animal ",
        "confidence": 0.87
      }
    ]
  }
]


In [None]:
# Print all results for interest
print( json.dumps( stt_results_json, indent=2 ) )

In [19]:
# Paste all the transcript pieces together into one, long string
stt_results_str = ""
for result in stt_results_json["results"]:
    stt_results_str += result["alternatives"][0]["transcript"]
print( stt_results_str )

thanks everybody for joining me for this short meeting what I wanted to do today was to go around the room and asked people to share what is their favorite animal and why okay hi my name is heather and my favorite animal is a dog and the reason that it's my favorite animal how did a lot of people scare animal because dogs are such loving companions who are so very loyal and I feel like they seem to know when you need them to come snuggle by you they're very perceptive of your feelings and they want to please you and I don't think there are well there are mean dogs well most dogs are very lovable and only just want to %HESITATION please humans and so I would say dogs are my favorite animal I can go next so %HESITATION my name Serra and %HESITATION every once in awhile actually reflecting a different animal and they like it connects with a different one but I recently I've been thinking a lot about elephants and I think offensively fast eating they were present in the African culture the

In [23]:
# Remove %HESITATION from the string
# (Those %HESITATION points mark where someone said something like "um" or "uh".)
import re
stt_results_str = re.sub( r"\s*%HESITATION\s*", " ", stt_results_str )
print( stt_results_str )

thanks everybody for joining me for this short meeting what I wanted to do today was to go around the room and asked people to share what is their favorite animal and why okay hi my name is heather and my favorite animal is a dog and the reason that it's my favorite animal how did a lot of people scare animal because dogs are such loving companions who are so very loyal and I feel like they seem to know when you need them to come snuggle by you they're very perceptive of your feelings and they want to please you and I don't think there are well there are mean dogs well most dogs are very lovable and only just want to please humans and so I would say dogs are my favorite animal I can go next so my name Serra and every once in awhile actually reflecting a different animal and they like it connects with a different one but I recently I've been thinking a lot about elephants and I think offensively fast eating they were present in the African culture they are symbol of strength and power a

# Step 3: Prepare sample Punctuator model

This step uses the Python library [`punctuator`](https://pypi.org/project/punctuator/) to add puntuation and capitalization to the Speech to Text results string.  

In particular, this step uses a pretrained model created by [Ottokar Tilk](https://ee.linkedin.com/in/ottokar-tilk).

See:
- [Model: `english-europarl-v7`](https://github.com/ottokart/punctuator2#english-europarl-v7)
- [Paper: Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration](https://www.isca-speech.org/archive/pdfs/interspeech_2016/tilk16_interspeech.pdf)

In [None]:
# Install the library for downloading a large file from Google Drive
!pip install gdown

### Get a `link_id` for the pretrained model:
The pretrained model, created by Ottokar Tilk, is available to download from Google Drive:
1. In Ottokar Tilk's [puntuator2 GitHub repo](https://github.com/ottokart/punctuator2), navigate to the "How well does it work" section of the README: [Link](https://github.com/ottokart/punctuator2#how-well-does-it-work)
2. Follow the link where it says "Pretrained models can be downloaded _here_"
3. Right-click on the file named "Demo-Europarl-EN.pcl" and then select **Make a copy**
4. In your Google drive, right-click on the file named "Copy of Demo-Europarl-EN.pcl" and then click **Get link**
5. The link will be of the form `https://drive.google.com/file/d/<link_id>/view?usp=sharing`
6. Paste that &lt;link_id> into the cell bellow

The reason you should create your own copy and link_id is: If too many people use the same link to download the file, the downlownd link is locked by Google.

In [48]:
link_id = ""

In [None]:
# Download the pretrained model
import gdown
url = "https://drive.google.com/uc?id=" + link_id
output = "Demo-Europarl-EN.pcl"
gdown.download( url, output, quiet=False )

In [51]:
# View the contents of the working directory
!ls

Demo-Europarl-EN.pcl  favorite-animals-short-meeting_STT-raw-results.json


In [None]:
# Install the punctuator library
!pip install punctuator

In [54]:
# Load the pretrained model
from punctuator import Punctuator
p = Punctuator( "Demo-Europarl-EN.pcl" )

# Step 4: Add punctuation and capitalization

In [55]:
# Add punctuation and capitalization using the pretrained model
stt_results_punc = p.punctuate( stt_results_str )
print( stt_results_punc )

Thanks everybody for joining me for this short meeting. What I wanted to do today was to go around the room and asked people to share what is their favorite animal and why? Okay hi my name is heather and my favorite animal is a dog and the reason that it's my favorite animal. How did a lot of people scare animal, because dogs are such loving companions who are so very loyal and I feel like they seem to know when you need them to come snuggle by you, they're very perceptive of your feelings, and they want to please you and I: don't think there are well, there are mean dogs. Well, most dogs are very lovable and only just want to please humans and so I would say. Dogs are my favorite. Animal I can go next, so my name Serra and every once in awhile, actually reflecting a different animal and they like it, connects with a different one. But I recently I've been thinking a lot about elephants and I think offensively fast eating. They were present in the African culture. They are symbol of st

# Step 5: Split string into sentences

This step uses the [Natural Language Toolkit](https://www.nltk.org/) to break the punctuated text into sentences.  

Specifically, the [NLTK sentence tokenizer](https://www.nltk.org/_modules/nltk/tokenize/punkt.html) is used.

In [56]:
# Download the sentence tokenizer library
import nltk
nltk.download( "punkt" )

[nltk_data] Downloading package punkt to /home/wsuser/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [57]:
# Split the text into an array of sentences
from nltk import tokenize
sentences_arr = tokenize.sent_tokenize( stt_results_punc )
sentences_arr

['Thanks everybody for joining me for this short meeting.',
 'What I wanted to do today was to go around the room and asked people to share what is their favorite animal and why?',
 "Okay hi my name is heather and my favorite animal is a dog and the reason that it's my favorite animal.",
 "How did a lot of people scare animal, because dogs are such loving companions who are so very loyal and I feel like they seem to know when you need them to come snuggle by you, they're very perceptive of your feelings, and they want to please you and I: don't think there are well, there are mean dogs.",
 'Well, most dogs are very lovable and only just want to please humans and so I would say.',
 'Dogs are my favorite.',
 'Animal I can go next, so my name Serra and every once in awhile, actually reflecting a different animal and they like it, connects with a different one.',
 "But I recently I've been thinking a lot about elephants and I think offensively fast eating.",
 'They were present in the Afri