# Sign Language Recognition Demo

This jupyter notebook is a quick Sign Language Recognition Demo. Please follow along to train a Sign Language Recognizer!

In [None]:
%load_ext autoreload
%autoreload 2

!pip3 install memory_profiler
%load_ext memory_profiler

## Feature Extraction Pipeline

This part of the notebook covers feature extraction from raw videos. We assume that our raw videos are sourced from the ASL Capture App. This app records whole sessions of the user signing words. Each session consists of 20 signs for 20 different words. The raw video must therefore be split up and then have mediapipe features extracted before we can train the models. The mediapipe features are then converted into HTK features.

### Mediapipe Feature Extraction

To extract mediapipe features, we must first add the raw videos to `videos/` and the run the following commands. You can find the raw videos at this Google Drive link: https://drive.google.com/file/d/1_sImmOjPiflbV7TWDzTiHs1W1qMF3DtY

We can count the videos.

In [None]:
sliced_video_files = !ls videos/
print(len(sliced_video_files))

We can also list the sliced up videos. Note the naming convention. The elements in the name are delimited by a hyphen (`-`). The first element is the user ID set in the Capture App. The second element is the word being signed. The third element is the dat and time at which the session was recorded (`YYYY_MM_DD_HH_MM_SS`). The fourth and last element, before the file extension, is the session ID followed by the number of times the sign was recorded in that session (`0` indexed). Ignore the error directory listed.

In [None]:
!ls videos/

Finally, we can play any videos listed above. To do so, define the `video_file` variable below as the name of the video file.

In [None]:
import os
from IPython.display import Video

video_dir = 'videos'
video_file = '00001-sleep-2022_09_19_15_42_00.307-18.mp4'

video_path = os.path.join(video_dir, video_file)
Video(video_path, height=540, width=360)

In the next step, we extract mediapipe features from the split videos and store them in `mediapipe/`. To do this, we call `mediapipe_convert.py`. This script generates mediapipe features for the split videos and stores them in the output directory. The directory hierarchy under the output directory is organized like so: The first level contains a directory for each user named: `[Capture App User ID]-singlesign/`. Beneath this level, the script creates a directory for each signed word named: `[sign]/`. Finally, at the last level the script creates a directory named after the date and time of each session in which the sign was recorded. The naming generally looks like this: `[YYYY_MM_DD_HH_MM_SS]` This final directory contains a data file with the mediapipe features.

Thia step can take a while. You can skip it by downloading and extracting the tarball at the following link: https://drive.google.com/file/d/1opuR5k8AwmoivuOBHePvT9_HhRYhJmT1/view?usp=sharing.

The link above contains features for all the videos linked in the README.

In [None]:
from mediapipe_convert import main

# This is system dependent, so set it appropriately. Setting a large number can crash the kernel.
num_threads = 4
source_dir = 'videos/'
dest_dir = 'mediapipe/'

%mprun main(noMark=True, inputDirectory=source_dir, outputDirectory=dest_dir, num_threads=num_threads)

We can count the number of signs for which we have mediapipe features.

In [None]:
!ls mediapipe/00001-singlesign/ | wc -l

For the final data preparation step, we convert our mediapipe features into a format usable by the HTK toolkit.

In [None]:
from utils.utils import load_json
from prepare_data import prepare_data

# This file should contain a path near the end thatpoints to the mediapipe features above.
features_file = 'configs/features.json'

# Load features
raw_data = load_json(features_file)
    
prepare_data(raw_data, isSingleWord=True)

## Training Demo

Run the train command below to train HMM models. The `train_iters` arg determines how many iterations to train to, while `n_splits` determines the number of folds for cross validation. Pass `gmm_mix` as a value that is not `None` to train GMM Mixture Models for emission probabilities. Pass `middle` to the `gmm_pattern` variable to use GMM Mixtures for all but the start/end states. Pass `all` to include GMM Mixtures for all states.

In [None]:
from train import train

train_iters = [150]
leave_one_out = False
n_splits = 20
hmm_step_type = 'single'
gmm_mix = 3
gmm_pattern = 'middle'
random_state = 423

train(
    train_iters=train_iters,
    leave_one_out=leave_one_out, # n_splits would be ignored in this case
    n_splits=n_splits,
    hmm_step_type=hmm_step_type, # see utils/model_utils.py for possible values
    gmm_mix=gmm_mix,
    gmm_pattern=gmm_pattern, # If gmm mix is none this is ignored.
    random_state=random_state
)

## Testing Demo

### Results

The next cell runs the same trained HMMs from the previous step on all of the test data. It outputs a weighted average on results across all folds.

In [None]:
from utils.utils import get_results, clean_and_make_path, get_num_lines
import shutil
import os
import numpy as np
## We can run HVite and HResults on the entire test dataset.

all_results = {'error': [], 'num_rows': []}

for i in range(n_splits):
    fold = str(i)
    
    clean_and_make_path(f'results/{fold}')
    clean_and_make_path(f'hresults/{fold}')
    
    model_file = f'models/{fold}/hmm150/newMacros'
    test_data = f'lists/{fold}/test.data'
    results_file = f'results/{fold}/res_hmm150.mlf'
    hresults_file = f'hresults/{fold}/res_hmm150.txt'
    
    # HVite
    HVite_str = (f'HVite -A -H {model_file} -m -S {test_data} -i '
                    f'{results_file} -p -200 -w wordNet.txt -s 25 dict wordList')

    os.system(HVite_str)

    # HResults
    HResults_str = (f'HResults -A -h -e \\?\\?\\? sil0 -e \\?\\?\\? '
                        f'sil1 -p -t -I all_labels.mlf wordList {results_file} '
                        f'>> {hresults_file}')
    os.system(HResults_str)
    
    results = get_results(hresults_file)
    num_rows = get_num_lines(test_data)
    
    all_results['error'].append(results['error'])
    all_results['num_rows'].append(num_rows)

total = sum(all_results['num_rows'])
print("\nError: ", np.average(all_results['error'], weights=all_results['num_rows']))
    

### Video Samples

The section below contains some video samples from the test data. We play the video with the correct label (from the trained HMMs) on the left side and the incorrect label on the right side. The first cell displays two different words, while the second two cells display the same word, but with a correct and incorrect label, on the left and right respectively.

In [None]:
## Look is displayed on the left and is labeled correctly by the HMM (which is shown below)
## Talk is displayed on the right and is labeled incorrect by the HMM (which is shown below)

from ipywidgets import Output, GridspecLayout
from IPython import display

videos = [
    'videos/00001-puzzle-2022_09_19_15_42_00.307-17.mp4',
    'videos/00001-puzzle-2022_09_19_15_42_00.307-10.mp4'
]
grid = GridspecLayout(1, len(videos))

for i, videopath in enumerate(videos):
    out = Output()
    with out:
        display.display(display.Video(videopath, embed=True, height=540, width=360))
    grid[0,i] = out

grid
    