# Data Science - Capstone Project Submission
#2

* Student Name: **James Toop**
* Student Pace: **Self Paced**
* Scheduled project review date/time: **29th October 2021 @ 21:30 BST**
* Instructor name: **Jeff Herman / James Irving**
* Blog URL: **https://toopster.github.io/**

## Table of Contents
1. [Business Case, Project Purpose and Approach](0_business_case.ipynb#business-case)
    1. [The importance of communication for people with severe learning disabilities](0_business_case.ipynb.ipynb#communication-and-learning-disabilities)
    2. [Types of communication](0_business_case.ipynb.ipynb#types-of-communication)
    3. [Communication techniques for people with learning disabilities](0_business_case.ipynb.ipynb#communication-techniques)
    4. [Project purpose & approach](0_business_case.ipynb.ipynb#project-purpose)
2. [Exploratory Data Analysis](1_eda.ipynb#eda)
    1. [The Datasets](1_eda.ipynb#the-datasets)
    2. [Discovery](1_eda.ipynb#data-discovery)
    3. [Preprocessing](1_eda.ipynb#data-preprocessing)
3. [Deep Learning Neural Networks](#deep-learning-neural-networks)
    1. [Initial Model Using Spectrograms](2_spectrogram_model.ipynb#model-1)
    2. [Advanced Model using MFCC's](3_mfcc_model.ipynb#model-2)
4. [Final Model Performance Evaluation](#final-model-performance-evaluation)

---
<a name="eda"></a>
## 2. Exploratory Data Analysis

This section presents an initial step to investigate, understand and document the available data and relationships, highlighting any potential issues / shortcomings within the datasets supplied.


<a name="the-datasets"></a>
### 2A. The Datasets


#### Speech Commands: A dataset for limited-vocabulary speech recognition –
https://arxiv.org/abs/1804.03209

The Speech Commands dataset is an attempt to build a standard training and evaluation dataset for a class of simple speech recognition tasks. Its primary goal is to provide a way to build and test small models that detect when a single word is spoken, from a set of ten or fewer target words, with as few false background noise or unrelated speech.


#### Ultrasuite: A collection of ultrasound and acoustic speech data from child speech therapy sessions –
https://ultrasuite.github.io/

Ultrasuite is a collection of ultrasound and acoustic speech data from child speech therapy sessions. The current release includes three datasets, one from typically developing children and two from speech disordered children:

* **Ultrax Typically Developing (UXTD)** -  A dataset of 58 typically developing children. 
* **Ultrax Speech Sound Disorders (UXSSD)** - A dataset of 8 children with speech sound disorders. 
* **UltraPhonix (UPX)** - A second dataset of children with speech sound disorders, collected from 20 children.

**IMPORTANT NOTE:**

The datasets have not been included in the GitHub repository with this notebook and will need to be downloaded and
stored in the local repository for the code to run correctly.  

The code below will however, download, store and transform the datasets as required for the models to run.

In [None]:
# Import required libraries and modules for data preprocessing
import pandas as pd
import numpy as np
import wave
import soundfile as sf
import librosa
import os
import tensorflow as tf

import pathlib
from pathlib import Path
import shutil
import collections
import nltk
from nltk.corpus import wordnet as wn
nltk.download('wordnet')

#### Download Speech Commands v0.02 dataset

In [None]:
# Function to download the Speech Commands dataset, unpack and remove unnecessary files
def download_speech_commands():
    
    data_dir = pathlib.Path('data/speech_commands_v0.02')
    
    # Check to see if data directory already exists, download if not
    if not data_dir.exists():
        tf.keras.utils.get_file(
            'speech_commands_v0.02.zip',
            origin='http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz',
            extract=True,
            cache_dir='.',
            cache_subdir='data/speech_commands_v0.02')
    else:
        print('Speech Commands dataset already exists')
        
    # Remove the _background_noise_ samples as these are not required
    try:
        shutil.rmtree(str(data_dir) + '/_background_noise_')
    except OSError as e:
        print('Error: %s - %s.' % (e.filename, e.strerror))
    
    
    # Remove the extracted zip file for politeness as this is also not required
    zip_file = str(data_dir) + '/speech_commands_v0.02.zip'
    if os.path.exists(zip_file):
        os.remove(zip_file)

In [None]:
download_speech_commands()

#### Download and transform the Ultrasuite dataset

In order to combine the datasets prior to training the neural network, we need to assimilate the Ultrasuite datasets and transform them into the same structure as the Speech Commands dataset.

In [None]:
# Function for downloading the Ultrasuite datasets
# def download_ultrasuite(datasets):
    
#     for dataset in datasets:
#         os.system('rsync -av ultrasuite-rsync.inf.ed.ac.uk::ultrasuite/core-uxtd')

In [None]:
# Function for extracting and combining labels from all .lab files into a single DataFrame
def all_ultrasuite_word_labels(src_directory, src_dataset):
    
    directory = src_directory + src_dataset + '/word_labels/lab/'
    columns = ['start_time', 'end_time', 'utterance']
    all_labels_df = pd.DataFrame()

    for filename in os.listdir(directory):
    
        filepath = directory + filename
    
        labels_df = pd.read_csv(filepath, sep=" ", header=None, names=columns)
    
        # Extract the speaker, session and speech data from the filename and add to the dataframe
        labels_df['dataset'] = src_dataset
        labels_df['speaker'] = filename[0:3]
        if len(filename[4:-9]) == 0:
            labels_df['session'] = None
        else:
            labels_df['session'] = filename[4:-9]
        labels_df['speech_waveform'] = filename[-8:-4]

        # Tidy up data formatting and correct time based units
        labels_df['utterance'] = labels_df['utterance'].str.lower()
        labels_df['start_time'] = pd.to_timedelta(labels_df['start_time'] * 100)
        labels_df['end_time'] = pd.to_timedelta(labels_df['end_time'] * 100)

        # Append incoming labels to existing dataframe
        all_labels_df = all_labels_df.append(labels_df, ignore_index=True)
    
    return all_labels_df

In [None]:
# Load the labels for the Ultrax Speech Sound Disorders dataset
uxssd_df = all_ultrasuite_word_labels('data/ultrasuite/labels-uxtd-uxssd-upx/', 'uxssd')

In [None]:
# Preview the data
uxssd_df.head()

In [None]:
# Load the labels for the Ultrax Typically Developing dataset
uxtd_df = all_ultrasuite_word_labels('data/ultrasuite/labels-uxtd-uxssd-upx/', 'uxtd')

In [None]:
# Preview the data
uxtd_df.head()

In [None]:
# Preview the dataframe info
uxtd_df.info()

In [None]:
# Load the labels for the Ultraphonix dataset
upx_df = all_ultrasuite_word_labels('data/ultrasuite/labels-uxtd-uxssd-upx/', 'upx')

In [None]:
# Preview the data
upx_df.head()

#### Transform the Ultrasuite dataset

> OUTLINE APPROACH TO TRANSFORMING THE DATASET IN BULLET POINTS

In [None]:
# Function for extracting labels from .lab file into a single DataFrame
def ultrasuite_word_labels(src_dataset, src_file):
    
    filepath = 'data/ultrasuite/labels-uxtd-uxssd-upx/' + src_dataset + '/word_labels/lab/' + src_file

    columns = ['start_time', 'end_time', 'utterance']
    word_labels_df = pd.DataFrame()
    word_labels_df = pd.read_csv(filepath, sep=" ", header=None, names=columns)
    
    # Extract the speaker, session and speech data from the filename and add to the dataframe
    word_labels_df['dataset'] = src_dataset
    word_labels_df['speaker'] = src_file[0:3]
    if len(src_file[4:-9]) == 0:
        word_labels_df['session'] = None
    else:
        word_labels_df['session'] = src_file[4:-9]
    word_labels_df['speech_waveform'] = src_file[-8:-4]

    # Tidy up data formatting and correct time based units
    word_labels_df['utterance'] = word_labels_df['utterance'].str.lower()
    word_labels_df['start_time'] = pd.to_timedelta(word_labels_df['start_time'] * 100)
    word_labels_df['end_time'] = pd.to_timedelta(word_labels_df['end_time'] * 100)
    
    return word_labels_df

In [None]:
# CONSIDER DELETING
# Quick test to check function works for a single labels file
upx_01F_df = ultrasuite_word_labels('upx', '01F-BL1-005A.lab')
upx_01F_df.head()

In [None]:
# Function for splicing original *.wav file based on timestamps
def extract_segments(y, sr, segments, dataset):
    
    # Compute segment regions in number of samples
    starts = np.floor(segments.start_time.dt.total_seconds() * sr).astype(int)
    ends = np.ceil(segments.end_time.dt.total_seconds() * sr).astype(int)
    
    isolated_directory = 'data/ultrasuite_isolated/' + dataset + '/'

    if not os.path.isdir(isolated_directory):
        os.makedirs(isolated_directory.strip('/'))
    
    i = 0
    # Slice the audio into segments
    for start, end in zip(starts, ends):
        audio_seg = y[start:end]
        print('extracting audio segment:', len(audio_seg), 'samples')
        
        # Set the file path for the spliced audio file    
        file_path = isolated_directory + str(segments.speaker[i]) + '/'
        if segments.session[i] != None:
            file_path = file_path + str(segments.session[i]) + '/' 
        file_path = file_path + str(segments.speech_waveform[i]) + '/'
            
        if not os.path.isdir(file_path):
            os.makedirs(file_path.strip('/')) 
            
        file_name = file_path + str(segments.utterance[i]) + '.wav'
        
        sf.write(file_name, audio_seg, sr)
        i += 1

In [None]:
# Function for processing and splicing all ultrasuite *.wav files
def process_ultrasuite_wav_files(src_dataset, src_speaker, src_session):

    directory = 'data/ultrasuite/core-' + src_dataset + '/core/' + src_speaker + '/'
    
    # Set the target directory based on session if available
    if src_session != False:
         directory = directory + src_session + '/'

    # Loop through files in the directory, splice and rename files based on labels
    for filename in os.listdir(directory):

        if not filename[-5:-4] == 'E' or filename[-5:-4] == 'D':
            # Fetch the corresponding word labels and load into a DataFrame
            # Handle errors for when no labels exist
            # Files are graded on basis of quality and labels only available for high quality samples
            try:
                if src_session != False:
                    labels_filename = src_speaker + '-' + src_session + '-' + filename[-8:-4] + '.lab'
                else:
                    labels_filename = src_speaker + '-' + filename[-8:-4] + '.lab'
                
                labels_df = ultrasuite_word_labels(src_dataset, labels_filename)
                
                wav_path = directory + filename
                y, sr = librosa.load(wav_path, sr=22050)
                extract_segments(y, sr, labels_df, src_dataset)                
            
            except IOError:
                if src_session != False:
                    print('\n' + src_speaker + '-' + src_session + '-' + filename[-8:-4] + '.lab not found \n')
                else:
                    print('\n' + src_speaker + '-' + filename[-8:-4] + '.lab not found \n')

In [None]:
def process_all_wav_files(datasets):
    
    # Loop through the datasets
    for dataset in datasets:
        current_dataset_dir = 'data/ultrasuite/core-' + dataset + '/core/'
        speakers = os.listdir(current_dataset_dir)
        
        # Loop through the speakers
        for speaker in speakers:
            current_speaker_dir = 'data/ultrasuite/core-' + dataset + '/core/' + speaker + '/'
            sessions = os.listdir(current_speaker_dir)

            # If there are multiple sessions, loop through the sessions and process files
            for session in sessions:
                if os.path.isdir(os.path.join(current_speaker_dir, session)):
                    process_ultrasuite_wav_files(dataset, speaker, session)
                else:
                    process_ultrasuite_wav_files(dataset, speaker, False)

In [None]:
# Splice all *.wav files for all datasets
# NOTE: This takes a long time to run
process_datasets = ['upx', 'uxssd', 'uxtd']
process_all_wav_files(process_datasets)

In [None]:
# Standardise filing structure for isolated samples from Ultrasuite dataset, renaming files in the process
def standardise_filing(datasets):
   
    # Loop through the datasets
    for dataset in datasets:

        isolated_files = Path.cwd() / 'data/ultrasuite_isolated' / dataset

        for isolated_file in isolated_files.glob('**/*'):

            if isolated_file.is_file():

                filename = isolated_file.stem
                extension = isolated_file.suffix
                sourcedata = dataset
                sourcefile = isolated_file.parent.parts[-1]
                
                # Rename the file but don't lose the original references handling the different folder structures
                if dataset == 'uxtd':
                    speaker = isolated_file.parent.parts[-2] 
                    new_filename = f'{filename}_{dataset}-{speaker}-{sourcefile}{extension}'
                    
                else:
                    session = isolated_file.parent.parts[-2]
                    speaker = isolated_file.parent.parts[-3]
                    new_filename = f'{filename}_{dataset}-{speaker}-{session}-{sourcefile}{extension}'

                # Define the new file path and create directory if it doesn't exist
                new_path = Path.cwd() / 'data/ultrasuite_transformed' / filename

                if not new_path.exists():
                    new_path.mkdir(parents=True, exist_ok=True)

                new_file_path = new_path.joinpath(new_filename)

                # Copy file to new location
                shutil.copy(str(isolated_file), str(new_file_path))

In [None]:
# Run the function to standardise the filing for all Ultrasuite datasets
standardise_filing(['upx', 'uxssd', 'uxtd'])

In [None]:
# Cleanse the Ultrasuite dataset –
# 1. Only keep audio samples of actual words using NLTK WordNet as a source corpus
# 2. Remove audio samples of simple phonetic letters
# 3. Only keep audio samples that have more than 5 different samples

def remove_invalid_samples():

    transformed_files = 'data/ultrasuite_transformed/'
    
    for name in sorted(os.listdir(transformed_files)):
        
        path = os.path.join(transformed_files, name)
        
        if os.path.isdir(path):
            num_files = len(os.listdir(path))
        
            # Remove audio samples of words not listed in NLTK WordNet corpus
            if not wordnet.synsets(name) or len(name)==1:
                print(name, 'is NOT a valid word... removing')
                print(num_files)
                shutil.rmtree(path)
            elif num_files <= 5: 
                print(name, 'does NOT have enough samples', num_files, '... removing')
                shutil.rmtree(path)
            else:
                print(name, 'is a valid word and there are', num_files, 'samples')

In [None]:
remove_invalid_samples()

In [None]:
# Function to get the audio sample file statistics based on a target directory
def get_filestats(src_directory):
    
    src_files = Path.cwd() / src_directory
    filedata = []

    for src_file in src_files.glob('**/*.wav'):
        
        if src_file.is_file():
            filedata.append([src_file.parent.parts[-1], 
                             src_file.stem + src_file.suffix, 
                             librosa.get_duration(filename=src_file)])
            
    columns = ['sample_utterance', 'sample_filename', 'sample_duration']
    filestats_df = pd.DataFrame(data=filedata, columns=columns)
    
    return filestats_df

In [None]:
ultrasuite_filestats = get_filestats('data/ultrasuite_transformed')

In [None]:
ultrasuite_filestats.head()

In [None]:
one_second_df = ultrasuite_filestats[ultrasuite_filestats['sample_duration']>=1.0]
print('Number of Ultrasuite samples with duration >= 1 second =', len(one_second_df))

In [None]:
# Check the framerates for each dataset
directory_path = 'data/ultrasuite_transformed/ambulance'
for file_name in os.listdir(directory_path):
    try:
        with wave.open(os.getcwd() + '/' + directory_path + '/' + file_name, mode='rb') as wave_file:
            frame_rate = wave_file.getframerate()
            print(file_name, ' : ', frame_rate)
    except wave.Error as e:
        print(file_name, ' : ERROR')
        pass