# CMU Pronouncing Dictionary - Data Exploration

## 1.0 Introduction

The CMU English dictionary is being used as part of a research project into the effectiveness of a phonetic spell checker for school children (referred to as user going forward) in The Republic of Ireland. It has been seen that if a user is unable to correctly spell a word, their attempt may be written phonetically, as they would say it. There is some overlap visible between the phoneme sequence for the correct spelling and the incorrect phonetic spelling. See Table 1 below for some examples.

### Table 1

| Correct Spelling | Phoneme Sequence | - | Incorrect Spelling | Phoneme Sequence |
|:----------------:|:----------------:|:-:|:------------------:|:----------------:|
| detailed         | D IH T EY LD     | - | detaled            | D IH T EY LD     |
| opened           | OW P AH ND       | - | opend              | OW P AH ND       |

In some cases we may not get an exact phoneme sequence match, but it may be close. To calculate this closeness a phoneme sequence edit distance function is deployed using levenshtein distance. It currently works as follows:

* For each mispelling (represented as phoneme sequence) compare with each phoneme sequence in the CMU dictionary and calculate the edit distance (closeness currently measured as 2 or less).
* Using a frequency dictionary to pick the highest frequency word from the list. 

This current implementation has two indentified drawbacks:

* Big O(n<sup>2</sup>) as for each mispelling it needs to be compared with the CMU dictionary.
* To reduce this complexity, only phoneme sequences that start and end with the same phoneme are compared. This means that similar sounding words may be excluded. For example "three" in a British accent starts with "TH" but in a strong Irish accent starts "T".

This notebook will undertake data exploration of the CMU Pronouncing Dictionary to indentify any ways to reduce the complexity of the current strategy of edit distance calculation across phoneme sequences. 

## 2.0 Data Exploration and Understanding






### 2.1 CMU Pronouncing Dictionary

This section will explore the structure of the CMU dictionary and, hopefully, identify patterns that may be useful in reducing the time complexity of comparison operations. The following steps will be taken:

* Explore word count and variation in the data set
* Explore word count per phoneme sequence
* Explore the phoneme sequence data - length statistics, count of different lengths
* Explore the variety of phonemes in each sequence

In [27]:
import pandas as pd 
import numpy as np
from pathlib import Path

data_folder = Path("C:/Users/robert/Documents/zeeko_nlp/input_files/")
file_to_open = data_folder / "cmu_processed.csv"
df_cmu = pd.read_csv(file_to_open, encoding = "ISO-8859-1", names=['Word', 'Phonemes'])

In [28]:
df_cmu.head(5)

Unnamed: 0,Word,Phonemes
0,aa,EY EY
1,aaa,T R IH P AH L EY
2,aaberg,AA B ER G
3,aachen,AA K AH N
4,aachener,AA K AH N ER


### 2.2 Acoustic Similarity Matrix


In [16]:
import os

cwd = os.getcwd()  # Get the current working directory (cwd)
files = os.listdir(cwd)  # Get all the files in that directory

In [17]:
print("Files in %r: %s" % (cwd, files))

Files in 'C:\\Users\\robert\\Documents\\zeeko_nlp\\data_analysis': ['.ipynb_checkpoints', 'cmu_data_exploration.ipynb']
