In this file, I am going to implement a phone recognizer by neural network.

First thing is to get the training and testing data. In the documentation of TIMIT database, there is a section called "Core Test Set", which contains 192 utterances from 24 speakers. I am going to use this set as my test set now for the purpose of speed. I will use the whole TIMIT database as my test set later.

Here is a table of core test set:
| Dialect |    Male    | Female | #Texts/Speaker | #Total Texts |
|---------|:----------:|:------:|:--------------:|:------------:|
|    1    | DAB0, WBT0 |  ELC0  |       8        |      24      |
|    2    | TAS1, WEW0 |  PAS0  |       8        |      24      |
|    3    | JMP0, LNT0 |  PKT0  |       8        |      24      |
|    4    | LLL0, TLS0 |  JLM0  |       8        |      24      |
|    5    | BPM0, KLT0 |  NLP0  |       8        |      24      |
|    6    | CMJ0, JDH0 |  MGD0  |       8        |      24      |
|    7    | GRT0, NJM0 |  DHC0  |       8        |      24      |
|    8    | JLN0, PAM0 |  MLD0  |       8        |      24      |
|  Total  |     16     |   8    |                |     192      |

The metadata (like file name, speaker id, etc.) of the training and testing data are stored in the file `train_data.csv` and `test_data.csv`. So the first thing is to load these two files.


In [1]:
import os
import pandas
from pathlib import Path

In [3]:
# Load `train_data.csv` and `test_data.csv` as pandas `DataFrames`
TIMIT = Path(os.environ["TIMIT"])
train_data = pandas.read_csv(TIMIT / "train_data.csv")
test_data = pandas.read_csv(TIMIT / "test_data.csv")

# Take a look at the first few rows of `train_data`
train_data.head()

Unnamed: 0,index,test_or_train,dialect_region,speaker_id,filename,path_from_data_dir,path_from_data_dir_windows,is_converted_audio,is_audio,is_word_file,is_phonetic_file,is_sentence_file
0,1.0,TRAIN,DR4,MMDM0,SI681.WAV.wav,TRAIN/DR4/MMDM0/SI681.WAV.wav,TRAIN\\DR4\\MMDM0\\SI681.WAV.wav,True,True,False,False,False
1,2.0,TRAIN,DR4,MMDM0,SI1311.PHN,TRAIN/DR4/MMDM0/SI1311.PHN,TRAIN\\DR4\\MMDM0\\SI1311.PHN,False,False,False,True,False
2,3.0,TRAIN,DR4,MMDM0,SI1311.WRD,TRAIN/DR4/MMDM0/SI1311.WRD,TRAIN\\DR4\\MMDM0\\SI1311.WRD,False,False,True,False,False
3,4.0,TRAIN,DR4,MMDM0,SX321.PHN,TRAIN/DR4/MMDM0/SX321.PHN,TRAIN\\DR4\\MMDM0\\SX321.PHN,False,False,False,True,False
4,5.0,TRAIN,DR4,MMDM0,SX321.WRD,TRAIN/DR4/MMDM0/SX321.WRD,TRAIN\\DR4\\MMDM0\\SX321.WRD,False,False,True,False,False


From the first few lines of `train_data.csv`, we can see that the data is stored in the following format:
`index, test_or_train, dialect_region, speaker_id, filename, path_from_data_dir, path_from_data_dir_windows, is_converted_audio, is_audio, is_word_file, is_phonetic_file, is_sentence_file`

Here is a brief explanation of each column:
- `index`: the index of the data
- `test_or_train`: whether the data is in the test set or the training set
- `dialect_region`: the dialect region of the speaker
- `speaker_id`: the id of the speaker
- `filename`: the filename of the data
- `path_from_data_dir`: the path of the data from the `data` directory
- `path_from_data_dir_windows`: the path of the data from the `data` directory in Windows
- `is_converted_audio`: whether the data is converted to `.wav` format
- `is_audio`: whether the data is audio
- `is_word_file`: whether the data is a word file
- `is_phonetic_file`: whether the data is a phonetic file
- `is_sentence_file`: whether the data is a sentence file

Since we only care about phonetic information, the only file types we care are `.phn` and `.wav`. The columns to filter them are `is_converted_audio` and `is_phonetic_file`. For finding files, the important column is `path_from_data_dir`. For the `test_data.csv`, since we want to have a core test set, the `speaker_id` is also important.