# Voice based Human Profiling and Forensics

> This project aims to explore the methodologies and technologies for voice based human profiling and voice based forensics.

> Keywords: voice, biometrics, profiling (describe people), forensics (technologies to assist crime detection), machine learning, signal processing

## Motivations
**Speech forensics** employs speech processing technologies to discover rich information contained in (concealed) speech associated with suspects, and provides evidence that could be used in court. 

These information includes
1. identity-related
    
    name, gender, age, height, weight, race, language & dialect, facial & body characteristics
2. geographical-related
    
    speech occurance location & conditions, trace map
3. social-relation-related
    
    home, family members, educatoin, work, party, social status, connections, upbringing
4. personal-traits-related
    
    mental state, personality, emotion tendency, habits
5. health-related

    illness history, disease tendency, DNA, body shape, composition and size of their vocal tract, skeletal proportions, lung volume and breathing functions
6. criminal-records-related
    
    crime history, crime tendency

**Justification**

We know that the acoustical aspects of speech are closely related to the speaker's articulatory system, which is further related to the speaker's facial structure and movement, and even to many other physical characteristics. The recordings also contains environmental information that can be exploited. Furthermore, the semantic aspects of speech contains a lot useful information.

The sub-objectives are
1. Discover disguised voice

    How to tell whether a speech is disguised or not? 
2. Discover voice under manipulatoin

    How can we tell if a speaker is under pressure, or threatened, etc.?
3. Privacy

    How can we pretect the privacy of speakers while preserving their confidential information?
4. Reconstruction
    
    Can we 3D reconstruct the speaker?

## Objectives
### Break voice disguise
Model a person’s normal speech state;  tell if a speech is natural or manipulated/controlled.
### Voice profiling
Build connections between speech models and a person’s profiles (physical, physiological, psychological factors, etc.).
### Voice hologram
Reconstruct a 3D figure of a person (and the surroundings) with the profiles from the speech; then build a hologram from the 3D figure.

## Methods
A disguised voice can be a person deliberately impersonates another person, or machine synthesized. In the disguise, the time-frequency traits and semantic traits are altered. Either way, in order to break the disguise, we need to identify the *invariant* and *variant* characteristics in speech. Some factors are innate while some can be modified. We first identify these factors (as mentioned in last part). We then need to define the *normal* manner of a speaker talking, and categorize and quantitize the *deviation* of speech. Using discovered factors, we create forensic profile of the speaker. With this profile, we are able to do a varity of prediction / classification tasks.

1. Microstructures: the sub-phonetic level features
    
    Voice onset time
    
    harmonic bandwidth

    Creak / Vocal fry
    
    Excitation

    Modulation

    Formant frequencies

    Formant bandwidth

    Formant dispersion

    Glottal airflow / Glottal pulse shape

    Harmonicity / Peak-to-valley ratio

    Long-term average spectra

    Nasality

    Pitch

    Register

    Resonance

    Voice bar 

    Voice Bar bandwidth

    Voice coil peak displacement
    
    gradient
    
    region of interest
    
    neural nets auto-discovered features
    
2. Hypothesis & tests

3. More

![methods](./docs/pics/methods.png)

### Normal speech modeling
#### Features
High: phones, idiolect, semantics, accent, pronunciation, etc.

Middle: pitch, energy, duration, rhythm, timbre, etc.

Low: temporal-frequency, glottal, etc.

Micro: temporal (voice onset time, intra-transition time, gap duration, multi-scale temporal representation), 
frequential (formant number, position, width, proportions among formants), magnitudial (intensity transitions), statistics (pattern repeat frequency, close-pattern replacement, disappearing patterns), unvoiced pattern, etc.

### Quantify & Modeling methods
Regression models, spectral models, gradient-based models, statistical models, template-based models
Neural net based methods (CNN, RNN, MLP, AE)

Graphical models (Bayesian nets, Markov random fields)

Neural networks + graphical models (e.g., CNN, LSTM, MLP + HMM, Gaussian process, Markov random fields)

### Connect speech models and speaker profiles
Classification, regression or generative tasks

### Speaker profile reconstruction
Generative models

For surroundings reconstruction, may need non-speech sounds and active probing methods (e.g., ultrasound arrays)

## Update 3/5 - 3/11
    1. Microfeatures
        a. For tidigits, construct speaker-feature dictionary. Speaker dict contains id, gender, age, dialect, seq info. Feature dict contains speaker id, spectrograms, mel spectrograms, const-q spectrograms, mfccs, etc. Seperate training and test set. Seperate single digits and seqs.        
        b. For timit, segment by word and by phone. Compute spectrograms and mfccs.
        c. Write an interface for general tasks: input speaker id and retrieve its features; input features and predict speaker-ids.
        d. Try a Conv-deconv network for feature extraction.
    2. Interspeech
        a. Compute metrics.
        b. Make figures.
    2. Qual preparation

## Update 5/8 - 5/13

### 1. Get familiar with datasets
> **TIMIT**
    * total 6300 sentences, 10 sentences spoken by each of 630 speakers
    * 8 major dialect regions of the United States
    
    
    1. **Dialect distribution**
    
    Table 1:  Dialect distribution of speakers
    
Dialect Region(dr) | # Male | # Female | Total
:------------------|-------:|---------:|------:
1 New England       |31 (63%) |18 (27%)  | 49 (8%)  
2 Northern       |71 (70%) |31 (30%)  |102 (16%) 
3 North Midland       |79 (67%) |23 (23%)  |102 (16%) 
4 South Midland       |69 (69%) |31 (31%)  |100 (16%) 
5 Southern       |62 (63%) |36 (37%)  | 98 (16%) 
6 New York City       |30 (65%) |16 (35%)  | 46 (7%) 
7 Western       |74 (74%) |26 (26%)  |100 (16%) 
8 Army Brat (moved around)       |22 (67%) |11 (33%)  | 33 (5%)
total      |438 (70%)|192 (30%) |630 (100%)
    
    2. **Corpus text**
    The dialect sentences (SA) were meant to expose the dialectal variants of the speakers and were read by all 630 speakers.
    
    The phonetically-compact sentences (SX) were designed to provide a good coverage of pairs of phones, with extra occurrences of phonetic contexts thought to be either difficult or of particular interest. Each speaker read 5 of these sentences and each text was spoken by 7 different speakers.
    
    The phonetically-diverse sentences (SI) were selected from existing text sources - the Brown Corpus (Kuchera and Francis, 1967) and the Playwrights Dialog (Hultzen, et al., 1964) - so as to add diversity in sentence types and phonetic contexts. The selection criteria maximized the variety of allophonic contexts found in the texts. Each speaker read 3 of these sentences, with each sentence being read only by a single speaker. 
    
    Table 2:  TIMIT speech material

Sentence Type |  #Sentences |  #Speakers |  Total |  #Sentences/Speaker
:-------------|---------- :| ---------:| -----:| ------------------:
Dialect (SA)|         2|         630|       1260|           2
Compact (SX)|        450|           7|       3150|           5
Diverse (SI)|       1890|           1|       1890|           3
Total|              2342|            |       6300|          10

    3. **Filesystem**
    
    /<CORPUS>/<USAGE>/<DIALECT>/<SEX><SPEAKER_ID>/<SENTENCE_ID>.<FILE_TYPE>
         SPEAKER_ID :== <INITIALS><DIGIT>
             INITIALS :== speaker initials, 3 letters
             DIGIT :== number 0-9 to differentiate speakers with identical initials
                             
    .wav - waveform file (SPHERE-headered)
    .txt - transcription
    .wrd - time-aligned word transcription
    .phn - time-aligned phonetic transcription
    
    Examples:
     /timit/train/dr1/fcjf0/sa1.wav
                         
     (TIMIT corpus, training set, dialect region 1, female speaker, 
      speaker-ID "cjf0", sentence text "sa1", speech waveform file)
      
    4. **Dataset division**
    (1) Roughly 20 to 30% of the corpus should be used for testing purposes,
     leaving the remaining 70 to 80% for training.
    (2) No speaker should appear in both the training and testing portions.

    (3) All the dialect regions should be represented in both subsets, with 
     at least 1 male and 1 female speaker from each dialect.

    (4) The amount of overlap of text material in the two subsets should be
     minimized; if possible no texts should be identical.

    (5) All the phonemes should be covered in the test material, preferably
     each phoneme should occur multiple times in different contexts.

The core test set thus contains 192 different texts ((5 SX + 3 SI sentences) x 24 speakers).  To avoid overlap with the training material the 2 SA sentences have been excluded from the core and complete test sets. **THESE SENTENCES ARE INCLUDED ON THE CD-ROM, BUT SHOULD NOT BE USED FOR TRAINING OR TEST PURPOSES.**

Table 1: Speakers in the Core Test Set

Dialect |       Male |        Female |   #Texts/Speaker |  #Total Texts
:-------: | :---------: | :------: | --------------: | ------------: 
1 |       DAB0, WBT0 |    ELC0 |        8 |              24 
2 |       TAS1, WEW0 |    PAS0 |        8 |              24
3 |       JMP0, LNT0 |    PKT0 |        8 |              24
4 |       LLL0, TLS0 |    JLM0 |        8 |              24
5 |       BPM0, KLT0 |    NLP0 |        8 |              24
6 |       CMJ0, JDH0 |    MGD0 |        8 |              24
7 |       GRT0, NJM0 |    DHC0 |        8 |              24
8 |       JLN0, PAM0 |    MLD0 |        8 |              24
Total |         16 |         8 |          |             192

The complete test set consists of a total of 1344 sentences, 8 sentences from each of the 168 speakers. In this set there are 120 distinct SX texts and 504 different SI texts. Thus, roughly 27% (624) of the texts have been reserved
for the test material.

Table 2: Dialect Distribution of Speakers in Complete Test Set
     
Dialect  |   #Male |  #Female |  Total
:-------: | -----: | -------: | -----:
1|           7|        4|       11
2|          18|        8|       26
3|          23|        3|       26
4|          16|       16|       32
5|          17|       11|       28
6|           8|        3|       11
7|          15|        8|       23
8|           8|        3|       11
Total|       112|       56|      168
      
    5. **Other docs**
        
        sentences, dict, lexicon, alignment, tagging
        
    6. **Extra info**
    
        Birthday, height, race, education
        
        mixed-race, multi-lingual
        
        high/low pitch, concious attemp to change accent, denasality, inhale/exhale, slow rate, high freq, intonation 
        
        /R/ in "WASH", whistling /S/'S, 
        
        movement
        
        hearing loss, cold, glottal fry, hoarse, voice disorder

> **TIDIGITS**
    * more than 25 thousand digit sequences
    * 326 speakers (111 men, 114 women, 50 boys, and 51 girls)
    * collected in a quiet environment and digitized at 20 kHz
    
    1. **Speaker statistics**
        1. Age distribution
TABLE 1.  Number and Age Ranges of Speakers

Category | Symbol | Number | Age Range (years)
:--------|-------:|-------:|-----------------:
Man | M | 111 | 21 - 70
Woman | W | 114 | 17 - 59
Boy | B | 50 | 6 - 14
Girl | G | 51 | 8 - 15
        
        2. Dialect distribution
        
TABLE 2.  Description of Dialects and Distribution of Speakers

City|                      Dialect|             M|    W|    B|    G
:---|----------------------------:|-------------:|----:|----:|-----:
01 Boston, MA|            Eastern New England|      5|    5|    0|    1
02 Richmond, VA|          Virginia Piedmont|        5|    5|    2|    4
03 Lubbock, TX|           Southwest|                5|    5|    0|    1
04 Los Angeles, CA|       Southern California|      5|    5|    0|    1
05 Knoxville, TN|         South Midland|            5|    5|    0|    0
06 Rochester, NY|         Central New York|         6|    6|    0|    0
07 Denver, CO|            Rocky Mountains|          5|    5|    0|    0
08 Milwaukee, WS|         North Central|            5|    5|    2|    0
09 Philadelphia, PA|      Delaware Valley|          5|    6|    0|    1
10 Kansas City, KS|       Midland|                  5|    5|    4|    1
11 Chicago, IL|           North Central|            5|    5|    1|    2
12 Charleston, SC|        South Carolina|           5|    5|    1|    0
13 New Orleans, LA|       Gulf South|               5|    5|    2|    0
14 Dayton, OH|            South Midland|            5|    5|    0|    0
15 Atlanta, GA|           Gulf South|               5|    5|    0|    1
16 Miami, FL|             Spanish American|         5|    5|    1|    0
17 Dallas, TX|            Southwest|                5|    5|   34|   36
18 New York, NY|          New York City|            5|    5|    2|    2
19 Little Rock, AR|       South Midland|            5|    6|    0|    0
20 Portland, OR|          Pacific Northwest|        5|    5|    0|    0
21 Pittsburgh, PA|        Upper Ohio Valley|        5|    5|    0|    0
22|                       Black|                    5|    6|    1|    1
|Total Speakers|         111|  114|   50|   51|    326  
        
    2. **Corpus text**
Each speaker provided 253 digits and 176 digit transitions. The procedure of generating the digits sequence makes the **frequency distribution uniform** over all eleven digits. However, the "zero"-"zero" and "oh"-"oh" transitions tend to occur **twice** as frequently as any other transition.

**Quite data acquisition**; Utterances were digitized using a Digital Sound Corporation Model 200 **16-bit** A/D/A. The sampling rate was **20 kHz**, and a **10 kHz anti-aliasing** filter was used. 

TABLE 2.  Corpus text types

Type | No. of sentences
:----|----------------:
isolated digits (two tokens of each of the eleven digits) | 22
two-digit sequences | 11
three-digit sequences | 11
four-digit sequences | 11
five-digit sequences | 11
seven-digit sequences | 11
TOTAL | 77
    
    3. **Filesystem**
        
        FILESPEC ::= /tidigits/<USAGE>/<SPEAKER-TYPE>/<SPEAKER-ID>/<DIGIT-STRING><PRODUCTION>.wav
             USAGE ::= test | train
             SPEAKER-ID ::= aa | ab | ac | ... | tc
             
        Example:
         /tidigits/train/man/fd/6z97za.wav

         ("tidigits" corpus, training material, adult male, speaker code "fd", digit sequence "six zero nine seven 
         zero", first production, NIST SPHERE file.)
    
The filename assigned to each data file consists of 3 to 9 characters and is of the form "NSI".

The symbol N represents a string of 1 to 7 of the characters Z,1,2,3,4,5,6,7,8,9,O and indicates the spoken digit sequence.

The symbol S represents a 2-letter speaker designator (initials). (The letters Z and O are not used in speaker designators.)

The symbol I is either null or a single digit, and is used to distinguish multiple utterances of the same digit sequence by the same speaker. The absence of a digit indicates there is only one utterance of the digit sequence by the speaker, while the presence of a digit M, say, indicates the M-th utterance of the digit sequence by the speaker. 

For example, the filename "23Z45MA" was assigned to the file containing the first or only utterance of the sequence "2 3 zero 4 5" by speaker designated "MA". The filename "ODF2" was assigned to the file containing the second utterance of the sequence "oh" by the speaker designated "DF".

    4. **Dataset division**
TABLE 7.  Number of Speakers as a Function of Speaker Category -- Test and Train

Speaker Category |   Man |   Woman |  Boy |   Girl
:----------------|------:|--------:|-----:|------:
Train|          55|     57|   25|     26
Test|           56|     57|   25|     25
    
    5. **Other docs**
many convenient records

    6. **Extra info**
    
    The following information was stored in the header of each data file:

    Speaker's name;
    Two-character speaker designator;
    Speaker's age;
    Speaker's dialect classification;
    Speaker's category (M,W,B,G);
    Speaker's subset (Train, Test);
    Sequence of digits uttered.

    The dataset provides statistics on **Speaker error** and **Listener error**.
    
    It also provides the inherent recognizability of the data. Each utterance in the database was downsampled to 12.5 kHz, analyzed and synthesized using 14-th order autocorrelation LPC analysis. A 25 ms window length and 10 ms frame period were used with pre-emphasis constant of 0.9375. Pitch tracking was accomplished using a crosscorrelation algorithm with post-processing. Listeners heard only this synthesized speech. The recognizability of the (LPC synthesized) data was measured as 99.99%.

> Comparison

### 2. Listen, visualize, and compare
    
    1. Listen samples from the two datasets
    
    2. Compute their spectra, visulize and analyze
    
    3. Discoveries

### 3. Get familiar with Sphinx

    1. Read docs
    
    2. Read codes and run demos

### 4. Other
    
    1. Code maintenance: format data io, extract and visualize features, previous speech align and segmentation codes, general interfaces
    
    2. Read relevant papers: speech production, deep kernel learning
    
    3. Summarize statistical learning

## Update 5/14 - 5/20

### New thoughts on feature extraction and speaker profiling
(discussed with Yandong)
> Build a framework (classical Guassian models or neural nets) that takes in **speech spectrograms** and outputs **speaker profiles** (id and id related: gender, age, height, race, education, language, dialects). Then **trace back** from outputs to inputs, and find **active regions**/inducing parts in inputs and their **activate path to output**. Next, build a graphical model upon the active regions and the hidden parameters (such as pitch, breathing patterns, speaking patterns, illness, etc).

> This idea aims to exploit the structures existed in both the input and the output spaces, and connect the two spaces with neural net models.

#### Steps
1. Baseline: (1) Build Gaussian models to do speaker recognition; (2) Build neural network models to do speaker recognition; (3) Get decent performance.

2. Back tracking: Trace back from output to input to obtain active path and active region.

3. Active region modeling: build a graphical model (such as the Ising model or RBM) to model the active region with corresponding hidden parameters. Then we can inference the graph based on observations. By doing so we can find the connection between the **specific nodes in the graph**(the patterns in spectrograms) and the **observed profiles**.

4. Using the **discovered patterns in spectrograms** and their indications, we can further build a simplified predictor to do speaker recognition and related tasks.

### Others
1. Update previous sections in this notebook.

2. Building baseline.



## Update ~ 05/29/2017

### Summary
1. wrote my own scripts for force alignment
**NOTE:** TO DO ALL TASKS TOGETHER:
```bash
./scripts/feat_align_extract_join.master.sh
```

2. Writing code for baselines

### Force align audio && extract phones and words
#### 0. Extract features: MFCC / logspec
    
    1. Prepare control files: filelists
timit_train_wavlists.ctl
![](./docs/pics/timit_train_list.png)

```bash
# command example
find ../timit_data/timit/ -name "*.wav" | grep -v "maxed" | grep "train" | sed "s/.*timit\///g" | sort > timit_train_wavlist.ctl
```

    2. Run scripts to extract mfcc / logspec
```bash
# command example
# NOTE: All scripts have to be RUN OUT OF scripts/ --> ./scripts/script_name
./scripts/compute_feat.sh --type mfcc -f ./lists/timit_train_wavlist.ctl -i ../timit_data/timit -o ./mfcc -t ./tmp

# view mfcc
./wave2feat/sphinx_cepview -f ./mfcc/train/dr1/fcjf0/sa1.80-7200-40f.mfc
```
sampled mfcc
![](./docs/pics/mfcc_sample.png)

#### 1. Force align
    1. Prepare transcripts, dicts, phonelists
transcripts: trans \ file pairs ![timit_trans](./docs/pics/timit_trans.png)
dicts ![timit_dicts](./docs/pics/timit_dict.png)

    2. Decode and align
phone segments
![](./docs/pics/ph_seg.png)
word segments
![](./docs/pics/word_seg.png)
```bash
# command example
./scripts/falign_timit.sh --part 1 --npart 1 --listbasedir ./lists/transcripts --ctlf timit_train.ctl --transf timit_correct.trans --dictf timit.dict --fillerf timit.fillerdict --mfcdir ./mfcc --outputdir falign_out --outsegdir phseg_out --jobname timit_falign_05292017 &

# NOTE: can check correctness by reading falign_out/*.log
```

**NOTE:** 1. alignment may differ each time.
2. Misalignment


#### 2. Extract  phones and words
    1. Prepare phone ctls
Get ctl for each force-aligned phone.
![](./docs/pics/phnctl_output.png)
![](./docs/pics/phone_ctl.png)
```bash
# command example
./scripts/makectl.sh  -indir ./phseg_out -type phone -phase train -ext phseg -tmpdir tmp -outdir phctl_out
```

    2. Extract phone segments
```bash
# command example
./scripts/extract_join_segs.sh -ctl phctl_out/AA.ctl -inwav ../timit_data/timit/train -ext wav -outwav phsegwav_out -join true -jpath ./ -jnum 100
```
    3. Hear and visualize

#### Extra: Using cluster: qsub

## Update ~ 06/04/2017
### Summary
1. Read literature on speaker recognition
2. Work on SRE with GMM, i-vector
3. Work on SRE with DNN

### 1. Literature: speech inversion
**Main idea:**
1. use DNN/CNN to predict articulatory trajectories from speech
    
    Data: multi-speaker synthetic articulatory data by varying vocal tract length, pitch, articulatory weights, etc. Generated by using the Haskins Laboratories Task Dynamics Application -- CMU dictionary fed to TADA to generates vocal tract constriction variables and corresponding synthetic speech.
    
    **Tract-variable trajectories / vocal tract constriction variables (*TVs*) **: variables defining 5 **constrictors** (lip, tongue tip, tongue body, velum, glottis) states, and are results of **articulatory activities**; the variables are formulated into motion equations -- **critically damped second-order systems**.
    
    Use **Pearson product-moment correlation (PPMC) coefficient** to measure amplitude and dynamic similarity between the groundtruth and the estimated TVs -- they actually are correlation.
    
2. then use two parallel CNNs on articulatory information and acoustical information (gammatone-filterbanked spectrograms) to do speech recognition

**Remarks:** 

1. This builds up connection between articulatory kinetics and a set of governing variables. Then we can do speech inversion to learn (TVs, acoustical feats (normalized modulation coefficients)). By previous proposition in _Update 5/14 - 5/20_, we do back tracking through NN to find the active region in spectrograms, thus we find (active regions, speech). Now we propose to further relate (TVs, active regions) through possibly joint learning.

2. The TVs are discrete models of articulatory activities, while in reality articulatory activities are continuous and overlapped. We propose to form a continuous model instead.

ref: https://www.sri.com/work/publications/hybrid-convolutional-neural-networks-articulatory-and-acoustic-information-based

**Questions:**
1. what are senones?

### 2. Speaker recognition with GMM based methods
1. UBM-GMM/i-vector model (with source-normalization)

*Use tools: Bob*
https://pythonhosted.org/bob/index.html#

### 3. Speaker recognition with NN based methods
0. DNN/i-vector model

1. DNN bottle neck model

2. CNN model

3. RNN model

*Implemented with PyTorch*

### To Dos
1. Continue building baselines

2. Study NN backtracking, GAN and other generative models

https://128.84.21.199/pdf/1706.00550.pdf

3. Work on theories on (articulatory, acoustical, speech and speaker id) modeling

## Update ~ 06/25/2017

### 0. Summary
> (With Yandong) Implement and run speaker recognition baselines.

> Paper reading

### 1. Baselines
1. speaker recognition on timit

```bash
# Example command
./ivector_sr.py --vv -d timit -p energy-2gauss -e mfcc-60 -a ${ALGORITHM} -s timit-${ALGORITHM} -T temp -R results --parallel <N>
# Evaluate
evaluate.py -d ${PATH_TO_SCORES_DEV} -c EER
```

baseline | config  | EER (%)
:--------|:-------:|-------:
mfcc-60 + ubm-gmm | 128\*gaussians | 5.367
mfcc-60 + ivector + cosine | 512\*gaussians, subspace dim 400 | 4.762
mfcc-60 + ivector + plda | 256\*gaussians, subspace dim 100, G_F dim 50 | running
mfcc-60 + ivector + lda-wccn-plda | lda dim 50, 256\*gaussians, subspace dim 100, G_F dim 50 | running
mfcc-60 + jfa | 512\*gaussians, U_V dim 2 | 11.607
dnn | |
cnn | 4\*conv + 1\*pool + 2\*fc | Acc 92~94
rnn | |

**Remarks:**

preprocessing: 2 Gaussian energy-based VAD

mfcc-60: (19 MFCC features + Energy) + First and second derivatives

wccn: within-class covariance normalization

### 2. Paper reading

* Wavenet: a generative model for raw audio

**remarks:**

    a. neural generative model, autoregressive, model joint prob with products of conditions, dilated causal convolutions (==time delay nns), gated, residual, skip conn, 
    b. context stacks?
    c. GAN or Dual: (speaker, context dependent) speech synthesis v.s. speech recog
    d. Text to speech:
![](./docs/pics/tts.png)


### Todos

1. work on baseline network dissection

## Update ~ 7/3/2017
### 0. Summary
> classification tasks with phonemes

> DCGAN on phonemes

### 1.  Classification with phonemes

speaker / dialect / height / education / race classification with phonemes

use **spectrograms of phonemes** as input

net | input | label | acc (%)
:---|------:|------:|-------:
cnn | AA | speaker id |
 | AE | |
 | AO | |
 | B | |
 | D | |
 | EY | |
 | G | |
 | HH | |
 | IY | |
 | K | |
 | L | |
 | M | |
 | N | |
 | NG | |
 | OW | |
 | P | |
 | S | |
 | UW | |
 | V | |
 | W | |
 | Y | |
 | Z | |


### 2. DCGAN on phonemes

use Deep Convolutional GAN to find internal representation of phonemes


ref:

[1] Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

https://arxiv.org/abs/1511.06434

### 3. VAE on phonemes

use Variational Autoencoder to find internal representation of phonemes

ref:

[1] Stochastic Gradient VB and the Variational Auto-Encoder

https://arxiv.org/pdf/1312.6114.pdf

### Todos

1. CNN with Gaussian filters

2. Net dissection
