# Speech corpus

[NLTK](http://www.nltk.org/) includes a small subset of the
[TIMIT](https://catalog.ldc.upenn.edu/LDC93S1) corpus.
Ideally, you want access to the full corpus, as we do.

### Notes

* [`nltk` corpus reader](https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/timit.py)
* [`nltk` corpus reader example](https://github.com/nltk/nltk/blob/develop/nltk/test/corpus.doctest#L863)
* Using TIMIT in PyLearn2
  * https://ift6266h14.wordpress.com/experimenting/
  * https://github.com/jfsantos/ift6266h14/blob/master/old/timit_full.py
  * http://vdumoulin.github.io/articles/timit-part-2/
  * https://jpraymond.wordpress.com/2014/02/21/using-the-new-an-improved-pylearn2-timit-dataset/
  * https://github.com/vdumoulin/research/blob/master/code/pylearn2/datasets/timit.py
  * https://github.com/jfsantos/ift6266h14/tree/master/old/pylearn2_timit

### Possibly useful Python packages

* [`PySoundFile`](https://github.com/bastibe/PySoundFile) (reads NIST Sphere, hopefully)
* [`PySoundCard`](https://github.com/bastibe/PySoundCard)
* [`audio` and related tools](https://github.com/boisgera?tab=repositories) (psychoacoustics?)

In [None]:
import nltk

try:
    print(nltk.data.find('corpora/timit'))
except:
    nltk.download('timit')
    print(nltk.data.find('corpora/timit'))

In [None]:
from IPython.display import Audio
from nltk.corpus import timit

In [None]:
utt = timit.utteranceids()[0]
Audio(data=timit.wav(utt, start=1000))

# Incoporating with Nengo

Basically, we want to use TIMIT
to generate evaluation points
and phoneme targets,
which we will use to solve for
appropriate decoding weights
for the ensembles that represent
acoustic features.

In [None]:
# Let's work with a single utterance first
import os

# Get the utterance and the data associated with it
region = 1
sex = 'm'
spkr_id = 'cpm0'
sent_type = 'i'
sent_number = 564

timit_root = nltk.data.find('corpora/timit')
spkr_dir = "dr%d-%s%s" % (region, sex, spkr_id)
sent_file = "s%s%d" % (sent_type, sent_number)

path = os.path.join(timit_root, spkr_dir, sent_file)

In [None]:
Audio(filename="%s.wav" % path)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import soundfile as sf

data, fs = sf.read("%s.sph" % path)  # Try the Sphere version
dt = 1. / fs
plt.plot(np.arange(data.size) * dt, data)

Phone transcriptions are (fortunately!) available
in `*.phn` files. Here's an example.

In [None]:
!cat {path}.phn

## Phonemes in TIMIT

### Consonants

#### Stops

| Symbol | Example word    | Possible phonetic transcription |
|--------|-----------------|---------------------------------|
| b      |    bee          |    BCL B iy                     |
| d      |    day          |    DCL D ey                     |
| g      |    gay          |    GCL G ey                     |
| p      |    pea          |    PCL P iy                     |
| t      |    tea          |    TCL T iy                     |
| k      |    key          |    KCL K iy                     |
| dx     |    muddy, dirty |    m ah DX iy, dcl d er DX iy   |
| q      |    bat          |    bcl b ae Q                   |

####  Affricates

| Symbol | Example word | Possible phonetic transcription |
|--------|--------------|---------------------------------|
| jh     |    joke      |    DCL JH ow kcl k              |
| ch     |    choke     |    TCL CH ow kcl k              |

####  Fricatives

| Symbol | Example word | Possible phonetic transcription |
|--------|--------------|---------------------------------|
| s      |    sea       |    S iy                         |
| sh     |    she       |    SH iy                        |
| z      |    zone      |    Z ow n                       |
| zh     |    azure     |    ae ZH er                     |
| f      |    fin       |    F ih n                       |
| th     |    thin      |    TH ih n                      |
| v      |    van       |    V ae n                       |
| dh     |    then      |    DH e n                       |

#### Nasals

| Symbol | Example word  | Possible phonetic transcription |
|--------|---------------|---------------------------------|
| m      |    mom        |    M aa M                       |
| n      |    noon       |    N uw N                       |
| ng     |    sing       |    s ih NG                      |
| em     |    bottom     |    b aa tcl t EM                |
| en     |    button     |    b ah q EN                    |
| eng    |    washington |    w aa sh ENG tcl t ax n       |
| nx     |    winner     |    w ih NX axr                  |

#### Semivowels and glides

| Symbol | Example word | Possible phonetic transcription |
|--------|--------------|---------------------------------|
| l      |    lay       |    L ey                         |
| r      |    ray       |    R ey                         |
| w      |    way       |    W ey                         |
| y      |    yacht     |    Y aa tcl t                   |
| hh     |    hay       |    HH ey                        |
| hv     |    ahead     |    ax HV eh dcl d               |
| el     |    bottle    |    bcl b aa tcl t EL            |

###  Vowels

| Symbol | Example word | Possible phonetic transcription  |
|--------|--------------|----------------------------------|
| iy     |    beet      |    bcl b IY tcl t                |
| ih     |    bit       |    bcl b IH tcl t                |
| eh     |    bet       |    bcl b EH tcl t                |
| ey     |    bait      |    bcl b EY tcl t                |
| ae     |    bat       |    bcl b AE tcl t                |
| aa     |    bott      |    bcl b AA tcl t                |
| aw     |    bout      |    bcl b AW tcl t                |
| ay     |    bite      |    bcl b AY tcl t                |
| ah     |    but       |    bcl b AH tcl t                |
| ao     |    bought    |    bcl b AO tcl t                |
| oy     |    boy       |    bcl b OY                      |
| ow     |    boat      |    bcl b OW tcl t                |
| uh     |    book      |    bcl b UH kcl k                |
| uw     |    boot      |    bcl b UW tcl t                |
| ux     |    toot      |    tcl t UX tcl t                |
| er     |    bird      |    bcl b ER dcl d                |
| ax     |    about     |    AX bcl b aw tcl t             |
| ix     |    debit     |    dcl d eh bcl b IX tcl t       |
| axr    |    butter    |    bcl b ah dx AXR               |
| ax-h   |    suspect   |    s AX-H s pcl p eh kcl k tcl t |

### Others

| Symbol | Description                          |
|--------|--------------------------------------|
| pau    | pause                                |
| epi    | epenthetic silence                   |
| h#     | begin/end marker (non-speech events) |
| 1      | primary stress marker                |
| 2      | secondary stress marker              |

In [None]:
consonants = [
    'b', 'd', 'g', 'p', 't', 'k', 'dx', 'q',
    'jh', 'ch',
    's', 'sh', 'z', 'zh', 'f', 'th', 'v', 'dh',
    'm', 'n', 'ng', 'em', 'en', 'eng', 'nx',
    'l', 'r', 'w', 'y', 'hh', 'hv', 'el'
]
# "the closure intervals of stops which are distinguished from the stop release"
closures = {
    'bcl': 'b',
    'dcl': 'd',
    'gcl': 'g',
    'pcl': 'p',
    'tck': 't',
    'kcl': 'k',
    'dcl': 'jh',
    'tcl': 'ch',
}
vowels = [
    'iy', 'ih', 'eh', 'ey',
    'ae', 'aa', 'aw', 'ay', 'ah', 'ao',
    'oy', 'ow', 'uh', 'uw', 'ux',
    'er', 'ax', 'ix', 'axr', 'ax-h',
]
ignores = [
    'pau', 'epi', 'h#', '1', '2',
]

Let's parse a `.phn` file into a string of phonemes
and their corresponding audio slices.
We'll separate these into separate vowel
and consonant lists.

In [None]:
from collections import defaultdict
cons = defaultdict(list)
vows = defaultdict(list)

with open("%s.phn" % path, 'r') as phnfile:
    for line in phnfile:
        start, end, phn = line.split()
        start, end = int(start), int(end)

        if phn in ignores:
            continue
        if phn in closures:
            phn = closures[phn]

        dataslice = np.array(data[start:end])
        if phn in consonants:
            cons[phn].append(dataslice)
        elif phn in vowels:
            vows[phn].append(dataslice)
        else:
            raise ValueError("Unrecognized phoneme: '%s'" % phn)

In [None]:
# Let's look at all of the speech samples for a random vowel
import random
vow_phn = random.choice(list(vows))
print(vow_phn)
speech = np.concatenate(vows[vow_phn])
plt.plot(speech)
dt = 1. / fs
plt.plot(np.arange(speech.size) * dt, speech)

In [None]:
# Let's repeat this to get something we can listen to

def timit_path(region, sex, spkr_id, sent_type, sent_number):
    timit_root = nltk.data.find('corpora/timit')
    spkr_dir = "dr%d-%s%s" % (region, sex, spkr_id)
    sent_file = "s%s%d" % (sent_type, sent_number)
    return os.path.join(timit_root, spkr_dir, sent_file)


def add_utterance(tpath, cons, vows):
    data, fs = sf.read("%s.sph" % tpath)  # Try the Sphere version
    with open("%s.phn" % tpath, 'r') as phnfile:
        for line in phnfile:
            start, end, phn = line.split()
            start, end = int(start), int(end)

            if phn in ignores:
                continue
            if phn in closures:
                phn = closures[phn]

            dataslice = np.array(data[start:end])
            if phn in consonants:
                cons[phn].append(dataslice)
            elif phn in vowels:
                vows[phn].append(dataslice)
            else:
                raise ValueError("Unrecognized phoneme: '%s'" % phn)

cons = defaultdict(list)
vows = defaultdict(list)

region = 1
sex = 'm'
spkr_id = 'cpm0'

for sent_type, sent_number in zip(['a', 'a', 'i', 'i', 'i', 'x',  'x', 'x', 'x'],
                                  [1, 2, 564, 1194, 1824, 24, 114, 204, 294, 384]):
    tpath = timit_path(region, sex, spkr_id, sent_type, sent_number)
    add_utterance(tpath, cons=cons, vows=vows)

# Let's hear all the 'ae' phonemes
phn = np.concatenate(vows['ow']).ravel()
print(phn.shape)
Audio(data=phn, rate=fs)  # Dunno why this only works 10% of the time...

In [None]:
from phd.sounds import ArrayProcess

# Final step: transform cons and vows into eval_points and targets
def phn2nengo(model, probe, phonemes, samples):
    orig_sound = mode.auditory_filter.sound_process
    dt = 1. / fs

    eval_points = []
    targets = []
    for i, phoneme in enumerate(phonemes):
        sound = np.concatenate(samples[phoneme]).ravel()
        target = np.zeros((len(phonemes), sound.size))
        target[i]
        model.auditory_filter.sound_process = ArrayProcess(sound)
        sim = nengo.Simulator(model, dt=dt*.5)
        sim.run(dt * sound.size)
        #if pool is not None:
        #    d = vowel.shape[1] // pool
        #    pooled_v = np.zeros((vowel.shape[0], d))
        #    for p in range(d):
        #        pooled_v[:, p] = np.sum(vowel[:, p*pool:(p+1)*pool], axis=1)
        #    vowel = pooled_v
        eval_points.append(sim.data[probe])
        targets.append(target)

    model.auditory_filter.sound_process = orig_sound
    return np.concatenate(eval_points), np.concatenate(targets)

# fs = 20000.
freqs = phd.filters.erbspace(20, 10000, 64)
sound = phd.sounds.WavFile('speech.wav')
aud_filter = phd.filters.gammatone(freqs)
cons_delay = 0.075
vowel_delay = 0.03
# Note: no integrator here

model = phd.SpeechRecognition()
model.add_periphery(freqs, sound, aud_filter, fs=fs, middle_ear=True)
model.add_derivative(n_neurons=30, delay=cons_delay)
model.add_derivative(n_neurons=30, delay=vowel_delay)

with model:
    # TODO: put all the info into one probe
    pass

with model:
    vowel_p = nengo.Probe(vowel, synapse=0.01, sample_every=0.001)
    cons_p = nengo.Probe(cons, synapse=0.01, sample_every=0.001)

vowel_ep, vowel_targets = phn2nengo(model, v_probe, vowels, vows)
cons_ep, cons_targets = phn2nengo(model, c_probe, consonants, cons)

_, vow_detect = model.add_phoneme_detector(15, vowel_ep, vowel_targets, [vowel_delay])
_, cons_detect = model.add_phoneme_detector(15, cons_ep, cons_targets, [cons_delay])