This repository has Python classes for iterating through the annotations
in the [Buckeye Corpus](http://buckeyecorpus.osu.edu/). They include 
cross-references between the .words, .phones, and .log files, and
can be used to extract sound clips from the .wav files. The docstrings in
`buckeye.py` and `containers.py` describe how to use the classes in more
detail.

The scripts can be installed directly from GitHub with pip using this command:

    pip install git+http://github.com/scjs/buckeye.git

You can also copy the `buckeye/` subdirectory into your working directory, or
put it in your Python path.

### Speaker

A `Speaker` instance is created by pointing to one of the zipped speaker
archives that can be downloaded from the corpus website.

In [1]:
import buckeye

speaker = buckeye.Speaker('./speakers/s01.zip')

This will open and process the annotations in each of the sub-archives inside
the speaker archive (the tracks, such as `s0101a` and `s0101b`). If an optional
`load_wavs` argument is set to `True` when creating a `Speaker` instance, the
wav files associated with each track will also be loaded into memory. Otherwise,
only the annotations are loaded.

Each `Speaker` instance has the speaker's code-name, sex, age, and interviewer
sex available as attributes.

In [2]:
print speaker.name, speaker.sex, speaker.age, speaker.interviewer

s01 f y f


The tracks can be accessed by iterating through the `Speaker` instance.
There is more detail about accessing the annotations below under the heading
**Tracks**.

In [3]:
for track in speaker:
    print track.name

s0101a
s0101b
s0102a
s0102b
s0103a


The tracks can also be accessed through the `tracks` attribute.

In [4]:
print speaker.tracks

[Track("s01/s0101a.zip"), Track("s01/s0101b.zip"), Track("s01/s0102a.zip"), Track("s01/s0102b.zip"), Track("s01/s0103a.zip")]


The `corpus()` generator function is a convenience for iterating through all of
the speaker archives together. Put all forty speaker archives in one directory,
such as `speakers/`. Create a new generator with this directory as an argument.

In [5]:
corpus = buckeye.corpus('./speakers')

The generator will yield the `Speaker` instances in numerical order.

In [6]:
for speaker in corpus:
    print speaker.name,

s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17 s18 s19 s20 s21 s22 s23 s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35 s36 s37 s38 s39 s40


### Track

Each `Track` instance has `words`, `phones`, `log`, `txt`, and `wav` attributes that contain the corpus data from one track for one speaker. Each speaker has 6 or so tracks.

In [8]:
speaker

Speaker("./speakers\s40.zip")

In [9]:
track = speaker[0]
track

Track("s40/s4001a.zip")

#### Words

The `words` attribute stores a list of Word and Pause instances, created from the `.words` file.

In [10]:
track.words[:10]

[Pause(u'<SIL>', 0.0, 0.96774),
 Pause(u'{B_TRANS}', 0.96774, 0.96774),
 Pause(u'<EXCLUDE-name>', 0.96774, 1.529995),
 Pause(u'<VOCNOISE>', 1.529995, 1.805585),
 Pause(u'<IVER>', 1.805585, 9.063611),
 Word(u'alright', 9.063611, 9.453562, [u'aa', u'l', u'r', u'ay', u't'], [u'ao', u'r', u'eh', u't'], u'NN'),
 Pause(u'<IVER>', 9.453562, 35.562644),
 Pause(u'<VOCNOISE>', 35.562644, 36.182116),
 Pause(u'<IVER>', 36.182116, 65.563712),
 Word(u'um', 65.563712, 65.963575, [u'ah', u'm'], [u'ah', u'm'], u'UH')]

Word instances have these nine attributes:

* `orthography` (the word's spelling)
* `beg` (word onset time, in seconds)
* `end` (word offset time)
* `dur` (duration)
* `phonemic` (the canonical transcription)
* `phonetic` (the close transcription)
* `pos` (the word's part of speech)
* `misaligned` (marked as True if the word has a negative duration, or if the phonetic transcription doesn't match what's in the `.phones` file)
* `phones` (a list of references to Phone instances that have the labels and timing information for the phonetic transcription)

In [11]:
word = track.words[4]

print word.orthography, word.beg, word.end, word.dur, word.phonemic, word.phonetic, word.pos, word.misaligned

AttributeError: 'Pause' object has no attribute 'orthography'

The phones are retrieved based on the timestamps for the word and for the entries in the `.phones` file.

In [None]:
word.phones

Phones have four attributes:

* `seg` (the pseudo-ARPABET transcription of the phone)
* `beg` (onset time)
* `end` (offset time)
* `dur` (duration)

In [None]:
for phone in word.phones:
    print phone.seg, phone.beg, phone.end, phone.dur

Many of the annotations are things like `<SIL>` (silence) or `<IVER>` (the interviewer's turn). These are yielded as Pause instances, not as Word instances. Pause instances have six attributes:

* `entry` (the kind of Pause, e.g. `<SIL>`)
* `beg` (pause onset time, in seconds)
* `end` (pause offset time)
* `dur` (duration)
* `misaligned` (marked as True if the Pause has a negative duration)
* `phones` (a list of references to Phone instances that are associated with this Pause, e.g. one or more `SIL` tokens)

In [None]:
pause = track.words[1]

print pause.entry, pause.beg, pause.end, pause.dur, pause.misaligned

In [None]:
pause.phones

#### Phones

The phones in a track can also be accessed directly through the `phones` attribute of a Track instance.

In [None]:
for phone in track.phones[:10]:
    print phone.seg, phone.beg, phone.end, phone.dur

#### Log

The list of entries in the Track's `.log` file can be accessed through the `log` attribute, which stores a list of the `LogEntry` instances for the Track. `LogEntry` instances have `entry`, `beg`, `end`, and `dur` attributes.

In [None]:
for log in track.log:
    print log.entry, log.beg, log.end, log.dur

You can call the `get_logs()` method of a Track to retrieve the log entries that overlap with the given timestamps.

For example, the log entries that overlap with the interval from 60 seconds to 62 seconds can be found like this:

In [None]:
logs = track.get_logs(60.0, 62.0)

for log in logs:
    print log.entry, log.beg, log.end

#### Txt

The `txt` attribute holds a list of speaker turns without timestamps, read from
the `.txt` file in the track.

In [None]:
track.txt[1]

#### Wavs

If a Speaker instance is created with `load_wavs=True`, each Track will also have a `wav` attribute that stores a `Wave_read` instance.

In [12]:
speaker = buckeye.Speaker('./speakers/s01.zip', load_wavs=True)
track = speaker[0]

track.wav

<wave.Wave_read instance at 0x00000000057A3988>

You can extract sound clips from the wav file with the `clip_wav()` method of each Track:

In [13]:
track.clip_wav('myclip.wav', 60.0, 62.0)

This will create a wav file in the current directory called `myclip.wav` which
contains the sound between 60 and 62 seconds in the track audio.

If you're using a `corpus()` generator, you can set `load_wavs` to `True` and it will be passed down to every `Track` instance, so that the all of the wav files will be loaded.

In [14]:
corpus = buckeye.corpus('./speakers', load_wavs=True)