# Creating and Manipulating Datasets

## Introduction

Dataset creation and (re-)organization is the starting point of almost every data-related task. This notebook covers a few popular datasets for speech recognition, enhancement, and activity detection included in `audlib.data`. All datasets follow [PyTorch's convention](https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#Dataset) and therefore are compatible with its [data-loader](https://pytorch.org/docs/stable/_modules/torch/utils/data/dataloader.html#DataLoader) out-of-the-box (it will require each relevant dataset on disk, of course).

Modules covered in this notebook are:
- `audlib.data.wsj.WSJ0` for speech recognition
- `audlib.data.wsj.RATS` for speech activity detection and enhancement

## Creating Datasets
We create some datasets in here to demonstrate the generic interface shared by all datasets, as well as keyword parameters that are specific to datasets for specific tasks.

### Generic Interface
A generic interface for any dataset is:

```python
DatasetX(root, train=True, filt=None, transform=None)
```

In [1]:
from audlib.data.wsj import WSJ0, ASRWSJ0
from audlib.asr.util import PhonemeMap

phonememap = PhonemeMap("/home/xyy/repos/pyaudlib/audlib/misc/cmudict-0.7b")
wsj0_train = WSJ0("/home/xyy/data/wsj0/", train=True)
print(wsj0_train)
wsj0_test = WSJ0("/home/xyy/data/wsj0/", train=False)
print(wsj0_test)


            +++++ Summary for [WSJ0][Train partition] +++++
            Total [43177] valid files to be processed.
        

            +++++ Summary for [WSJ0][Test partition] +++++
            Total [4122] valid files to be processed.
        


### Random Sampling + Additive for Speech Enhancement Dataset

One common scenario in deep-learning-based speech enhancement system is to train with noisy speech as a result of additive noise at various signal-to-noise ratios (SNR). The `data.enhance` module provides tools to assemble such dataset. Specifically,
- `RandSample` class allows for sampling any dataset to specific durations, while 
- `Additive` combines two `Dataset`s at different SNRs and produces noisy speech and and clean speech pairs.

In [2]:
from audlib.data.enhance import RandSample, Additive

SR = 16000  # fix sampling rate
wsjspeech = RandSample("/home/xyy/data/wsj0/", sr=SR, minlen=3., maxlen=8., unit='second',
                       filt=lambda p: p.endswith('.wv1'))
noizeus = RandSample("/home/xyy/Documents/MATLAB/loizou/Databases/noise16k/",
                     sr=SR, minlen=3.)
wsj_noizues = Additive(wsjspeech, noizeus)  # noisy speech dataset

In [3]:
from IPython.display import Audio
sample = wsj_noizues[250]
print("SNR: [{}]dB".format(sample.snr))
Audio(sample.noisy.signal, rate=sample.noisy.samplerate)

SNR: [5]dB
