# Creating and Manipulating Datasets

## Introduction

Dataset creation and (re-)organization is the starting point of almost every data-related task. This notebook covers a few popular datasets for speech recognition, enhancement, and activity detection included in `audlib.data`. All datasets follow [PyTorch's convention](https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#Dataset) and therefore are compatible with its [data-loader](https://pytorch.org/docs/stable/_modules/torch/utils/data/dataloader.html#DataLoader) out-of-the-box (it will require each relevant dataset on disk, of course).

Modules covered in this notebook are:
- `audlib.data.wsj.WSJ0` for speech recognition
- `audlib.data.wsj.RATS` for speech activity detection and enhancement

## Creating Datasets
We create some datasets in here to demonstrate the generic interface shared by all datasets, as well as keyword parameters that are specific to datasets for specific tasks.

### Generic Interface
A generic interface for any dataset is:

```python
DatasetX(root, train=True, filt=None, transform=None)
```

In [None]:
from audlib.data.wsj import WSJ0, ASRWSJ0
from audlib.asr.util import PhonemeMap

phonememap = PhonemeMap("/home/xyy/repos/pyaudlib/audlib/misc/cmudict-0.7b")
wsj0_train = WSJ0("/home/xyy/data/wsj0/", train=True)
print(wsj0_train)
#wsj0_asr_train = ASRWSJ0(wsj0_train, phonememap)
#print(wsj0_asr_train)
wsj0_test = WSJ0("/home/xyy/data/wsj0/", train=False)
print(wsj0_test)
#wsj0_asr_test = ASRWSJ0(wsj0_test, phonememap)
#print(wsj0_asr_test)