Scripts to aid in the setup of various databases for PyTorch
- Gigaword Abstractive Summary Corpus
- Librispeech
- The Wall Street Journal Speech Corpus
- TIMIT Acoustic-Phonetic Continuous Speech Corpus
Please consult the wiki for details about individual databases.
pytorch-database-prep
is not intended to be installed. It is intended to be
included in some other Git repository as a submodule and pinned to a specific
commit, e.g.
git submodule add https://github.com/sdrobert/pytorch-database-prep prep
cd prep
git checkout --detach # current commit
See this blog post for more info.
There is no fixed list of requirements for this repository. Most recipes will
rely on pytorch >= 1.0
, pydrobert-speech
, and pydrobert-pytorch
, but some
will need more. All packages needed should be readily available on PyPI and
Conda (perhaps in the conda-forge
and sdrobert
channels). I will keep track
of all the packages that I explicitly install when building/testing these
recipes under environment.yaml
. If you have Conda, you should be able to
recreate the environment I use by calling
conda env create -f environment.yaml
Package versions in the environment.yaml
are unlikely to be strictly
necessary; if you have a conflicting package, just try that one instead.
For speech corpora such as WSJ, these "borrow" from Kaldi. Kaldi's license file has been copied to COPYING_kaldi. Kaldi is Apache 2.0 licensed, as is this repo.
The rouge-1.5.5.py
script uses code from
py-rouge, which is
Apache 2.0 licensed.
ngram_lm.DbfilenameCountDict
and the various routines surrounding it are
based on the python 3.11 source of the shelve
module, subject to the PSF.
Details are in LICENSE_python.