pytorch-database-prep

Scripts to aid in the setup of various databases for PyTorch

Databases

Gigaword Abstractive Summary Corpus
Librispeech
The Wall Street Journal Speech Corpus
TIMIT Acoustic-Phonetic Continuous Speech Corpus

Please consult the wiki for details about individual databases.

Installation

pytorch-database-prep is not intended to be installed. It is intended to be included in some other Git repository as a submodule and pinned to a specific commit, e.g.

git submodule add https://github.com/sdrobert/pytorch-database-prep prep
cd prep
git checkout --detach  # current commit

See this blog post for more info.

There is no fixed list of requirements for this repository. Most recipes will rely on pytorch >= 1.0, pydrobert-speech, and pydrobert-pytorch, but some will need more. All packages needed should be readily available on PyPI and Conda (perhaps in the conda-forge and sdrobert channels). I will keep track of all the packages that I explicitly install when building/testing these recipes under environment.yaml. If you have Conda, you should be able to recreate the environment I use by calling

conda env create -f environment.yaml

Package versions in the environment.yaml are unlikely to be strictly necessary; if you have a conflicting package, just try that one instead.

Licensing

For speech corpora such as WSJ, these "borrow" from Kaldi. Kaldi's license file has been copied to COPYING_kaldi. Kaldi is Apache 2.0 licensed, as is this repo.

The rouge-1.5.5.py script uses code from py-rouge, which is Apache 2.0 licensed.

ngram_lm.DbfilenameCountDict and the various routines surrounding it are based on the python 3.11 source of the shelve module, subject to the PSF. Details are in LICENSE_python.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
conf		conf
resources		resources
tests		tests
.gitignore		.gitignore
COPYING_kaldi		COPYING_kaldi
LICENSE		LICENSE
LICENSE_python		LICENSE_python
README.md		README.md
arpa-lm-to-state-dict.py		arpa-lm-to-state-dict.py
asr-baseline.py		asr-baseline.py
common.py		common.py
construct-vocab-from-text.py		construct-vocab-from-text.py
environment.yml		environment.yml
error-rates-from-trn.py		error-rates-from-trn.py
ggws.py		ggws.py
librispeech.py		librispeech.py
logits-to-trn-via-pyctcdecode.py		logits-to-trn-via-pyctcdecode.py
lookup-lm-corpus-perplexity.py		lookup-lm-corpus-perplexity.py
ngram_lm.py		ngram_lm.py
rouge-1.5.5.py		rouge-1.5.5.py
subword2word.py		subword2word.py
timit.py		timit.py
unlzw.py		unlzw.py
word2subword.py		word2subword.py
wsj.py		wsj.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pytorch-database-prep

Databases

Installation

Licensing

About

Uh oh!

Releases

Packages

Languages

License

sdrobert/pytorch-database-prep

Folders and files

Latest commit

History

Repository files navigation

pytorch-database-prep

Databases

Installation

Licensing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages