Skip to content

sdrobert/pytorch-database-prep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pytorch-database-prep

Scripts to aid in the setup of various databases for PyTorch

Databases

  • Gigaword Abstractive Summary Corpus
  • Librispeech
  • The Wall Street Journal Speech Corpus
  • TIMIT Acoustic-Phonetic Continuous Speech Corpus

Please consult the wiki for details about individual databases.

Installation

pytorch-database-prep is not intended to be installed. It is intended to be included in some other Git repository as a submodule and pinned to a specific commit, e.g.

git submodule add https://github.com/sdrobert/pytorch-database-prep prep
cd prep
git checkout --detach  # current commit

See this blog post for more info.

There is no fixed list of requirements for this repository. Most recipes will rely on pytorch >= 1.0, pydrobert-speech, and pydrobert-pytorch, but some will need more. All packages needed should be readily available on PyPI and Conda (perhaps in the conda-forge and sdrobert channels). I will keep track of all the packages that I explicitly install when building/testing these recipes under environment.yaml. If you have Conda, you should be able to recreate the environment I use by calling

conda env create -f environment.yaml

Package versions in the environment.yaml are unlikely to be strictly necessary; if you have a conflicting package, just try that one instead.

Licensing

For speech corpora such as WSJ, these "borrow" from Kaldi. Kaldi's license file has been copied to COPYING_kaldi. Kaldi is Apache 2.0 licensed, as is this repo.

The rouge-1.5.5.py script uses code from py-rouge, which is Apache 2.0 licensed.

ngram_lm.DbfilenameCountDict and the various routines surrounding it are based on the python 3.11 source of the shelve module, subject to the PSF. Details are in LICENSE_python.

About

Scripts to aid in the setup of various databases for pytorch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages