Skip to content


Supervised learning baselines

torchtext 0.4.0 includes several example scripts that showcase how to create data, build vocabularies, train, test and run inference for common supervised learning baselines. We further provide a tutorial to explain these examples in more detail.

For an advanced application of these constructs see the example.


We would like to thank the open source community, who continues to send pull
requests for new features and bug-fixes.

Major New Features

New Features


  • Added logging to download_from_url (#569)
  • Added fast, basic english sentence normalization to get_tokenizer (#569 #568)
  • Updated docs theme to pytorch_sphinx_theme (#573)
  • Refined Example.fromJSON() to support parse nested key for parsing nested JSON dataset. (#563)
  • Added __len__ & get_vecs_by_tokens in 'Vectors' class to generate vector from a list of tokens (#561)
  • Added templates for torchtext users to bring up issues (#553 #574)
  • Added a new argument specials in Field.build_vocab to save the user-defined special tokens (#495)
  • Added a new argument is_target in RawField class to show whether the field is a target variable - False by default (#459). Adjusted is_target argument in LabelField to True to take it into effect (#450)
  • Added the option to serialize fields with or pickle.dump, allow tokenizers in different languages (#453)

Bug Fixes

  • Allow caching from unverified SSL in CharNGram (#554)
  • Fix the wrong unk index by generating the unk_index according to the specials (#531)
  • Update Moses tokenizer link in README.rst file (#529)
  • Fix the url to load wiki.simple.vec (#525), fix the dead url to load fastText vectors (#521)
  • Fix UnicodeDecodeError for loading sequence tagging dataset (#506)
  • Fix collisions between oov words and in-vocab words caused by Issue #447 (#482)
  • Fix a mistake in the processing bar of Vectors class (#480)
  • Add the dependency to six under 'install_requires' in the file (PR #475 for Issue #465)
  • Fix a bug in Field class which causes overwriting the stop_words attribute (PR #458 for Issue #457)
  • Transpose the text and target tensors if the text field in BPTTIterator has 'batch_first' set to True (#462)
  • Add <unk> to default specials (#567)

Backward Compatibility

  • Dropped support for python 2.7.9 (#552)
Assets 2

Major changes:

  • Added bABI dataset (#286)
  • Added MultiNLP dataset (#326)
  • Pytorch 0.4 compatibility + bugfixes (#299, #302)
  • Batch iteration now returns a tuple of (inputs), outputs by default without having to index attributes from Batch (#288)
  • [BREAKING] Iterator no longer repeats infinitely by default (now stops after epoch has completed) (#417)

Minor changes:

  • Handle moses tokenizer being migrated from nltk (#361)
  • Vector loading made more efficient and flexible (#353)
  • Allow special tokens to be added to the end of the vocabulary (#400)
  • Allow filtering unknown words from examples (#413)


  • Documentation (#382, #383, #393 #395, #410)
  • Create cache dir for pretrained embeddings if it doesn't exist (#301)
  • Various typos (#293, #369, #373, #344, #401, #404, #405, #418)
  • Dataset.split() not copying sort_key fixed (#279)
  • Various python 2.* vs python 3.* issues (#280)
  • Fix OOV token vector dimensionality (#308)
  • Lowercased type of TabularDataset (#315)
  • Fix splits method in various translation datasets (#377, #385, #392, #429)
  • Fix ParseTextField postprocessing (#386)
  • Fix SubwordVocab (#399)
  • Make NestedField GPU compatible and fix frequency saving (#409, #403)
  • Allow CSVreader params to be modified by user (#432)
  • Use tqdm progressbar in downloads (#425)
Assets 2

@jekbradbury jekbradbury released this Apr 9, 2018 · 128 commits to master since this release

Release notes coming shortly.

Assets 2
Apr 9, 2018
bump version

@jekbradbury jekbradbury released this Dec 28, 2017 · 165 commits to master since this release

This is a minor release; we have not included any breaking API changes but there are some new features that don't break existing APIs.

We have always intended to support lazy datasets (specifically, those implemented as Python generators) but this version includes a bugfix that makes that support more useful. See a demo of it in action here.


  • Added support for sequence tagging (e.g., NER/POS/chunking) datasets and wrapped the Universal Dependencies POS-tagged corpus (#157, thanks @sivareddyg!)


  • Added pad_first keyword argument to Field constructors, allowing left-padding in addition to right-padding (#161, thanks @GregorySenay!)
  • Support loading word vectors from local folder (#168, thanks @ahhegazy!)
  • Support using list (character tokenization) in ReversibleField (#188)
  • Added hooks for Sphinx/RTD documentation (#179, thanks @keon and @EntilZha, whose preliminary version is available at
  • Added support for torchtext.__version__ (#179, thanks @keon!)


  • Fixed deprecated word vector usage in WT2 dataset (#166, thanks @keon!)
  • Fixed bug in word vector loading (#168, thanks @ahhegazy!)
  • Fixed bug in word vector aliases (#191, thanks @ryanleary!)
  • Fixed side effects of building a vocabulary (#193 + #181, thanks @donglixp!)
  • Fixed arithmetic mistake in language modeling dataset length calculation (#182, thanks @jihunchoi!)
  • Avoid materializing an otherwise-lazy dataset when using filter_pred (#194)
  • Fixed bug in raw float fields (#159)
  • Avoid providing a misleading len when using batch_size_fn (#192)
Assets 2

Breaking changes:

  • By default, examples are now sorted within a batch by decreasing sequence length (#95, #139). This is required for use of PyTorch PackedSequences, and it can be flexibly overridden with a Dataset constructor flag.
  • The unknown token is now included as part of specials and can be overridden or removed in the Field constructor (part of #107).

New features:

  • New word vector API with classes for GloVe and FastText; string descriptors are still accepted for backwards compatibility (#94, #102, #115, #120, thanks @nelson-liu and @bmccann!)
  • Reversible tokenization (#107). Introduces a new Field subclass, ReversibleField, with a .reverse method that detokenizes. All implementations of ReversibleField should guarantee that the tokenization+detokenization round-trip is idempotent; torchtext provides wrappers for the revtok tokenizer and subword segmenter that satisfy this property.
  • Skip header line in CSV/TSV loading (#146)
  • RawFields that represent any data type without processing (#147, thanks @kylegao91!)

New datasets:


  • Fix pretrained word vector loading (#99, thanks @matt-peters!)
  • Fix JSON loader silently ignoring requested columns not present in the file (#105, thanks @nelson-liu!)
  • Many fixes for Python 2, especially surrounding Unicode (#105, #112, #135, #153 thanks @nelson-liu!)
  • Fix behavior (#113, thanks @nelson-liu!)
  • Fix README example (#134, thanks @czhang99!)
  • Fix WikiText2 loader (#138)
  • Fix typo in MT loader (#142, thanks @sivareddyg!)
  • Fix Example.fromlist behavior on non-strings (#145)
  • Update test set URL for Multi30k (#149)
  • Fix SNLI data loader (#150, thanks @sivareddyg!)
  • Fix language modeling iterator (#151)
  • Remove transpose as a side effect of Field.reverse (#155)
Assets 2

@jekbradbury jekbradbury released this Aug 15, 2017 · 293 commits to master since this release

So that we can develop v0.2 on master, with refactored and extended word vectors (minimally breaking) and revtok support (reversible tokenizer with optional wordpieces; major feature but shouldn't break API).

Assets 2
You can’t perform that action at this time.