Skip to content
Switch branches/tags
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Base-Inflection Encoding

This repository contains code for the paper "Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding" (EMNLP 2020).

Authors: Samson Tan, Shafiq Joty, Lav Varshney, and Min-Yen Kan


pip install git+


from bite import BITETokenizer

bite = BITETokenizer('moses')
print(bite.tokenize('I was going to the engine room!'))

We also include a script you can use to tokenize entire files ( The parser arguments (--argument_name) will give you an idea of the options supported by the script.

If you are using HuggingFace's BERT model, you may want to use the BiteWordpieceTokenizer instead. This is implementation we use in our BERT-based experiments.

Pretokenization modes

Three types of pretokenizers are supported out of the box:

  1. BertPreTokenizer (HuggingFace)
  2. Moses (sacremoses)
  3. Whitespace splitting

Inflection symbols

Since subword tokenizers often operate on individual characters, running them on BITE-processed input with human readable inflection tags (e.g., [VBD]) would skew the character/subword statistics of the training corpus and occupy unnecessary slots in the subword vocabulary. Therefore, we recommend using single-character inflection symbols (by passing map_to_single_char=True to tokenize) when using BITE with such tokenizers.

Dialectal Data

The scripts for cleaning the CORAAL data and scraping the Colloquial Singapore English data can be found in paper_scripts. Please be considerate when scraping and do not flood the site's servers with requests :)


Please cite the following if you use the code in this repository:

    title = "Mind Your Inflections! {I}mproving {NLP} for Non-Standard {E}nglishes with {B}ase-{I}nflection {E}ncoding",
    author = "Tan, Samson  and
      Joty, Shafiq  and
      Varshney, Lav  and
      Kan, Min-Yen",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "",
    pages = "5647--5663",


Code for "Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding" (EMNLP 2020).