piecelearn

Byte Pair Encoding (BPE) is a compression technique that replaces commonly occurring substrings by atomic ids, which are sometimes called pieces. For example, if the substring at occurs often in your training data, we can save space by not representing at as two integer IDs every time it pops up, but instead allocating a single integer ID for the substring at. That way, we save a single ID every time we encounter the substring, and thus saves space at the cost of a larger dictionary. BPE encoders can be employed to have a fixed vocabulary, speeding up training. For example, we can say we only want to represent the 30,000 most frequent pieces. Unlike a word-based encoder, which doesn't know what to do with unseen words, a BPE-based encoder can be used to represent almost any word in the language on which the BPE encoder was trained. For example, if we use a word-based encoder, and we have only seen cat during training, but not cats, our system would not know what to do. In a BPE-based system, the model can encode cats as cats = cat + s. The only exception to this is the appearance of unknown letters. If the letter x never appears in your training data, the encoder will not be able to represent catx as catx = cat + x.

BPE embeddings can be a useful replacement for regular word-based word2vec embeddings, especially in noisy domains. The script in this repository facilitates training BPE-based word embeddings on a corpus by training the BPE encoder and a word2vec model on the same set of texts.

Requirements

gensim
sentencepiece
numpy

Usage

The script assumes your corpus has been preprocessed beforehand. Things to consider:

lowercasing
removing punctuation
removing numbers (setting them to 0)

Tokenization is not necessary.

Once you have a corpus, you can train a sentencepiece encoder and word2vec model like this:

my_corpus = "file.txt"
num_pieces = 30000

spm_kwargs = {}
w2v_kwargs = {"sg": 1, "window": 15}

train(my_corpus,
      "spm_model_name",
      num_pieces,
      "w2v_path",
      spm_kwargs,
      w2v_kwargs)

The end result is a trained spm model and a trained word2vec model in .vec file format, both of which are aligned.

The spm model and embeddings can then be fed into BPEmb, as follows:

from bpemb import BPEmb
from bpemb.util import sentencepiece_load, load_word2vec_file

b = BPEmb(lang="en")
b.spm = sentencepiece_load("spm_model_name.model")
b.emb = load_word2vec_file("w2v_path")

s = b.embed("the dog flew over the fence")
print(s.shape)

License

MIT

Author

Stéphan Tulkens

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
piecelearn.py		piecelearn.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

piecelearn

Requirements

Usage

License

Author

About

Releases

Packages

Contributors 4

Languages

License

stephantul/piecelearn

Folders and files

Latest commit

History

Repository files navigation

piecelearn

Requirements

Usage

License

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages