GitHub - thousandvoices/memb: Memory-mapped word embeddings storage with quantization support

Motivation

Even though vector representations of words are an important concept in natural language processing and maintain high popularity since introduction of word2vec, existing formats for storing them still suffer from several drawbacks. This library addresses two of them: large file size and slow loading into memory.

To reduce storage space requirements, we use aggressive quantization followed by Huffman encoding. The implementation is mostly based on Deep Compression: Compressing Deep Neural Networks With Pruning, Trained Quantization and Huffman Coding paper. Experiments show that quality loss remains negligible even for models compressed by a factor of 6.

To overcome second problem, we use a single mmap call during Reader construction for disk input. It allows to

Reduce initial load time to several milliseconds
Make all data loads lazy. We only read vectors requested by user
Keep data in physical memory even after process restart. On the other hand, when the machine is low on free memory, data will be evicted by operating system even without explicit deletion and garbage collection calls.

Experiments

We present results for tayga_upos_skipgram_300_2_2019 model in this section. We also observed similar behavior for ruscorpora_upos_cbow_300_20_2019 model. You can see that

A compressed model using 6 bits per weight is 6 times smaller than full-precision one and is indistinguishable from it in terms of quality
The correlation between hand-labeled and predicted word similarity is much less sensitive to quantization noise that analogy task accuracy. Even models that use 2 bits per weight show correlation just as high as full models
Models compressed to 1 bit per weight still maintain significant predictive power

Experiments involving training neural networks with compressed embeddings show that 4 bits of precision is sufficient to match full models.

Quickstart

Download and install wheels from releases page
Obtain compressed embedding files:
- Download pretrained vectors from https://drive.google.com/open?id=1g2Bv-5qscDRdnD4UjE6TAB2uaNdz2JdU. Glove 840B, fasttext crawl (english), tayga_upos_skipgram_300_2_2019 and ruscorpora_upos_cbow_300_20_2019 (russian from RusVectores project) are available at the moment.
- Or use memb_converter tool to convert embeddings from word2vec text format. It is recommended to pass --quantization trained --bits-per-weight 6 to the script as parameters and leave --max-words empty.
Now you can create a Reader object:

from memb import Reader

reader = Reader('glove.840B.300d.4bit.bin')

Find out vectors dimensions:

print(reader.dim)

Obtain embedding for a single word, which returns an array of shape (reader.dim,):

print(reader['the'])

Or embeddings for a list of words, which is a two-dimensional array of shape (n_words, reader.dim):

print(reader[['a', 'the', 'of']])

Additional features

ReadersUnion allows to combine several embedding sources into a single one. Use mode='average' or mode='concatenate' to choose respective merging strategy.

from memb import Reader, ReadersUnion

filenames = ['glove.840B.300d.4bit.bin', 'fasttext-crawl-300d-2M.4bit.bin']
readers = [Reader(filename) for filename in filenames]

union = ReadersUnion(readers, mode='concatenate')
print(union['the'])

tokenizer_embedding method creates an array suitable for using as Embedding layer weights from keras.preprocessing.text.Tokenizer

from memb import Reader
from keras.preprocessing.text import Tokenizer
from keras.layers import Embedding

texts = ['First sentence', 'Second sentence']
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)

reader = Reader('glove.840B.300d.4bit.bin')

embedding_matrix = reader.tokenizer_embedding(tokenizer)
embedding_layer = Embedding(
    *embedding_matrix.shape,
    weights=[embedding_matrix],
    trainable=False)

Use to_keyed_vectors method to export model to KeyedVectors. Note: you must install gensim to use it.

Building from source

To build library from source, you'll need

CMake
Compiler with C++14 support
Additional libraries
- Boost
- Flatbuffers

You can follow steps described in dependency installation scripts for linux and mac os.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
cmake_modules		cmake_modules
docs/images		docs/images
pybind11 @ f7bc18f		pybind11 @ f7bc18f
python		python
src		src
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
appveyor.yml		appveyor.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Motivation

Experiments

Quickstart

Additional features

Building from source

About

Releases 2

Packages

Languages

License

thousandvoices/memb

Folders and files

Latest commit

History

Repository files navigation

Motivation

Experiments

Quickstart

Additional features

Building from source

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages