# Generate word embeddings using Swivel

## Overview

In this notebook we show how to generate word embeddings based on the K-Cap corpus using the [Swivel algorithm](https://arxiv.org/pdf/1602.02215). In particular, we reuse the implementation included in the [Tensorflow models repo on Github](https://github.com/tensorflow/models/tree/master/research/swivel) (with some small modifications).

In [1]:
import os

## Learn embeddings

### Generate co-occurrence matrix using Swivel `prep`

Call `prep` using the `%run` magic command. In this case, we use the `kcap15-17.txt` corpus consisting of plain text extracted from the accepted papers to K-CAP 2015 and 2017. 

We set the `shard_size` to 512 since the corpus is quite small. For larger corpora we could use the standard value of 4096.

In [2]:
coocs_path = 'corpora/kcap15-17/txt/coocs/'
%run -i swivel/prep --input='corpora/kcap15-17/kcap15-17.txt' --output={coocs_path} --shard_size=512


vocabulary contains 6656 tokens

writing shard 169/169
done!


The expected output is:

    vocabulary contains 6656 tokens

    writing shard 169/169
    done!
    
The `prep` step does the following:
  - it uses a basic, white space, tokenization to get sequences of tokens
  - in a first pass through the corpus, it counts all tokens and keeps only those that have a minimum frequency (5) in the corpus. Then it keeps a multiple of the `shard_size` of that. The tokens that are kept form the **vocabulary** with size $v$.
  - on a second pass through the corpus, it uses a sliding window to count co-occurrences between the focus token and the context tokens (similar to `word2vec`). The result is a sparse co-occurrence matrix of size $v \times v$.
  - for easier storage and manipulation, Swivel uses *sharding* to split the co-occurrence matrix into sub-matrices of size $s \times s$, where $s$ is the `shard_size`.
  ![Swivel co-occurrence matrix sharding](img/swivel-sharding.PNG)
  - store the sharded co-occurrence submatrices as protobuf files.

## Learn embeddings from co-occurrence matrix
With the sharded co-occurrence matrix it is now possible to learn embeddings:
 - the input is the folder with the co-occurrence matrix (protobuf files with the sparse matrix).
 - `submatrix_` rows and columns need to be the same size as the `shard_size` used in the `prep` step.

In [3]:
vec_path = 'corpora/kcap15-17/txt/vec/'
%run -i swivel/swivel --input_base_path={coocs_path} \
    --output_base_path={vec_path} \
    --num_epochs=40 --dim=300 \
    --submatrix_rows=512 --submatrix_cols=512

INFO:tensorflow:Starting standard services.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Saving checkpoint to path corpora/kcap15-17/txt/vec/model.ckpt
INFO:tensorflow:Starting queue runners.
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:local_step=10 global_step=10 loss=40.6, 0.1% complete
INFO:tensorflow:local_step=20 global_step=20 loss=44.9, 0.3% complete
INFO:tensorflow:local_step=30 global_step=30 loss=41.9, 0.4% complete
INFO:tensorflow:local_step=40 global_step=40 loss=41.8, 0.6% complete
INFO:tensorflow:local_step=50 global_step=50 loss=40.7, 0.7% complete
INFO:tensorflow:local_step=60 global_step=60 loss=41.1, 0.9% complete
INFO:tensorflow:local_step=70 global_step=70 loss=43.2, 1.0% complete
INFO:tensorflow:local_step=80 global_step=80 loss=42.4, 1.2% complete
INFO:tensorflow:local_step=90 global_step=90 loss=38.7, 1.3% complete
INFO:tensorflow:local_step=100 global_step=100 loss=33.9, 1.5% complete
INFO:tensorflow:local_step=110 global_step=110 loss=37.

INFO:tensorflow:local_step=1110 global_step=1110 loss=25.3, 16.4% complete
INFO:tensorflow:local_step=1120 global_step=1120 loss=23.1, 16.6% complete
INFO:tensorflow:local_step=1130 global_step=1130 loss=22.5, 16.7% complete
INFO:tensorflow:local_step=1140 global_step=1140 loss=44.4, 16.9% complete
INFO:tensorflow:local_step=1150 global_step=1150 loss=25.9, 17.0% complete
INFO:tensorflow:local_step=1160 global_step=1160 loss=24.8, 17.2% complete
INFO:tensorflow:local_step=1170 global_step=1170 loss=26.1, 17.3% complete
INFO:tensorflow:local_step=1180 global_step=1180 loss=23.2, 17.5% complete
INFO:tensorflow:local_step=1190 global_step=1190 loss=27.6, 17.6% complete
INFO:tensorflow:local_step=1200 global_step=1200 loss=23.9, 17.8% complete
INFO:tensorflow:local_step=1210 global_step=1210 loss=22.9, 17.9% complete
INFO:tensorflow:local_step=1220 global_step=1220 loss=28.6, 18.0% complete
INFO:tensorflow:local_step=1230 global_step=1230 loss=23.8, 18.2% complete
INFO:tensorflow:local_ste

INFO:tensorflow:local_step=2200 global_step=2200 loss=22.9, 32.5% complete
INFO:tensorflow:local_step=2210 global_step=2210 loss=26.6, 32.7% complete
INFO:tensorflow:local_step=2220 global_step=2220 loss=24.8, 32.8% complete
INFO:tensorflow:local_step=2230 global_step=2230 loss=24.3, 33.0% complete
INFO:tensorflow:local_step=2240 global_step=2240 loss=21.9, 33.1% complete
INFO:tensorflow:local_step=2250 global_step=2250 loss=22.5, 33.3% complete
INFO:tensorflow:local_step=2260 global_step=2260 loss=24.3, 33.4% complete
INFO:tensorflow:local_step=2270 global_step=2270 loss=23.5, 33.6% complete
INFO:tensorflow:local_step=2280 global_step=2280 loss=25.8, 33.7% complete
INFO:tensorflow:local_step=2290 global_step=2290 loss=24.3, 33.9% complete
INFO:tensorflow:local_step=2300 global_step=2300 loss=25.0, 34.0% complete
INFO:tensorflow:local_step=2310 global_step=2310 loss=21.4, 34.2% complete
INFO:tensorflow:local_step=2320 global_step=2320 loss=21.1, 34.3% complete
INFO:tensorflow:local_ste

INFO:tensorflow:local_step=3300 global_step=3300 loss=26.5, 48.8% complete
INFO:tensorflow:local_step=3310 global_step=3310 loss=25.6, 49.0% complete
INFO:tensorflow:local_step=3320 global_step=3320 loss=26.2, 49.1% complete
INFO:tensorflow:local_step=3330 global_step=3330 loss=22.8, 49.3% complete
INFO:tensorflow:local_step=3340 global_step=3340 loss=23.2, 49.4% complete
INFO:tensorflow:local_step=3350 global_step=3350 loss=24.4, 49.6% complete
INFO:tensorflow:local_step=3360 global_step=3360 loss=27.4, 49.7% complete
INFO:tensorflow:local_step=3370 global_step=3370 loss=29.6, 49.9% complete
INFO:tensorflow:local_step=3380 global_step=3380 loss=34.1, 50.0% complete
INFO:tensorflow:local_step=3390 global_step=3390 loss=27.7, 50.1% complete
INFO:tensorflow:local_step=3400 global_step=3400 loss=69.2, 50.3% complete
INFO:tensorflow:local_step=3410 global_step=3410 loss=92.1, 50.4% complete
INFO:tensorflow:local_step=3420 global_step=3420 loss=28.8, 50.6% complete
INFO:tensorflow:local_ste

INFO:tensorflow:local_step=4390 global_step=4390 loss=23.5, 64.9% complete
INFO:tensorflow:local_step=4400 global_step=4400 loss=90.3, 65.1% complete
INFO:tensorflow:local_step=4410 global_step=4410 loss=29.2, 65.2% complete
INFO:tensorflow:local_step=4420 global_step=4420 loss=22.3, 65.4% complete
INFO:tensorflow:local_step=4430 global_step=4430 loss=24.4, 65.5% complete
INFO:tensorflow:local_step=4440 global_step=4440 loss=23.2, 65.7% complete
INFO:tensorflow:local_step=4450 global_step=4450 loss=25.2, 65.8% complete
INFO:tensorflow:local_step=4460 global_step=4460 loss=25.7, 66.0% complete
INFO:tensorflow:local_step=4470 global_step=4470 loss=22.1, 66.1% complete
INFO:tensorflow:local_step=4480 global_step=4480 loss=25.8, 66.3% complete
INFO:tensorflow:local_step=4490 global_step=4490 loss=21.9, 66.4% complete
INFO:tensorflow:local_step=4500 global_step=4500 loss=68.8, 66.6% complete
INFO:tensorflow:local_step=4510 global_step=4510 loss=28.9, 66.7% complete
INFO:tensorflow:local_ste

INFO:tensorflow:local_step=5490 global_step=5490 loss=22.5, 81.2% complete
INFO:tensorflow:local_step=5500 global_step=5500 loss=22.3, 81.4% complete
INFO:tensorflow:local_step=5510 global_step=5510 loss=24.8, 81.5% complete
INFO:tensorflow:local_step=5520 global_step=5520 loss=24.7, 81.7% complete
INFO:tensorflow:local_step=5530 global_step=5530 loss=26.0, 81.8% complete
INFO:tensorflow:local_step=5540 global_step=5540 loss=24.4, 82.0% complete
INFO:tensorflow:local_step=5550 global_step=5550 loss=22.2, 82.1% complete
INFO:tensorflow:local_step=5560 global_step=5560 loss=26.2, 82.2% complete
INFO:tensorflow:local_step=5570 global_step=5570 loss=21.9, 82.4% complete
INFO:tensorflow:local_step=5580 global_step=5580 loss=27.3, 82.5% complete
INFO:tensorflow:local_step=5590 global_step=5590 loss=24.5, 82.7% complete
INFO:tensorflow:local_step=5600 global_step=5600 loss=27.5, 82.8% complete
INFO:tensorflow:local_step=5610 global_step=5610 loss=29.8, 83.0% complete
INFO:tensorflow:local_ste

INFO:tensorflow:local_step=6580 global_step=6580 loss=28.2, 97.3% complete
INFO:tensorflow:local_step=6590 global_step=6590 loss=188.1, 97.5% complete
INFO:tensorflow:local_step=6600 global_step=6600 loss=22.9, 97.6% complete
INFO:tensorflow:local_step=6610 global_step=6610 loss=23.7, 97.8% complete
INFO:tensorflow:local_step=6620 global_step=6620 loss=23.7, 97.9% complete
INFO:tensorflow:local_step=6630 global_step=6630 loss=27.0, 98.1% complete
INFO:tensorflow:local_step=6640 global_step=6640 loss=27.0, 98.2% complete
INFO:tensorflow:local_step=6650 global_step=6650 loss=25.7, 98.4% complete
INFO:tensorflow:local_step=6660 global_step=6660 loss=20.3, 98.5% complete
INFO:tensorflow:local_step=6670 global_step=6670 loss=31.3, 98.7% complete
INFO:tensorflow:local_step=6680 global_step=6680 loss=23.3, 98.8% complete
INFO:tensorflow:local_step=6690 global_step=6690 loss=22.4, 99.0% complete
INFO:tensorflow:local_step=6700 global_step=6700 loss=58.4, 99.1% complete
INFO:tensorflow:local_st

This should take a few minutes, depending on your machine.
The result is a list of files in the specified output folder, including:
 - checkpoints of the model
 - `tsv` files for the column and row embeddings.

In [4]:
os.listdir(vec_path)

['checkpoint',
 'col_embedding.tsv',
 'events.out.tfevents.1512241137.9f7852dbefc0',
 'graph.pbtxt',
 'model.ckpt-0.data-00000-of-00001',
 'model.ckpt-0.index',
 'model.ckpt-0.meta',
 'model.ckpt-6760.data-00000-of-00001',
 'model.ckpt-6760.index',
 'model.ckpt-6760.meta',
 'row_embedding.tsv']

### Convert `tsv` files to `bin` file
The `tsv` files are easy to inspect, but they take too much space and they are slow to load since we need to convert the different values to floats and pack them as vectors. Swivel offers a utility to convert the `tsv` files into a `bin`ary format. At the same time it combines the column and row embeddings into a single space (it simply adds the two vectors for each word in the vocabulary).

In [6]:
%run -i swivel/text2bin --vocab={vec_path}vocab.txt --output={vec_path}vecs.bin \
        {vec_path}row_embedding.tsv \
        {vec_path}col_embedding.tsv

executing text2bin
merging files ['corpora/kcap15-17/txt/vec/row_embedding.tsv', 'corpora/kcap15-17/txt/vec/col_embedding.tsv'] into output bin


This adds the `vocab.txt` and `vecs.bin` to the folder with the vectors:

In [7]:
os.listdir(vec_path)

['checkpoint',
 'col_embedding.tsv',
 'events.out.tfevents.1512241137.9f7852dbefc0',
 'graph.pbtxt',
 'model.ckpt-0.data-00000-of-00001',
 'model.ckpt-0.index',
 'model.ckpt-0.meta',
 'model.ckpt-6760.data-00000-of-00001',
 'model.ckpt-6760.index',
 'model.ckpt-6760.meta',
 'row_embedding.tsv',
 'vecs.bin',
 'vocab.txt']

## Read stored binary embeddings and inspect them

Swivel provides the `vecs` library which implements the basic `Vecs` class. It accepts a `vocab_file` and a file for the binary serialization of the vectors (`vecs.bin`).

In [8]:
from swivel import vecs

...and we can load existing vectors. Here we load some pre-computed embeddings, but feel free to use the embeddings you computed by following the steps above (although, due to random initialization of weight during the training step, your results may be different).

In [9]:
precomp_vec_path = 'corpora/kcap15-17/precomp-txt/vec/'
vecs = vecs.Vecs(precomp_vec_path + 'vocab.txt', 
            precomp_vec_path + 'vecs.bin')

Opening vector with expected size 5632 from file corpora/kcap15-17/precomp-txt/vec/vocab.txt
vocab size 5632 (unique 5632)
read rows


Next, let's define a basic method for printing the `k` nearest neighbors for a given word:

In [10]:
def k_neighbors(word, k=10):
    res = vecs.neighbors(word)
    if not res:
        print('%s is not in the vocabulary, try e.g. %s' % (word, vecs.random_word_in_vocab()))
    else:
        for word, sim in res[:10]:
            print('%0.4f: %s' % (sim, word))

And let's use the method on a few words:

In [11]:
k_neighbors('knowledge')

1.0000: knowledge
0.5351: bases
0.4386: base
0.4098: base,
0.3995: domain-specific
0.3576: acquisition
0.3541: hierarchical
0.3365: base.
0.3312: commonsense
0.3301: integrating


In [12]:
k_neighbors('capture')

1.0000: capture
0.3359: positional
0.3327: geometry
0.3318: bases
0.3150: knowledge
0.2915: compounds
0.2875: rules,
0.2720: rules
0.2649: First,
0.2625: notions


In [13]:
k_neighbors('embedding')

1.0000: embedding
0.4480: single-modal
0.4080: paragraph
0.4040: multi-modal
0.3930: space,
0.3719: word2vec
0.3372: embeddings
0.3331: learned
0.3269: space
0.3266: dimensional


In [14]:
k_neighbors('ontology')

1.0000: ontology
0.3774: design
0.3317: first-order
0.3292: zooming
0.3290: w.r.t.
0.3161: standard.
0.3158: ontologies,
0.3145: FO
0.2987: OntoSoft
0.2959: visualizations.


## Conclusion

In this notebook, we used swivel to generate word embeddings and we explored the resulting embeddings using `k neighbors` exploration. 