# Attempting to combine Vocab with Vectors for POS Tagger training

This notebook shows that training POS Tagger on combined vocab + vectors is not more effective than training it on just the vectors.
While training POS Tagger without custom vocab was added to DVC, this is just a demonstration and therefore will not be part of any pipeline.

In [28]:
# make sure to do this first:

In [None]:
!cd .. && dvc pull && dvc repro vocab-model.dvc && dvc repro init-model-vectors-300.dvc

In [29]:
import spacy  # make sure it is our own master, not the latest spact from pip

In [30]:
!pwd

/home/kk385830/spacy-pl-utils/analysis


In [32]:
Polish = spacy.util.get_lang_class('pl')
vocab = spacy.vocab.Vocab().from_disk('../training/vocab/vocab/')
vectors = spacy.vectors.Vectors().from_disk('../training/vectors-300/vocab/')

### Inspect differences between vocab lexemes and vector keys

In [33]:
vocab.length

146089

In [34]:
vectors.shape

(143478, 300)

In [35]:
%%time
missing_keys =  set([key for key in vectors.keys() if not key in vocab])
print(missing_keys)

{4918752717281726814, 4347571224981593329, 14399288562721897527, 1591739283372538942}
CPU times: user 40 ms, sys: 0 ns, total: 40 ms
Wall time: 37.5 ms


In [36]:
%%time
vk = set(vectors.keys())
missing_keys_2 = set([key.text for key in vocab if not key.orth in vk])
print(len(missing_keys_2))
print(list(missing_keys_2)[:10])

2615
['A.E.', 'T. T.', 'R. A.', 'x.', 'R.U.', 'Z.Z.', 'O. Ź.', 'O. W.', 'K. L.', 'W.Ś.']
CPU times: user 136 ms, sys: 0 ns, total: 136 ms
Wall time: 136 ms


### Adding vectors to their corresponding lexemes
Note that `missing_keys` contain list of keys from the vectors that cannot be added to any of the lexemes.

In [37]:
%%time
for idx, (key, vector) in enumerate(vectors.items()):
    try:
        vocab.set_vector(key, vector)
    except Exception:
        assert(key in missing_keys)  # if this fails, we have calculated missing keys badly

CPU times: user 840 ms, sys: 1min 22s, total: 1min 23s
Wall time: 1min 23s


In [38]:
vocab.length

146089

In [39]:
vocab.vectors_length  # should be 300

300

##### Alternative approach
Spacy cli `train` command uses `util.load_model` which does not match vocab to the vectors by itself (at least not in our case).

Possible reasons for this behavior:
- either vocab or vectors are not well-formatted and the module is confused
- my matching strategy implemented above is wrong - vocab and vectors represent completely different sets of words even though their ids/hashes match (at least in most cases)
- vocab cannot contain additional information that is present and that is the reason for failed matches
- using `util.load_model` to specify vocab for vectors is broken and we should report it as an issue

Need to further inspect why it is the case, see:
https://github.com/explosion/spaCy/blob/021d04069a9c49bc29c6f4415e3d2b4b7a02012e/spacy/cli/train.py#L93

(as well as the code below):

In [40]:
model2 = spacy.util.load_model('../training/vectors-300/', vocab=vocab)

In [41]:
model2.vocab.length  # nearly 2x of what it should be

289760

Note that vocab is duplicated, as if vectors were generated for a separate vocab than provided.

In [13]:
model2.vocab.vectors_length

300

### Load the vocab and train polish POS Tagger

In [14]:
nlp = Polish(vocab=vocab)

In [15]:
nlp.pipeline  # should be empty so that we can add the pos tagger

[]

In [16]:
tagger = spacy.pipeline.Tagger(nlp.vocab)

In [17]:
# TODO: Add tag map to tagger !!!

In [18]:
nlp.add_pipe(tagger)

##### Begin training the tagger
We cannot use train from spacy cli for loading our vocab correctly, code below is a copied version of it using our loaded language, containing vocab as well as vectors.

In [19]:
import random
import json
from spacy.util import minibatch, compounding
from spacy import util

In [20]:
from spacy.gold import GoldCorpus
from tqdm import tqdm

import plac
from pathlib import Path
from thinc.neural._classes.model import Model
from timeit import default_timer as timer

from spacy.attrs import PROB, IS_OOV, CLUSTER, LANG
from spacy.gold import GoldCorpus, minibatch
from spacy.util import prints
from spacy import util
from spacy import about
from spacy import displacy
from spacy.compat import json_dumps

In [21]:
train_path = '../data/training/pos-train.json'
n_iter = 10

In [22]:
with open(train_path, 'r') as f:
    train_data = json.load(f)
len(train_data)

1941

In [23]:
lang='pl'
output_dir='../training/pos-tagger-poc-2/'
train_data='../data/training/pos-train.json'
dev_data='../data/training/pos-validation.json'
n_iter=10
n_sents=None
use_gpu=3
# vectors=,  # THIS IS PROVIDED MANUALLY (VECTORS DEFINED ABOVE)
no_tagger=False
no_parser=True
no_entities=True
gold_preproc=False
meta_path=None
verbose=False  # DO NOT USE THIS
version='0.0.0'

In [24]:
def _render_parses(i, to_render):
    to_render[0].user_data['title'] = "Batch %d" % i
    with Path('/tmp/entities.html').open('w') as file_:
        html = displacy.render(to_render[:5], style='ent', page=True)
        file_.write(html)
    with Path('/tmp/parses.html').open('w') as file_:
        html = displacy.render(to_render[:5], style='dep', page=True)
        file_.write(html)


def print_progress(itn, losses, dev_scores, cpu_wps=0.0, gpu_wps=0.0):
    scores = {}
    for col in ['dep_loss', 'tag_loss', 'uas', 'tags_acc', 'token_acc',
                'ents_p', 'ents_r', 'ents_f', 'cpu_wps', 'gpu_wps']:
        scores[col] = 0.0
    scores['dep_loss'] = losses.get('parser', 0.0)
    scores['ner_loss'] = losses.get('ner', 0.0)
    scores['tag_loss'] = losses.get('tagger', 0.0)
    scores.update(dev_scores)
    scores['cpu_wps'] = cpu_wps
    scores['gpu_wps'] = gpu_wps or 0.0
    tpl = ''.join((
        '{:<6d}',
        '{dep_loss:<10.3f}',
        '{ner_loss:<10.3f}',
        '{uas:<8.3f}',
        '{ents_p:<8.3f}',
        '{ents_r:<8.3f}',
        '{ents_f:<8.3f}',
        '{tags_acc:<8.3f}',
        '{token_acc:<9.3f}',
        '{cpu_wps:<9.1f}',
        '{gpu_wps:.1f}',
    ))
    print(tpl.format(itn, **scores))


def print_results(scorer):
    results = {
        'TOK': '%.2f' % scorer.token_acc,
        'POS': '%.2f' % scorer.tags_acc,
        'UAS': '%.2f' % scorer.uas,
        'LAS': '%.2f' % scorer.las,
        'NER P': '%.2f' % scorer.ents_p,
        'NER R': '%.2f' % scorer.ents_r,
        'NER F': '%.2f' % scorer.ents_f}
    util.print_table(results, title="Results")

In [26]:
# copied from spacy.train
util.fix_random_seed()
util.set_env_log(True)
n_sents = None
output_path = util.ensure_path(output_dir)
train_path = util.ensure_path(train_data)
dev_path = util.ensure_path(dev_data)
meta_path = util.ensure_path(meta_path)
if not output_path.exists():
    output_path.mkdir()
if not train_path.exists():
    prints(train_path, title=Messages.M050, exits=1)
if dev_path and not dev_path.exists():
    prints(dev_path, title=Messages.M051, exits=1)
if meta_path is not None and not meta_path.exists():
    prints(meta_path, title=Messages.M020, exits=1)
meta = util.read_json(meta_path) if meta_path else {}
if not isinstance(meta, dict):
    prints(Messages.M053.format(meta_type=type(meta)),
           title=Messages.M052, exits=1)
meta.setdefault('lang', lang)
meta.setdefault('name', 'unnamed')

pipeline = ['tagger', 'parser', 'ner']
if no_tagger and 'tagger' in pipeline:
    pipeline.remove('tagger')
if no_parser and 'parser' in pipeline:
    pipeline.remove('parser')
if no_entities and 'ner' in pipeline:
    pipeline.remove('ner')

# Take dropout and batch size as generators of values -- dropout
# starts high and decays sharply, to force the optimizer to explore.
# Batch size starts at 1 and grows, so that we make updates quickly
# at the beginning of training.
dropout_rates = util.decaying(util.env_opt('dropout_from', 0.2),
                              util.env_opt('dropout_to', 0.2),
                              util.env_opt('dropout_decay', 0.0))
batch_sizes = util.compounding(util.env_opt('batch_from', 1),
                               util.env_opt('batch_to', 16),
                               util.env_opt('batch_compound', 1.001))
max_doc_len = util.env_opt('max_doc_len', 5000)
corpus = GoldCorpus(train_path, dev_path, limit=n_sents)
n_train_words = corpus.count_train()

# lang_class = util.get_lang_class(lang)
# nlp = lang_class()
meta['pipeline'] = pipeline
nlp.meta.update(meta)
# if vectors:
#     print("Load vectors model", vectors)
#     util.load_model(vectors, vocab=nlp.vocab)
#     for lex in nlp.vocab:
#         values = {}
#         for attr, func in nlp.vocab.lex_attr_getters.items():
#             # These attrs are expected to be set by data. Others should
#             # be set by calling the language functions.
#             if attr not in (CLUSTER, PROB, IS_OOV, LANG):
#                 values[lex.vocab.strings[attr]] = func(lex.orth_)
#         lex.set_attrs(**values)
#         lex.is_oov = False
# for name in pipeline:
#     nlp.add_pipe(nlp.create_pipe(name), name=name)
# if parser_multitasks:
#     for objective in parser_multitasks.split(','):
#         nlp.parser.add_multitask_objective(objective)
# if entity_multitasks:
#     for objective in entity_multitasks.split(','):
#         nlp.entity.add_multitask_objective(objective)


optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
nlp._optimizer = None

print("Itn.  Dep Loss  NER Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %  CPU WPS  GPU WPS")
try:
    train_docs = corpus.train_docs(nlp, projectivize=True, noise_level=0.0,
                                   gold_preproc=gold_preproc, max_length=0)
    train_docs = list(train_docs)
    for i in range(n_iter):
        with tqdm(total=n_train_words, leave=False) as pbar:
            losses = {}
            for batch in minibatch(train_docs, size=batch_sizes):
                batch = [(d, g) for (d, g) in batch if len(d) < max_doc_len]
                if not batch:
                    continue
                docs, golds = zip(*batch)
                nlp.update(docs, golds, sgd=optimizer,
                           drop=next(dropout_rates), losses=losses)
                pbar.update(sum(len(doc) for doc in docs))

        with nlp.use_params(optimizer.averages):
            util.set_env_log(False)
            epoch_model_path = output_path / ('model%d' % i)
            nlp.to_disk(epoch_model_path)
            nlp_loaded = util.load_model_from_path(epoch_model_path)
            dev_docs = list(corpus.dev_docs(
                            nlp_loaded,
                            gold_preproc=gold_preproc))
            nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
            start_time = timer()
            scorer = nlp_loaded.evaluate(dev_docs, verbose)
            end_time = timer()
            if use_gpu < 0:
                gpu_wps = None
                cpu_wps = nwords/(end_time-start_time)
            else:
                gpu_wps = nwords/(end_time-start_time)
                with Model.use_device('cpu'):
                    nlp_loaded = util.load_model_from_path(epoch_model_path)
                    dev_docs = list(corpus.dev_docs(
                                    nlp_loaded, gold_preproc=gold_preproc))
                    start_time = timer()
                    scorer = nlp_loaded.evaluate(dev_docs)
                    end_time = timer()
                    cpu_wps = nwords/(end_time-start_time)
            acc_loc = (output_path / ('model%d' % i) / 'accuracy.json')
            with acc_loc.open('w') as file_:
                file_.write(json_dumps(scorer.scores))
            meta_loc = output_path / ('model%d' % i) / 'meta.json'
            meta['accuracy'] = scorer.scores
            meta['speed'] = {'nwords': nwords, 'cpu': cpu_wps,
                             'gpu': gpu_wps}
            meta['vectors'] = {'width': nlp.vocab.vectors_length,
                               'vectors': len(nlp.vocab.vectors),
                               'keys': nlp.vocab.vectors.n_keys}
            meta['lang'] = nlp.lang
            meta['pipeline'] = pipeline
            meta['spacy_version'] = '>=%s' % about.__version__
            meta.setdefault('name', 'model%d' % i)
            meta.setdefault('version', version)

            with meta_loc.open('w') as file_:
                file_.write(json_dumps(meta))
            util.set_env_log(True)
        print_progress(i, losses, scorer.scores, cpu_wps=cpu_wps,
                       gpu_wps=gpu_wps)
finally:
    print("Saving model...")
    with nlp.use_params(optimizer.averages):
        final_model_path = output_path / 'model-final'
        nlp.to_disk(final_model_path)

dropout_from = 0.2 by default
dropout_to = 0.2 by default
dropout_decay = 0.0 by default
batch_from = 1 by default
batch_to = 16 by default
batch_compound = 1.001 by default
max_doc_len = 5000 by default
learn_rate = 0.001 by default
optimizer_B1 = 0.9 by default
optimizer_B2 = 0.999 by default
optimizer_eps = 1e-08 by default
L2_penalty = 1e-06 by default
grad_norm_clip = 1.0 by default
Itn.  Dep Loss  NER Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %  CPU WPS  GPU WPS


  0%|          | 0/537931 [00:00<?, ?it/s]

0     0.000     0.000     0.000   0.000   0.000   0.000   85.020  100.000  13458.4  7556.5


  0%|          | 0/537931 [00:00<?, ?it/s]

1     0.000     0.000     0.000   0.000   0.000   0.000   86.609  100.000  11628.4  11374.4


  0%|          | 0/537931 [00:00<?, ?it/s]

2     0.000     0.000     0.000   0.000   0.000   0.000   87.542  100.000  1143.4   1978.9


  0%|          | 0/537931 [00:00<?, ?it/s]

3     0.000     0.000     0.000   0.000   0.000   0.000   88.115  100.000  1311.3   2081.4


  0%|          | 0/537931 [00:00<?, ?it/s]

4     0.000     0.000     0.000   0.000   0.000   0.000   88.544  100.000  11394.7  3088.9


  0%|          | 0/537931 [00:00<?, ?it/s]

5     0.000     0.000     0.000   0.000   0.000   0.000   88.774  100.000  3778.6   4093.1


  0%|          | 0/537931 [00:00<?, ?it/s]

6     0.000     0.000     0.000   0.000   0.000   0.000   88.873  100.000  11337.2  3001.2


  0%|          | 0/537931 [00:00<?, ?it/s]

7     0.000     0.000     0.000   0.000   0.000   0.000   88.662  100.000  12124.9  3955.3


  0%|          | 0/537931 [00:00<?, ?it/s]

8     0.000     0.000     0.000   0.000   0.000   0.000   28.393  100.000  13094.9  11887.5




9     0.000     0.000     0.000   0.000   0.000   0.000   25.100  100.000  10901.4  9552.1
Saving model...


By the looks of it, adding vocab to the model does not improve performance in any way (overfitting after iteration #6)