# Dependency Parsing with SuPar

Project repo: https://github.com/yzhangcs/parser

Prerequisites:
* `python`: 3.7
* `pytorch`: 1.4
* `transformers`: 3.0

We recommend installing SuPar from GitHub repository. 
The `pip install supar` version did not allow us to train the models.
In the Issues, the author as well recommends installing SuPar from source.

In [None]:
## If not installed earlier:
# !apt install git -y
# !apt install wget -y
# !pip install gensim

In [None]:
!pip install transformers
!pip install corpuscula
!pip install junky

# At the time we tested SuPar, the `pip install` version didn't work, so we
# clone the original repo with the latest version

!git clone https://github.com/yzhangcs/parser
!cd parser && python setup.py install

In [None]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1YV9L3AXORclrfGKuxh9LuARNhaUX5iri' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1YV9L3AXORclrfGKuxh9LuARNhaUX5iri" -O model && rm -rf /tmp/cookies.txt

In [None]:
!wget http://files.deeppavlov.ai/embeddings/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize/ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec

## Loading corpus

In [None]:
from corpuscula.corpus_utils import download_ud, syntagrus

corpus_name = 'UD_Russian-SynTagRus'
# corpus_name = 'UD_Russian-Taiga'
download_ud(corpus_name, root_dir='.')

## Filtering embeddings

SuPar requires a lot of CPU memory to load word embeddings. 25 Gb is not enough. If you have enough CPU memory, you can use SuPar without filtering embeddings.

Otherwise, shrinking word vectors definitely helps while still showing comparable-to-SOTA results and taking no more than a couple of minutes. We implemented vector filtering in our `junky` library.

In [2]:
import junky

In [3]:
junky.clear_tqdm()
fpath = 'corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-{}.conllu'

train, train_lemmas = junky.get_conllu_fields(fpath.replace('{}', 'train'), fields=['LEMMA'])
dev, dev_lemmas = junky.get_conllu_fields(fpath.replace('{}', 'dev'), fields=['LEMMA'])
test, test_lemmas = junky.get_conllu_fields(fpath.replace('{}', 'test'), fields=['LEMMA'])

Load corpus
Corpus has been loaded: 48814 sentences, 871526 tokens
Load corpus
Corpus has been loaded: 6584 sentences, 118692 tokens
Load corpus
Corpus has been loaded: 6491 sentences, 117523 tokens


In [None]:
full_corpus = train + train_lemmas + dev + dev_lemmas + test + test_lemmas

In [None]:
# any pretrained word embeddings in txt format where first line is `<vocab_size> <emb_dim>`
FT_VECTORS_PATH = 'ft_native_300_ru_wiki_lenta_nltk_wordpunct_tokenize.vec'

_ = junky.filter_embeddings(pretrained_embs=FT_VECTORS_PATH, corpus=full_corpus,
                      min_abs_freq=1, save_name='filtered_vectors_freq1.vec',
                      pad_token='[PAD]', unk_token='[UNK]')

## Biaffine Dependency Parsing

 [[Link]](https://arxiv.org/abs/1611.01734) Original Paper: Timothy Dozat and Christopher D. Manning. 2017. Deep Biaffine Attention for Neural Dependency Parsing.

### Training with w2v+Bert embeddings

Depending on the GPU you have, you might have the following error while running the training script:
```
Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.
	Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
```
To avoid the problem, set the environment variable `MKL_SERVICE_FORCE_INTEL=1`. 
You can run the code below and then run training script withour restarting the notebook.
```python
import os
os.environ["MKL_SERVICE_FORCE_INTEL"] = "1"
```

In [None]:
!python -m supar.cmds.biaffine_dependency train -b -d 0    \
  -p ./model-bert \
  -f bert  \
  --punct  \
  --train corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-train.conllu \
  --dev corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-dev.conllu \
  --test corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-test.conllu \
  --embed filtered_vectors_freq1.vec \
  --unk '[UNK]' \
  --n-embed 300 \
  --batch-size 1000 \
  --bert bert-base-multilingual-cased

### Training with w2v+character embeddings

In [None]:
!python -m supar.cmds.biaffine_dependency train -b -d 0    \
  -p ./model-char \
  -f char  \
  --punct  \
  --train corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-train.conllu \
  --dev corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-dev.conllu \
  --test corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-test.conllu \
  --embed filtered_vectors_freq1.vec \
  --unk '[UNK]' \
  --n-embed 300 \
  --batch-size 1000

## Distributed training on several GPUs

In [None]:
!python -m torch.distributed.launch \
  --nproc_per_node=4 --master_port=10000  \
  -m supar.cmds.biaffine_dependency train -b -d 0,1 \
  -p ./model-bert-distr \
  -f bert  \
  --punct  \
  --train corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-train.conllu \
  --dev corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-dev.conllu \
  --test corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-test.conllu \
  --embed filtered_vectors_freq1.vec \
  --unk '[UNK]' \
  --n-embed 300 \
  --batch-size 1500 \
  --bert bert-base-multilingual-cased

## Prediction

In [6]:
from supar import Parser

In [7]:
parser = Parser.load('model')
test_dataset = parser.predict(test, pred='supar_char_depparse_test.conllu')

2020-08-07 13:32:45 INFO Load the data
2020-08-07 13:32:51 INFO                                                           
Dataset(n_sentences=6491, n_batches=26, n_buckets=8)
2020-08-07 13:32:51 INFO Make predictions on the dataset
100%|####################################| 26/26 00:04<00:00,  6.33it/s
2020-08-07 13:32:56 INFO Save predicted results to supar_char_depparse_test.conllu
2020-08-07 13:32:57 INFO 0:00:04.629948s elapsed, 1401.96 Sents/s


## Evaluation

During evaluation, the model makes predictions on the target corpus first, and then runs evaluation.

In [8]:
parser = Parser.load('model-bert')
loss, metric = parser.evaluate('corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-test.conllu')

2020-08-07 13:33:10 INFO Load the data
2020-08-07 13:33:26 INFO                                                           
Dataset(n_sentences=6491, n_batches=26, n_buckets=8)
2020-08-07 13:33:26 INFO Evaluate the dataset
2020-08-07 13:33:39 INFO loss: 0.2480 - UCM: 61.61% LCM: 50.41% UAS: 94.89% LAS: 92.79%
2020-08-07 13:33:39 INFO 0:00:13.039016s elapsed, 497.81 Sents/s


In [9]:
parser = Parser.load('model')
loss, metric = parser.evaluate('corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-test.conllu')

2020-08-07 13:33:41 INFO Load the data
2020-08-07 13:33:48 INFO                                                           
Dataset(n_sentences=6491, n_batches=26, n_buckets=8)
2020-08-07 13:33:48 INFO Evaluate the dataset
2020-08-07 13:33:52 INFO loss: 0.2753 - UCM: 58.45% LCM: 47.97% UAS: 94.29% LAS: 92.15%
2020-08-07 13:33:52 INFO 0:00:04.232478s elapsed, 1533.62 Sents/s
