# Stanza Dependency Parsing

## Navigation:
* [General Info](#info)
* [Setting up Stanza for training](#setup)
* [Preparing Dataset for DEPPARSE](#prepare)
* [Training a Dependency Parser with Stanza](#depparse)
* [Using Trained Model for Prediction](#predict)
* [Prediction and Saving to CONLL-U](#save)

## General Info <a class="anchor" id="info"></a>

[`Link to Manual`](https://stanfordnlp.github.io/stanza/index.html) [`Training Page`](https://stanfordnlp.github.io/stanza/training.html)

[`Link to GitHub Repository`](https://github.com/stanfordnlp/stanza) (git clone this repo)

`Libraries needed:` `corpuscula` (conllu parsing); `stanza` (training); `tqdm` (displaying progress); `junky` (loading datasets); `mordl` (conllu evaluation script).

`Pre-Trained Embeddings used in this example:` Recommended vectors are downloaded from [here](https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-1989/word-embeddings-conll17.tar?sequence=9&isAllowed=y)(~30GB, 60+ languages)

`Pipeline Input:` CONLL-U file.

`Pipeline Output:` CONLL-U file with predicitons.

`Sample pipeline output:`
```
>>> nlp = stanza.Pipeline('ru',
                          processors='tokenize,pos,lemma,ner,depparse',
                          depparse_model_path='stanza/saved_models/depparse/ru_syntagrus_parser.pt',
                          tokenize_pretokenized=True)

>>> doc = nlp(' '.join(test[0]))

>>> print(*[f'id: {word.id}\tword: {word.text}\thead id: {word.head}\t\
        head: {sent.words[word.head-1].text if word.head > 0 else "root"}\tdeprel: {word.deprel}'
        for sent in doc.sentences for word in sent.words], sep='\n')
        
id: 1	word: В	head id: 3	        head: период	deprel: case
id: 2	word: советский	head id: 3	        head: период	deprel: amod
id: 3	word: период	head id: 11	        head: составляло	deprel: obl
id: 4	word: времени	head id: 3	        head: период	deprel: nmod
id: 5	word: число	head id: 11	        head: составляло	deprel: nsubj
id: 6	word: ИТ	head id: 5	        head: число	deprel: nmod
id: 7	word: -	head id: 8	        head: специалистов	deprel: punct
id: 8	word: специалистов	head id: 6	        head: ИТ	deprel: appos
id: 9	word: в	head id: 10	        head: Армении	deprel: case
id: 10	word: Армении	head id: 5	        head: число	deprel: nmod
id: 11	word: составляло	head id: 0	        head: root	deprel: root
id: 12	word: около	head id: 14	        head: тысяч	deprel: case
id: 13	word: десяти	head id: 14	        head: тысяч	deprel: nummod
id: 14	word: тысяч	head id: 11	        head: составляло	deprel: obl
id: 15	word: .	head id: 11	        head: составляло	deprel: punct
```

## Setting up Stanza for training<a class="anchor" id="setup"></a>

In [1]:
#!pip install stanza

In [13]:
# !pip install -U stanza

Run in terminal.

1. Clone Stanza GitHub repository
```
$git clone https://github.com/stanfordnlp/stanza
```

2. Move to cloned git repository & download embeddings ({lang}.vectors.xz format)
(run in a screen, takes up several hours, depending on the Internet speed). Make sure the vectors are in `/extern_data/word2vec` folder. You will probably need to create this folder and move the downloaded folders with word vectors there manually.
```
$ cd stanza
$ ./scripts/download_vectors.sh ./extern_data/
```

3. Make sure your `./stanza/scripts/config.sh` is set up like below. Modify if necessary (pay attention to UDBASE and NERBASE).

```
export UDBASE=./udbase

export NERBASE=./nerbase

# Set directories to store processed training/evaluation files
export DATA_ROOT=./data
export TOKENIZE_DATA_DIR=$DATA_ROOT/tokenize
export MWT_DATA_DIR=$DATA_ROOT/mwt
export POS_DATA_DIR=$DATA_ROOT/pos
export LEMMA_DATA_DIR=$DATA_ROOT/lemma
export DEPPARSE_DATA_DIR=$DATA_ROOT/depparse
export ETE_DATA_DIR=$DATA_ROOT/ete
export NER_DATA_DIR=$DATA_ROOT/ner
export CHARLM_DATA_DIR=$DATA_ROOT/charlm

# Set directories to store external word vector data
export WORDVEC_DIR=./extern_data/
```
**NB!** Make sure `WORDVEC_DIR=./extern_data/` if your vectors are in `/extern_data/word2vec` folder.
If you leave `WORDVEC_DIR=./extern_data/`, your vectors should be stored in `/extern_data/word2vec/word2vec` folder.

4. Download language resources:

In [1]:
import stanza
stanza.download('ru')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.0.0.json: 115kB [00:00, 2.47MB/s]                    
2020-07-24 11:27:31 INFO: Downloading default packages for language: ru (Russian)...
2020-07-24 11:27:32 INFO: File exists: /home/steysie/stanza_resources/ru/default.zip.
2020-07-24 11:27:38 INFO: Finished downloading models and saved to /home/steysie/stanza_resources.


## Preparing Dataset for DEPPARSE<a class="anchor" id="prepare"></a>

In [8]:
from corpuscula.corpus_utils import syntagrus, download_ud, Conllu
from corpuscula import corpus_utils
import junky

import corpuscula.corpus_utils as cu
import stanza
# cu.set_root_dir('.')

In [5]:
# !pip install -U junky

In [5]:
corpus_utils.download_syntagrus(root_dir=corpus_utils.get_root_dir(), overwrite=True)

Downloading SynTagRus 1 of 3
>##################] 100%                                          
done: 81043533 bytes
Downloading SynTagRus 2 of 3
[###########] 100%                                                 
done: 10903424 bytes
Downloading SynTagRus 3 of 3
[###########] 100%                                                 
done: 10798207 bytes


['./corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-train.conllu',
 './corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-dev.conllu',
 './corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-test.conllu']

In [11]:
junky.clear_tqdm()
# train, train_heads, train_deprels = junky.get_conllu_fields(syntagrus.train, fields=['HEAD', 'DEPREL'])
# dev, train_heads, dev_deprels = junky.get_conllu_fields(syntagrus.dev, fields=['HEAD', 'DEPREL'])
test, test_heads, test_deprels = junky.get_conllu_fields(syntagrus.test, fields=['HEAD', 'DEPREL'])

Load corpus


Corpus has been loaded: 48814 sentences, 871526 tokens


Load corpus
Corpus has been loaded: 6584 sentences, 118692 tokens
Load corpus
Corpus has been loaded: 6491 sentences, 117523 tokens


## Training a Dependency Parser with Stanza<a class="anchor" id="depparse"></a>

**`STEP 1`**

`Input files for DEPPARSE model training should be placed here:` 

**`{UDBASE}/{corpus}/{corpus_short}-ud-{train,dev,test}.conllu`**, where 
* **`{UDBASE}`** is `./stanza/udbase/` (specified in `config.sh`), 
* **`{corpus}`** is full corpus name (e.g. `UD_Russian-SynTagRus` or `UD_English-EWT`, case-sensitive), and 
* **`{corpus_short}`** is the treebank code, can be [found here](https://stanfordnlp.github.io/stanza/model_history.html) (e.g. `ru_syntagrus`).

**`STEP 2`**

**Important:** Create `./data/depparse/` folder, otherwise the code below will fail to run.


**`STEP 3`** To prepare data, run:
```
$ cd stanza
$ ./scripts/prep_depparse_data.sh UD_Russian-SynTagRus gold
```
The script above prepares the train-dev-test.conllu data which is located in `./udbase/UD_Russian-SynTagRus/`.

**`STEP 4`**
To start training, run:
```
$ ./scripts/run_depparse.sh UD_Russian-SynTagRus gold
```
The model will be saved to `saved_models/depparse/ru_syntagrus_parser.pt`.

**`HOW TO USE`** 
#### Loading Trained Models to Pipeline


To load the model for prediction, when setting up Tagger Pipeline, specify path to the model:
```
nlp = stanza.Pipeline('ru', 
                       processors='tokenize,pos,lemma,ner,depparse',
                       pos_model_path=<path to model>,
                       lemma_model_path=<path to model>,
                       ner_model_path=<path to model>,
                       depparse_model_path=<path to model>)
```

## Using Trained Model for Prediction  <a class="anchor" id="predict"></a>

If you want to disable Stanza built-in tokenizer, specify `tokenize_pretokenized=True` parameter in Pipeline.

Input should still be a list of strings, but tokens will be separated by spaces, no multi-word tokens will appear.

In [9]:
nlp = stanza.Pipeline('ru',
                          processors='tokenize,pos,lemma,ner,depparse',
                          depparse_model_path='stanza/saved_models/depparse/ru_syntagrus_parser.pt',
                          tokenize_pretokenized=True)

2020-07-28 13:14:34 INFO: Loading these models for language: ru (Russian):
| Processor | Package                 |
---------------------------------------
| tokenize  | syntagrus               |
| pos       | syntagrus               |
| lemma     | syntagrus               |
| depparse  | stanza/sav..._parser.pt |
| ner       | wikiner                 |

2020-07-28 13:14:36 INFO: Use device: cpu
2020-07-28 13:14:36 INFO: Loading: tokenize
2020-07-28 13:14:36 INFO: Loading: pos
2020-07-28 13:14:37 INFO: Loading: lemma
2020-07-28 13:14:37 INFO: Loading: depparse
2020-07-28 13:14:38 INFO: Loading: ner
2020-07-28 13:14:38 INFO: Done loading processors!


In [23]:
doc = nlp(' '.join(test[0]))

In [24]:
print(*[f'id: {word.id}\tword: {word.text}\thead id: {word.head}\t\
        head: {sent.words[word.head-1].text if word.head > 0 else "root"}\tdeprel: {word.deprel}'
        for sent in doc.sentences for word in sent.words], sep='\n')

id: 1	word: В	head id: 3	        head: период	deprel: case
id: 2	word: советский	head id: 3	        head: период	deprel: amod
id: 3	word: период	head id: 11	        head: составляло	deprel: obl
id: 4	word: времени	head id: 3	        head: период	deprel: nmod
id: 5	word: число	head id: 11	        head: составляло	deprel: nsubj
id: 6	word: ИТ	head id: 5	        head: число	deprel: nmod
id: 7	word: -	head id: 8	        head: специалистов	deprel: punct
id: 8	word: специалистов	head id: 6	        head: ИТ	deprel: appos
id: 9	word: в	head id: 10	        head: Армении	deprel: case
id: 10	word: Армении	head id: 5	        head: число	deprel: nmod
id: 11	word: составляло	head id: 0	        head: root	deprel: root
id: 12	word: около	head id: 14	        head: тысяч	deprel: case
id: 13	word: десяти	head id: 14	        head: тысяч	deprel: nummod
id: 14	word: тысяч	head id: 11	        head: составляло	deprel: obl
id: 15	word: .	head id: 11	        head: составляло	deprel: punct


In [27]:
doc

[
  [
    {
      "id": "1",
      "text": "В",
      "lemma": "в",
      "upos": "ADP",
      "head": 3,
      "deprel": "case",
      "misc": "start_char=0|end_char=1"
    },
    {
      "id": "2",
      "text": "советский",
      "lemma": "советский",
      "upos": "ADJ",
      "feats": "Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing",
      "head": 3,
      "deprel": "amod",
      "misc": "start_char=2|end_char=11"
    },
    {
      "id": "3",
      "text": "период",
      "lemma": "период",
      "upos": "NOUN",
      "feats": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing",
      "head": 11,
      "deprel": "obl",
      "misc": "start_char=12|end_char=18"
    },
    {
      "id": "4",
      "text": "времени",
      "lemma": "время",
      "upos": "NOUN",
      "feats": "Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing",
      "head": 3,
      "deprel": "nmod",
      "misc": "start_char=19|end_char=26"
    },
    {
      "id": "5",
      "text": "число",
      "lemma": "чи

In [29]:
from collections import OrderedDict
import stanza
from tqdm import tqdm

def stanza_parse(sents,
                 depparse_model='stanza/saved_models/depparse/ru_syntagrus_parser.pt'
                ):
    
    sents = [' '.join(sent) for sent in sents]
    nlp = stanza.Pipeline('ru',
                          processors='tokenize,pos,lemma,ner,depparse',
#                           pos_model_path=pos_model,
#                           lemma_model_path=lemma_model,
#                           ner_model_path=ner_model,
                          depparse_model_path=depparse_model,
                          tokenize_pretokenized=True)
    
    for idx, sent in enumerate(tqdm(sents)):
        doc = nlp(sent)       
        res = []

        assert len(doc.sentences) == 1, \
               'ERROR: incorrect lengths of sentences ({}) for sent {}' \
                   .format(len(doc.sentences), idx)
        sent = doc.sentences[0]
        tokens, words = sent.tokens, sent.words
        assert len(tokens) == len(words), \
               'ERROR: inconsistent lengths of tokens and words for sent {}' \
                   .format(idx)
        for token, word in zip(tokens, words):
            res.append({
                'ID': token.id,
                'FORM': token.text,
                'LEMMA': word.lemma,
                'UPOS': word.upos,
                'XPOS': word.xpos,
                'FEATS': OrderedDict(
                    [(k, v) for k, v in [
                        t.split('=', 1) for t in word.feats.split('|')
                    ]] if word.feats else []
                ),
                'HEAD': str(word.head),
                'DEPREL': word.deprel,
                'DEPS': str(word.head)+':'+ word.deprel,
                'MISC': OrderedDict(
                    [('NE', token.ner[2:])] if token.ner != 'O' else []
                )
            })

        yield res

## Prediction and Saving Results to CONLL-U<a class="anchor" id="save"></a>

In [30]:
junky.clear_tqdm()

In [31]:
Conllu.save(stanza_parse(test), 'stanza_syntagrus.conllu',
            fix=True, log_file=None)

2020-07-28 13:24:45 INFO: Loading these models for language: ru (Russian):
| Processor | Package                 |
---------------------------------------
| tokenize  | syntagrus               |
| pos       | syntagrus               |
| lemma     | syntagrus               |
| depparse  | stanza/sav..._parser.pt |
| ner       | wikiner                 |

2020-07-28 13:24:45 INFO: Use device: cpu
2020-07-28 13:24:45 INFO: Loading: tokenize
2020-07-28 13:24:45 INFO: Loading: pos
2020-07-28 13:24:46 INFO: Loading: lemma
2020-07-28 13:24:46 INFO: Loading: depparse
2020-07-28 13:24:47 INFO: Loading: ner
2020-07-28 13:24:48 INFO: Done loading processors!
100%|██████████| 6491/6491 [25:39<00:00,  4.22it/s]


## Inference on Test Corpus

In [35]:
# !pip install mordl

In [32]:
from mordl import conll18_ud_eval

In [33]:
gold_file = 'corpus/_UD/UD_Russian-SynTagRus/ru_syntagrus-ud-test.conllu'
system_file = 'stanza_syntagrus.conllu'

In [34]:
conll18_ud_eval(gold_file, system_file, verbose=True, counts=False)

  0%|          | 0/6491 [34:50<?, ?it/s]


Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.99 |    100.00 |     99.99 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.99 |    100.00 |     99.99 |
UPOS       |     98.58 |     98.58 |     98.58 |     98.58
XPOS       |     99.99 |    100.00 |     99.99 |    100.00
UFeats     |     96.28 |     96.29 |     96.29 |     96.29
AllTags    |     95.94 |     95.94 |     95.94 |     95.95
Lemmas     |     97.90 |     97.90 |     97.90 |     97.91
UAS        |     93.16 |     93.16 |     93.16 |     93.16
LAS        |     91.33 |     91.33 |     91.33 |     91.34
CLAS       |     89.68 |     89.62 |     89.65 |     89.62
MLAS       |     85.61 |     85.55 |     85.58 |     85.56
BLEX       |     87.41 |     87.34 |     87.37 |     87.35
