# Stanza POS, LEMMA and NER Tagging Pipeline

## Navigation:
* [General Info](#info)
* [Setting up Stanza for training](#setup)
* [Training POS and LEMMA taggers with Stanza](#pos)
* [Preparing Dataset for NER](#prepare)
* [Adding BIOES Annotation](#bioes)
* [Training NER tagger with Stanza](#ner)
* [Using Trained Model for Prediction](#predict)
* [Prediction and Saving to CONLL-U](#save)

## General Info <a class="anchor" id="info"></a>

[`Link to Manual`](https://stanfordnlp.github.io/stanza/index.html) [`Training Page`](https://stanfordnlp.github.io/stanza/training.html)

[`Link to GitHub Repository`](https://github.com/stanfordnlp/stanza) (git clone this repo)

`Libraries needed:` `corpuscula.conllu` (conllu parsing); `stanza` (training); `json` (saving results); `tqdm` (displaying progress)

`Pre-Trained Embeddings used in this example:` Recommended vectors are downloaded from [here](https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-1989/word-embeddings-conll17.tar?sequence=9&isAllowed=y)(~30GB, 60+ languages)

`Pipeline Input:` CONLL-U parsed text file.

`Processing:` Extracting tokens and named entities as separate lists of lists of strings, and adding BIOES tags to entities.

`Train Input:` `{train,dev,test}.bio` files in BIOES format as shown [here](https://en.wikipedia.org/wiki/Inside–outside–beginning_(tagging))

`Sample train input:`
```
здравствуйте O
расскажите O
справочной S-Department
аэропорта S-Organization
город B-Geo
томск E-Geo
```

`Sample inference (predict) result:`
```
>> print(*[f'token: {token.text}\tner: {token.ner}' for sent in doc.sentences for token in sent.tokens], sep='\n')
token: 4	ner: B-Organization
token: больница	ner: I-Organization
token: детская	ner: I-Organization
token: городская	ner: I-Organization
token: больница	ner: I-Organization
token: номер	ner: I-Organization
token: 4	ner: E-Organization
token: города	ner: B-Geo
token: сочи	ner: E-Geo
token: приемный	ner: B-Department
token: покой	ner: E-Department
```

`Pipeline Output:` JSON with NER, POS, Features Parsing (list of lists of dict)

`Sample pipeline output:`
```
[[{'word': 'подскажите',
   'entity': None,
   'pos': 'VERB',
   'feats': 'Aspect=Perf|Mood=Imp|Number=Plur|Person=2|VerbForm=Fin|Voice=Act'},
  {'word': 'мне',
   'entity': None,
   'pos': 'PRON',
   'feats': 'Case=Dat|Number=Sing|Person=1'},
  {'word': 'регистратуру',
   'entity': 'Department',
   'pos': 'NOUN',
   'feats': 'Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing'}, ...]
```

## Setting up Stanza for training <a class="anchor" id="setup"></a>

In [1]:
#!pip install stanza

Run in terminal.

1. Clone Stanza GitHub repository
```
$git clone https://github.com/stanfordnlp/stanza
```

2. Move to cloned git repository & download embeddings ({lang}.vectors.xz format)
(run in a screen, takes up about 5-6h)
```
$ cd stanza
$ ./scripts/download_vectors.sh ./extern_data/
```

3. Make sure your `./stanza/scripts/config.sh` is set up like below. Modify if necessary (pay attention to UDBASE and NERBASE).

```
export UDBASE=./udbase

export NERBASE=./nerbase

# Set directories to store processed training/evaluation files
export DATA_ROOT=./data
export TOKENIZE_DATA_DIR=$DATA_ROOT/tokenize
export MWT_DATA_DIR=$DATA_ROOT/mwt
export POS_DATA_DIR=$DATA_ROOT/pos
export LEMMA_DATA_DIR=$DATA_ROOT/lemma
export DEPPARSE_DATA_DIR=$DATA_ROOT/depparse
export ETE_DATA_DIR=$DATA_ROOT/ete
export NER_DATA_DIR=$DATA_ROOT/ner
export CHARLM_DATA_DIR=$DATA_ROOT/charlm

# Set directories to store external word vector data
export WORDVEC_DIR=./extern_data/word2vec
```

Prepared train-dev-test will be placed to `{NERBASE}/{corpus}/{train,dev,test}.bio`, where `{corpus}` = full language name (e.g. Russian). 

Stanza does not accept any alterations to corpus names.

4. Download language resources:

In [2]:
#stanza.download('ru')

## Training POS and LEMMA taggers with Stanza <a class="anchor" id="pos"></a>

**`STEP 1`**

`Input files for POS and LEMMA model training should be placed here:` 

**`{UDBASE}/{corpus}/{corpus_short}-ud-{train,dev,test}.conllu`**, where 
* **`{UDBASE}`** is `./stanza/udbase/` (specified in `config.sh`), 
* **`{corpus}`** is full corpus name (e.g. `UD_Russian-SynTagRus` or `UD_English-EWT`, case-sensitive), and 
* **`{corpus_short}`** is the treebank code, can be [found here](https://stanfordnlp.github.io/stanza/model_history.html) (e.g. `ru_syntagrus`).

**`STEP 2`**

**Important:** Create `./data/{pos,lemma}/` folder, otherwise the code below will fail to run.


**`STEP 3`** To prepare data, run:
```
$ cd stanza
$ ./scripts/prep_{pos,lemma}_data.sh UD_Russian-SynTagRus
```
The script above prepares the train-dev-test.conllu data which is located in `./udbase/UD_Russian-SynTagRus/`.

**`STEP 4`**
To start training, run:
```
$ ./scripts/run_{pos,lemma}.sh UD_Russian-SynTagRus
```
The model will be saved to `saved models`.

**`HOW TO USE`** 
#### Loading Trained Models to Pipeline


To load the model for prediction, when setting up Tagger Pipeline, specify path to the model:
```
nlp = stanza.Pipeline('ru', 
                       processors='tokenize,pos,lemma,ner',
                       pos_model_path=<path to model>,
                       lemma_model_path=<path to model>,
                       ner_model_path=<path to model>)
```

## Preparing Dataset for NER <a class="anchor" id="prepare"></a>

In [3]:
from corpuscula.conllu import Conllu

In [4]:
def read_corpus(corpus=None, silent=False):
    if isinstance(corpus, str):
        corpus = Conllu.load(corpus, **({'log_file': None} if silent else{}))
    elif callable(corpus):
        corpus = corpus()

    parsed_corpus = []
    parsed_ne = []
    
    for sent in corpus:
        curr_sent = [x['FORM'] for x in sent[0] if x['FORM'] and '-' not in x['ID']]
        curr_ne = [x['MISC']['NE'] if 'NE' in x['MISC'].keys() else 'O' for x in sent[0]]
        parsed_corpus.append(curr_sent)
        parsed_ne.append(curr_ne)
    
    return parsed_corpus, parsed_ne

In [5]:
# replace file names, if necessary
parsed_corpus_train, named_entities_train = read_corpus('result_ner_train.conllu')
parsed_corpus_dev, named_entities_dev = read_corpus('result_ner_dev.conllu')
parsed_corpus_test, named_entities_test = read_corpus('result_ner_test.conllu')

Load corpus
Corpus has been loaded: 30390 sentences, 378829 tokens
Load corpus
[====] 3799                                                        
Corpus has been loaded: 3799 sentences, 47280 tokens
Load corpus
[====] 3798                                                        
Corpus has been loaded: 3798 sentences, 47126 tokens


In [6]:
parsed_corpus_train[:1], named_entities_train[:1]

([['добрый',
   'день',
   'девушка',
   'скажите',
   'пожалуйста',
   'мне',
   'телефончик',
   'автобусная',
   'да',
   'по',
   'бежецкого']],
 [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'Organization', 'O', 'O', 'Address']])

## Adding BIOES Annotation <a class="anchor" id="bioes"></a>

In [7]:
def bioes_annotation(ne_list):    

    # Adding BIOES-annotation for future training with Stanza 
    
    prev_ne = 'O'
    bioes_ne = []
    
    for i, ne in enumerate(ne_list):
        if ne == 'O':
            prev_ne = 'O'
            
        elif prev_ne == 'O' or ne != prev_ne.split('-')[1]:
            if i < len(ne_list)-1 and ne == ne_list[i+1]:
                ne = 'B-' + ne   
            else:
                ne = 'S-' + ne

        elif ne == prev_ne.split('-')[1] and prev_ne.split('-')[0] in ['B', 'I']:
            if i < len(ne_list)-1 and ne == ne_list[i+1]:
                ne = 'I-' + ne
            else:
                ne = 'E-' + ne
                    
        prev_ne = ne
        bioes_ne.append(ne)
    
    return bioes_ne

In [8]:
bio_ne_train = [bioes_annotation(ne_seq) for ne_seq in named_entities_train]
bio_ne_dev = [bioes_annotation(ne_seq) for ne_seq in named_entities_dev]
bio_ne_test = [bioes_annotation(ne_seq) for ne_seq in named_entities_test]

In [9]:
bio_ne_train[:1]

[['O', 'O', 'O', 'O', 'O', 'O', 'O', 'S-Organization', 'O', 'O', 'S-Address']]

In [10]:
# Modify paths and file names, if necessary
import os

# Note that "Russian" is considered as a name of the corpus. 
# You cannot give your corpus other names, only language names.
nerbase = './stanza/nerbase/'
corpus_lang = 'Russian'

dn = os.path.join(nerbase, corpus_lang)
if not os.path.isdir(dn):
    os.makedirs(dn)

with open(os.path.join(dn, 'train.bio'), 'wt', encoding='utf-8') as f:
    for i in range(len(parsed_corpus_train)):
        [print('\n'.join([' '.join(pair) for pair in list(zip(parsed_corpus_train[i],
                                                              bio_ne_train[i]))]),
               file=f)]
        print(file=f)

with open(os.path.join(dn, 'dev.bio'), 'wt', encoding='utf-8') as f:
    for i in range(len(parsed_corpus_dev)):
        [print('\n'.join([' '.join(pair) for pair in list(zip(parsed_corpus_dev[i],
                                                              bio_ne_dev[i]))]),
               file=f)] 
        print(file=f)
        
with open(os.path.join(dn, 'test.bio'), 'wt', encoding='utf-8') as f:
    for i in range(len(parsed_corpus_test)):
        [print('\n'.join([' '.join(pair) for pair in list(zip(parsed_corpus_test[i],
                                                              bio_ne_test[i]))]),
               file=f)]
        print(file=f)

## Training NER tagger with Stanza <a class="anchor" id="ner"></a>

Go to terminal and run:
```
$ cd stanza
$ ./scripts/run_ner.sh Russian --shorthand='ru_syntagrus'
```

Your model will be saved to `/stanza/saved_models/ner/ru_syntagrus_nertagger.pt`

## Using Trained Model for Prediction  <a class="anchor" id="predict"></a>

If you want to disable Stanza built-in tokenizer, specify `tokenize_pretokenized=True` parameter in Pipeline.

Input should still be a list of strings, but tokens will be separated by spaces, no multi-word tokens will appear.

In [11]:
from collections import OrderedDict
import stanza
from tqdm import tqdm

def stanza_parse(sents,
                 pos_model='./stanza/saved_models/pos/ru_syntagrus_tagger.pt',
                 lemma_model='./stanza/saved_models/lemma/ru_syntagrus_lemmatizer.pt',
                 ner_model='./stanza/saved_models/ner/ru_syntagrus_nertagger.pt'):
    
    sents = [' '.join(sent) for sent in sents]
    nlp = stanza.Pipeline('ru',
                          processors='tokenize,pos,lemma,ner',
                          pos_model_path=pos_model,
                          lemma_model_path=lemma_model,
                          ner_model_path=ner_model,
                          tokenize_pretokenized=True)
    
    for idx, sent in enumerate(tqdm(sents)):
        doc = nlp(sent)       
        res = []

        assert len(doc.sentences) == 1, \
               'ERROR: incorrect lengths of sentences ({}) for sent {}' \
                   .format(len(doc.sentences), idx)
        sent = doc.sentences[0]
        tokens, words = sent.tokens, sent.words
        assert len(tokens) == len(words), \
               'ERROR: inconsistent lengths of tokens and words for sent {}' \
                   .format(idx)
        for token, word in zip(tokens, words):
            res.append({
                'ID': token.id,
                'FORM': token.text,
                'LEMMA': word.lemma,
                'UPOS': word.upos,
                'XPOS': word.xpos,
                'FEATS': OrderedDict(
                    [(k, v) for k, v in [
                        t.split('=', 1) for t in word.feats.split('|')
                    ]] if word.feats else []
                ),
                'HEAD': None,
                'DEPREL': None,
                'DEPS': None,
                'MISC': OrderedDict(
                    [('NE', token.ner[2:])] if token.ner != 'O' else []
                )
            })

        yield res

## Prediction and Saving Results to CONLL-U  <a class="anchor" id="save"></a>

In [12]:
Conllu.save(stanza_parse(parsed_corpus_test), 'stanza_syntagrus.conllu',
            fix=True, log_file=None)

2020-04-04 12:08:55 INFO: Loading these models for language: ru (Russian):
| Processor | Package                 |
---------------------------------------
| tokenize  | syntagrus               |
| pos       | ./stanza/s..._tagger.pt |
| lemma     | ./stanza/s...matizer.pt |
| ner       | ./stanza/s...rtagger.pt |

2020-04-04 12:08:55 INFO: Use device: gpu
2020-04-04 12:08:55 INFO: Loading: tokenize
2020-04-04 12:08:55 INFO: Loading: pos
2020-04-04 12:08:59 INFO: Loading: lemma
2020-04-04 12:08:59 INFO: Loading: ner
2020-04-04 12:08:59 INFO: Done loading processors!
100%|██████████| 3798/3798 [02:09<00:00, 29.40it/s]
