# T/V parsing

This notebook is designed for the single target - parse test file in Russian language to [CoNLL](https://universaldependencies.org/format.html) format. Output of parsing is later used to label sentences with T/V tags based on set of heuristics.

We use [DeepPavlov model](!http://docs.deeppavlov.ai/en/master/features/models/syntaxparser.html) for joint syntactic and morphological parsing and [conllu package](https://pypi.org/project/conllu/) to parse the output of the model.

### How to run this notebook

- Use Google Colaboratory. This step helps to save space on local runtime and avoid painful dependency conflicts.

- Package installation (1st cell) and model building (4th cell) take some time, so, please, be patient.

- Please, remove '%%capture' and inspect installation logs in case of any errors on parsing.

In [1]:
%%capture
! pip install deeppavlov
! python -m deeppavlov install syntax_ru_syntagrus_bert
! pip install russian-tagsets
! pip install conllu
! pip install wget

In [None]:
# # This step is required to override some default Colab packages 
# # with other versions installed from deeppavlov requirements.
import os
os.kill(os.getpid(), 9)

In [1]:
import os
import wget
from tqdm import tqdm
from timeit import default_timer as timer
from typing import Tuple, List, Set, Union, Optional

import deeppavlov
from deeppavlov import build_model, configs
import conllu
from conllu import parse as parse_conllu

In [2]:
%%capture
model = build_model("ru_syntagrus_joint_parsing", download=True)

2022-03-27 13:04:26.889 INFO in 'deeppavlov.core.common.file'['file'] at line 32: Interpreting 'ru_syntagrus_joint_parsing' as '/usr/local/lib/python3.7/dist-packages/deeppavlov/configs/syntax/ru_syntagrus_joint_parsing.json'
2022-03-27 13:04:27.662 INFO in 'deeppavlov.core.data.utils'['utils'] at line 95: Downloading from http://files.deeppavlov.ai/deeppavlov_data/bert/rubert_cased_L-12_H-768_A-12_v1.tar.gz to /root/.deeppavlov/downloads/rubert_cased_L-12_H-768_A-12_v1.tar.gz
2022-03-27 13:04:52.68 INFO in 'deeppavlov.core.data.utils'['utils'] at line 272: Extracting /root/.deeppavlov/downloads/rubert_cased_L-12_H-768_A-12_v1.tar.gz archive into /root/.deeppavlov/downloads/bert_models
2022-03-27 13:05:00.772 INFO in 'deeppavlov.core.data.utils'['utils'] at line 95: Downloading from http://files.deeppavlov.ai/deeppavlov_data/morpho_tagger/BERT/morpho_ru_syntagrus_bert.tar.gz to /root/.deeppavlov/models/morpho_ru_syntagrus_bert.tar.gz
2022-03-27 13:05:25.958 INFO in 'deeppavlov.core.dat

In [4]:
test_sentences = ["Вы увидитесь завтра.", "Ты увидишься завтра."]

for conll_str in model(test_sentences):
    print(conll_str, end="\n\n")

1	Вы	вы	PRON	_	Case=Nom|Number=Plur|Person=2	2	nsubj	_	_
2	увидитесь	увидеться	VERB	_	Aspect=Perf|Mood=Ind|Number=Plur|Person=2|Tense=Fut|VerbForm=Fin|Voice=Mid	0	root	_	_
3	завтра	завтра	ADV	_	Degree=Pos	2	advmod	_	_
4	.	.	PUNCT	_	_	2	punct	_	_

1	Ты	ты	PRON	_	Case=Nom|Number=Sing|Person=2	2	nsubj	_	_
2	увидишься	увидеться	VERB	_	Aspect=Perf|Mood=Ind|Number=Sing|Person=2|Tense=Fut|VerbForm=Fin|Voice=Mid	0	root	_	_
3	завтра	завтра	ADV	_	Degree=Pos	2	advmod	_	_
4	.	.	PUNCT	_	_	2	punct	_	_



In [9]:
# load test data
git_path = r'https://raw.githubusercontent.com/tsimafeip/TV-distinction/main/'

test_filename = 'tv_model_oracle_labels'
test_gitpath = os.path.join(git_path, 'translations', test_filename)

if not os.path.isfile(test_filename):
    wget.download(test_gitpath, test_filename)

In [None]:
# batching helps to avoid out-of-memory problems 
BATCH_SIZE = 100
START_SENTENCE = 0

with open(test_filename) as input_file, open(test_filename + '.conll', 'w') as conll_file:

    input_lines = input_file.read().splitlines()
    
    start = timer()
    i, j = 0, BATCH_SIZE
    print(f"Started parsing of {len(input_lines)} sentences ...")
    for i in tqdm(range(START_SENTENCE, len(input_lines), BATCH_SIZE)):
        cur_batch = input_lines[i:i+BATCH_SIZE]
        # we add comments (sentence id and text) for each set of conll lines:
        # sent_id = 1
        # text = They buy and sell books.
        conllu_lines = []
        for k, conll_line in enumerate(model(cur_batch)):
            header = f"# sent_id = {i+k}\n" + f"# text = {cur_batch[k]}\n"
            conllu_lines.append(header + conll_line + "\n\n")
         
        conll_file.writelines(conllu_lines)
    print("\nFinished parsing in : %f seconds\n" % (timer() - start))

Started parsing of 9684 sentences ...


 90%|█████████ | 9/10 [1:00:26<06:42, 402.83s/it]

In [10]:
# CoNNL parsing demo 
test_sentences = ["Вы увидите Вашего сына.", "Ты увидишь твоего сына"]

for conll_str in model(test_sentences):
    print(conll_str, end="\n\n")
    conll_token_list = parse_conllu(conll_str)[0]
    print(conll_token_list.to_tree().print_tree())
    conll_token_list = conll_token_list.filter(upos=lambda x: x in {'PRON', 'DET', 'VERB'})
    sample_token = conll_token_list[0]
    for sample_token in conll_token_list[:3]:
        print(sample_token, type(sample_token))
        print(sample_token.items())
    print()

1	Вы	вы	PRON	_	Case=Nom|Number=Plur|Person=2	2	nsubj	_	_
2	увидите	увидеть	VERB	_	Aspect=Perf|Mood=Ind|Number=Plur|Person=2|Tense=Fut|VerbForm=Fin|Voice=Act	0	root	_	_
3	Вашего	ваш	DET	_	Case=Acc|Gender=Masc|Number=Sing	4	det	_	_
4	сына	сын	NOUN	_	Animacy=Anim|Case=Acc|Gender=Masc|Number=Sing	2	obj	_	_
5	.	.	PUNCT	_	_	2	punct	_	_

(deprel:root) form:увидите lemma:увидеть upos:VERB [2]
    (deprel:nsubj) form:Вы lemma:вы upos:PRON [1]
    (deprel:obj) form:сына lemma:сын upos:NOUN [4]
        (deprel:det) form:Вашего lemma:ваш upos:DET [3]
    (deprel:punct) form:. lemma:. upos:PUNCT [5]
None
Вы <class 'conllu.models.Token'>
dict_items([('id', 1), ('form', 'Вы'), ('lemma', 'вы'), ('upos', 'PRON'), ('xpos', None), ('feats', {'Case': 'Nom', 'Number': 'Plur', 'Person': '2'}), ('head', 2), ('deprel', 'nsubj'), ('deps', None), ('misc', None)])
увидите <class 'conllu.models.Token'>
dict_items([('id', 2), ('form', 'увидите'), ('lemma', 'увидеть'), ('upos', 'VERB'), ('xpos', None), ('feats', {'