# Лабораторная работа 4. Снятие омонимии с использованием методов машинного обучения

Необходимо написать программу, которая принимает на вход корпус со снятой омонимией (что-то с соревнований, НКРЯ, Universal Dependencies, …), тренирует модель машинного обучения (любую, можно и с глубинными сетями, по умолчанию CRF или Random Forest) и снимает омонимию с поданного на вход текста (например, с тестовой части корпуса).

### Критерии оценки:

- 2 - использован размеченный корпус
- 4 - обучена модель
- 2 - снимает омонимию
- 2 - красота решения (субъективное мнение преподавателя)


In [100]:
import pathlib

In [101]:
datadir = pathlib.Path().cwd() / "data"

In [102]:
train_filenames = [
    "processed_ru_syntagrus-ud-train-a.conllu",
    "processed_ru_syntagrus-ud-train-b.conllu",
    "processed_ru_syntagrus-ud-train-c.conllu",
]

test_filename = "processed_ru_syntagrus-ud-test.conllu"

In [103]:
train_data = []

for filename in train_filenames:
    with open(datadir / filename, "r", encoding="utf-8") as file:
        train_data.extend(file.read().rstrip().split("\n\n"))

with open(datadir / test_filename, "r", encoding="utf-8") as file:
    test_data = file.read().rstrip().split("\n\n")

In [104]:
len(train_data), len(test_data)

(69630, 8800)

In [105]:
print(train_data[1])

Начальник	POS=NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing
областного	POS=ADJ|Case=Gen|Degree=Pos|Gender=Neut|Number=Sing
управления	POS=NOUN|Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing
связи	POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing
Семен	POS=PROPN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing
Еремеевич	POS=PROPN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing
был	POS=AUX|Aspect=Imp|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act
человек	POS=NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing
простой	POS=ADJ|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing
,	POS=PUNCT
приходил	POS=VERB|Aspect=Imp|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act
на	POS=ADP
работу	POS=NOUN|Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing
всегда	POS=ADV|Degree=Pos
вовремя	POS=ADV|Degree=Pos
,	POS=PUNCT
здоровался	POS=VERB|Aspect=Imp|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Mid
с	POS=ADP
секретаршей	POS=NOUN|Animacy=Anim|Case=Ins|Gender=Fe

In [106]:
def prep_entry(text:str):
    entry = tuple(filter(lambda l: not l.startswith("#"), text.split("\n")))
    entry = [line.split("\t") for line in entry]
    return entry

In [107]:
def split_grammar(line:str) -> dict[str, str]:
    return {
        (p := pair.split("="))[0]: p[1]
        for pair
        in line.split("|")
    }

In [108]:
all_tags: dict[str, set] = dict()
word_dict: dict[str, dict[str, int]] = {}

In [109]:
for sent in train_data:
    lines = prep_entry(sent)
    for line in lines:
        word, grammar = line
        word_lower = word.lower()

        # Обновить грамматический словарь
        d = word_dict.get(word_lower, dict())
        d[grammar] = d.get(grammar, 0) + 1
        word_dict[word_lower] = d

        # Обновить словарь тегов
        for tag, val in split_grammar(grammar).items():
            all_tags[tag] = all_tags.get(tag, {"[PAD]"}) | {val}

In [110]:
word_dict["стали"]

{'POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act': 223,
 'POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Mid': 144,
 'POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing': 19,
 'POS=NOUN|Animacy=Inan|Case=Nom|Gender=Fem|Number=Plur': 1,
 'POS=NOUN|Animacy=Inan|Case=Dat|Gender=Fem|Number=Sing': 1}

In [111]:
sum(
    [
        sum(word_dict[word].values())
        for word
        in word_dict
    ]
)

1206302

In [112]:
## Уберем все разборы, кол-во которых меньше THRESHOLD

THRESHOLD = 6

for word in word_dict:
    word_dict[word] = {
        k: val
        for k, val
        in word_dict[word].items()
        if val >= THRESHOLD
    }


In [113]:
sum(
    [
        sum(word_dict[word].values())
        for word
        in word_dict
    ]
)

1000132

In [114]:
word_dict["стали"]

{'POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act': 223,
 'POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Mid': 144,
 'POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing': 19}

In [115]:
test_sent = list(map(lambda pair: pair[0], prep_entry(test_data[0])))
test_sent

['В',
 'советский',
 'период',
 'времени',
 'число',
 'ИТ',
 '-',
 'специалистов',
 'в',
 'Армении',
 'составляло',
 'около',
 'десяти',
 'тысяч',
 '.']

In [116]:
class BaseMorphotagger:

    def __init__(self, emtries):
        self.word_dict = emtries

    def disambig(self, d: dict):
        pass

    def parse_unknown(self, tok, **kwargs):
        return "[UNKNOWN]"

    def parse_known(self, tok, **kwargs):
        d = kwargs["d"]
        grammar = max(
            d,
            key=d.get
        )

        if len(d) > 1:
            grammar = "[AMBIG]: " + grammar

        else:
            grammar = "[SING]: " + grammar

        return grammar

    def parse_tokens(self, tokens: list[str]):
        result = []

        for tok in tokens:

            tok_lower = tok.lower()
            d = self.word_dict[tok_lower]

            if not d:
                grammar = self.parse_unknown(tok_lower, tokens=tokens, grammar=result)

            else:
                grammar = self.parse_known(tok_lower, d=d, tokens=tokens, grammar=result)

            result.append(grammar)

        return result

In [117]:
morphotagger = BaseMorphotagger(word_dict)

In [118]:
for tok, grammar in zip(test_sent, morphotagger.parse_tokens(test_sent)):
    print(tok, grammar, sep="\t")

В	[SING]: POS=ADP
советский	[AMBIG]: POS=ADJ|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing
период	[AMBIG]: POS=NOUN|Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing
времени	[AMBIG]: POS=NOUN|Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing
число	[AMBIG]: POS=NOUN|Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing
ИТ	[UNKNOWN]
-	[SING]: POS=PUNCT
специалистов	[AMBIG]: POS=NOUN|Animacy=Anim|Case=Gen|Gender=Masc|Number=Plur
в	[SING]: POS=ADP
Армении	[UNKNOWN]
составляло	[SING]: POS=VERB|Aspect=Imp|Gender=Neut|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act
около	[SING]: POS=ADP
десяти	[AMBIG]: POS=NUM|Case=Gen|NumType=Card
тысяч	[SING]: POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur
.	[SING]: POS=PUNCT


In [119]:
all_tags

{'POS': {'ADJ',
  'ADP',
  'ADV',
  'AUX',
  'CCONJ',
  'DET',
  'INTJ',
  'NOUN',
  'NUM',
  'PART',
  'PRON',
  'PROPN',
  'PUNCT',
  'SCONJ',
  'SYM',
  'VERB',
  'X',
  '[PAD]',
  '_'},
 'Animacy': {'Anim', 'Inan', '[PAD]'},
 'Case': {'Acc', 'Dat', 'Gen', 'Ins', 'Loc', 'Nom', 'Par', 'Voc', '[PAD]'},
 'Gender': {'Fem', 'Masc', 'Neut', '[PAD]'},
 'Number': {'Plur', 'Sing', '[PAD]'},
 'Degree': {'Cmp', 'Pos', 'Sup', '[PAD]'},
 'Aspect': {'Imp', 'Perf', '[PAD]'},
 'Mood': {'Cnd', 'Imp', 'Ind', '[PAD]'},
 'Tense': {'Fut', 'Past', 'Pres', '[PAD]'},
 'VerbForm': {'Conv', 'Fin', 'Inf', 'Part', '[PAD]'},
 'Voice': {'Act', 'Mid', 'Pass', '[PAD]'},
 'Person': {'1', '2', '3', '[PAD]'},
 'PronType': {'Dem', 'Ind', 'Int,Rel', 'Neg', 'Prs', 'Tot', '[PAD]'},
 'Variant': {'Short', '[PAD]'},
 'NumType': {'Card', '[PAD]'},
 'Poss': {'Yes', '[PAD]'},
 'Reflex': {'Yes', '[PAD]'},
 'Foreign': {'Yes', '[PAD]'},
 'Polarity': {'Neg', '[PAD]'},
 'Typo': {'Yes', '[PAD]'},
 'Abbr': {'Yes', '[PAD]'}}

In [120]:
all_tags["POS"].remove('_')

In [121]:
all_tags["POS"]

{'ADJ',
 'ADP',
 'ADV',
 'AUX',
 'CCONJ',
 'DET',
 'INTJ',
 'NOUN',
 'NUM',
 'PART',
 'PRON',
 'PROPN',
 'PUNCT',
 'SCONJ',
 'SYM',
 'VERB',
 'X',
 '[PAD]'}

In [122]:
to_del = []

for tag in all_tags:
    if len(all_tags[tag]) < 3:
        print(tag)
        to_del.append(tag)

for tag in to_del:
    del all_tags[tag]

Variant
NumType
Poss
Reflex
Foreign
Polarity
Typo
Abbr


In [123]:
all_tags

{'POS': {'ADJ',
  'ADP',
  'ADV',
  'AUX',
  'CCONJ',
  'DET',
  'INTJ',
  'NOUN',
  'NUM',
  'PART',
  'PRON',
  'PROPN',
  'PUNCT',
  'SCONJ',
  'SYM',
  'VERB',
  'X',
  '[PAD]'},
 'Animacy': {'Anim', 'Inan', '[PAD]'},
 'Case': {'Acc', 'Dat', 'Gen', 'Ins', 'Loc', 'Nom', 'Par', 'Voc', '[PAD]'},
 'Gender': {'Fem', 'Masc', 'Neut', '[PAD]'},
 'Number': {'Plur', 'Sing', '[PAD]'},
 'Degree': {'Cmp', 'Pos', 'Sup', '[PAD]'},
 'Aspect': {'Imp', 'Perf', '[PAD]'},
 'Mood': {'Cnd', 'Imp', 'Ind', '[PAD]'},
 'Tense': {'Fut', 'Past', 'Pres', '[PAD]'},
 'VerbForm': {'Conv', 'Fin', 'Inf', 'Part', '[PAD]'},
 'Voice': {'Act', 'Mid', 'Pass', '[PAD]'},
 'Person': {'1', '2', '3', '[PAD]'},
 'PronType': {'Dem', 'Ind', 'Int,Rel', 'Neg', 'Prs', 'Tot', '[PAD]'}}

In [124]:
id2tags = {k: None for k in all_tags}
tag2ids = {k: None for k in all_tags}

for k in all_tags:
    id2tags[k] = {i: val for i, val in enumerate(all_tags[k])}

for k in id2tags:
    tag2ids[k] = {val: key for key, val in id2tags[k].items()}

for k in all_tags:
    zeroth_elem = id2tags[k][0]
    if zeroth_elem != "[PAD]":

        pad_idx = tag2ids[k]["[PAD]"]
        id2tags[k][0] = "[PAD]"
        id2tags[k][pad_idx] = zeroth_elem

        tag2ids[k]["[PAD]"] = 0
        tag2ids[k][zeroth_elem] = pad_idx


In [125]:
tag2ids

{'POS': {'NUM': 16,
  'SYM': 1,
  'CCONJ': 2,
  'PROPN': 3,
  'ADV': 4,
  'VERB': 5,
  'X': 6,
  'NOUN': 7,
  'ADJ': 8,
  'PRON': 9,
  'DET': 10,
  'PUNCT': 11,
  'PART': 12,
  'AUX': 13,
  'SCONJ': 14,
  'INTJ': 15,
  '[PAD]': 0,
  'ADP': 17},
 'Animacy': {'Anim': 2, 'Inan': 1, '[PAD]': 0},
 'Case': {'Par': 4,
  'Acc': 1,
  'Loc': 2,
  'Voc': 3,
  '[PAD]': 0,
  'Ins': 5,
  'Gen': 6,
  'Nom': 7,
  'Dat': 8},
 'Gender': {'[PAD]': 0, 'Fem': 1, 'Neut': 2, 'Masc': 3},
 'Number': {'Sing': 2, 'Plur': 1, '[PAD]': 0},
 'Degree': {'Pos': 1, '[PAD]': 0, 'Sup': 2, 'Cmp': 3},
 'Aspect': {'Imp': 1, '[PAD]': 0, 'Perf': 2},
 'Mood': {'[PAD]': 0, 'Ind': 1, 'Imp': 2, 'Cnd': 3},
 'Tense': {'Fut': 3, 'Past': 1, 'Pres': 2, '[PAD]': 0},
 'VerbForm': {'Conv': 3, 'Fin': 1, 'Inf': 2, '[PAD]': 0, 'Part': 4},
 'Voice': {'Pass': 1, '[PAD]': 0, 'Mid': 2, 'Act': 3},
 'Person': {'2': 2, '1': 1, '[PAD]': 0, '3': 3},
 'PronType': {'Dem': 1,
  '[PAD]': 0,
  'Ind': 2,
  'Tot': 3,
  'Prs': 4,
  'Neg': 5,
  'Int,Rel': 6}

In [126]:
id2tags

{'POS': {0: '[PAD]',
  1: 'SYM',
  2: 'CCONJ',
  3: 'PROPN',
  4: 'ADV',
  5: 'VERB',
  6: 'X',
  7: 'NOUN',
  8: 'ADJ',
  9: 'PRON',
  10: 'DET',
  11: 'PUNCT',
  12: 'PART',
  13: 'AUX',
  14: 'SCONJ',
  15: 'INTJ',
  16: 'NUM',
  17: 'ADP'},
 'Animacy': {0: '[PAD]', 1: 'Inan', 2: 'Anim'},
 'Case': {0: '[PAD]',
  1: 'Acc',
  2: 'Loc',
  3: 'Voc',
  4: 'Par',
  5: 'Ins',
  6: 'Gen',
  7: 'Nom',
  8: 'Dat'},
 'Gender': {0: '[PAD]', 1: 'Fem', 2: 'Neut', 3: 'Masc'},
 'Number': {0: '[PAD]', 1: 'Plur', 2: 'Sing'},
 'Degree': {0: '[PAD]', 1: 'Pos', 2: 'Sup', 3: 'Cmp'},
 'Aspect': {0: '[PAD]', 1: 'Imp', 2: 'Perf'},
 'Mood': {0: '[PAD]', 1: 'Ind', 2: 'Imp', 3: 'Cnd'},
 'Tense': {0: '[PAD]', 1: 'Past', 2: 'Pres', 3: 'Fut'},
 'VerbForm': {0: '[PAD]', 1: 'Fin', 2: 'Inf', 3: 'Conv', 4: 'Part'},
 'Voice': {0: '[PAD]', 1: 'Pass', 2: 'Mid', 3: 'Act'},
 'Person': {0: '[PAD]', 1: '1', 2: '2', 3: '3'},
 'PronType': {0: '[PAD]',
  1: 'Dem',
  2: 'Ind',
  3: 'Tot',
  4: 'Prs',
  5: 'Neg',
  6: 'Int,Rel'}

In [127]:
set(all_tags.keys())

{'Animacy',
 'Aspect',
 'Case',
 'Degree',
 'Gender',
 'Mood',
 'Number',
 'POS',
 'Person',
 'PronType',
 'Tense',
 'VerbForm',
 'Voice'}

In [128]:
sorted_keys = sorted(all_tags.keys())

In [129]:
sorted_keys

['Animacy',
 'Aspect',
 'Case',
 'Degree',
 'Gender',
 'Mood',
 'Number',
 'POS',
 'Person',
 'PronType',
 'Tense',
 'VerbForm',
 'Voice']

In [130]:
def get_grammar_dict(line:str, allowed_tags=set(all_tags.keys())):
    pairs = split_grammar(line)
    return {
        k: tag2ids.get(
            k,
            dict()
        ).get(
            pairs.get(k, "[PAD]"),
            0
        )
        for k
        in allowed_tags
    }

In [131]:
get_grammar_dict("POS=NOUN|Animacy=Inan")

{'Tense': 0,
 'Number': 0,
 'Gender': 0,
 'Case': 0,
 'Aspect': 0,
 'VerbForm': 0,
 'POS': 7,
 'Animacy': 1,
 'Mood': 0,
 'Person': 0,
 'Degree': 0,
 'Voice': 0,
 'PronType': 0}

In [132]:
def get_grammar_vector(line:str, allowed_tags=set(all_tags.keys())):
    grammar_dict = get_grammar_dict(line, allowed_tags=allowed_tags)
    return [grammar_dict[k] for k in sorted(grammar_dict.keys())]

In [133]:
# В алфавитном порядке
get_grammar_vector("POS=NOUN|Animacy=Inan")

[1, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0]

In [134]:
def grammar_vector_to_dict(vector, sorted_keys=sorted_keys):
    return {
        sorted_keys[i]: val
        for i, val
        in enumerate(vector)
    }

In [135]:
grammar_vector_to_dict(
    get_grammar_vector("POS=NOUN|Animacy=Inan")
)

{'Animacy': 1,
 'Aspect': 0,
 'Case': 0,
 'Degree': 0,
 'Gender': 0,
 'Mood': 0,
 'Number': 0,
 'POS': 7,
 'Person': 0,
 'PronType': 0,
 'Tense': 0,
 'VerbForm': 0,
 'Voice': 0}

In [136]:
def grammar_dict_to_line(dict, id2tags=id2tags, pad_idx=0):
    return "|".join(
        [
            f"{tag}={id2tags[tag][dict[tag]]}"
            for tag
            in dict
            if dict[tag] != pad_idx
        ]
    )

In [137]:
grammar_dict_to_line(
    grammar_vector_to_dict(
        get_grammar_vector("POS=NOUN|Animacy=Inan")
    )
)

'Animacy=Inan|POS=NOUN'

In [138]:
def extract_features(sentence, index, neg=False, word_dict=word_dict):

    word, target_tags = sentence[index]

    # word = word.lower()

    if index - 1 >= 0:
        _, prev_tag_1 = sentence[index - 1]
    else:
        _, prev_tag_1 = "[PAD]", "DUMMY=dummy|dummy=DUMMY"

    # prev_tag_1 = get_grammar_dict(prev_tag_1)
    prev_tag_1 = get_grammar_vector(prev_tag_1)

    if index - 2 >= 0:
        _, prev_tag_2 = sentence[index - 2]
    else:
        _, prev_tag_2 = "[PAD]", "DUMMY=dummy|dummy=DUMMY"

    prev_tag_2 = get_grammar_vector(prev_tag_2)

    if index - 3 >= 0:
        _, prev_tag_3 = sentence[index - 3]
    else:
        _, prev_tag_3 = "[PAD]", "DUMMY=dummy|dummy=DUMMY"

    prev_tag_3 = get_grammar_vector(prev_tag_3)

    # prev_tag_2 = get_grammar_dict(prev_tag_2)
    

    target_vector = get_grammar_vector(target_tags)

    if neg:
        for parse in word_dict[word.lower()]:
            vector = get_grammar_vector(parse)
            if vector != target_vector:
                target_vector = vector
                break
        else:
            target_vector = [0]*len(target_vector)

    training_row = [
        int(word.isupper()),
        int(word.istitle()),
        int(word.isdigit()),
        *prev_tag_1,
        *prev_tag_2,
        *prev_tag_3,
        *target_vector
    ]

    return word.lower(), training_row, int(not neg)

In [139]:
word, trr, tstr = extract_features(prep_entry(train_data[1]), 0)
word, word_dict.get(word.lower(), "DUMMY=dummy|dummy=DUMMY"), tstr

('начальник',
 {'POS=NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing': 72},
 1)

In [140]:
word, trr, tstr = extract_features(prep_entry(train_data[1]), 1)
word, word_dict.get(word.lower(), "DUMMY=dummy|dummy=DUMMY"), tstr

('областного', {'POS=ADJ|Case=Gen|Degree=Pos|Gender=Masc|Number=Sing': 6}, 1)

In [141]:
word, trr, tstr = extract_features(prep_entry(train_data[1]), 2)
word, word_dict.get(word.lower(), "DUMMY=dummy|dummy=DUMMY"), tstr

('управления',
 {'POS=NOUN|Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing': 185,
  'POS=PROPN|Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing': 10},
 1)

In [142]:
word, trr, tstr = extract_features(prep_entry(train_data[1]), 2, neg=True)
word, word_dict.get(word.lower(), "DUMMY=dummy|dummy=DUMMY"), tstr

('управления',
 {'POS=NOUN|Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing': 185,
  'POS=PROPN|Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing': 10},
 0)

In [143]:
trr

[0,
 0,
 0,
 0,
 0,
 6,
 1,
 2,
 0,
 2,
 8,
 0,
 0,
 0,
 0,
 0,
 2,
 0,
 7,
 0,
 3,
 0,
 2,
 7,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 6,
 0,
 2,
 0,
 2,
 3,
 0,
 0,
 0,
 0,
 0]

In [144]:
tstr

0

In [145]:
def gen_training_row(train_data):
    while True:
        for i, entry in enumerate(train_data):
            parsed_entry = prep_entry(entry)
            for j in range(len(parsed_entry)):
                yield extract_features(parsed_entry, j)
                yield extract_features(parsed_entry, j, neg=True)


In [146]:
gen_train = gen_training_row(train_data)
gen_test  = gen_training_row(test_data)

In [147]:
from lightgbm import LGBMClassifier
import numpy as np
from tqdm import tqdm

In [148]:
model = LGBMClassifier(
    objective="binary",
    metric="auc",
    boosting_type="gbdt",
    num_leaves=31,
    learning_rate=0.05,
    verbose=-1
)

In [149]:
EPOCHS = 5000
BATCH_SIZE = 15000

In [150]:
for epoch in tqdm(range(1, EPOCHS + 1)):
    X_bar = []
    y_bar = []

    for _ in range(BATCH_SIZE):
        word, features, labels = next(gen_train)
        X_bar.append(features)
        y_bar.append(labels)

    X_bar = np.array(X_bar)
    y_bar = np.array(y_bar)

    model.fit(X_bar, y_bar)

100%|██████████| 5000/5000 [36:17<00:00,  2.30it/s] 


In [160]:
# DEBUG = False
DEBUG = True

In [161]:
def disambiguate_sentence(tokens:list[str], model:LGBMClassifier):
    prep_tokens = []

    for tok in tokens:
        entries = word_dict.get(tok.lower(), dict())

        if len(entries) == 0:
            prep_tokens.append([tok, "Unkn=Unkn"])

        elif len(entries) == 1:
            prep_tokens.append([tok, next(iter(entries))])

        else:
            parses_probas = []

            for parse in entries:
                try_tokens = prep_tokens.copy() + [[tok, parse]]
                feats = np.array(extract_features(try_tokens, len(try_tokens)-1)[1]).reshape(1, -1)

                proba = model.predict_proba(feats, verbose=-1)[0, -1]
                if DEBUG:
                    print(tok, parse, proba, sep="\t")
                parses_probas.append([parse, proba])

            decided_entry = max(parses_probas, key = lambda pair: pair[1])[0]
            prep_tokens.append([tok, decided_entry])


    return prep_tokens

### Пример 1: часть речи

In [168]:
disambiguate_sentence("В итоге они стали разбойниками .".split(), model)

они	POS=PRON|Case=Nom|Number=Plur|Person=3|PronType=Prs	0.8574635691041903
они	POS=PRON|Case=Nom|Number=Plur|Person=3	0.0704212368055494
стали	POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act	0.967774957402844
стали	POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Mid	0.9707824405681388
стали	POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing	0.8097826302937551




[['В', 'POS=ADP'],
 ['итоге', 'POS=NOUN|Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing'],
 ['они', 'POS=PRON|Case=Nom|Number=Plur|Person=3|PronType=Prs'],
 ['стали',
  'POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Mid'],
 ['разбойниками', 'Unkn=Unkn'],
 ['.', 'POS=PUNCT']]

In [169]:
disambiguate_sentence("Завод , производящий стали .".split(), model)

Завод	POS=NOUN|Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing	0.8747726228080009
Завод	POS=NOUN|Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing	0.3022019757904673
стали	POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act	0.9680747421354197
стали	POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Mid	0.9701506220205404
стали	POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing	0.8206447256887596




[['Завод', 'POS=NOUN|Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing'],
 [',', 'POS=PUNCT'],
 ['производящий', 'Unkn=Unkn'],
 ['стали',
  'POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Mid'],
 ['.', 'POS=PUNCT']]

In [170]:
# Недостаточно контекста
disambiguate_sentence("стали , которые мы производим .".split(), model)

стали	POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act	0.9680747421354197
стали	POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Mid	0.9701506220205404
стали	POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing	0.7618211837673081
которые	POS=PRON|Case=Acc|PronType=Int,Rel	0.5224583542141229
которые	POS=PRON|Case=Nom|PronType=Int,Rel	0.2449058229316219
которые	POS=PRON|Case=Nom|Number=Plur	0.06184213261852183
которые	POS=PRON|Animacy=Inan|Case=Acc|Number=Plur	0.06719047634263982
мы	POS=PRON|Case=Nom|Number=Plur|Person=1|PronType=Prs	0.8455422376251126
мы	POS=PRON|Case=Nom|Number=Plur|Person=1	0.06054985085902337




[['стали',
  'POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Mid'],
 [',', 'POS=PUNCT'],
 ['которые', 'POS=PRON|Case=Acc|PronType=Int,Rel'],
 ['мы', 'POS=PRON|Case=Nom|Number=Plur|Person=1|PronType=Prs'],
 ['производим', 'Unkn=Unkn'],
 ['.', 'POS=PUNCT']]

### Пример 2: падеж

In [171]:
disambiguate_sentence("Для установления связи с пришельцами .".split(), model)

связи	POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing	0.8713172024249778
связи	POS=NOUN|Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing	0.1756566290122286
связи	POS=NOUN|Animacy=Inan|Case=Acc|Gender=Fem|Number=Plur	0.19450508188085774
связи	POS=NOUN|Animacy=Inan|Case=Nom|Gender=Fem|Number=Plur	0.7420672195614567




[['Для', 'POS=ADP'],
 ['установления', 'POS=NOUN|Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing'],
 ['связи', 'POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing'],
 ['с', 'POS=ADP'],
 ['пришельцами', 'Unkn=Unkn'],
 ['.', 'POS=PUNCT']]

In [172]:
disambiguate_sentence("Наши связи с коллегами .".split(), model)

Наши	POS=DET|Case=Nom|Number=Plur|Poss=Yes|PronType=Prs	0.6362390043419599
Наши	POS=DET|Case=Acc|Number=Plur|Poss=Yes|PronType=Prs	0.5534279071909028
Наши	POS=DET|Animacy=Inan|Case=Acc|Number=Plur	0.06601800849158747
Наши	POS=DET|Case=Nom|Number=Plur	0.07254468550448342
связи	POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing	0.8097826302937551
связи	POS=NOUN|Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing	0.22106952734012555
связи	POS=NOUN|Animacy=Inan|Case=Acc|Gender=Fem|Number=Plur	0.19236769378562488
связи	POS=NOUN|Animacy=Inan|Case=Nom|Gender=Fem|Number=Plur	0.8926212844364179




[['Наши', 'POS=DET|Case=Nom|Number=Plur|Poss=Yes|PronType=Prs'],
 ['связи', 'POS=NOUN|Animacy=Inan|Case=Nom|Gender=Fem|Number=Plur'],
 ['с', 'POS=ADP'],
 ['коллегами', 'POS=NOUN|Animacy=Anim|Case=Ins|Gender=Masc|Number=Plur'],
 ['.', 'POS=PUNCT']]

In [173]:
disambiguate_sentence("18 человек .".split(), model)

18	POS=NUM|NumType=Card	0.6838407988105323
18	POS=NUM	0.6838407988105323
18	POS=ADJ	0.48508467736522665
человек	POS=NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing	0.8827861755771124
человек	POS=NOUN|Animacy=Anim|Case=Gen|Gender=Masc|Number=Plur	0.7777328596403669




[['18', 'POS=NUM|NumType=Card'],
 ['человек', 'POS=NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing'],
 ['.', 'POS=PUNCT']]

In [None]:
disambiguate_sentence("Несколько человек .".split(), model)

Несколько	POS=ADV|Degree=Pos	0.7705854914602377
Несколько	POS=NUM|Animacy=Inan|Case=Acc|NumType=Card	0.5530824330719456
Несколько	POS=NUM|Case=Nom|NumType=Card	0.651617060870511
Несколько	POS=NUM|Case=Nom	0.651617060870511
Несколько	POS=NUM|Animacy=Inan|Case=Acc	0.5530824330719456
человек	POS=NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing	0.8441678757850495
человек	POS=NOUN|Animacy=Anim|Case=Gen|Gender=Masc|Number=Plur	0.7648674332679221




[['Несколько', 'POS=ADV|Degree=Pos'],
 ['человек', 'POS=NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing'],
 ['.', 'POS=PUNCT']]