# Лабораторная работа 4. Снятие омонимии с использованием методов машинного обучения

Необходимо написать программу, которая принимает на вход корпус со снятой омонимией (что-то с соревнований, НКРЯ, Universal Dependencies, …), тренирует модель машинного обучения (любую, можно и с глубинными сетями, по умолчанию CRF или Random Forest) и снимает омонимию с поданного на вход текста (например, с тестовой части корпуса).

### Критерии оценки:

- 2 - использован размеченный корпус
- 4 - обучена модель
- 2 - снимает омонимию
- 2 - красота решения (субъективное мнение преподавателя)


In [1]:
import pathlib

In [2]:
datadir = pathlib.Path().cwd() / "data"

In [3]:
train_filenames = [
    "processed_ru_syntagrus-ud-train-a.conllu",
    "processed_ru_syntagrus-ud-train-b.conllu",
    "processed_ru_syntagrus-ud-train-c.conllu",
]

test_filename = "processed_ru_syntagrus-ud-test.conllu"

In [4]:
train_data = []

for filename in train_filenames:
    with open(datadir / filename, "r", encoding="utf-8") as file:
        train_data.extend(file.read().rstrip().split("\n\n"))

with open(datadir / test_filename, "r", encoding="utf-8") as file:
    test_data = file.read().rstrip().split("\n\n")

In [5]:
len(train_data), len(test_data)

(69630, 8800)

In [6]:
print(train_data[1])

Начальник	POS=NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing
областного	POS=ADJ|Case=Gen|Degree=Pos|Gender=Neut|Number=Sing
управления	POS=NOUN|Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing
связи	POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing
Семен	POS=PROPN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing
Еремеевич	POS=PROPN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing
был	POS=AUX|Aspect=Imp|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act
человек	POS=NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing
простой	POS=ADJ|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing
,	POS=PUNCT
приходил	POS=VERB|Aspect=Imp|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act
на	POS=ADP
работу	POS=NOUN|Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing
всегда	POS=ADV|Degree=Pos
вовремя	POS=ADV|Degree=Pos
,	POS=PUNCT
здоровался	POS=VERB|Aspect=Imp|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Mid
с	POS=ADP
секретаршей	POS=NOUN|Animacy=Anim|Case=Ins|Gender=Fe

In [7]:
def prep_entry(text:str):
    entry = tuple(filter(lambda l: not l.startswith("#"), text.split("\n")))
    entry = [line.split("\t") for line in entry]
    return entry

In [8]:
def split_grammar(line:str) -> dict[str, str]:
    return {
        (p := pair.split("="))[0]: p[1]
        for pair
        in line.split("|")
    }

In [9]:
all_tags: dict[str, set] = dict()
word_dict: dict[str, dict[str, int]] = {}

In [10]:
for sent in train_data:
    lines = prep_entry(sent)
    for line in lines:
        word, grammar = line
        word_lower = word.lower()

        # Обновить грамматический словарь
        d = word_dict.get(word_lower, dict())
        d[grammar] = d.get(grammar, 0) + 1
        word_dict[word_lower] = d

        # Обновить словарь тегов
        for tag, val in split_grammar(grammar).items():
            all_tags[tag] = all_tags.get(tag, {"[PAD]"}) | {val}

In [11]:
word_dict["стали"]

{'POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act': 223,
 'POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Mid': 144,
 'POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing': 19,
 'POS=NOUN|Animacy=Inan|Case=Nom|Gender=Fem|Number=Plur': 1,
 'POS=NOUN|Animacy=Inan|Case=Dat|Gender=Fem|Number=Sing': 1}

In [12]:
sum(
    [
        sum(word_dict[word].values())
        for word
        in word_dict
    ]
)

1206302

In [13]:
## Уберем все разборы, кол-во которых меньше THRESHOLD

THRESHOLD = 6

for word in word_dict:
    word_dict[word] = {
        k: val
        for k, val
        in word_dict[word].items()
        if val >= THRESHOLD
    }


In [14]:
sum(
    [
        sum(word_dict[word].values())
        for word
        in word_dict
    ]
)

1000132

In [15]:
word_dict["стали"]

{'POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act': 223,
 'POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Mid': 144,
 'POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing': 19}

In [16]:
test_sent = list(map(lambda pair: pair[0], prep_entry(test_data[0])))
test_sent

['В',
 'советский',
 'период',
 'времени',
 'число',
 'ИТ',
 '-',
 'специалистов',
 'в',
 'Армении',
 'составляло',
 'около',
 'десяти',
 'тысяч',
 '.']

In [17]:
class BaseMorphotagger:

    def __init__(self, emtries):
        self.word_dict = emtries

    def disambig(self, d: dict):
        pass

    def parse_unknown(self, tok, **kwargs):
        return "[UNKNOWN]"

    def parse_known(self, tok, **kwargs):
        d = kwargs["d"]
        grammar = max(
            d,
            key=d.get
        )

        if len(d) > 1:
            grammar = "[AMBIG]: " + grammar

        else:
            grammar = "[SING]: " + grammar

        return grammar

    def parse_tokens(self, tokens: list[str]):
        result = []

        for tok in tokens:

            tok_lower = tok.lower()
            d = self.word_dict[tok_lower]

            if not d:
                grammar = self.parse_unknown(tok_lower, tokens=tokens, grammar=result)

            else:
                grammar = self.parse_known(tok_lower, d=d, tokens=tokens, grammar=result)

            result.append(grammar)

        return result

In [18]:
morphotagger = BaseMorphotagger(word_dict)

In [19]:
for tok, grammar in zip(test_sent, morphotagger.parse_tokens(test_sent)):
    print(tok, grammar, sep="\t")

В	[SING]: POS=ADP
советский	[AMBIG]: POS=ADJ|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing
период	[AMBIG]: POS=NOUN|Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing
времени	[AMBIG]: POS=NOUN|Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing
число	[AMBIG]: POS=NOUN|Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing
ИТ	[UNKNOWN]
-	[SING]: POS=PUNCT
специалистов	[AMBIG]: POS=NOUN|Animacy=Anim|Case=Gen|Gender=Masc|Number=Plur
в	[SING]: POS=ADP
Армении	[UNKNOWN]
составляло	[SING]: POS=VERB|Aspect=Imp|Gender=Neut|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act
около	[SING]: POS=ADP
десяти	[AMBIG]: POS=NUM|Case=Gen|NumType=Card
тысяч	[SING]: POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur
.	[SING]: POS=PUNCT


In [20]:
all_tags

{'POS': {'ADJ',
  'ADP',
  'ADV',
  'AUX',
  'CCONJ',
  'DET',
  'INTJ',
  'NOUN',
  'NUM',
  'PART',
  'PRON',
  'PROPN',
  'PUNCT',
  'SCONJ',
  'SYM',
  'VERB',
  'X',
  '[PAD]',
  '_'},
 'Animacy': {'Anim', 'Inan', '[PAD]'},
 'Case': {'Acc', 'Dat', 'Gen', 'Ins', 'Loc', 'Nom', 'Par', 'Voc', '[PAD]'},
 'Gender': {'Fem', 'Masc', 'Neut', '[PAD]'},
 'Number': {'Plur', 'Sing', '[PAD]'},
 'Degree': {'Cmp', 'Pos', 'Sup', '[PAD]'},
 'Aspect': {'Imp', 'Perf', '[PAD]'},
 'Mood': {'Cnd', 'Imp', 'Ind', '[PAD]'},
 'Tense': {'Fut', 'Past', 'Pres', '[PAD]'},
 'VerbForm': {'Conv', 'Fin', 'Inf', 'Part', '[PAD]'},
 'Voice': {'Act', 'Mid', 'Pass', '[PAD]'},
 'Person': {'1', '2', '3', '[PAD]'},
 'PronType': {'Dem', 'Ind', 'Int,Rel', 'Neg', 'Prs', 'Tot', '[PAD]'},
 'Variant': {'Short', '[PAD]'},
 'NumType': {'Card', '[PAD]'},
 'Poss': {'Yes', '[PAD]'},
 'Reflex': {'Yes', '[PAD]'},
 'Foreign': {'Yes', '[PAD]'},
 'Polarity': {'Neg', '[PAD]'},
 'Typo': {'Yes', '[PAD]'},
 'Abbr': {'Yes', '[PAD]'}}

In [21]:
all_tags["POS"].remove('_')

In [22]:
all_tags["POS"]

{'ADJ',
 'ADP',
 'ADV',
 'AUX',
 'CCONJ',
 'DET',
 'INTJ',
 'NOUN',
 'NUM',
 'PART',
 'PRON',
 'PROPN',
 'PUNCT',
 'SCONJ',
 'SYM',
 'VERB',
 'X',
 '[PAD]'}

In [23]:
to_del = []

for tag in all_tags:
    if len(all_tags[tag]) < 3:
        print(tag)
        to_del.append(tag)

for tag in to_del:
    del all_tags[tag]

Variant
NumType
Poss
Reflex
Foreign
Polarity
Typo
Abbr


In [24]:
all_tags

{'POS': {'ADJ',
  'ADP',
  'ADV',
  'AUX',
  'CCONJ',
  'DET',
  'INTJ',
  'NOUN',
  'NUM',
  'PART',
  'PRON',
  'PROPN',
  'PUNCT',
  'SCONJ',
  'SYM',
  'VERB',
  'X',
  '[PAD]'},
 'Animacy': {'Anim', 'Inan', '[PAD]'},
 'Case': {'Acc', 'Dat', 'Gen', 'Ins', 'Loc', 'Nom', 'Par', 'Voc', '[PAD]'},
 'Gender': {'Fem', 'Masc', 'Neut', '[PAD]'},
 'Number': {'Plur', 'Sing', '[PAD]'},
 'Degree': {'Cmp', 'Pos', 'Sup', '[PAD]'},
 'Aspect': {'Imp', 'Perf', '[PAD]'},
 'Mood': {'Cnd', 'Imp', 'Ind', '[PAD]'},
 'Tense': {'Fut', 'Past', 'Pres', '[PAD]'},
 'VerbForm': {'Conv', 'Fin', 'Inf', 'Part', '[PAD]'},
 'Voice': {'Act', 'Mid', 'Pass', '[PAD]'},
 'Person': {'1', '2', '3', '[PAD]'},
 'PronType': {'Dem', 'Ind', 'Int,Rel', 'Neg', 'Prs', 'Tot', '[PAD]'}}

In [25]:
id2tags = {k: None for k in all_tags}
tag2ids = {k: None for k in all_tags}

for k in all_tags:
    id2tags[k] = {i: val for i, val in enumerate(all_tags[k])}

for k in id2tags:
    tag2ids[k] = {val: key for key, val in id2tags[k].items()}

for k in all_tags:
    zeroth_elem = id2tags[k][0]
    if zeroth_elem != "[PAD]":

        pad_idx = tag2ids[k]["[PAD]"]
        id2tags[k][0] = "[PAD]"
        id2tags[k][pad_idx] = zeroth_elem

        tag2ids[k]["[PAD]"] = 0
        tag2ids[k][zeroth_elem] = pad_idx


In [26]:
tag2ids

{'POS': {'PRON': 6,
  'NUM': 1,
  'PART': 2,
  'VERB': 3,
  'CCONJ': 4,
  'PUNCT': 5,
  '[PAD]': 0,
  'ADJ': 7,
  'ADV': 8,
  'DET': 9,
  'X': 10,
  'INTJ': 11,
  'PROPN': 12,
  'NOUN': 13,
  'SCONJ': 14,
  'ADP': 15,
  'SYM': 16,
  'AUX': 17},
 'Animacy': {'[PAD]': 0, 'Anim': 1, 'Inan': 2},
 'Case': {'Acc': 3,
  'Ins': 1,
  'Voc': 2,
  '[PAD]': 0,
  'Dat': 4,
  'Loc': 5,
  'Gen': 6,
  'Par': 7,
  'Nom': 8},
 'Gender': {'Masc': 1, '[PAD]': 0, 'Fem': 2, 'Neut': 3},
 'Number': {'[PAD]': 0, 'Plur': 1, 'Sing': 2},
 'Degree': {'Cmp': 2, 'Sup': 1, '[PAD]': 0, 'Pos': 3},
 'Aspect': {'[PAD]': 0, 'Perf': 1, 'Imp': 2},
 'Mood': {'Cnd': 2, 'Imp': 1, '[PAD]': 0, 'Ind': 3},
 'Tense': {'Fut': 2, 'Pres': 1, '[PAD]': 0, 'Past': 3},
 'VerbForm': {'Conv': 3, 'Part': 1, 'Inf': 2, '[PAD]': 0, 'Fin': 4},
 'Voice': {'Pass': 1, '[PAD]': 0, 'Mid': 2, 'Act': 3},
 'Person': {'1': 1, '[PAD]': 0, '2': 2, '3': 3},
 'PronType': {'Tot': 3,
  'Dem': 1,
  'Int,Rel': 2,
  '[PAD]': 0,
  'Ind': 4,
  'Neg': 5,
  'Prs': 6}

In [27]:
id2tags

{'POS': {0: '[PAD]',
  1: 'NUM',
  2: 'PART',
  3: 'VERB',
  4: 'CCONJ',
  5: 'PUNCT',
  6: 'PRON',
  7: 'ADJ',
  8: 'ADV',
  9: 'DET',
  10: 'X',
  11: 'INTJ',
  12: 'PROPN',
  13: 'NOUN',
  14: 'SCONJ',
  15: 'ADP',
  16: 'SYM',
  17: 'AUX'},
 'Animacy': {0: '[PAD]', 1: 'Anim', 2: 'Inan'},
 'Case': {0: '[PAD]',
  1: 'Ins',
  2: 'Voc',
  3: 'Acc',
  4: 'Dat',
  5: 'Loc',
  6: 'Gen',
  7: 'Par',
  8: 'Nom'},
 'Gender': {0: '[PAD]', 1: 'Masc', 2: 'Fem', 3: 'Neut'},
 'Number': {0: '[PAD]', 1: 'Plur', 2: 'Sing'},
 'Degree': {0: '[PAD]', 1: 'Sup', 2: 'Cmp', 3: 'Pos'},
 'Aspect': {0: '[PAD]', 1: 'Perf', 2: 'Imp'},
 'Mood': {0: '[PAD]', 1: 'Imp', 2: 'Cnd', 3: 'Ind'},
 'Tense': {0: '[PAD]', 1: 'Pres', 2: 'Fut', 3: 'Past'},
 'VerbForm': {0: '[PAD]', 1: 'Part', 2: 'Inf', 3: 'Conv', 4: 'Fin'},
 'Voice': {0: '[PAD]', 1: 'Pass', 2: 'Mid', 3: 'Act'},
 'Person': {0: '[PAD]', 1: '1', 2: '2', 3: '3'},
 'PronType': {0: '[PAD]',
  1: 'Dem',
  2: 'Int,Rel',
  3: 'Tot',
  4: 'Ind',
  5: 'Neg',
  6: 'Prs'}

In [28]:
set(all_tags.keys())

{'Animacy',
 'Aspect',
 'Case',
 'Degree',
 'Gender',
 'Mood',
 'Number',
 'POS',
 'Person',
 'PronType',
 'Tense',
 'VerbForm',
 'Voice'}

In [29]:
sorted_keys = sorted(all_tags.keys())

In [30]:
sorted_keys

['Animacy',
 'Aspect',
 'Case',
 'Degree',
 'Gender',
 'Mood',
 'Number',
 'POS',
 'Person',
 'PronType',
 'Tense',
 'VerbForm',
 'Voice']

In [31]:
def get_grammar_dict(line:str, allowed_tags=set(all_tags.keys())):
    pairs = split_grammar(line)
    return {
        k: tag2ids.get(
            k,
            dict()
        ).get(
            pairs.get(k, "[PAD]"),
            0
        )
        for k
        in allowed_tags
    }

In [32]:
get_grammar_dict("POS=NOUN|Animacy=Inan")

{'Gender': 0,
 'Animacy': 2,
 'Aspect': 0,
 'Number': 0,
 'Mood': 0,
 'Voice': 0,
 'PronType': 0,
 'VerbForm': 0,
 'POS': 13,
 'Person': 0,
 'Tense': 0,
 'Degree': 0,
 'Case': 0}

In [33]:
def get_grammar_vector(line:str, allowed_tags=set(all_tags.keys())):
    grammar_dict = get_grammar_dict(line, allowed_tags=allowed_tags)
    return [grammar_dict[k] for k in sorted(grammar_dict.keys())]

In [34]:
# В алфавитном порядке
get_grammar_vector("POS=NOUN|Animacy=Inan")

[2, 0, 0, 0, 0, 0, 0, 13, 0, 0, 0, 0, 0]

In [35]:
def grammar_vector_to_dict(vector, sorted_keys=sorted_keys):
    return {
        sorted_keys[i]: val
        for i, val
        in enumerate(vector)
    }

In [36]:
grammar_vector_to_dict(
    get_grammar_vector("POS=NOUN|Animacy=Inan")
)

{'Animacy': 2,
 'Aspect': 0,
 'Case': 0,
 'Degree': 0,
 'Gender': 0,
 'Mood': 0,
 'Number': 0,
 'POS': 13,
 'Person': 0,
 'PronType': 0,
 'Tense': 0,
 'VerbForm': 0,
 'Voice': 0}

In [37]:
def grammar_dict_to_line(dict, id2tags=id2tags, pad_idx=0):
    return "|".join(
        [
            f"{tag}={id2tags[tag][dict[tag]]}"
            for tag
            in dict
            if dict[tag] != pad_idx
        ]
    )

In [38]:
grammar_dict_to_line(
    grammar_vector_to_dict(
        get_grammar_vector("POS=NOUN|Animacy=Inan")
    )
)

'Animacy=Inan|POS=NOUN'

In [39]:
def extract_features(sentence, index, neg=False, word_dict=word_dict):

    word, target_tags = sentence[index]

    # word = word.lower()

    if index - 1 >= 0:
        _, prev_tag_1 = sentence[index - 1]
    else:
        _, prev_tag_1 = "[PAD]", "DUMMY=dummy|dummy=DUMMY"

    # prev_tag_1 = get_grammar_dict(prev_tag_1)
    prev_tag_1 = get_grammar_vector(prev_tag_1)

    if index - 2 >= 0:
        _, prev_tag_2 = sentence[index - 2]
    else:
        _, prev_tag_2 = "[PAD]", "DUMMY=dummy|dummy=DUMMY"

    prev_tag_2 = get_grammar_vector(prev_tag_2)

    if index - 3 >= 0:
        _, prev_tag_3 = sentence[index - 3]
    else:
        _, prev_tag_3 = "[PAD]", "DUMMY=dummy|dummy=DUMMY"

    prev_tag_3 = get_grammar_vector(prev_tag_3)

    # prev_tag_2 = get_grammar_dict(prev_tag_2)
    

    target_vector = get_grammar_vector(target_tags)

    if neg:
        for parse in word_dict[word.lower()]:
            vector = get_grammar_vector(parse)
            if vector != target_vector:
                target_vector = vector
                break
        else:
            target_vector = [0]*len(target_vector)

    training_row = [
        int(word.isupper()),
        int(word.istitle()),
        int(word.isdigit()),
        *prev_tag_1,
        *prev_tag_2,
        *prev_tag_3,
        *target_vector
    ]

    return word.lower(), training_row, int(not neg)

In [40]:
word, trr, tstr = extract_features(prep_entry(train_data[1]), 0)
word, word_dict.get(word.lower(), "DUMMY=dummy|dummy=DUMMY"), tstr

('начальник',
 {'POS=NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing': 72},
 1)

In [41]:
word, trr, tstr = extract_features(prep_entry(train_data[1]), 1)
word, word_dict.get(word.lower(), "DUMMY=dummy|dummy=DUMMY"), tstr

('областного', {'POS=ADJ|Case=Gen|Degree=Pos|Gender=Masc|Number=Sing': 6}, 1)

In [42]:
word, trr, tstr = extract_features(prep_entry(train_data[1]), 2)
word, word_dict.get(word.lower(), "DUMMY=dummy|dummy=DUMMY"), tstr

('управления',
 {'POS=NOUN|Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing': 185,
  'POS=PROPN|Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing': 10},
 1)

In [43]:
word, trr, tstr = extract_features(prep_entry(train_data[1]), 2, neg=True)
word, word_dict.get(word.lower(), "DUMMY=dummy|dummy=DUMMY"), tstr

('управления',
 {'POS=NOUN|Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing': 185,
  'POS=PROPN|Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing': 10},
 0)

In [44]:
trr

[0,
 0,
 0,
 0,
 0,
 6,
 3,
 3,
 0,
 2,
 7,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 8,
 0,
 1,
 0,
 2,
 13,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 2,
 0,
 6,
 0,
 3,
 0,
 2,
 12,
 0,
 0,
 0,
 0,
 0]

In [45]:
tstr

0

In [46]:
def gen_training_row(train_data):
    while True:
        for i, entry in enumerate(train_data):
            parsed_entry = prep_entry(entry)
            for j in range(len(parsed_entry)):
                w = parsed_entry[j][0].lower()
                if w in word_dict and len(word_dict[w]) > 1:

                    yield extract_features(parsed_entry, j)
                    yield extract_features(parsed_entry, j, neg=True)


In [47]:
gen_train = gen_training_row(train_data)
gen_test  = gen_training_row(test_data)

In [49]:
from lightgbm import LGBMClassifier
import numpy as np
from tqdm import tqdm

In [50]:
model = LGBMClassifier(
    objective="binary",
    metric="auc",
    boosting_type="gbdt",
    num_leaves=31,
    learning_rate=0.05,
    verbose=-1
)

In [51]:
EPOCHS = 500
BATCH_SIZE = 15000

In [52]:
for epoch in tqdm(range(1, EPOCHS + 1)):
    X_bar = []
    y_bar = []

    for _ in range(BATCH_SIZE):
        word, features, labels = next(gen_train)
        X_bar.append(features)
        y_bar.append(labels)

    X_bar = np.array(X_bar)
    y_bar = np.array(y_bar)

    model.fit(X_bar, y_bar)

  0%|          | 0/500 [00:00<?, ?it/s]

100%|██████████| 500/500 [03:50<00:00,  2.17it/s]


In [53]:
# DEBUG = False
DEBUG = True

In [54]:
def disambiguate_sentence(tokens:list[str], model:LGBMClassifier):
    prep_tokens = []

    for tok in tokens:
        entries = word_dict.get(tok.lower(), dict())

        if len(entries) == 0:
            prep_tokens.append([tok, "Unkn=Unkn"])

        elif len(entries) == 1:
            prep_tokens.append([tok, next(iter(entries))])

        else:
            parses_probas = []

            for parse in entries:
                try_tokens = prep_tokens.copy() + [[tok, parse]]
                feats = np.array(extract_features(try_tokens, len(try_tokens)-1)[1]).reshape(1, -1)

                proba = model.predict_proba(feats, verbose=-1)[0, -1]
                if DEBUG:
                    print(tok, parse, proba, sep="\t")
                parses_probas.append([parse, proba])

            decided_entry = max(parses_probas, key = lambda pair: pair[1])[0]
            prep_tokens.append([tok, decided_entry])


    return prep_tokens

### Пример 1: часть речи

In [75]:
disambiguate_sentence("В итоге они стали разбойниками .".split(), model)

они	POS=PRON|Case=Nom|Number=Plur|Person=3|PronType=Prs	0.9410943122339618
они	POS=PRON|Case=Nom|Number=Plur|Person=3	0.05654492476862961
стали	POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act	0.2773807683956345
стали	POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Mid	0.487979429865982
стали	POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing	0.4352742436662145




[['В', 'POS=ADP'],
 ['итоге', 'POS=NOUN|Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing'],
 ['они', 'POS=PRON|Case=Nom|Number=Plur|Person=3|PronType=Prs'],
 ['стали',
  'POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Mid'],
 ['разбойниками', 'Unkn=Unkn'],
 ['.', 'POS=PUNCT']]

In [70]:
disambiguate_sentence("Завод , производящий стали .".split(), model)

Завод	POS=NOUN|Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing	0.7657080149316211
Завод	POS=NOUN|Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing	0.23089967080332932
стали	POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act	0.3996179464628823
стали	POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Mid	0.46966881176653846
стали	POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing	0.4797654538727015




[['Завод', 'POS=NOUN|Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing'],
 [',', 'POS=PUNCT'],
 ['производящий', 'Unkn=Unkn'],
 ['стали', 'POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing'],
 ['.', 'POS=PUNCT']]

In [None]:
# Недостаточно контекста
disambiguate_sentence("стали , которые мы производим .".split(), model)

стали	POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act	0.5333808103536697
стали	POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Mid	0.4703197575905293
стали	POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing	0.4824340613896696
которые	POS=PRON|Case=Acc|PronType=Int,Rel	0.412156596773844
которые	POS=PRON|Case=Nom|PronType=Int,Rel	0.16492322400122736
которые	POS=PRON|Case=Nom|Number=Plur	0.23170548226228918
которые	POS=PRON|Animacy=Inan|Case=Acc|Number=Plur	0.3236271419588549
мы	POS=PRON|Case=Nom|Number=Plur|Person=1|PronType=Prs	0.8831806459383733
мы	POS=PRON|Case=Nom|Number=Plur|Person=1	0.04320322244734478




[['стали',
  'POS=VERB|Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act'],
 [',', 'POS=PUNCT'],
 ['которые', 'POS=PRON|Case=Acc|PronType=Int,Rel'],
 ['мы', 'POS=PRON|Case=Nom|Number=Plur|Person=1|PronType=Prs'],
 ['производим', 'Unkn=Unkn'],
 ['.', 'POS=PUNCT']]

### Пример 2: падеж

In [56]:
disambiguate_sentence("Для установления связи с пришельцами .".split(), model)

связи	POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing	0.7821928980705306
связи	POS=NOUN|Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing	0.14301670560841828
связи	POS=NOUN|Animacy=Inan|Case=Acc|Gender=Fem|Number=Plur	0.12173660638699113
связи	POS=NOUN|Animacy=Inan|Case=Nom|Gender=Fem|Number=Plur	0.289142732564695




[['Для', 'POS=ADP'],
 ['установления', 'POS=NOUN|Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing'],
 ['связи', 'POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing'],
 ['с', 'POS=ADP'],
 ['пришельцами', 'Unkn=Unkn'],
 ['.', 'POS=PUNCT']]

In [57]:
disambiguate_sentence("Наши связи с коллегами .".split(), model)

Наши	POS=DET|Case=Nom|Number=Plur|Poss=Yes|PronType=Prs	0.8174051474847277
Наши	POS=DET|Case=Acc|Number=Plur|Poss=Yes|PronType=Prs	0.49333631059075006
Наши	POS=DET|Animacy=Inan|Case=Acc|Number=Plur	0.17932659073661827
Наши	POS=DET|Case=Nom|Number=Plur	0.19312027592786477
связи	POS=NOUN|Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing	0.39891742605929353
связи	POS=NOUN|Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing	0.12105718061149123
связи	POS=NOUN|Animacy=Inan|Case=Acc|Gender=Fem|Number=Plur	0.0962986355968993
связи	POS=NOUN|Animacy=Inan|Case=Nom|Gender=Fem|Number=Plur	0.8932993573756414




[['Наши', 'POS=DET|Case=Nom|Number=Plur|Poss=Yes|PronType=Prs'],
 ['связи', 'POS=NOUN|Animacy=Inan|Case=Nom|Gender=Fem|Number=Plur'],
 ['с', 'POS=ADP'],
 ['коллегами', 'POS=NOUN|Animacy=Anim|Case=Ins|Gender=Masc|Number=Plur'],
 ['.', 'POS=PUNCT']]

In [59]:
disambiguate_sentence("18 человек .".split(), model)

18	POS=NUM|NumType=Card	0.8270148472327914
18	POS=NUM	0.8270148472327914
18	POS=ADJ	0.10900230332007474
человек	POS=NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing	0.37675607807173767
человек	POS=NOUN|Animacy=Anim|Case=Gen|Gender=Masc|Number=Plur	0.6343944777518347




[['18', 'POS=NUM|NumType=Card'],
 ['человек', 'POS=NOUN|Animacy=Anim|Case=Gen|Gender=Masc|Number=Plur'],
 ['.', 'POS=PUNCT']]

In [60]:
disambiguate_sentence("Тот самый человек .".split(), model)

Тот	POS=DET|Case=Acc|Gender=Masc|Number=Sing|PronType=Dem	0.34409332452990526
Тот	POS=DET|Case=Nom|Gender=Masc|Number=Sing|PronType=Dem	0.711244767108151
Тот	POS=DET|Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing	0.17517261307106563
Тот	POS=DET|Case=Nom|Gender=Masc|Number=Sing	0.18878527574404855
самый	POS=ADJ|Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing	0.13351796729952956
самый	POS=ADJ|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing	0.7929178155095201
самый	POS=DET|Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing	0.08319809881083459
самый	POS=DET|Case=Nom|Gender=Masc|Number=Sing	0.37717207676575604
человек	POS=NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing	0.8778543527320118
человек	POS=NOUN|Animacy=Anim|Case=Gen|Gender=Masc|Number=Plur	0.5242649552704631




[['Тот', 'POS=DET|Case=Nom|Gender=Masc|Number=Sing|PronType=Dem'],
 ['самый', 'POS=ADJ|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing'],
 ['человек', 'POS=NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing'],
 ['.', 'POS=PUNCT']]

In [None]:
# Неверно классифицировалось "Несколько", ошибка потянулась в "человек"
disambiguate_sentence("Несколько человек .".split(), model)



Несколько	POS=ADV|Degree=Pos	0.5250802368455961
Несколько	POS=NUM|Animacy=Inan|Case=Acc|NumType=Card	0.5663796516062514
Несколько	POS=NUM|Case=Nom|NumType=Card	0.6190846233633428
Несколько	POS=NUM|Case=Nom	0.6190846233633428
Несколько	POS=NUM|Animacy=Inan|Case=Acc	0.5663796516062514
человек	POS=NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing	0.7509036516654578
человек	POS=NOUN|Animacy=Anim|Case=Gen|Gender=Masc|Number=Plur	0.6606745070552508




[['Несколько', 'POS=NUM|Case=Nom|NumType=Card'],
 ['человек', 'POS=NOUN|Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing'],
 ['.', 'POS=PUNCT']]