Задание 1. (5 баллов) 
В тетрадке реализована биграмная языковая модель (при генерации учитывается информация только о 1 предыдущем слове). Реализуйте триграмную модель и сгенерируйте несколько текстов. Сравните их с текстами, сгенерированными биграмной моделью. 
Можно использовать те же тексты, что в семинаре, или взять какой-то другой (на английском или русском языке).  

Делать это задание будет легче после прочтения первых 7 страниц вот этой главы из Журафского - https://web.stanford.edu/~jurafsky/slp3/3.pdf



In [15]:
import nltk
import re
from collections import Counter, defaultdict
import numpy as np
import copy
from gensim.models.phrases import Phrases

In [16]:
with open('stranger.txt', encoding='utf-8') as file: #Robert Heinlein's Stranger in a Strange Land with preface removed
    stranger = file.read()

In [17]:
stranger = stranger.replace('“Stranger In A Strange Land” by Robert Heinlein', '')

In [18]:
sents = nltk.tokenize.sent_tokenize(stranger)

In [19]:
pattern = re.compile(r'([A-Za-z]+[\'-]?[A-Za-z]*)')

In [20]:
sents = [re.findall(pattern, sent) for sent in sents]

In [21]:
tri_model = defaultdict(lambda: defaultdict(lambda: 0))
for sentence in sents:
    for w1, w2, w3 in nltk.trigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        tri_model[(w1, w2)][w3] += 1

Логарифм не использую, потому что верятности нужны только, чтобы сделать взвешенную выборку, а не чтобы их глазами сравнивать.

In [22]:
for bigram in tri_model:
    total_count = sum(tri_model[bigram].values())
    for target in tri_model[bigram]:
        tri_model[bigram][target] /= total_count

In [23]:
def tri_generate(model, start=('<s>', '<s>')):
    text = list(start)
    while text[-1] != '</s>': 
        index = tuple(text[-2:])
        keys = list(model[index].keys())
        values = list(model[index].values())
        key = np.random.choice(keys, 1, values)[0]
        text.append(key)
    return ' '.join(text[2:]).strip(' </s>')

In [24]:
def text_generator(sent_generator, model, number_of_sents=1, count_words=False):
    result = []
    for _ in range(number_of_sents):
        result.append(sent_generator(model))
       
    if count_words == True:
        count = count_words_avg(result)
        return count
    else:
        result = '. '.join(result) + '.'
        return result


In [25]:
def count_words_avg(sents):
    total = 0
    for sent in sents:
        total += len(sent.split(' '))
    return total/len(sents)

In [26]:
tri_example = text_generator(tri_generate, tri_model, 6)
tri_example

"We call her over. repeated Jill. Hypothetical situation for you wench Miriam insisted. Mnimm on Mars which we can judge his virtue- virtue so great mind you that would show up for sale but they seemed like gun. Eternity is no cure short of breaking him on the spot while it's hot- your newscaster Happy Holliday Went on as before. grabbed his ear pulled him gently to Jill cupped a handful of water brother said breathlessly."

In [27]:
bi_model = defaultdict(lambda: defaultdict(lambda: 0))
for sentence in sents:
    for w1, w2 in nltk.bigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        bi_model[w1][w2] += 1

In [28]:
for unigram in bi_model:
    total_count = sum(bi_model[unigram].values())
    for target in bi_model[unigram]:
        bi_model[unigram][target] /= total_count

In [29]:
def bi_generate(model, start=['<s>']):
    text = copy.deepcopy(start)
    while text[-1] != '</s>': 
        index = text[-1]
        keys = list(model[index].keys())
        values = list(model[index].values())
        key = np.random.choice(keys, 1, values)[0]
        text.append(key)
    return ' '.join(text[1:]).strip(' </s>')

In [30]:
bi_example = text_generator(bi_generate, bi_model, 6)
bi_example

"Duke-where s stop worrying about another channel then searched in anyone would again rose hushe. Suddenly she pleased mood from his oath that Martians haven't estimated at hand in seeing it difficult and won't his shoulder shoved back stood alone-a slender young fellows seem. Through your eyes-no violence brutality-but it sent back later maybe priestess-it can dig him become untired. Thirty minutes had whether or at Berkeley and Jiib with equal. t count is no purpose-Dr Nelson long pauses some religion before with more aggressive. Feeling better by ancestry."

Результат триграммной модели для сравнения: <br> <br>

In [31]:
tri_example

"We call her over. repeated Jill. Hypothetical situation for you wench Miriam insisted. Mnimm on Mars which we can judge his virtue- virtue so great mind you that would show up for sale but they seemed like gun. Eternity is no cure short of breaking him on the spot while it's hot- your newscaster Happy Holliday Went on as before. grabbed his ear pulled him gently to Jill cupped a handful of water brother said breathlessly."

Можно отметить две вещи: <br>
1) Предложения, созданные триграммной модели определённо больше похожи на текст, написанный человеком. Они более связные (только грамматически, естественно). <br>
2) Биграммные предложения длиннее, чем триграммные, при условии, что мы останавливаемся только на символе окончания предложения. Это подтверждается экспериментом ниже.

In [32]:
n = 100
m = 10
bi_test = 0
tri_test = 0
for _ in range(n):
    bi_test += text_generator(bi_generate, bi_model, m, count_words=True)
    tri_test += text_generator(tri_generate, tri_model, m, count_words=True)
print(f'Average bigram model sentence length: {bi_test/n:.2f} words\n' +
      f'Average trigram model sentence length: {tri_test/n:.2f} words\n')

Average bigram model sentence length: 19.81 words
Average trigram model sentence length: 12.93 words



Задание 2. (5 баллов) 
При помощи gensim.models.Phrases реализуйте byte-pair-encoding, про который говорилось на первом семинаре (https://github.com/mannefedov/compling_nlp_hse_course/blob/master/notebooks/Preprocessing.ipynb) 
А именно 1) возьмите любой текст; разбейте его на предложения, а каждое предложение разбейте на отдельные символы (не потеряйте пробелы) 2) обучите gensim.models.Phrases на полученных символьных предложениях 3) примените полученный нграммер к этим символьным предложениям 4) повторите 2 и 3 N количество раз, чтобы начали получаться целые слова
Параметры в gensim.models.Phrases влияют на количество получаемых нграммов после каждого прохода, поэтому не забудьте их настроить


In [33]:
symbol_sents = [' '.join(sent) for sent in sents]

In [34]:
symbol_sents = [[ch for ch in sent if ch not in ',.;!?\n'] for sent in symbol_sents]

In [39]:
phrases1 = Phrases(symbol_sents, scoring='npmi', threshold=1, min_count=3)

In [40]:
phrases2 = Phrases(phrases1[symbol_sents], scoring='npmi', threshold=0, min_count=2)

In [41]:
phrases3 = Phrases(phrases2[symbol_sents], scoring='npmi', threshold=-1)

In [42]:
list(phrases3[phrases2[phrases1[symbol_sents]]])[0]

['p_a_r',
 't_ _o_n',
 'e_ _H',
 'I_S',
 ' _M_A',
 'C_U_L_A',
 'T_E_ _O',
 'R_I_G',
 'I_N_ _I',
 ' _O_N_C',
 'E_ _U',
 'P_O_N',
 ' _A_ _T',
 'I_M_E',
 ' _w_h_e',
 'n_ _t_h',
 'e_ _w_o',
 'r_l_d',
 ' _w_a_s',
 ' _y_o_u',
 'n_g_ _t',
 'h_e_r_e',
 ' _w_a_s',
 ' _a_ _M',
 'a_r_t_i',
 'a_n_ ',
 'n_a_m',
 'e_d_ _S',
 'm_i_t_h']