# Языковое моделирование

Мы будем работать с корпусом шекспировских текстов. Для того, чтобы его скачать, введите:
```python
!wget http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt
```

**Проделайте следующие упражнения:**
1. Разбейте текст на слова.
2. Приведите все к нижнему регистру.
3. Посчитайте частоты всех слов.
4. Замените слова с частотой встречаемости ниже 2 на UNK.
5. Создайте словарь, где по ключу _i_ будет лежать словарь с частотами _n_-грамм длины _i_.
6. Напишите функцию для оценки вероятностей предложений при помощи данного словаря с использованием сглаживания Лапласа.

In [440]:
from collections import Counter, defaultdict

# Вспомогательаня функция для замены нечастотных слов тэгом UNK
def replace_(word, frequency_dict):
    if frequency_dict[word] < 2:
        return "UNK"
    else:
        return word

# Функция для удаления лишних символов в слове(знаки препинания)
def strip_(word):
    word = word.strip()
    result = ""
    for symbol in word:
        if symbol not in ".,!@#$%^&*()_+:\"'|/":
            result += symbol
    return result.lower()

with open("shakespeare_input.txt", 'r') as fin:
    frequency_dict = defaultdict(int)
    fin = fin.read(10000)
    text = [strip_(word.lower()) for word in fin.split()]
    for word in text:
        frequency_dict[word] += 1
    text = [replace_(word, frequency_dict) for word in text]

In [419]:
# Вспомогательная функция для создания словаря n-граммов длины n из заданного текста:
def create_dict(n, text):
    d = defaultdict(int)
    current_index = 0
    for i in range(0, len(text) - n + 1):
        bigram = ''
        for j in range(0, n):
            bigram += " " + text[i + j]
        bigram = bigram.strip()
        d[bigram] += 1.0
    total = sum(d.values())
    total += len(list(set(d.keys())))
    for key in d:
        d[key] = (float(d[key]) / total) + 1.0
    d["UNK"] = 1.0
    return d

# Вспомогательная функция для создания словаря словарей
# (ключ словаря - целое число, означающее длину n-грамма,
# значение – словарь таких n-граммов):
def create_meta_dict(n, text):
    dictionary = defaultdict(dict)
    for i in range(1, n + 1):
        dictionary[i] = create_dict(i, text)
    return dictionary

In [420]:
import re
def tokenize_text(text):
    text = re.split("\.|\n|\?|!", text)
    return [strip_(t) for t in text if t]
sentences = tokenize_text(fin)

# Чтобы не создавать слишком много словарей n-граммов, ограничим n длиной самого длинного предложения:
max_length = len(max(sentences, key=lambda x: len(x)).split())
print(max_length)
dictionary = create_meta_dict(max_length, text)

51


In [439]:
# Функция для оценки вероятности предложения по биграммной модели:

def joint(tokens, dictionary):
    x = dictionary[len(tokens.split())][tokens]
    if x != 0.0:
        return x
    else:
        return 1.0
    

def cond_prob(tokens, dictionary):
    joint_prob = joint(tokens, dictionary)
    tokens = tokens.split(" ")
    p = dictionary[1][tokens[0]]
    return joint_prob / p


def sentence_probability(sentence, dictionary):
    sentence = sentence.split()
    sentence = [replace_(word, frequency_dict) for word in sentence]
    span = sentence[0]
    prob = dictionary[1][span]
    for i in range(1, len(sentence)):
        span += " " + sentence[i]
        prob *= cond_prob(span, dictionary)
        s = []
        span = span.split(" ")
        s.append(span[-2])
        s.append(span[-1])
        span = " ".join(s)
    return prob

## Генерация случайных текстов (Д/З)

Для того, чтобы сгенерировать случайный текст нужно запастись двумя вещами:

1. Тренировочный корпус.
2. Языковая модель.

С первым все совсем легко. Запустите строчку, указанную ниже.

In [None]:
!wget http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt

Тут уже поинтересней. Следуя комментариям напишите класс, реализующий простейшую **побуквенную** языковую модель.

In [199]:
from collections import Counter, defaultdict

class LanguageModel:
    def __init__(self, data, order=4):
        self.order = order
        self.ngrams = defaultdict(Counter)
        pad = '~' * order
        data = pad + data
        
        # For each ngram in data count all characters following this ngram.
        # For instance for oder = 2 and data = 'abcbcb' self.ngrams should be the following:
        # self.ngrams['~~']['a'] == 1
        # self.ngrams['~a']['b'] == 1
        # self.ngrams['ab']['c'] == 1
        # self.ngrams['bc']['c'] == 2
        # self.ngrams['cb']['c'] == 1
                
        for current_index in range(0, len(data) - order):
            ngram = ''
            for i in range(0, order):
                ngram += data[current_index + i]
            
            self.ngrams[ngram][data[current_index + order]] += 1.

        self.lm = {history: self.normalize(chars) for history, chars in self.ngrams.items()}
    
    def normalize(self, counter):
        total = sum([counter[c] for c in counter])
        return [(c, counter[c]/total) for c in counter]
    
    def __getitem__(self, history):
        return self.lm[history]

Что-ж, пришло время обучить языковую модельку и проверить результаты.

In [201]:
with open('shakespeare_input.txt', 'r') as fin:
    lm = LanguageModel(fin.read())

In [202]:
lm['ello']

[('!', 0.0068143100511073255),
 (' ', 0.013628620102214651),
 ("'", 0.017035775127768313),
 (',', 0.027257240204429302),
 ('.', 0.0068143100511073255),
 ('r', 0.059625212947189095),
 ('u', 0.03747870528109029),
 ('w', 0.817717206132879),
 ('n', 0.0017035775127768314),
 (':', 0.005110732538330494),
 ('?', 0.0068143100511073255)]

In [203]:
lm['Firs']

[('t', 1.0)]

А теперь напишем функцию для генерации случайных текстов!

In [204]:
from random import random

def generate_letter(lm, history):
    next_letter = max(lm[history], key=lambda x:x[1])
    return next_letter[0]
        
def generate_text(lm, n_letters=1000):
    history = '~' * lm.order
    out = []
    for i in range(0, n_letters):
        next_letter = generate_letter(lm, history)
        out.append(next_letter)
        history += next_letter
        history = history[-lm.order:]
    return ''.join(out)

In [173]:
with open('shakespeare_input.txt', 'r') as fin:
    lm = LanguageModel(fin.read())

print(generate_text(lm, 2000))

First Senator:
The shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the shall the s

In [153]:
with open('shakespeare_input.txt', 'r') as fin:
    lm = LanguageModel(fin.read(), 8)
    
print(generate_text(lm, 2000))

First Citizen:
We have seen the charge thee, and the son of my love to her sin and his company.

DUKE VINCENTIO:
I know not what they shall be so bold to say so in this contract of eternal spirit,
And therefore the wind and set forth the same dish? for, in choosing me a cup of sack with him that hath a hand that the streets,
And for the which he hath been a man of the world and a coward the Third, he bid you come to the sea was called the dead bodies shall be so bold to say so in this contract of eternal spirit,
And therefore the wind and set forth the same dish? for, in choosing me a cup of sack with him that hath a hand that the streets,
And for the which he hath been a man of the world and a coward the Third, he bid you come to the sea was called the dead bodies shall be so bold to say so in this contract of eternal spirit,
And therefore the wind and set forth the same dish? for, in choosing me a cup of sack with him that hath a hand that the streets,
And for the which he hath been 

In [155]:
with open('shakespeare_input.txt', 'r') as fin:
    lm = LanguageModel(fin.read(), 16)
    
print(generate_text(lm, 2000))

First Citizen:
Bring him with triumph home unto his house.

Second Citizen:
What he cannot help in his nature, you account a
vice in him. You must in no way say he is covetous.

First Citizen:
We are accounted poor citizens, the patricians of you. For your wants,
Your suffering in this dearth, you may as well
Strike at the heaven with your ships:
They are in readiness.

QUEEN MARGARET:
And take my heart with extreme laughter:
I pry'd me through the hole of this vile wall!

Thisbe:
I kiss the wall's hole, not your lips at all.

Pyramus:
Wilt thou at Ninny's tomb meet me straightway?

Thisbe:
'Tide life, 'tide death, I come without delay.

Wall:
Thus have I, Wall, my part discharged so;
And, being done, thus Wall away doth go.

THESEUS:
Now is the mural down between the two moist elements,
Like Perseus' horse: where's then the saucy boat
Whose weak untimber'd sides but even now
Co-rivall'd greatness? Either to harbour fled,
Or made a toast for Neptune. Even so
Doth valour's show and valo