## Задание
Задание 1.
Обучите нейронную сеть решать шифр Цезаря.

Что необходимо сделать:

    Написать алгоритм шифра Цезаря для генерации выборки (сдвиг на К каждой буквы. Например, при сдвиге на 2 буква “А” переходит в букву “В” и тп)
    Сделать нейронную сеть
    Обучить ее (вход - зашифрованная фраза, выход - дешифрованная фраза)
    Проверить качество

Задание 2.
Выполнить практическую работу из лекционного ноутбука.

    Построить RNN-ячейку на основе полносвязных слоев
    Применить построенную ячейку для генерации текста с выражениями героев сериала “Симпсоны”


### Задание 1. Шифр Цезаря.

##### 1. Подготовка

In [1]:
# импорт библиотек
import torch
import random
import time
import datetime
import tqdm

##### 2. Словарь символов
##### Работать буедт с русским языком, цифрами и знаками препинания.

In [2]:
letters = "абвгдеёжзийклмнопрстуфхцчшщъыьэюя 1234567890.,!:;\'\"?()" # полный список допустимых символов
letters_len = 33 # количество букв в русском языке
full_len = len(letters) # общее количество символов в списке

##### 3. Вспомогательные функции

In [3]:
# функция сдвига букв для шифра. сдвигает только буквы, знаки препинания не трогает.
def encode(text, k = 5):
    return [letters[(letters.index(c) + k) % letters_len] if letters.index(c) < letters_len else c for c in text ]
def decode(text, k = 5):
    return [letters[(letters.index(c) - k) % letters_len] if letters.index(c) < letters_len else c for c in text ]

In [4]:
# проверка
enc = encode('ая.')
denc = decode(enc)
print(f"закодировано:\t{enc}\nдекодировано:\t{denc}")

закодировано:	['е', 'д', '.']
декодировано:	['а', 'я', '.']


In [5]:
# функция перевода символов в индексы и обратно
def text_to_idx(text):
    indices = torch.zeros(len(text))
    for i in range(len(text)):
        indices[i] = letters.index(text[i])
    return indices.int()

def idx_to_text(indices):
    text = ""
    for i in indices:
        text += letters[int(i)]
    return text

In [6]:
# проверка
print(text_to_idx('ая.'))
print(idx_to_text(text_to_idx('ая.')))

tensor([ 0, 32, 44], dtype=torch.int32)
ая.


In [7]:
# функция кодировки и раскодировки текста в тензор со сдвигом
def encode_idx(idx_tens, k = 5):
    result = idx_tens.clone().detach()
    mask1 = result < letters_len
    result[mask1] += k
    mask2 = mask1 & (result > letters_len - 1)
    result[mask2] -= letters_len
    return result

def decode_idx(idx_tens, k = 5):
    result = idx_tens.clone().detach()
    mask1 = result < letters_len
    result[mask1] -= k
    mask2 = mask1 & (result < 0)
    result[mask2] += letters_len
    return result

In [8]:
print(encode_idx(text_to_idx('ая.')))
print(decode_idx(encode_idx(text_to_idx('ая.'))))

tensor([ 5,  4, 44], dtype=torch.int32)
tensor([ 0, 32, 44], dtype=torch.int32)


In [9]:
# проверка шифрования и дешифровки
input_text = ('Проверочный текст для шифрования 123 !').lower()
shift = 6
encoded_text = text_to_idx(input_text)
encoded_tensor = encode_idx(encoded_text, shift)
decoded_tensor = decode_idx(encoded_tensor, shift)
decoded_text = idx_to_text(decoded_tensor)
print(f"текст на входе:\t\t{input_text}\nключ шифрования:\t{shift}\n{encoded_text}\n{encoded_tensor}\n{decoded_tensor}\nтекст на выходе:\t{decoded_text}")

текст на входе:		проверочный текст для шифрования 123 !
ключ шифрования:	6
tensor([16, 17, 15,  2,  5, 17, 15, 24, 14, 28, 10, 33, 19,  5, 11, 18, 19, 33,
         4, 12, 32, 33, 25,  9, 21, 17, 15,  2,  0, 14,  9, 32, 33, 34, 35, 36,
        33, 46], dtype=torch.int32)
tensor([22, 23, 21,  8, 11, 23, 21, 30, 20,  1, 16, 33, 25, 11, 17, 24, 25, 33,
        10, 18,  5, 33, 31, 15, 27, 23, 21,  8,  6, 20, 15,  5, 33, 34, 35, 36,
        33, 46], dtype=torch.int32)
tensor([16, 17, 15,  2,  5, 17, 15, 24, 14, 28, 10, 33, 19,  5, 11, 18, 19, 33,
         4, 12, 32, 33, 25,  9, 21, 17, 15,  2,  0, 14,  9, 32, 33, 34, 35, 36,
        33, 46], dtype=torch.int32)
текст на выходе:	проверочный текст для шифрования 123 !


##### 4. Создание и обучение модели.

##### Идея - сделать модель почти универсальной, т.е. обучить её на достаточно большом наборе символов с разным случайным сдвигом, чтобы модель могла расшифровывать различные комбинации, зашифрованные шифром Цезаря.

In [10]:
# создаю данные для обучения в виде случайного набора индексов, соответствующих символам в словаре
DATA_CHARS = 10000
X_train = torch.randint(0, full_len - 1, (DATA_CHARS,))
X_test = torch.randint(0, full_len - 1, (int(0.3 * DATA_CHARS),))

In [11]:
embedding = torch.nn.Embedding(full_len, 30)
# embedding(X_train[:4]).shape

In [12]:
model = torch.nn.Sequential(
    torch.nn.Embedding(full_len, 128),
    torch.nn.Linear(128, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, full_len)
)

In [13]:
loss = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=.01)
BATCH_SIZE = 100

In [14]:
k = range(1,10)
k

range(1, 10)

In [15]:
k = 5
for ep in range(10):
    start = time.time()
    train_loss = 0.
    train_passed = 0
    for i in range(int(len(X_train) / BATCH_SIZE)):
        batch = X_train[i * BATCH_SIZE:(i + 1) * BATCH_SIZE]
        Y = decode_idx(batch, k)
        optimizer.zero_grad()
        answers = model(batch)
        l = loss(answers, Y)
        train_loss += l.item()
        l.backward()
        optimizer.step()
        train_passed += 1
    print("Epoch {}. Time: {:.3f}, Train loss: {:.3f}".format(ep, time.time() - start, train_loss / train_passed))


Epoch 0. Time: 0.166, Train loss: 0.212
Epoch 1. Time: 0.067, Train loss: 0.000
Epoch 2. Time: 0.066, Train loss: 0.000
Epoch 3. Time: 0.070, Train loss: 0.000
Epoch 4. Time: 0.075, Train loss: 0.000
Epoch 5. Time: 0.067, Train loss: 0.000
Epoch 6. Time: 0.067, Train loss: 0.000
Epoch 7. Time: 0.072, Train loss: 0.000
Epoch 8. Time: 0.071, Train loss: 0.000
Epoch 9. Time: 0.071, Train loss: 0.000


##### Проверка. Модель показывает 100% точность, но есть одно НО. Данная модель достаточно проставя и неуниверсальная, т.к. она обучалась только на одном ключе шифра Цезаря и, естественно, идеально расшифровывает фразы, зашифрованные с помощью этого ключа.
##### Высшим пилотажем бы было создать модель которая могла бы расшифровывать фразы, зашифрованные с помощью любого ключа, однако такую модель, как мне представляется, нужно обучать на очень большом тексте, зашифрованном с помощью разных ключей чего я сейчас не осилю.

In [16]:
Y_test = decode_idx(X_test)
accuracy = float((model(X_test).argmax(axis=1) == Y_test).sum() / len(Y_test))
accuracy

1.0

In [17]:
phrase = "проверка 1 точности 2, модели"
encoded = encode(phrase, 5)
indices = text_to_idx(encoded)
decoded = model(indices).argmax(axis = 1)
idx_to_text(decoded)

'проверка 1 точности 2, модели'

### Задание 2. Фразы Симпсонов.

##### 1. Подготовка

In [18]:
# импорт библиотек
import torch
import torch.nn as nn
from collections import Counter
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

##### 2. Подготовка

In [19]:
torch.autograd.set_detect_anomaly(True)
batch_size = 100
seq_size = 32
embedding_size = 64
lstm_size = 64
gradients_norm = 5
device = "cpu"

In [20]:
# мешок слов
def doc2words(doc):
    words=[]
    for line in doc:
      words += line.split()
    return words

In [21]:
# удаление пунктуации
def removepunct(words):
    punct = set(string.punctuation)
    words = [''.join([char for char in list(word) if char not in punct]) for word in words]
    return words

In [22]:
# функция генерации батчей
def get_batches(words, vocab_to_int, batch_size, seq_size):
    # Генерируем батчи для  Xs и Ys: shape = (batchsize * num_batches) * seq_size
    word_ints = [vocab_to_int[word] for word in words]
    num_batches = int(len(word_ints) / (batch_size * seq_size))
    Xs = word_ints[:num_batches*batch_size*seq_size]
    Ys = np.zeros_like(Xs)
    Ys[:-1] = Xs[1:]
    Ys[-1] = Xs[0]
    Xs = np.reshape(Xs, (num_batches*batch_size, seq_size))
    Ys = np.reshape(Ys, (num_batches*batch_size, seq_size))
    
    # iterate over rows of Xs and Ys to generate batches
    for i in range(0, num_batches*batch_size, batch_size):
        yield Xs[i:i+batch_size, :], Ys[i:i+batch_size, :]

In [23]:
# создание сети
class RNNModule(nn.Module):
    # initialize RNN module
    def __init__(self, n_vocab, seq_size=32, embedding_size=64, lstm_size=64):
        super(RNNModule, self).__init__()
        self.seq_size = seq_size
        self.lstm_size = lstm_size
        self.embedding = nn.Embedding(n_vocab, embedding_size)
        self.lstm = nn.LSTM(embedding_size,
                            lstm_size,
                            batch_first=True)
        self.dense = nn.Linear(lstm_size, n_vocab)
        
    def forward(self, x, prev_state):
        embed = self.embedding(x)
        output, state = self.lstm(embed, prev_state)
        logits = self.dense(output)

        return logits, state
    
    def zero_state(self, batch_size):
        return (torch.zeros(1, batch_size, self.lstm_size),torch.zeros(1, batch_size, self.lstm_size))

In [24]:
def vocab_map(vocab):
    # 2 словаря - int to words and word to int
    int_to_vocab = {k:w for k,w in enumerate(vocab)}
    vocab_to_int = {w:k for k,w in int_to_vocab.items()}
    return int_to_vocab, vocab_to_int

In [25]:
def getvocab(words):
  # Словарь из мешка слов
    wordfreq = Counter(words)
    sorted_wordfreq = sorted(wordfreq, key=wordfreq.get)
    return sorted_wordfreq

In [26]:
# функция потерь и оптимизатор
def get_loss_and_train_op(net, lr=0.001):
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)

    return criterion, optimizer

In [27]:
# генерация текста
def generate_text(device, net, words, n_vocab, vocab_to_int, int_to_vocab, top_k=5):
    net.eval()

    state_h, state_c = net.zero_state(1)
    state_h = state_h.to(device)
    state_c = state_c.to(device)
    for w in words:
        ix = torch.tensor([[vocab_to_int[w]]]).to(device)
        output, (state_h, state_c) = net(ix, (state_h, state_c))
    
    _, top_ix = torch.topk(output[0], k=top_k)
    choices = top_ix.tolist()
    choice = np.random.choice(choices[0])

    words.append(int_to_vocab[choice])
    
    for _ in range(100):
        ix = torch.tensor([[choice]]).to(device)
        output, (state_h, state_c) = net(ix, (state_h, state_c))

        _, top_ix = torch.topk(output[0], k=top_k)
        choices = top_ix.tolist()
        choice = np.random.choice(choices[0])
        words.append(int_to_vocab[choice])

    print(' '.join(words))

In [28]:
# функция обучения сети
def train_rnn(words, vocab_to_int, int_to_vocab, n_vocab):
    
    # ячейка RNN
    net = RNNModule(n_vocab, seq_size, embedding_size, lstm_size)
    net = net.to(device)
    criterion, optimizer = get_loss_and_train_op(net, 0.01)

    iteration = 0
    
    # итерируемся по эпохам
    for epoch in tqdm(range(10)): # учиться будет на 10 эпохах
        # получаем батчи
        batches = get_batches(words, vocab_to_int, batch_size, seq_size)
        # инициализируем выход и сккрытое состояние
        state_h, state_c = net.zero_state(batch_size)

        # Передаем данные на GPU
        state_h = state_h.to(device)
        state_c = state_c.to(device)
        # итерируемся по батчам
        for x, y in tqdm(batches):
            iteration += 1

            # Переходим  в режим обучения
            net.train()

            # Обнуляем градиенты
            optimizer.zero_grad()

            # Передаем x и y на GPU
            x = torch.tensor(x).to(device)
            y = torch.tensor(y).to(device)
            
            # Модель возвращает логиты, последнее скрытое состояние и новый выход
            logits, (state_h, state_c) = net(x, (state_h, state_c))
            loss = criterion(logits.transpose(1, 2), y)

            state_h = state_h.detach()
            state_c = state_c.detach()

            loss_value = loss.item()

            # back-propagation
            loss.backward(retain_graph=True)

            _ = torch.nn.utils.clip_grad_norm_(net.parameters(), gradients_norm)
            
            # Обновляем параметры, выполняя шаг обучения
            optimizer.step()

            if iteration % 100 == 0:
                print('Epoch: {}/{}'.format(epoch, 10),'Iteration: {}'.format(iteration),'Loss: {}'.format(loss_value))

            # if iteration % 1000 == 0:
                # predict(device, net, flags.initial_words, n_vocab,vocab_to_int, int_to_vocab, top_k=5)
                # torch.save(net.state_dict(),'checkpoint_pt/model-{}.pth'.format(iteration))
                
    return net

##### Загрузка и предобработка данных

In [29]:
df = pd.read_csv('/home/vk/OneDrive/Образование/Слава/Нетология/16. DLL-30 Deep Learning/Материалы/simpsons_script_lines.csv')
df.head()

  df = pd.read_csv('/home/vk/OneDrive/Образование/Слава/Нетология/16. DLL-30 Deep Learning/Материалы/simpsons_script_lines.csv')


Unnamed: 0,id,episode_id,number,raw_text,timestamp_in_ms,speaking_line,character_id,location_id,raw_character_text,raw_location_text,spoken_words,normalized_text,word_count
0,9549,32,209,"Miss Hoover: No, actually, it was a little of ...",848000,True,464.0,3.0,Miss Hoover,Springfield Elementary School,"No, actually, it was a little of both. Sometim...",no actually it was a little of both sometimes ...,31
1,9550,32,210,Lisa Simpson: (NEAR TEARS) Where's Mr. Bergstrom?,856000,True,9.0,3.0,Lisa Simpson,Springfield Elementary School,Where's Mr. Bergstrom?,wheres mr bergstrom,3
2,9551,32,211,Miss Hoover: I don't know. Although I'd sure l...,856000,True,464.0,3.0,Miss Hoover,Springfield Elementary School,I don't know. Although I'd sure like to talk t...,i dont know although id sure like to talk to h...,22
3,9552,32,212,Lisa Simpson: That life is worth living.,864000,True,9.0,3.0,Lisa Simpson,Springfield Elementary School,That life is worth living.,that life is worth living,5
4,9553,32,213,Edna Krabappel-Flanders: The polls will be ope...,864000,True,40.0,3.0,Edna Krabappel-Flanders,Springfield Elementary School,The polls will be open from now until the end ...,the polls will be open from now until the end ...,33


In [30]:
phrases = df['normalized_text'].astype(str).tolist()  # колонка с предобработанными текстами
phrases[:10]

['no actually it was a little of both sometimes when a disease is in all the magazines and all the news shows its only natural that you think you have it',
 'wheres mr bergstrom',
 'i dont know although id sure like to talk to him he didnt touch my lesson plan what did he teach you',
 'that life is worth living',
 'the polls will be open from now until the end of recess now just in case any of you have decided to put any thought into this well have our final statements martin',
 'i dont think theres anything left to say',
 'bart',
 'victory party under the slide',
 'nan',
 'mr bergstrom mr bergstrom']

In [32]:
# получаем мешок слов, удаляем пунктуацию
words = removepunct(doc2words(phrases))
# Словарь из мешка слов
vocab = getvocab(words)
# 2 словаря - int_to_vocab и vocab_to_int
int_to_vocab, vocab_to_int = vocab_map(vocab)

In [33]:
rnn_net = train_rnn(words, vocab_to_int, int_to_vocab, len(vocab))

  0%|          | 0/10 [00:00<?, ?it/s]

0it [00:00, ?it/s]

Epoch: 0/10 Iteration: 100 Loss: 6.873432636260986
Epoch: 0/10 Iteration: 200 Loss: 6.394588470458984
Epoch: 0/10 Iteration: 300 Loss: 6.195881366729736
Epoch: 0/10 Iteration: 400 Loss: 6.484381198883057


0it [00:00, ?it/s]

Epoch: 1/10 Iteration: 500 Loss: 6.052926540374756
Epoch: 1/10 Iteration: 600 Loss: 5.834512233734131
Epoch: 1/10 Iteration: 700 Loss: 5.844120502471924
Epoch: 1/10 Iteration: 800 Loss: 5.741043567657471


0it [00:00, ?it/s]

Epoch: 2/10 Iteration: 900 Loss: 5.632015228271484
Epoch: 2/10 Iteration: 1000 Loss: 5.5734639167785645
Epoch: 2/10 Iteration: 1100 Loss: 5.623622894287109
Epoch: 2/10 Iteration: 1200 Loss: 5.638889789581299


0it [00:00, ?it/s]

Epoch: 3/10 Iteration: 1300 Loss: 5.701300621032715
Epoch: 3/10 Iteration: 1400 Loss: 5.505397796630859
Epoch: 3/10 Iteration: 1500 Loss: 5.360555648803711
Epoch: 3/10 Iteration: 1600 Loss: 5.492495536804199


0it [00:00, ?it/s]

Epoch: 4/10 Iteration: 1700 Loss: 5.550518989562988
Epoch: 4/10 Iteration: 1800 Loss: 5.295286655426025
Epoch: 4/10 Iteration: 1900 Loss: 5.394376754760742
Epoch: 4/10 Iteration: 2000 Loss: 5.351922035217285


0it [00:00, ?it/s]

Epoch: 5/10 Iteration: 2100 Loss: 5.246644973754883
Epoch: 5/10 Iteration: 2200 Loss: 5.304635047912598
Epoch: 5/10 Iteration: 2300 Loss: 5.224936485290527
Epoch: 5/10 Iteration: 2400 Loss: 5.135441303253174


0it [00:00, ?it/s]

Epoch: 6/10 Iteration: 2500 Loss: 5.088631629943848
Epoch: 6/10 Iteration: 2600 Loss: 5.22064208984375
Epoch: 6/10 Iteration: 2700 Loss: 5.239499092102051
Epoch: 6/10 Iteration: 2800 Loss: 5.217926025390625
Epoch: 6/10 Iteration: 2900 Loss: 5.269838333129883


0it [00:00, ?it/s]

Epoch: 7/10 Iteration: 3000 Loss: 5.032680988311768
Epoch: 7/10 Iteration: 3100 Loss: 5.1731791496276855
Epoch: 7/10 Iteration: 3200 Loss: 5.002187728881836
Epoch: 7/10 Iteration: 3300 Loss: 5.08482551574707


0it [00:00, ?it/s]

Epoch: 8/10 Iteration: 3400 Loss: 5.261453151702881
Epoch: 8/10 Iteration: 3500 Loss: 5.096644401550293
Epoch: 8/10 Iteration: 3600 Loss: 5.008889675140381
Epoch: 8/10 Iteration: 3700 Loss: 4.8966875076293945


0it [00:00, ?it/s]

Epoch: 9/10 Iteration: 3800 Loss: 4.804150104522705
Epoch: 9/10 Iteration: 3900 Loss: 5.081767559051514
Epoch: 9/10 Iteration: 4000 Loss: 4.911337375640869
Epoch: 9/10 Iteration: 4100 Loss: 4.930315017700195


##### Генерация реплик

In [34]:
generate_text(device, rnn_net, ['hey', 'you'], len(vocab), vocab_to_int, int_to_vocab)

hey you can see what i said you can see that the big leagues oh my dear boy i dont think im not gonna be on your hands are up in a few days ago you can watch out your heart nan hey hey homer i dont care what you want that the big step to me but im not talking that was a good one night nan oh no i was wondering what i can have a big disappointment and now i love the truth is a very bad idea to me and a lot like a good day i think im
