In [0]:
!pip3 -qq install torch==0.4.1
!pip install -q --upgrade nltk gensim bokeh pandas

!wget -O quora.zip -qq --no-check-certificate "https://drive.google.com/uc?export=download&id=1ERtxpdWOgGQ3HOigqAMHTJjmOE_tWvoF"
!unzip quora.zip

import nltk
nltk.download('punkt')

In [0]:
import time
import math
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import clear_output
%matplotlib inline

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim 

np.random.seed(42)

# PyTorch

PyTorch is one of the most famous frameworks for working with neural networks.

Why precisely he? Well, it's nyashan, pitonach and easier to debug - compared to tensoflow type monsters (although tf 2.0 with eager execution will be about the same).

And in general, we are not frameworks here, but we were going to learn grids :)

## Automatic differentiation

### Графы вычислений

Calculation graphs are such a convenient way to quickly calculate the gradients of complex-implicit functions.

For example, the function

$$f = (x + y) \cdot z$$

introduce yourself as a graph

![graph](https://github.com/DanAnastasyev/DeepNLP-Course/raw/master/Week%2003/Images/Circuit.png =500x)  
*From [Backpropagation, Intuitions - CS231n](http://cs231n.github.io/optimization-2/)*

**Task** Set the values ​​of $ x, y, z $ (green in the picture). How to calculate $ \ frac {\ partial f} {\ partial x}, \ frac {\ partial f} {\ partial y}, \ frac {\ partial f} {\ partial z} $? (* Recall what backpropagation is *)

In PyTorch, such calculations are made very simply.

First, a function is defined - just a sequence of operations:

In [0]:
x = torch.tensor(-2., requires_grad=True)
y = torch.tensor(5., requires_grad=True)
z = torch.tensor(-4., requires_grad=True)

q = x + y
f = q * z


![graph](https://raw.githubusercontent.com/pytorch/pytorch/master/docs/source/_static/img/dynamic_graph.gif)  
*From [github.com/pytorch/pytorch](https://github.com/pytorch/pytorch)*

According to the described sequence of operations * on the fly * a computation graph is constructed, and the backward pass is performed on it.

This is a key difference from tensoflow: the graph does not need to be compiled before the code is executed - this allows you to manage its structure more flexibly.

In [0]:
f.backward()

print('df/dz =', z.grad)
print('df/dx =', x.grad)
print('df/dy =', y.grad)

Calling the method `backward ()` calculates gradients for all tensors that have `requires_grad == True`.

There is another alternative way to not calculate gradients - use context managers. ([Locally disabling gradient computation](https://pytorch.org/docs/stable/autograd.html#locally-disabling-gradient-computation)):
```python
torch.autograd.no_grad()
torch.autograd.enable_grad()
torch.autograd.set_grad_enabled(mode)

```

In [0]:
with torch.autograd.no_grad():
    x = torch.tensor(-2., requires_grad=True)
    y = torch.tensor(5., requires_grad=True)
    q = x + y

z = torch.tensor(-4., requires_grad=True)
f = q * z

f.backward()

print('df/dz =', z.grad)
print('df/dx =', x.grad)
print('df/dy =', y.grad)

Read more about how autograd works, read here: [Autograd mechanics] (https://pytorch.org/docs/stable/notes/autograd.html).

In general, any tensor in pytorch is analogous to multidimensional matrices in numpy.

It contains data:

In [0]:
x.data

Накопленный градиент:

In [0]:
x.grad

Функцию, как градиент считать:

In [0]:
q.grad_fn

И всякую дополнительную метаинформацию:

In [0]:
x.type(), x.shape, x.device, x.layout

Зачем... У меня один вопрос - зачем вот это вот нам нужно?

### Задача для разминки

Чтобы разобраться - решим простенькую задачу на линейную регрессию:

In [0]:
w_orig, b_orig = 2.6, -0.4

X = np.random.rand(100) * 10. - 5.
y_orig = w_orig * X + b_orig

y = y_orig + np.random.randn(100)

plt.plot(X, y, '.')
plt.plot(X, y_orig)
plt.show()

I want to fasten backpropagation here, yes.

There are two parameters $ w $ and $ b $ - they need to be chosen so that they are as close as possible to the original $ w_ {orig}, b_ {orig} $.

What will we optimize? We will optimize MSE:

$$J(w, b) = \frac{1}{N} \sum_{i=1}^N || \hat y_i - y_i(w, b)||^2 =\frac{1}{N} \sum_{i=1}^N || \hat y_i - (w \cdot x_i + b)||^2. $$

With such loss functions, we can run a simple gradient descent (not even stochastic yet):

$$w_{t+1} := w_t - \alpha \cdot \frac{\partial J}{\partial w}(w_t, b_t)$$
$$b_{t+1} := w_t - \alpha \cdot \frac{\partial J}{\partial b}(w_t, b_t)$$

**Task** Implement optimization on pure numpy.

For this you need:
1. Calculate the value of the function on the forward pass: $ y (w, b) = w \ cdot x + b $;
2. To think and calculate the gradients $\frac{\partial J} {\partial w}, \frac {\partial J} {\partial b}$ on the back pass;
3. Shift $w, b$ by antigradients.

In [0]:
def display_progress(epoch, loss, w, b, X, y, y_pred):
    clear_output(True)
    print('Epoch = {}, Loss = {}, w = {}, b = {}'.format(epoch, loss, w, b))
    plt.plot(X, y, '.')
    plt.plot(X, y_pred)
    plt.show()
    time.sleep(1)


w = np.random.randn()
b = np.random.randn()

alpha = 0.01

for i in range(100):
    y_pred = <calc it>

    loss = <and it>

    <find w_grad and b_grad>

    w -= alpha * w_grad
    b -= alpha * b_grad
    
    if (i + 1) % 5 == 0:
        display_progress(i + 1, loss, w, b, X, y, y_pred)

On PyTorch, the same thing is somewhat simpler - the calculation of the straight passage is copied almost verbatim.

We already know how to go back - you just need to call `loss.backward ()`.

To update `w` and` b` you need to keep in mind the following. First, pytorch won't just update them:

In [0]:
w = torch.randn(1, requires_grad=True)

w -= 1.

The problem is the difficulty of supporting in-place operations for autograd ([in place operations with autograd] (https://pytorch.org/docs/stable/notes/autograd.html#in-place-operations-with-autograd))

But we do not need support for gradients! We will not do a backward pass through this operation - you just need to update the value of the variable. To do this, you can use the context `no_grad`, or you can update the buffer that the tensor uses:

In [0]:
w.data -= 1.

Another thing to remember is that the gradients in the tensors accumulate. Between calls to `loss.backward ()` we need to reset the gradients of `w` and` b`:

```python
w.grad.zero_ ()
b.grad.zero_ ()
```

**Task** Implement linear regression on pytorch.

In [0]:
X = torch.as_tensor(X).float()
y = torch.as_tensor(y).float()

w = torch.randn(1, requires_grad=True)
b = torch.randn(1, requires_grad=True)

for i in range(100):
    <copy forward pass and add backward pass + parameters updates>
    
    if (i + 1) % 5 == 0:
        display_progress(i + 1, loss, w.item(), b.item(), 
                         X.data.numpy(), y.data.numpy(), y_pred.data.numpy())

Думать нужно уже гораздо меньше, да? :)

Про другие фишки низкоуровнего pytorch можно почитать здесь: [PyTorch — ваш новый фреймворк глубокого обучения](https://habr.com/post/334380/) (статья веселая, но немного устарела, читать лучше с оглядкой на [PyTorch 0.4.0 Migration Guide](https://pytorch.org/blog/pytorch-0_4_0-migration-guide/))

## Word embeddings и высокоуровневый API PyTorch

Займёмся рассмотрением высокоуровневого API - в нем уже реализованы разные классы-запчасти для обучения нейронок.

Будем решать всё ту же задачу, что и в прошлый раз - обучение словных эмбеддингов, только теперь мы будем учить их самостоятельно!

Для начала нужно подготовить данные для обучения.

Соберем и токенизируем тексты:

In [0]:
import pandas as pd
from nltk.tokenize import word_tokenize

quora_data = pd.read_csv('train.csv')

quora_data.question1 = quora_data.question1.replace(np.nan, '', regex=True)
quora_data.question2 = quora_data.question2.replace(np.nan, '', regex=True)

texts = list(pd.concat([quora_data.question1, quora_data.question2]).unique())

tokenized_texts = [word_tokenize(text.lower()) for text in texts]

Соберем индекс самых частотных слов:

In [0]:
from collections import Counter

MIN_COUNT = 5

words_counter = Counter(token for tokens in tokenized_texts for token in tokens)
word2index = {
    '<unk>': 0
}

for word, count in words_counter.most_common():
    if count < MIN_COUNT:
        break
        
    word2index[word] = len(word2index)
    
index2word = [word for word, _ in sorted(word2index.items(), key=lambda x: x[1])]
    
print('Vocabulary size:', len(word2index))
print('Tokens count:', sum(len(tokens) for tokens in tokenized_texts))
print('Unknown tokens appeared:', sum(1 for tokens in tokenized_texts for token in tokens if token not in word2index))
print('Most freq words:', index2word[1:21])

### Skip-Gram Word2vec

Начнем с skip-gram модели обучения word2vec.

Это простая модель всего из двух слоев. Ее идея - учить вектора эмбеддингов такими, чтобы по ним можно было как можно лучше предсказать контекст соответствующих слов. То есть если мы хорошо научились кодировать слова, с которыми встречается данное - значит, мы что-то знаем и о нем самом. Например, естественным образом получится, что слова, встречающиеся в одинаковых контекстах (скажем, `apple` и `orange`)  будут иметь близкие вектора эмбеддингов.

![](https://github.com/DanAnastasyev/DeepNLP-Course/raw/master/Week%2003/Images/Word2vecExample.jpeg =600x)  
*From cs224n, Lecture 2*

Для этого мы моделируем вероятности $\{P(w_{c+j}|w_c):  j = c-k, ..., c+k, j \neq c\}$, где $k$ - размер контекстного окна, $c$ - индекс центрального слова.

Соберем такую модель: будем учить пару матриц $U$ - матрицу эмбеддингов, которую потом и возьмем для своих задач, и $V$ - матрицу выходного слоя.

Каждому слову в словаре соответствует строка в матрице $U$ и столбец $V$.

![skip-gram](https://github.com/DanAnastasyev/DeepNLP-Course/raw/master/Week%2003/Images/SkipGram.png =500x)

Что тут происходит? Слово отображается в эмбеддинг - строку $u_c$. Дальше этот эмбеддинг умножается на матрицу $V$. 

В итоге получаем набор числе $v_j^T u_c$ - степень похожести слова с номером $j$ и нашего слова.

Преобразуем эти числа в что-то вроде вероятностей - воспользуемся функцией softmax: $P(i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$.

А дальше будем считать кросс-энтропийные потери:

$$-\sum_{-k \leq j \leq k, j \neq 0} \log \frac{\exp(v_{c+j}^T u_c)}{\sum_{i=1}^{|V|} \exp(v_i^T u_c)} \to \min_{U, V}.$$

В итоге, вектор $u_c$ будет приближаться к векторам $v_{c_j}$ из его контекста.

Реализуем это всё, чтобы разобраться.

#### Генерация батчей

Для начала нужно собрать контексты.

In [0]:
def build_contexts(tokenized_texts, window_size)
    contexts = []
    for tokens in tokenized_texts:
        for i in range(len(tokens)):
            central_word = tokens[i]
            context = [tokens[i + delta] for delta in range(-window_size, window_size + 1) 
                       if delta != 0 and i + delta >= 0 and i + delta < len(tokens)]

            contexts.append((central_word, context))
            
    return contexts

In [0]:
contexts = build_contexts(tokenized_texts, window_size=2)

In [0]:
contexts[:5]

Преобразуем слова в индексы.

In [0]:
contexts = [(word2index.get(central_word, 0), [word2index.get(word, 0) for word in context]) 
            for central_word, context in contexts]

Реализуем генератор батчей для нашей нейронки:

In [0]:
import random

def make_skip_gram_batchs_iter(contexts, window_size, num_skips, batch_size):
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * window_size
    
    central_words = [word for word, context in contexts if len(context) == 2 * window_size and word != 0]
    contexts = [context for word, context in contexts if len(context) == 2 * window_size and word != 0]
    
    batch_size = int(batch_size / num_skips)
    batchs_count = int(math.ceil(len(contexts) / batch_size))
    
    print('Initializing batchs generator with {} batchs per epoch'.format(batchs_count))
    
    while True:
        indices = np.arange(len(contexts))
        np.random.shuffle(indices)

        for i in range(batchs_count):
            batch_begin, batch_end = i * batch_size, min((i + 1) * batch_size, len(contexts))
            batch_indices = indices[batch_begin: batch_end]

            batch_data, batch_labels = [], []

            for data_ind in batch_indices:
                central_word, context = central_words[data_ind], contexts[data_ind]
                
                words_to_use = random.sample(context, num_skips)
                batch_data.extend([central_word] * num_skips)
                batch_labels.extend(words_to_use)
            
            yield batch_data, batch_labels

In [0]:
batch, labels = next(make_skip_gram_batchs_iter(contexts, window_size=2, num_skips=2, batch_size=32))

#### nn.Sequential

Простейший способ реализовать модель на PyTorch - использовать модуль `nn.Sequential`. В нем нужно просто перечислить все слои, и он будет применять их последовательно.

In [0]:
model = nn.Sequential(
    nn.Embedding(len(word2index), 32),
    nn.Linear(32, len(word2index))
)

Еще одна особенность pytorch, о которой до сих пор не говорили - поддержка вычислений на видеокарте. На видеокарте большинство нейронок считается гораздо быстрее благодаря высокой параллелизации. Сказать pytorch'у, чтобы он считал на видеокарте, очень просто:

In [0]:
model.cuda()

либо

In [0]:
device = torch.device("cuda")

model = model.to(device)

Создать тензоры на видеокарте можно, например, так:

In [0]:
batch = torch.cuda.LongTensor(batch)
labels = torch.cuda.LongTensor(labels)

Заставить модель посчитать значение можно так:

In [0]:
logits = model(batch)

Теперь нам нужна функция потерь

In [0]:
loss_function = nn.CrossEntropyLoss().cuda() 

Посчитать значение можно так:

In [0]:
loss = loss_function(logits, labels)

А теперь, конечно же, backprop!

In [0]:
loss.backward()

И, наконец, оптимизатор.

Будем использовать Adam. Интерфейс - передать список оптимизируемых параметров и learning rate.

In [0]:
optimizer = optim.Adam(model.parameters(), lr=0.01) 

Оптимизация идет просто - нужно вызвать `step()`:

In [0]:
print(model[1].weight)

optimizer.step()

print(model[1].weight)

И последнее - нужно обнулить градиенты!

In [0]:
optimizer.zero_grad()

#### Реализация обучения skip-gram модели

Наконец, напишем цикл обучения - как уже было с линейной регрессией.

 **Задание** Заполните цикл.

In [0]:
loss_every_nsteps = 1000
total_loss = 0
start_time = time.time()

for step, (batch, labels) in enumerate(make_skip_gram_batchs_iter(contexts, window_size=2, num_skips=4, batch_size=128)):
    <1. convert data to tensors>
    
    <2. make forward pass>

    <3. make backward pass>

    <4. apply optimizer>
    
    <5. zero grads>

    total_loss += loss.item()
    
    if step != 0 and step % loss_every_nsteps == 0:
        print("Step = {}, Avg Loss = {:.4f}, Time = {:.2f}s".format(step, total_loss / loss_every_nsteps, 
                                                                    time.time() - start_time))
        total_loss = 0
        start_time = time.time()

#### Анализ

Получить эмбеддинги можно, скаставав такое заклинание:

In [0]:
embeddings = model[0].weight.cpu().data.numpy()

Проверим, получилось ли хоть сколько-то адекватно.

In [0]:
from sklearn.metrics.pairwise import cosine_similarity

def most_similar(embeddings, index2word, word2index, word):
    word_emb = embeddings[word2index[word]]
    
    similarities = cosine_similarity([word_emb], embeddings)[0]
    top10 = np.argsort(similarities)[-10:]
    
    return [index2word[index] for index in reversed(top10)]

most_similar(embeddings, index2word, word2index, 'warm')

И визуализируем!

In [0]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook

from sklearn.manifold import TSNE
from sklearn.preprocessing import scale


def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    output_notebook()
    
    if isinstance(color, str): 
        color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: 
        pl.show(fig)
    return fig


def get_tsne_projection(word_vectors):
    tsne = TSNE(n_components=2, verbose=100)
    return scale(tsne.fit_transform(word_vectors))
    
    
def visualize_embeddings(embeddings, index2word, word_count):
    word_vectors = embeddings[1: word_count + 1]
    words = index2word[1: word_count + 1]
    
    word_tsne = get_tsne_projection(word_vectors)
    draw_vectors(word_tsne[:, 0], word_tsne[:, 1], color='green', token=words)
    
    
visualize_embeddings(model[0], index2word, 1000)

### Continuous Bag of Words (CBoW)

Alternative model:

! [] (https://github.com/DanAnastasyev/DeepNLP-Course/raw/master/Week%2003/Images/CBOW.png = 500x)

Now, by the sum of the context vectors, the vector of the central word is predicted.

** Task ** Implement part of the function to generate batches.

In [1]:
Alternative model:

! [] (https://github.com/DanAnastasyev/DeepNLP-Course/raw/master/Week%2003/Images/CBOW.png = 500x)

Now, by the sum of the context vectors, the vector of the central word is predicted.

** Task ** Implement part of the function to generate batches.def make_cbow_batchs_iter(contexts, window_size, batch_size):
    data = np.array([context for word, context in contexts if len(context) == 2 * window_size and word != 0])
    labels = np.array([word for word, context in contexts if len(context) == 2 * window_size and word != 0])
        
    batchs_count = int(math.ceil(len(data) / batch_size))
    
    print('Initializing batchs generator with {} batchs per epoch'.format(batchs_count))
    
    while True:
        <do batchs generation>

SyntaxError: invalid syntax (<ipython-input-1-fe791f4bb2d9>, line 1)

Посмотрим на альтернативный вариант создания модели - им мы будем пользоваться чаще всего - отнаследоваться от `nn.Module`. Схематично её использование выглядит так:

```python
class MyNetModel(nn.Module):
    def __init__(self, *args, **kwargs):
        super(MyNetModel, self).__init__()
        <initialize layers>
        
    def forward(self, inputs):
        <apply layers>
        return final_output
```



In [0]:
class CBoWModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.out_layer = nn.Linear(embedding_dim, vocab_size)

    def forward(self, inputs):
        <apply layers>
        return output
      
model = CBoWModel(vocab_size=len(word2index), embedding_dim=32).cuda()

loss_function = <create loss function>
optimizer = <create optimizer>

In [0]:
loss_every_nsteps = 1000
total_loss = 0
start_time = time.time()

for step, (batch, labels) in enumerate(make_cbow_batchs_iter(contexts, window_size=2, batch_size=128)):
    <copy-paste learning cycle>

    total_loss += loss.item()
    
    if step != 0 and step % loss_every_nsteps == 0:
        print("Step = {}, Avg Loss = {:.4f}, Time = {:.2f}s".format(step, total_loss / loss_every_nsteps, 
                                                                    time.time() - start_time))
        total_loss = 0
        start_time = time.time()

In [0]:
visualize_embeddings(model.embeddings.weight.data.cpu().numpy(), index2word, 1000)

### Negative Sampling

What is the hardest thing now? Calculating softmax and applying gradients to all words in $ V $.

One way to handle this is to use * Negative Sampling *.

In fact, instead of predicting the index of a word by context, it is predicted that such a word $ w $ can be in this context $ c $: $ P (D = 1 | w, c) $.

You can use a regular sigmoid to get this probability:
$$ P (D = 1 | w, c) = \sigma (v_w ^ T u_c) = \frac 1 {1 + \exp (-v ^ T_w u_c)}. $$

The learning process then looks like this: for each pair, the word and its context generate a set of negative examples:

![Negative Sampling](https://github.com/DanAnastasyev/DeepNLP-Course/raw/master/Week%2003/Images/Negative_Sampling.png =350x)

For CBoW, the loss function will look like this:

$$ - \log \sigma (v_c ^ T u_c) - \sum_ {k = 1} ^ K \log \sigma (- \tilde v_k ^ T u_c), $$

where $ v_c $ is the vector of the central word, $ u_c $ is the context vector (sum of context vectors), $ \tilde v_1, \ldots, \tilde v_K $ are the sampled negative examples.

Compare this formula with the usual CBoW:
$$ - v_c ^ T u_c + \log \sum_ {i = 1} ^ {| V |} \exp (v_i ^ T u_c). $$

Usually words are sampled from $ U ^ {3/4} $, where $ U $ is the unigram distribution, that is, the frequency of occurrence of words divided by the total number of words.

Frequencies we have already considered: they are obtained in `Counter (words)`. Simply convert them to probabilities and multiply these probabilities by $ \frac 3 4 $. Why $ \frac 3 4 $? Some intuition can be found in the following example:

$$P(\text{is}) = 0.9, \ P(\text{is})^{3/4} = 0.92$$
$$P(\text{Constitution}) = 0.09, \ P(\text{Constitution})^{3/4} = 0.16$$
$$P(\text{bombastic}) = 0.01, \ P(\text{bombastic})^{3/4} = 0.032$$

The probability for high-frequency words is not particularly increased (relatively), but low-frequency ones will fall out with a noticeably greater probability.

**Task** Implement your Negative Sampling.

First, let's set the distribution for sampling:

In [0]:
words_sum_count = sum(words_counter.values())
word_distribution = np.array([(words_counter[word] / words_sum_count) ** (3 / 4) for word in index2word])
# Вообще-то, тут нечестно сделанно, можно лучше
word_distribution /= word_distribution.sum()

indices = np.arange(len(word_distribution))

np.random.choice(indices, p=word_distribution, size=(32, 5))

In [0]:
class NegativeSamplingModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.out_layer = nn.Linear(embedding_dim, vocab_size)

    def forward(self, inputs, targets, num_samples):
        '''
        inputs: (batch_size, context_size)
        targets: (batch_size)
        num_samples: int
        '''
        
        <calculate u_c's>
        
        <calculate v_c>
        
        <sample indices>
        <calculate negative vectors v'_c>
        
        <apply F.logsigmoid to v_c * u_c and to -v'_c * u_c>
        
        <calc result loss>

In [0]:
model = NegativeSamplingModel(vocab_size=len(word2index), embedding_dim=32).cuda()

optimizer = optim.Adam(model.parameters(), lr=0.01)  

loss_every_nsteps = 1000
total_loss = 0
start_time = time.time()

for step, (batch, labels) in enumerate(make_cbow_batchs_iter(contexts, window_size=2, batch_size=128)):
    <copy-paste (mostly) learning cycle>

    total_loss += loss.item()
    
    if step != 0 and step % loss_every_nsteps == 0:
        print("Step = {}, Avg Loss = {:.4f}, Time = {:.2f}s".format(step, total_loss / loss_every_nsteps, 
                                                                    time.time() - start_time))
        total_loss = 0
        start_time = time.time()

In [0]:
visualize_embeddings(model.embeddings.weight.data.cpu().numpy(), index2word, 1000)

### Structured Word2Vec

**Задание** В статье [Two/Too Simple Adaptations of Word2Vec for Syntax Problems (2015), Ling, Wang, et al.](https://www.aclweb.org/anthology/N/N15/N15-1142.pdf) рассматриваются два варианта улучшения эмбеддингов - *Structured Skip-gram Model* и *Continuous Window Model*:   
![](https://github.com/DanAnastasyev/DeepNLP-Course/raw/master/Week%2003/Images/StructuredWord2vec.png =600x)  
*From Two/Too Simple Adaptations of Word2Vec for Syntax Problems*

Отличие - матрицы для каждого слова контекста учатся свои. Это хорошо на больших корпусах, но на нашем маленьком зайдет не слишком хорошо - многовато параметров придется выучить.

Идея этого в том, что порядок слов в предложении очень важен (особенно в английском, на котором они как всегда тестируются). Задавая порядок, они лучше учатся синтаксису.

Почитайте статью и попробуйте реализовать один из них.

# Дополнительные материалы
## Почитать
### Блоги
[On word embeddings - Part 1, Sebastian Ruder](http://ruder.io/word-embeddings-1/)  
[On word embeddings - Part 2: Approximating the Softmax, Sebastian Ruder](http://ruder.io/word-embeddings-softmax/index.html)  
[Word2Vec Tutorial - The Skip-Gram Model, Chris McCormick](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)  
[Word2Vec Tutorial Part 2 - Negative Sampling, Chris McCormick](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/) 

### Статьи
[Word2vec Parameter Learning Explained (2014), Xin Rong](https://arxiv.org/abs/1411.2738)  
[Neural word embedding as implicit matrix factorization (2014), Levy, Omer, and Yoav Goldberg](http://u.cs.biu.ac.il/~nlp/wp-content/uploads/Neural-Word-Embeddings-as-Implicit-Matrix-Factorization-NIPS-2014.pdf)  

### Улучшение эмбеддингов
[Two/Too Simple Adaptations of Word2Vec for Syntax Problems (2015), Ling, Wang, et al.](https://www.aclweb.org/anthology/N/N15/N15-1142.pdf)  
[Not All Neural Embeddings are Born Equal (2014)](https://arxiv.org/pdf/1410.0718.pdf)  
[Retrofitting Word Vectors to Semantic Lexicons (2014), M. Faruqui, et al.](https://arxiv.org/pdf/1411.4166.pdf)  
[All-but-the-top: Simple and Effective Postprocessing for Word Representations (2017), Mu, et al.](https://arxiv.org/pdf/1702.01417.pdf)  

### Эмбеддинги предложений
[Skip-Thought Vectors (2015), Kiros, et al.](https://arxiv.org/pdf/1506.06726)  

### Backpropagation
[Backpropagation, Intuitions, cs231n + next parts in the Module 1](http://cs231n.github.io/optimization-2/)   
[Calculus on Computational Graphs: Backpropagation, Christopher Olah](http://colah.github.io/posts/2015-08-Backprop/)

## Посмотреть
[cs224n "Lecture 2 - Word Vector Representations: word2vec"](https://www.youtube.com/watch?v=ERibwqs9p38&index=2&list=PLqdrfNEc5QnuV9RwUAhoJcoQvu4Q46Lja&t=0s)  
[cs224n "Lecture 5 - Backpropagation"](https://www.youtube.com/watch?v=isPiE-DBagM&index=5&list=PLqdrfNEc5QnuV9RwUAhoJcoQvu4Q46Lja&t=0s)   


# Сдача задания

[Сдача](https://goo.gl/forms/rzWjQQsGpqYNz5yt1)  
[Опрос](https://goo.gl/forms/as640TWE058bFTpy2)