## Resources
A lot of resources are mentioned here: http://mccormickml.com/2016/04/27/word2vec-resources/

As we know, CBOW is learning to predict the word by the context. Or maximize the probability of the target word by looking at the context. And this happens to be a problem for rare words. For example, given the context `yesterday was a really [...] day` CBOW model will tell you that most probably the word is `beautiful` or `nice`. Words like `delightful` will get much less attention of the model, because it is designed to predict the most probable word. This word will be smoothed over a lot of examples with more frequent words.

On the other hand, the skip-gram model is designed to predict the context. Given the word `delightful` it must understand it and tell us that there is a huge probability that the context is `yesterday was really [...] day`, or some other relevant context. With skip-gram the word delightful will not try to compete with the word beautiful but instead, delightful+context pairs will be treated as new observations.

In [53]:
import torch
import torch.nn as nn
import torch.autograd as autograd
import torch.optim as optim
import torch.nn.functional as F
import operator

# Continuous Bag of Words model
class CBOW(nn.Module):

    def __init__(self, context_size=2, embedding_size=100, vocab_size=None):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_size)
        self.linear1 = nn.Linear(embedding_size, vocab_size)

    def forward(self, inputs):
        lookup_embeds = self.embeddings(inputs)
        embeds = lookup_embeds.sum(dim=0)
        out = self.linear1(embeds)
        out = F.log_softmax(out)
        return out



def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    tensor = torch.LongTensor(idxs)
    return autograd.Variable(tensor)



In [45]:
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
EMBEDDING_SIZE = 10
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".lower().split()

# How could you do a better pre-processing?
# Maybe a sentence tokenizer?
# Maybe a word lemmatizer?
# Should you take a bigger corpus, Replace this small corpus with a bigger one
# Maybe you should remove stopwords
# Maybe you should just Google?

In [46]:
# Create the  vocabulary
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))

print (data[0])

(['we', 'are', 'to', 'study'], 'about')


In [47]:
loss_func = nn.CrossEntropyLoss()
net = CBOW(CONTEXT_SIZE, embedding_size=EMBEDDING_SIZE, vocab_size=vocab_size)
optimizer = optim.SGD(net.parameters(), lr=0.01)

# The training loop
for epoch in range(100):
    total_loss = 0
    for context, target in data:
        context_var = make_context_vector(context, word_to_ix)
        net.zero_grad()
        ## Enter code to get log_probs from model
        target = autograd.Variable(torch.LongTensor([word_to_ix[target]]))
        loss = loss_func(log_probs.reshape(1,-1), target)

        loss.backward()
        optimizer.step()
        total_loss += loss.data
    print("Loss for epoch ", epoch, " : ", total_loss)



Loss for epoch  0  :  tensor(255.9591)
Loss for epoch  1  :  tensor(226.5571)
Loss for epoch  2  :  tensor(203.5022)
Loss for epoch  3  :  tensor(185.3333)
Loss for epoch  4  :  tensor(170.6402)
Loss for epoch  5  :  tensor(158.3100)
Loss for epoch  6  :  tensor(147.6673)
Loss for epoch  7  :  tensor(138.3020)
Loss for epoch  8  :  tensor(129.9490)
Loss for epoch  9  :  tensor(122.4272)
Loss for epoch  10  :  tensor(115.6029)
Loss for epoch  11  :  tensor(109.3704)
Loss for epoch  12  :  tensor(103.6462)
Loss for epoch  13  :  tensor(98.3646)
Loss for epoch  14  :  tensor(93.4732)
Loss for epoch  15  :  tensor(88.9299)
Loss for epoch  16  :  tensor(84.6992)
Loss for epoch  17  :  tensor(80.7513)
Loss for epoch  18  :  tensor(77.0601)
Loss for epoch  19  :  tensor(73.6033)
Loss for epoch  20  :  tensor(70.3612)
Loss for epoch  21  :  tensor(67.3165)
Loss for epoch  22  :  tensor(64.4538)
Loss for epoch  23  :  tensor(61.7596)
Loss for epoch  24  :  tensor(59.2216)
Loss for epoch  25  : 

In [48]:
# Now let's find embedding for every word
vocab_to_embedding = {}
for word in vocab:
    vocab_to_embedding[word] = net.embeddings.forward(make_context_vector([word], word_to_ix))

In [65]:
def find_k_similar_words(word, k = 5):
    word = word.lower()
    if word not in vocab:
        print ("Not found ", word)
        return []
    a = vocab_to_embedding[word]
    max_sim = -1
    sim_here = {}
    for b in vocab_to_embedding:
        emb = vocab_to_embedding[b]
        sim = torch.dot(a.reshape(-1),emb.reshape(-1))/(a.norm()*emb.norm())
        sim_here[b] = sim.data[0]
    sorted_t = sorted(sim_here.items(), key=operator.itemgetter(1))
    sorted_t.reverse()
    return sorted_t[:k]

In [66]:
find_k_similar_words('program.', 5)

  if sys.path[0] == '':


[('program.', tensor(1.)),
 ('other', tensor(0.5284)),
 ('processes', tensor(0.4700)),
 ('a', tensor(0.4521)),
 ('our', tensor(0.3823))]

### Could you define a Skip Gram model?