In [1]:
!pip install tsundoku

Collecting tsundoku
  Downloading https://files.pythonhosted.org/packages/53/9a/ee69fb151e5c628683f67021c39d327a24a2f0ae81d4c433823711756a11/tsundoku-0.0.3.tar.gz
Building wheels for collected packages: tsundoku
  Running setup.py bdist_wheel for tsundoku: started
  Running setup.py bdist_wheel for tsundoku: finished with status 'done'
  Stored in directory: C:\Users\monis\AppData\Local\pip\Cache\wheels\ee\b3\9f\bfdb5e507142dacffd199c4aaa9c58d805191f24e4d7d71905
Successfully built tsundoku
Installing collected packages: tsundoku
Successfully installed tsundoku-0.0.3


You are using pip version 9.0.1, however version 19.0 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [2]:
!pip install sklearn torch tqdm nltk lazyme requests gensim
!python -m nltk.downloader movie_reviews



You are using pip version 9.0.1, however version 19.0 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\monis\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!




In [16]:
from IPython.display import display, Markdown, Latex
from tsundoku.word2vec_hints import *

# Overview


- <a href="#section-3-0">**3.0. Data Preparation**</a>
  - <a href="#section-3-0-1">3.0.1. *Vocabulary*</a>
    - <a href="#section-3-0-1-a"> Pet Peeve: using `gensim`</a>
  - <a href="#section-3-0-2">3.0.2. *Dataset*</a>  (<a href="#section-3-0-2-hints">Hints</a>)
    - <a href="#section-3-0-2-return-dict">Return `dict` in `__getitem__()`</a>
    - <a href="#section-3-0-2-labeleddata">Try `LabeledDataset`</a>
<br><br>
- <a href="#section-3-1">**3.1. Word2Vec from Scratch**</a>
  - <a href="#section-3-1-1">3.1.1. *CBOW*</a>
  - <a href="#section-3-1-2">3.1.2. *Skipgram*</a>
  - <a href="#section-3-1-3">3.1.3. *Word2Vec Dataset*</a> (<a href="#section-3-1-3-hint">Hints</a>)
  - <a href="#section-3-1-4-hint">3.1.4. *Train a CBOW model*</a>
    - <a href="#section-3-1-4-fill-cbow">Fill in the CBOW model</a>
    - <a href="#section-3-1-4-train-cbow">Train the model (*for real*)</a>
    - <a href="#section-3-1-4-evaluate-cbow">Evaluate the model</a>
    - <a href="#section-3-1-4-load-model">Load model at specific epoch</a>
  - <a href="#section-3-1-5">3.1.5. *Train a Skipgram model*</a>
    - <a href="#section-3-1-5-forward">Take a closer look at `forward()`</a>
    - <a href="#section-3-1-5-train">Train the model (*for real*)</a>
    - <a href="section-3-1-5-evaluate">Evaluate the model</a>
  - <a href="#section-3-1-6">3.1.6. *Loading Pre-trained Embeddings*</a>
    - <a href="#section-3-1-6-vocab">Override the Embedding vocabulary</a>
    - <a href="#section-3-1-6-pretrained">Override the Embedding weights</a>
    - <a href="#section-3-1-6-eval-skipgram">Evaluate on the Skipgram task</a>
    - <a href="#section-3-1-6-eval-cbow">Evaluate on the CBOW task</a>
    - <a href="#section-3-1-6-unfreeze-finetune">Unfreeeze and finetune</a>
    - <a href="#section-3-1-6-reval-cbow">Re-evaluate on the CBOW task</a>
<br><br>


<a id="section-3-0"></a>
# 3.0. Data Preparation

Before we train our own embeddings, lets first understand how to read text data into pytorch.
The native pytorch way to load datasets is to use the `torch.utils.data.Dataset` object.

There are already several other libraries that help with loading text datasets, e.g. 

 - FastAI https://docs.fast.ai/text.data.html
 - AllenNLP https://allenai.github.io/allennlp-docs/api/allennlp.data.dataset.html
 - Torch Text https://github.com/pytorch/text#data
 - Texar https://texar.readthedocs.io/en/latest/code/data.html#id4 
 - SpaCy https://github.com/explosion/thinc
 

But to truly understand and use it for the custom datasets you'll see at work, lets learn it the native way.

<a id="section-3-0-1"></a>
## 3.0.1  Vocabulary

Given a text, the first thing to do is to build a vocabulary (i.e. a dictionary of unique words) and assign an index to each unique word.

In [17]:
import random
from itertools import chain

from tqdm import tqdm
from gensim.corpora import Dictionary

import torch
from torch import nn, optim, tensor, autograd
from torch.nn import functional as F
from torch.utils.data import Dataset
import numpy as np

from functools import partial
from torch.utils.data import Dataset, DataLoader
from torch import functional as F


In [18]:
try: # Use the default NLTK tokenizer.
    from nltk import word_tokenize, sent_tokenize 
    # Testing whether it works. 
    # Sometimes it doesn't work on some machines because of setup issues.
    word_tokenize(sent_tokenize("This is a foobar sentence. Yes it is.")[0])
except: # Use a naive sentence tokenizer and toktok.
    import re
    from nltk.tokenize import ToktokTokenizer
    # See https://stackoverflow.com/a/25736515/610569
    sent_tokenize = lambda x: re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', x)
    # Use the toktok tokenizer that requires no dependencies.
    toktok = ToktokTokenizer()
    word_tokenize = word_tokenize = toktok.tokenize


In [19]:

text = """Language users never choose words randomly, and language is essentially
non-random. Statistical hypothesis testing uses a null hypothesis, which
posits randomness. Hence, when we look at linguistic phenomena in corpora, 
the null hypothesis will never be true. Moreover, where there is enough
data, we shall (almost) always be able to establish that it is not true. In
corpus studies, we frequently do have enough data, so the fact that a relation 
between two phenomena is demonstrably non-random, does not support the inference 
that it is not arbitrary. We present experimental evidence
of how arbitrary associations between word frequencies and corpora are
systematically non-random. We review literature in which hypothesis testing 
has been used, and show how it has often led to unhelpful or misleading results.""".lower()

tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]

uniq_tokens = set(chain(*tokenized_text))

vocab = {}   # Assign indices to every word.
idx2tok = {} # Also keep an dict of index to words.
for i, token in enumerate(uniq_tokens):
    vocab[token] = i
    idx2tok[i] = token

In [20]:
vocab

{'(': 2,
 ')': 14,
 ',': 17,
 '.': 20,
 'a': 10,
 'able': 29,
 'almost': 64,
 'always': 80,
 'and': 16,
 'arbitrary': 48,
 'are': 30,
 'associations': 63,
 'at': 12,
 'be': 50,
 'been': 73,
 'between': 82,
 'choose': 69,
 'corpora': 11,
 'corpus': 36,
 'data': 74,
 'demonstrably': 84,
 'do': 39,
 'does': 47,
 'enough': 76,
 'essentially': 85,
 'establish': 41,
 'evidence': 8,
 'experimental': 72,
 'fact': 18,
 'frequencies': 57,
 'frequently': 28,
 'has': 38,
 'have': 21,
 'hence': 34,
 'how': 31,
 'hypothesis': 32,
 'in': 19,
 'inference': 59,
 'is': 83,
 'it': 65,
 'language': 45,
 'led': 62,
 'linguistic': 53,
 'literature': 67,
 'look': 86,
 'misleading': 26,
 'moreover': 55,
 'never': 78,
 'non-random': 22,
 'not': 51,
 'null': 0,
 'of': 42,
 'often': 58,
 'or': 33,
 'phenomena': 75,
 'posits': 9,
 'present': 66,
 'randomly': 27,
 'randomness': 54,
 'relation': 13,
 'results': 44,
 'review': 7,
 'shall': 37,
 'show': 43,
 'so': 56,
 'statistical': 3,
 'studies': 6,
 'support': 4,


In [21]:
# Retrieve the index of the word 'corpora'
vocab['non-random']

22

In [22]:
# The indexed representation of the first sentence.

sent0 = tokenized_text[0]

[vocab[token] for token in sent0] 

[45, 81, 78, 69, 70, 27, 17, 16, 45, 83, 85, 22, 20]

<a id="section-3-0-1-a"></a>

### Pet Peeve (Gensim)

I (Liling) don't really like to write my own vectorizer the `gensim` has functions that are optimized for such operations. In fact, I've written a [whole preprocessing pipeline library for me to use for language modelling and machine translation purposes](https://github.com/alvations/komorebi/blob/master/komorebi/text.py) =)

Using `gensim`, I would have written the above as such:

In [23]:
from gensim.corpora.dictionary import Dictionary
vocab = Dictionary(tokenized_text)

In [24]:
# Note the key-value order is different of gensim from the native Python's
dict(vocab.items())

{0: ',',
 1: '.',
 2: 'and',
 3: 'choose',
 4: 'essentially',
 5: 'is',
 6: 'language',
 7: 'never',
 8: 'non-random',
 9: 'randomly',
 10: 'users',
 11: 'words',
 12: 'a',
 13: 'hypothesis',
 14: 'null',
 15: 'posits',
 16: 'randomness',
 17: 'statistical',
 18: 'testing',
 19: 'uses',
 20: 'which',
 21: 'at',
 22: 'be',
 23: 'corpora',
 24: 'hence',
 25: 'in',
 26: 'linguistic',
 27: 'look',
 28: 'phenomena',
 29: 'the',
 30: 'true',
 31: 'we',
 32: 'when',
 33: 'will',
 34: '(',
 35: ')',
 36: 'able',
 37: 'almost',
 38: 'always',
 39: 'data',
 40: 'enough',
 41: 'establish',
 42: 'it',
 43: 'moreover',
 44: 'not',
 45: 'shall',
 46: 'that',
 47: 'there',
 48: 'to',
 49: 'where',
 50: 'arbitrary',
 51: 'between',
 52: 'corpus',
 53: 'demonstrably',
 54: 'do',
 55: 'does',
 56: 'fact',
 57: 'frequently',
 58: 'have',
 59: 'inference',
 60: 'relation',
 61: 'so',
 62: 'studies',
 63: 'support',
 64: 'two',
 65: 'are',
 66: 'associations',
 67: 'evidence',
 68: 'experimental',
 69: 'fr

In [25]:
vocab.token2id['corpora']

23

In [26]:
vocab.doc2idx(sent0)

[6, 10, 7, 3, 11, 9, 0, 2, 6, 5, 4, 8, 1]

The "indexed form" of the tokens in the sentence forms the ***vectorized*** input to the `nn.Embedding` layer in PyTorch.

<a id="section-3-0-2"></a>

# 3.0.2 Dataset

Lets try creating a `torch.utils.data.Dataset` object.

In [27]:
from torch.utils.data import Dataset, DataLoader

class Text(Dataset):
    def __init__(self, tokenized_texts):
        """
        :param tokenized_texts: Tokenized text.
        :type tokenized_texts: list(list(str))
        """
        self.sents = tokenized_texts
        self.vocab = Dictionary(tokenized_texts)

    def __getitem__(self, index):
        """
        The primary entry point for PyTorch datasets.
        This is were you access the specific data row you want.
        
        :param index: Index to the data point.
        :type index: int
        """
        # Hint: You want to return a vectorized sentence here.
        return {'x': self.vectorize(self.sents[index])}

    def vectorize(self, tokens):
        """
        :param tokens: Tokens that should be vectorized. 
        :type tokens: list(str)
        """
        # See https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2idx 
        return self.vocab.doc2idx(tokens)
    
    def unvectorize(self, indices):
        """
        :param indices: Converts the indices back to tokens.
        :type tokens: list(int)
        """
        return [self.vocab[i] for i in indices]

<a id="section-3-0-2-hints"></a>
## Hints to the above cell

In [28]:
# Option 1: To see the hint and partial code for the cell above, uncomment the following line.
##hint_dataset_vectorize()
##code_text_dataset_vectorize()

# Option 2: "I give up just, run the code for me" 
# Uncomment the next two lines, if you really gave up... 
#full_code_text_dataset_vectorize()
##from tsundoku.word2vec import Text


In [29]:
tokenized_text[5]

['we',
 'present',
 'experimental',
 'evidence',
 'of',
 'how',
 'arbitrary',
 'associations',
 'between',
 'word',
 'frequencies',
 'and',
 'corpora',
 'are',
 'systematically',
 'non-random',
 '.']

In [30]:
text_dataset = Text(tokenized_text)

In [31]:
text_dataset[5] # First sentence.

{'x': [31, 72, 68, 67, 71, 70, 50, 66, 51, 74, 69, 2, 23, 65, 73, 8, 1]}

<a id="section-3-0-2-return-dict"></a>

### Return `dict` in `__getitem__()`

This is nice if we're just representing sentences/documents by their indices but when we're doing machine learning, we usually have `X` and `Y`. 

If we have labels for the each sentence, we can also put it into to `__getitem__()` by having it return a dictionary.

In [32]:
from torch.utils.data import Dataset, DataLoader

class LabeledText(Dataset):
    def __init__(self, tokenized_texts, labels):
        """
        :param tokenized_texts: Tokenized text.
        :type tokenized_texts: list(list(str))
        """
        self.sents = tokenized_texts
        self.labels = labels # Sentence level labels.
        self.vocab = Dictionary(self.sents)

    def __getitem__(self, index):
        """
        The primary entry point for PyTorch datasets.
        This is were you access the specific data row you want.
        
        :param index: Index to the data point.
        :type index: int
        """
        return {'X': self.vectorize(self.sents[index]), 'Y': self.labels[index]}

    def vectorize(self, tokens):
        """
        :param tokens: Tokens that should be vectorized. 
        :type tokens: list(str)
        """
        # See https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2idx 
        return self.vocab.doc2idx(tokens)
    
    def unvectorize(self, indices):
        """
        :param indices: Converts the indices back to tokens.
        :type tokens: list(int)
        """
        return [self.vocab[i] for i in indices]

<a id="section-3-0-2-labeleddata"></a>

### Lets try the `LabeledDataset` on a movie review corpus

In [33]:
from nltk.corpus import movie_reviews

In [34]:
documents = []
labels = []

for fileid in tqdm(movie_reviews.fileids()):
    label = fileid.split('/')[0]
    doc = word_tokenize(movie_reviews.open(fileid).read())
    documents.append(doc)
    labels.append(label)

100%|█████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:18<00:00, 108.12it/s]


In [35]:
documents[0]

['plot',
 ':',
 'two',
 'teen',
 'couples',
 'go',
 'to',
 'a',
 'church',
 'party',
 ',',
 'drink',
 'and',
 'then',
 'drive',
 '.',
 'they',
 'get',
 'into',
 'an',
 'accident',
 '.',
 'one',
 'of',
 'the',
 'guys',
 'dies',
 ',',
 'but',
 'his',
 'girlfriend',
 'continues',
 'to',
 'see',
 'him',
 'in',
 'her',
 'life',
 ',',
 'and',
 'has',
 'nightmares',
 '.',
 'what',
 "'s",
 'the',
 'deal',
 '?',
 'watch',
 'the',
 'movie',
 'and',
 '``',
 'sorta',
 '``',
 'find',
 'out',
 '.',
 '.',
 '.',
 'critique',
 ':',
 'a',
 'mind-fuck',
 'movie',
 'for',
 'the',
 'teen',
 'generation',
 'that',
 'touches',
 'on',
 'a',
 'very',
 'cool',
 'idea',
 ',',
 'but',
 'presents',
 'it',
 'in',
 'a',
 'very',
 'bad',
 'package',
 '.',
 'which',
 'is',
 'what',
 'makes',
 'this',
 'review',
 'an',
 'even',
 'harder',
 'one',
 'to',
 'write',
 ',',
 'since',
 'i',
 'generally',
 'applaud',
 'films',
 'which',
 'attempt',
 'to',
 'break',
 'the',
 'mold',
 ',',
 'mess',
 'with',
 'your',
 'head',
 '

In [36]:
labeled_dataset = LabeledText(documents, labels)

In [37]:
labeled_dataset[0]  # First review in the data.

{'X': [243,
  17,
  314,
  294,
  77,
  140,
  307,
  20,
  68,
  237,
  6,
  97,
  34,
  299,
  98,
  8,
  302,
  135,
  167,
  33,
  22,
  8,
  226,
  220,
  297,
  145,
  87,
  6,
  60,
  158,
  136,
  74,
  307,
  262,
  157,
  165,
  153,
  179,
  6,
  34,
  149,
  214,
  8,
  333,
  2,
  297,
  82,
  18,
  326,
  297,
  204,
  34,
  19,
  280,
  19,
  124,
  230,
  8,
  8,
  8,
  79,
  17,
  20,
  199,
  204,
  129,
  297,
  294,
  133,
  296,
  311,
  225,
  20,
  322,
  75,
  164,
  6,
  60,
  245,
  169,
  165,
  20,
  322,
  46,
  234,
  8,
  337,
  168,
  333,
  188,
  304,
  253,
  33,
  108,
  148,
  226,
  307,
  345,
  6,
  272,
  163,
  132,
  37,
  122,
  337,
  42,
  307,
  59,
  297,
  201,
  6,
  196,
  341,
  348,
  152,
  34,
  290,
  4,
  185,
  156,
  1,
  195,
  5,
  6,
  60,
  300,
  38,
  142,
  34,
  46,
  328,
  220,
  189,
  28,
  315,
  220,
  122,
  6,
  34,
  301,
  128,
  173,
  86,
  208,
  276,
  304,
  226,
  76,
  8,
  302,
  263,
  307,
  150,
  2

In [38]:
labeled_dataset[0]['X']  # Label of the first review in the data. 

[243,
 17,
 314,
 294,
 77,
 140,
 307,
 20,
 68,
 237,
 6,
 97,
 34,
 299,
 98,
 8,
 302,
 135,
 167,
 33,
 22,
 8,
 226,
 220,
 297,
 145,
 87,
 6,
 60,
 158,
 136,
 74,
 307,
 262,
 157,
 165,
 153,
 179,
 6,
 34,
 149,
 214,
 8,
 333,
 2,
 297,
 82,
 18,
 326,
 297,
 204,
 34,
 19,
 280,
 19,
 124,
 230,
 8,
 8,
 8,
 79,
 17,
 20,
 199,
 204,
 129,
 297,
 294,
 133,
 296,
 311,
 225,
 20,
 322,
 75,
 164,
 6,
 60,
 245,
 169,
 165,
 20,
 322,
 46,
 234,
 8,
 337,
 168,
 333,
 188,
 304,
 253,
 33,
 108,
 148,
 226,
 307,
 345,
 6,
 272,
 163,
 132,
 37,
 122,
 337,
 42,
 307,
 59,
 297,
 201,
 6,
 196,
 341,
 348,
 152,
 34,
 290,
 4,
 185,
 156,
 1,
 195,
 5,
 6,
 60,
 300,
 38,
 142,
 34,
 46,
 328,
 220,
 189,
 28,
 315,
 220,
 122,
 6,
 34,
 301,
 128,
 173,
 86,
 208,
 276,
 304,
 226,
 76,
 8,
 302,
 263,
 307,
 150,
 293,
 304,
 246,
 209,
 72,
 6,
 60,
 113,
 169,
 295,
 8,
 277,
 333,
 38,
 297,
 248,
 341,
 297,
 204,
 18,
 331,
 6,
 170,
 186,
 247,
 168,
 296,
 169,
 2,

<a id="section-3-1"></a>

# 3.1 Word2Vec Training

Word2Vec has two training variants:

 - **Continuous Bag of Words (CBOW)**: Predict center word from (bag of) context words.
 - **Skip-grams**: Predict context words given center word.
  
Visually, they look like this:

<img src="https://ibin.co/4UIznsOEyH7t.png" width="500" align="left">


<a id="section-3-1-1"></a>

## 3.1.1. CBOW

CBOW windows through the sentence and picks out the center word as the `Y` and the surrounding context words as the inputs `X`. 


In [39]:
from lazyme import per_window, per_chunk

xx =[1,2,3,4]
list(per_window(xx, n=2))
list(per_chunk(xx, n=3))

[(1, 2, 3), (4, None, None)]

In [40]:
def per_window(sequence, n=1):
    """
    From http://stackoverflow.com/q/42220614/610569
        >>> list(per_window([1,2,3,4], n=2))
        [(1, 2), (2, 3), (3, 4)]
        >>> list(per_window([1,2,3,4], n=3))
        [(1, 2, 3), (2, 3, 4)]
    """
    start, stop = 0, n
    seq = list(sequence)
    while stop <= len(seq):
        yield seq[start:stop]
        start += 1
        stop += 1

def cbow_iterator(tokens, window_size):
    n = window_size * 2 + 1
    for window in per_window(tokens, n):
        target = window.pop(window_size)
        yield window, target   # X = window ; Y = target. 


In [41]:
sent0 = ['language', 'users', 'never', 'choose', 'words', 'randomly', ',', 
         'and', 'language', 'is', 'essentially', 'non-random', '.']

In [42]:
list(cbow_iterator(sent0, 2)) 

[(['language', 'users', 'choose', 'words'], 'never'),
 (['users', 'never', 'words', 'randomly'], 'choose'),
 (['never', 'choose', 'randomly', ','], 'words'),
 (['choose', 'words', ',', 'and'], 'randomly'),
 (['words', 'randomly', 'and', 'language'], ','),
 (['randomly', ',', 'language', 'is'], 'and'),
 ([',', 'and', 'is', 'essentially'], 'language'),
 (['and', 'language', 'essentially', 'non-random'], 'is'),
 (['language', 'is', 'non-random', '.'], 'essentially')]

In [43]:
list(cbow_iterator(sent0, 3)) 

[(['language', 'users', 'never', 'words', 'randomly', ','], 'choose'),
 (['users', 'never', 'choose', 'randomly', ',', 'and'], 'words'),
 (['never', 'choose', 'words', ',', 'and', 'language'], 'randomly'),
 (['choose', 'words', 'randomly', 'and', 'language', 'is'], ','),
 (['words', 'randomly', ',', 'language', 'is', 'essentially'], 'and'),
 (['randomly', ',', 'and', 'is', 'essentially', 'non-random'], 'language'),
 ([',', 'and', 'language', 'essentially', 'non-random', '.'], 'is')]

<a id="section-3-1-2"></a>

## 3.1.2. Skipgram

Skipgram training windows through the sentence and pictures out the center word as the input `X` and the context words as the outputs `Y`, additionally, it will randommly sample words not in the window as **negative samples**.

In [44]:
def skipgram_iterator(tokens, window_size):
    n = window_size * 2 + 1 
    for i, window in enumerate(per_window(tokens, n)):
        target = window.pop(window_size)
        # Generate positive samples.
        for context_word in window:
            yield target, context_word, 1
        # Generate negative samples.
        for _ in range(n-1):
            leftovers = tokens[:i] + tokens[i+n:]
            yield target, random.choice(leftovers), 0

In [45]:
list(skipgram_iterator(sent0, 2))

[('never', 'language', 1),
 ('never', 'users', 1),
 ('never', 'choose', 1),
 ('never', 'words', 1),
 ('never', 'essentially', 0),
 ('never', 'is', 0),
 ('never', 'randomly', 0),
 ('never', '.', 0),
 ('choose', 'users', 1),
 ('choose', 'never', 1),
 ('choose', 'words', 1),
 ('choose', 'randomly', 1),
 ('choose', ',', 0),
 ('choose', 'and', 0),
 ('choose', 'language', 0),
 ('choose', 'language', 0),
 ('words', 'never', 1),
 ('words', 'choose', 1),
 ('words', 'randomly', 1),
 ('words', ',', 1),
 ('words', 'non-random', 0),
 ('words', 'essentially', 0),
 ('words', 'language', 0),
 ('words', 'language', 0),
 ('randomly', 'choose', 1),
 ('randomly', 'words', 1),
 ('randomly', ',', 1),
 ('randomly', 'and', 1),
 ('randomly', 'essentially', 0),
 ('randomly', 'essentially', 0),
 ('randomly', 'users', 0),
 ('randomly', 'language', 0),
 (',', 'words', 1),
 (',', 'randomly', 1),
 (',', 'and', 1),
 (',', 'language', 1),
 (',', '.', 0),
 (',', 'users', 0),
 (',', 'non-random', 0),
 (',', 'users', 0),

## Cut-away: What is `partial`?

The [`functools.partial`](https://docs.python.org/3.7/library/functools.html#functools.partial) function in Python is a mechanism to overload a function with preset arguments. 

For example:

In [46]:
from nltk import ngrams

# Generates bigrams
list(ngrams('this is a sentence'.split(), n=2))

[('this', 'is'), ('is', 'a'), ('a', 'sentence')]

In [47]:
from functools import partial

# You can create a new function that "preset" the `n` argument, e.g.
bigrams = partial(ngrams, n=2)
trigrams = partial(ngrams, n=3)

In [48]:
list(trigrams('this is a sentence'.split()))

[('this', 'is', 'a'), ('is', 'a', 'sentence')]

In [49]:
list(bigrams('this is a sentence'.split()))

[('this', 'is'), ('is', 'a'), ('a', 'sentence')]

<a id="section-3-1-3"></a>

## 3.1.3 Word2Vec Dataset

Now that we know what are the inputs `X` and outputs `Y` of the Word2Vec task. 

Lets put everything together and modify the `Dataset` so that `__getitem__` retrieves CBOW or Skipgram formats.

In [50]:
class Word2VecText(Dataset):
    def __init__(self, tokenized_texts, window_size, variant):
        """
        :param tokenized_texts: Tokenized text.
        :type tokenized_texts: list(list(str))
        """
        self.sents = tokenized_texts
        self._len = len(self.sents)
        self.vocab = Dictionary(self.sents)
        self.window_size = window_size
        self.variant = variant
        if variant.lower() == 'cbow':
            self._iterator = partial(self.cbow_iterator, window_size=self.window_size)
        elif variant.lower() == 'skipgram':
            self._iterator = partial(self.skipgram_iterator, window_size=self.window_size)

    def __getitem__(self, index):
        """
        The primary entry point for PyTorch datasets.
        This is were you access the specific data row you want.
        
        :param index: Index to the data point.
        :type index: int
        """
        vectorized_sent = self.vectorize(self.sents[index])
        
        return list(self._iterator(vectorized_sent))

    def __len__(self):
        return self._len
    
    def vectorize(self, tokens):
        """
        :param tokens: Tokens that should be vectorized. 
        :type tokens: list(str)
        """
        # See https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2idx 
        return self.vocab.doc2idx(tokens)
    
    def unvectorize(self, indices):
        """
        :param indices: Converts the indices back to tokens.
        :type tokens: list(int)
        """
        return [self.vocab[i] for i in indices]

    def cbow_iterator(self,tokens, window_size):
        n = window_size * 2 + 1
        for window in per_window(tokens, n):
            target = window.pop(window_size)
            yield window, target   # X = window ; Y = target. 

    def skipgram_iterator(self, tokens, window_size):
        n = window_size * 2 + 1 
        for i, window in enumerate(per_window(tokens, n)):
            target = window.pop(window_size)
            # Generate positive samples.
            for context_word in window:
                yield target, context_word, 1
            # Generate negative samples.
            for _ in range(n-1):
                leftovers = tokens[:i] + tokens[i+n:]
                yield target, random.choice(leftovers), 0
                

<a id="section-3-1-3-hint"></a>
## Hints for the cell above.

In [51]:
# Option 1: To see the hint and partial code for the cell above, uncomment the following line.
##hint_word2vec_dataset()

# Option 2: "I give up just, run the code for me" 
# Uncomment the next two lines, if you really gave up... 
##full_code_word2vec_dataset()
##from tsundoku.word2vec import Word2VecText


<a id="section-3-1-4-hint"></a>

## 3.1.4. Train a CBOW model

### Lets Get Some Data

Lets take Kilgarriff (2005) , "Language is never ever, ever random". 

In [52]:
import os
import requests
import io #codecs


# Text version of https://kilgarriff.co.uk/Publications/2005-K-lineer.pdf
if os.path.isfile('language-never-random.txt'):
    with io.open('language-never-random.txt', encoding='utf8') as fin:
        text = fin.read()
else:
    url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt"
    text = requests.get(url).content.decode('utf8')
    with io.open('language-never-random.txt', 'w', encoding='utf8') as fout:
        fout.write(text)

tokenized_text = [list(map(str.lower, word_tokenize(sent))) for sent in sent_tokenize(text)]
window_size = 2
w2v_dataset = Word2VecText(tokenized_text, window_size=window_size, variant='cbow')

In [53]:
print(text[:1000])

                       Language is never, ever, ever, random

                                                               ADAM KILGARRIFF




Abstract
Language users never choose words randomly, and language is essentially
non-random. Statistical hypothesis testing uses a null hypothesis, which
posits randomness. Hence, when we look at linguistic phenomena in cor-
pora, the null hypothesis will never be true. Moreover, where there is enough
data, we shall (almost) always be able to establish that it is not true. In
corpus studies, we frequently do have enough data, so the fact that a rela-
tion between two phenomena is demonstrably non-random, does not sup-
port the inference that it is not arbitrary. We present experimental evidence
of how arbitrary associations between word frequencies and corpora are
systematically non-random. We review literature in which hypothesis test-
ing has been used, and show how it has often led to unhelpful or mislead-
ing results.
Keywords: 쎲쎲쎲

1. Int

In [54]:
# Sanity check, lets take a look at the data.
print(tokenized_text[0])

['language', 'is', 'never', ',', 'ever', ',', 'ever', ',', 'random', 'adam', 'kilgarriff', 'abstract', 'language', 'users', 'never', 'choose', 'words', 'randomly', ',', 'and', 'language', 'is', 'essentially', 'non-random', '.']


In [55]:
from lazyme import color_str

def visualize_predictions(x, y, prediction, vocab, window_size, unk='<unk>'):
    left = ' '.join([vocab.get(int(_x), '<unk>') for _x in x[:window_size]])
    right = ' '.join([vocab.get(int(_x), '<unk>') for _x in x[window_size:]])
    target = vocab.get(int(y), '<unk>')

    if not prediction:
        predicted_word = '______'
    else:
        predicted_word = vocab.get(int(prediction), '<unk>') 
    print(color_str(target, 'green'), '\t' if len(target) > 6 else '\t\t', 
          left, color_str(predicted_word, 'green' if target == predicted_word else 'red'), right)
    
device = 'cuda' if torch.cuda.is_available() else 'cpu'
sent_idx = 10
window_size = 2
w2v_dataset = Word2VecText(tokenized_text, window_size=window_size, variant='cbow')
print(' '.join(w2v_dataset.sents[sent_idx]))
for w2v_io in w2v_dataset[sent_idx]:
    context, target = w2v_io
    context, target = tensor(context).to(device), tensor(target).to(device)
    visualize_predictions(context, target, None, w2v_dataset.vocab, window_size)

the bulk of linguistic questions concern the dis- tinction between a and m. a linguistic account of a phenomenon gen- erally gives us reason to view the relation between , for example , a verb ’ s syntax and its semantics , as motivated rather than arbitrary .


NameError: name 'device' is not defined

<a id="section-3-1-4-cbow-model"></a>

## Fill-in the code for the CBOW Model

<img src="https://lilianweng.github.io/lil-log/assets/images/word2vec-cbow.png" width="500" align="left">


(Image from https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html)

In [56]:
import torch
from torch import nn, optim, tensor, autograd
from torch.nn import functional as F

class CBOW(nn.Module):
    def __init__(self, vocab_size, embd_size, context_size, hidden_size):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embd_size)
        self.linear1 = nn.Linear(2*context_size*embd_size, hidden_size)
        self.linear2 = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, inputs):
        embedded = self.embeddings(inputs).view((1, -1))
        hid = F.relu(self.linear1(embedded))
        out = self.linear2(hid)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


## Lets take a closer look from the inputs to the first `nn.Linear`

Cos after it reach the first `nn.Linear` it's just the same as our multi-layered perceptron example =)

In [57]:
# Lets take a look at the first output.
x, y = w2v_dataset[0][0]['x'],  w2v_dataset[0][0]['y'], 

x = tensor(x)
y = autograd.Variable(tensor(y, dtype=torch.long))
print(x)
print(y)

TypeError: tuple indices must be integers or slices, not str

In [58]:
embd_size = 5
emb = nn.Embedding(len(w2v_dataset.vocab), embd_size)
emb.state_dict()

OrderedDict([('weight',
              tensor([[-4.2168e-01,  4.8609e-02, -2.1194e+00, -6.1581e-01, -4.4611e-01],
                      [ 2.1553e-01,  8.2212e-01,  7.6828e-01, -1.9757e+00,  1.0063e+00],
                      [ 1.1148e+00, -1.5603e+00,  9.6289e-02,  1.1540e+00,  5.2998e-03],
                      [ 1.0295e+00, -5.8385e-01,  8.4793e-01,  1.1187e+00, -7.8351e-02],
                      [-6.6948e-01, -8.0136e-01,  1.3363e+00,  7.1023e-01, -8.5575e-01],
                      [-1.2863e+00,  2.8408e-01, -1.0605e+00, -1.6178e+00,  8.6683e-02],
                      [-1.2780e+00, -2.1280e+00,  3.8221e-01,  9.6016e-01,  7.6150e-01],
                      [-1.3492e+00, -1.1232e+00, -6.3847e-01,  1.3816e+00, -2.7207e-01],
                      [-9.0710e-01,  1.7912e+00,  1.5406e+00,  1.7876e+00, -5.0253e-01],
                      [-8.6535e-02, -2.5198e+00,  4.1283e-02, -1.8119e+00, -4.9352e-01],
                      [ 6.7914e-01, -5.3780e-01,  6.0494e-01,  2.9811e-01, -1.4980e+00

In [59]:
print(emb.state_dict()['weight'].shape)
emb.state_dict()['weight']

torch.Size([87, 5])


tensor([[-4.2168e-01,  4.8609e-02, -2.1194e+00, -6.1581e-01, -4.4611e-01],
        [ 2.1553e-01,  8.2212e-01,  7.6828e-01, -1.9757e+00,  1.0063e+00],
        [ 1.1148e+00, -1.5603e+00,  9.6289e-02,  1.1540e+00,  5.2998e-03],
        [ 1.0295e+00, -5.8385e-01,  8.4793e-01,  1.1187e+00, -7.8351e-02],
        [-6.6948e-01, -8.0136e-01,  1.3363e+00,  7.1023e-01, -8.5575e-01],
        [-1.2863e+00,  2.8408e-01, -1.0605e+00, -1.6178e+00,  8.6683e-02],
        [-1.2780e+00, -2.1280e+00,  3.8221e-01,  9.6016e-01,  7.6150e-01],
        [-1.3492e+00, -1.1232e+00, -6.3847e-01,  1.3816e+00, -2.7207e-01],
        [-9.0710e-01,  1.7912e+00,  1.5406e+00,  1.7876e+00, -5.0253e-01],
        [-8.6535e-02, -2.5198e+00,  4.1283e-02, -1.8119e+00, -4.9352e-01],
        [ 6.7914e-01, -5.3780e-01,  6.0494e-01,  2.9811e-01, -1.4980e+00],
        [-1.7107e+00, -1.1553e-02,  9.6819e-01,  5.5295e-01, -1.2484e-01],
        [ 1.2672e+00, -1.4675e+00,  5.1901e-01, -1.4572e+00,  1.1917e+00],
        [-6.3780e-01, -1.

In [60]:
print(emb(x).shape)
print(emb(x))

NameError: name 'x' is not defined

In [61]:
print(emb(x).view(1, -1).shape)
emb(x).view(1, -1)

NameError: name 'x' is not defined

In [119]:
hidden_size = 100
lin1 = nn.Linear(len(x)*embd_size, hidden_size)
print(lin1.state_dict())

OrderedDict([('weight', tensor([[ 0.0682, -0.0824, -0.1613,  ...,  0.0075,  0.0211, -0.1730],
        [ 0.1452,  0.1545, -0.1338,  ..., -0.0559,  0.1861, -0.1367],
        [ 0.1486,  0.1155,  0.0959,  ...,  0.0907,  0.0113,  0.1604],
        ...,
        [ 0.1149,  0.0058, -0.0836,  ...,  0.0469, -0.1511, -0.1294],
        [ 0.1305,  0.0394,  0.1792,  ...,  0.0004,  0.1719, -0.1286],
        [ 0.0849, -0.0992,  0.1415,  ..., -0.0807, -0.0264, -0.0846]])), ('bias', tensor([ 1.8791e-02,  1.5262e-01,  1.5195e-01,  1.1731e-01,  9.1180e-02,
         9.6480e-02,  4.2502e-02,  1.1213e-01, -9.4328e-02,  1.0605e-01,
        -1.1435e-01,  5.4850e-02,  2.5943e-02,  1.8330e-01,  1.5268e-02,
        -1.9103e-01,  3.5164e-02, -7.9650e-02, -9.0032e-02, -1.1632e-01,
         7.7349e-02,  1.2977e-01, -1.3206e-01, -7.0709e-03, -1.3217e-01,
         1.0222e-01,  1.4309e-01,  6.8925e-02,  1.8032e-01,  1.4187e-01,
        -8.7168e-03, -1.8049e-01, -1.3915e-01, -1.3955e-01, -5.4416e-03,
        -4.7703e-02,

In [120]:
print(lin1.state_dict()['weight'].shape)
print(lin1.state_dict()['weight'])

torch.Size([100, 25])
tensor([[ 0.0682, -0.0824, -0.1613,  ...,  0.0075,  0.0211, -0.1730],
        [ 0.1452,  0.1545, -0.1338,  ..., -0.0559,  0.1861, -0.1367],
        [ 0.1486,  0.1155,  0.0959,  ...,  0.0907,  0.0113,  0.1604],
        ...,
        [ 0.1149,  0.0058, -0.0836,  ...,  0.0469, -0.1511, -0.1294],
        [ 0.1305,  0.0394,  0.1792,  ...,  0.0004,  0.1719, -0.1286],
        [ 0.0849, -0.0992,  0.1415,  ..., -0.0807, -0.0264, -0.0846]])


In [121]:
print(lin1(emb(x).view(1, -1)).shape)
lin1(emb(x).view(1, -1))

torch.Size([1, 100])


tensor([[ 0.3971,  0.3754, -0.4184,  0.1931, -0.3988, -0.2276,  0.0472,  0.6021,
          0.2035, -0.1339,  0.0322, -0.7428, -0.5011,  0.1104, -0.6841,  0.2021,
          0.4721,  0.0744, -0.0591, -0.3688,  0.5280, -0.2357, -0.1436, -0.6866,
          0.0530,  0.2143,  0.9329,  0.1224, -0.1782,  0.0445, -0.2867,  0.4525,
         -0.4842,  0.2364,  1.3188, -0.2338, -0.2273, -0.7911,  0.8791, -0.1656,
          0.8919, -0.4287, -0.7792, -0.0292, -0.0628,  0.3437,  0.6424,  0.0903,
          0.3973, -0.1186,  0.6307, -0.7290,  0.1930, -0.1160,  0.3763,  0.3032,
         -0.0249, -0.4606, -0.2629, -0.3670,  0.2756,  0.3389,  0.7264, -0.2258,
          0.5897, -0.0961,  0.1126, -0.8335, -0.5276,  0.0637, -0.4331,  1.4119,
         -0.0479, -0.8726, -0.0240,  0.2704,  0.0539, -0.2571,  0.0348,  0.2081,
          0.1643,  0.6453,  0.6159, -0.3658,  0.2794, -0.6706,  0.2810, -0.5973,
          0.1143, -0.3701,  0.0720,  0.5984, -0.2203, -0.3437, -0.8729, -0.9318,
         -0.2127, -0.2728, -

In [122]:
relu = nn.ReLU()
print(relu(lin1(emb(x).view(1, -1))).shape)
relu(lin1(emb(x).view(1, -1)))

torch.Size([1, 100])


tensor([[0.3971, 0.3754, 0.0000, 0.1931, 0.0000, 0.0000, 0.0472, 0.6021, 0.2035,
         0.0000, 0.0322, 0.0000, 0.0000, 0.1104, 0.0000, 0.2021, 0.4721, 0.0744,
         0.0000, 0.0000, 0.5280, 0.0000, 0.0000, 0.0000, 0.0530, 0.2143, 0.9329,
         0.1224, 0.0000, 0.0445, 0.0000, 0.4525, 0.0000, 0.2364, 1.3188, 0.0000,
         0.0000, 0.0000, 0.8791, 0.0000, 0.8919, 0.0000, 0.0000, 0.0000, 0.0000,
         0.3437, 0.6424, 0.0903, 0.3973, 0.0000, 0.6307, 0.0000, 0.1930, 0.0000,
         0.3763, 0.3032, 0.0000, 0.0000, 0.0000, 0.0000, 0.2756, 0.3389, 0.7264,
         0.0000, 0.5897, 0.0000, 0.1126, 0.0000, 0.0000, 0.0637, 0.0000, 1.4119,
         0.0000, 0.0000, 0.0000, 0.2704, 0.0539, 0.0000, 0.0348, 0.2081, 0.1643,
         0.6453, 0.6159, 0.0000, 0.2794, 0.0000, 0.2810, 0.0000, 0.1143, 0.0000,
         0.0720, 0.5984, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.5720]], grad_fn=<ThresholdBackward0>)

In [123]:
lin2 = nn.Linear(hidden_size, len(w2v_dataset.vocab))
print(lin2.state_dict()['weight'].shape)
lin2.state_dict()['weight']

torch.Size([1388, 100])


tensor([[-0.0163, -0.0994,  0.0672,  ...,  0.0186, -0.0159,  0.0627],
        [-0.0074, -0.0539, -0.0873,  ..., -0.0316, -0.0493,  0.0676],
        [-0.0643, -0.0600, -0.0529,  ..., -0.0705, -0.0151, -0.0443],
        ...,
        [-0.0546, -0.0097, -0.0718,  ..., -0.0164,  0.0428,  0.0612],
        [-0.0227, -0.0751, -0.0962,  ...,  0.0180,  0.0922,  0.0302],
        [-0.0727, -0.0318, -0.0375,  ...,  0.0241, -0.0968,  0.0940]])

In [124]:
h_x = relu(lin1(emb(x).view(1, -1)))
print(lin2(h_x).shape)
lin2(h_x)

torch.Size([1, 1388])


tensor([[ 0.2554,  0.0995,  0.1225,  ..., -0.0309, -0.2753, -0.0698]],
       grad_fn=<AddmmBackward>)

In [125]:
softmax = nn.LogSoftmax(dim=1)
softmax(lin2(h_x)).detach().numpy().tolist()

[[-7.008087635040283,
  -7.164051532745361,
  -7.140991687774658,
  -6.906752109527588,
  -7.240259647369385,
  -7.038198471069336,
  -7.3443193435668945,
  -6.702448844909668,
  -7.318045616149902,
  -6.870830535888672,
  -7.056977272033691,
  -7.279659271240234,
  -7.4093427658081055,
  -6.968787670135498,
  -7.378303527832031,
  -7.249201774597168,
  -7.308876037597656,
  -7.218778610229492,
  -7.1557393074035645,
  -7.171926975250244,
  -7.515322208404541,
  -7.045439720153809,
  -7.370819091796875,
  -7.049893379211426,
  -6.97169828414917,
  -7.340801239013672,
  -7.147621154785156,
  -6.899478912353516,
  -7.073168754577637,
  -6.781639099121094,
  -7.03122615814209,
  -7.764267444610596,
  -7.429055213928223,
  -7.157585144042969,
  -7.229382038116455,
  -7.688353061676025,
  -7.3556647300720215,
  -7.453991413116455,
  -7.008212566375732,
  -7.171889305114746,
  -7.36814546585083,
  -7.133216857910156,
  -7.111328125,
  -7.182351112365723,
  -7.2988715171813965,
  -7.473907947

In [126]:
# Select the index with highest softmax probabilities
torch.max(softmax(lin2(h_x)), 1)

(tensor([-6.5213], grad_fn=<MaxBackward0>), tensor([1173]))

<a id="section-3-1-4-train-cbow"></a>

# Now, we train the CBOW model for real.

In [127]:
# First we split the data into training and testing.
from sklearn.model_selection import train_test_split

tokenized_text_train, tokenized_text_test = train_test_split(tokenized_text, test_size=0.1, random_state=42)
len(tokenized_text_train), len(tokenized_text_test)

(211, 24)

In [128]:
import torch
from torch import nn, optim, tensor, autograd
from torch.nn import functional as F

class CBOW(nn.Module):
    def __init__(self, vocab_size, embd_size, context_size, hidden_size):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embd_size)
        self.linear1 = nn.Linear(2*context_size*embd_size, hidden_size)
        self.linear2 = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, inputs):
        embedded = self.embeddings(inputs).view((1, -1))
        hid = F.relu(self.linear1(embedded))
        out = self.linear2(hid)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


In [130]:
embd_size = 100
learning_rate = 0.003
hidden_size = 100
window_size = 2

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Initialize the dataset.
w2v_dataset = Word2VecText(tokenized_text_train, window_size=window_size, variant='cbow')
vocab_size = len(w2v_dataset.vocab)

criterion = nn.NLLLoss()
model = CBOW(vocab_size, embd_size, window_size, hidden_size).to(device)
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

losses = []

model = nn.DataParallel(model)

num_epochs = 100
for _e in tqdm(range(num_epochs)):
    epoch_loss = []
    for sent_idx in range(w2v_dataset._len):
        for w2v_io in w2v_dataset[sent_idx]:
            # Retrieve the inputs and outputs.
            x, y = w2v_io
            x = tensor(x).to(device)
            y = autograd.Variable(tensor(y, dtype=torch.long)).to(device)
            # Zero gradient.
            model.zero_grad()
            # Calculate the log probability of the context embeddings.
            logprobs = model(x)
            # This unsqueeze thing is really a feature/bug... -_-
            loss = criterion(logprobs, y.unsqueeze(0)) 
            loss.backward()
            optimizer.step()
            epoch_loss.append(float(loss))
    # Save model after every epoch.
    torch.save(model.state_dict(), 'cbow_checkpoint_{}.pt'.format(_e))
    losses.append(sum(epoch_loss)/len(epoch_loss))




  0%|          | 0/100 [00:00<?, ?it/s][A
[A

KeyboardInterrupt: 

In [83]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

sns.set_style("darkgrid")
sns.set(rc={'figure.figsize':(12, 8)})

plt.plot(losses)
plt.show()

<Figure size 1200x800 with 1 Axes>

<a id="section-3-1-4-evaluate-cbow"></a>

# Apply and Evaluate the CBOW Model 

In [69]:
from lazyme import color_str

true_positive = 0
all_data = 0
# Iterate through the test sentences. 
for sent in tokenized_text_test:
    # Extract all the CBOW contexts (X) and targets (Y)
    for w2v_io in w2v_dataset._iterator(w2v_dataset.vectorize(sent)):
        # Retrieve the inputs and outputs.
        x = tensor(w2v_io['x'])
        y = tensor(w2v_io['y'])
        
        if -1 in x: # Skip unknown words.
            continue
            
        with torch.no_grad():
            _, prediction =  torch.max(model(x), 1)
        true_positive += int(prediction) == int(y)
        visualize_predictions(x, y, prediction, w2v_dataset.vocab, window_size=window_size)
        all_data += 1

NameError: name 'tokenized_text_test' is not defined

In [144]:
print('Accuracy:', true_positive/all_data)

Accuracy: 0.24497991967871485


<a id="section-3-1-4-load-model"></a>

# Go back to the 10th Epoch

In [72]:
model_10 = CBOW(vocab_size, embd_size, window_size, hidden_size)
model_10 = torch.nn.DataParallel(model_10)
model_10.load_state_dict(torch.load('cbow_checkpoint_10.pt'))
model_10.eval()

DataParallel(
  (module): CBOW(
    (embeddings): Embedding(1310, 100)
    (linear1): Linear(in_features=400, out_features=100, bias=True)
    (linear2): Linear(in_features=100, out_features=1310, bias=True)
  )
)

In [73]:

true_positive = 0
all_data = 0
# Iterate through the test sentences. 
for sent in tokenized_text_test:
    # Extract all the CBOW contexts (X) and targets (Y)
    for w2v_io in w2v_dataset._iterator(w2v_dataset.vectorize(sent)):
        # Retrieve the inputs and outputs.
        x = tensor(w2v_io['x'])
        y = tensor(w2v_io['y'])
        
        if -1 in x: # Skip unknown words.
            continue
            
        with torch.no_grad():
            _, prediction =  torch.max(model_10(x), 1)
        true_positive += int(prediction) == int(y)
        visualize_predictions(x, y, prediction, w2v_dataset.vocab, window_size=window_size)
        all_data += 1

[92mis[0m 		 the problem [91m______[0m essentially this
[92messentially[0m 	 problem is [91mnot[0m this :
[92mthis[0m 		 is essentially [91mto[0m : if
[92m:[0m 		 essentially this [91m([0m if a
[92mif[0m 		 this : [91mas[0m a word
[92ma[0m 		 : if [91mand[0m word (
[92mword[0m 		 if a [91mtwo[0m ( or
[92m([0m 		 a word [91m______[0m or bigram
[92mor[0m 		 word ( [91m1[0m bigram ,
[92mbigram[0m 		 ( or [91mrandom[0m , or
[92m<unk>[0m 		 , or [91mrandom[0m , or
[92m<unk>[0m 		 , or [91mby[0m etc. )
[92mis[0m 		 the web [92mis[0m a vast
[92ma[0m 		 web is [91mthe[0m vast re-
[92mvast[0m 		 is a [91msmall[0m re- source
[92mre-[0m 		 a vast [91mto[0m source for
[92msource[0m 		 vast re- [91m:[0m for many
[92mthe[0m 		 is that [92mthe[0m association is
[92massociation[0m 	 that the [91mprobability[0m is random
[92mis[0m 		 the association [91mrandom[0m random ,
[92mrandom[0m 		 association is [91mever[0m , ar

[92msyntax[0m 		 ’ s [91m______[0m and its
[92m<unk>[0m 		 and its [91mdata[0m , as
[92mmotivated[0m 	 , as [91ma[0m rather than
[92mrather[0m 		 as motivated [91mless[0m than arbitrary
[92mthan[0m 		 motivated rather [91mand[0m arbitrary .
[92mvalue[0m 		 the average [92mvalue[0m of the
[92mof[0m 		 average value [92mof[0m the error
[92mthe[0m 		 value of [92mthe[0m error term
[92merror[0m 		 of the [91mnull[0m term ,
[92mterm[0m 		 the error [92mterm[0m , language
[92m,[0m 		 error term [91mis[0m language is
[92mlanguage[0m 	 term , [92mlanguage[0m is never
[92mis[0m 		 , language [92mis[0m never ,
[92mnever[0m 		 language is [92mnever[0m , ever
[92m,[0m 		 is never [91m______[0m ever ,
[92mever[0m 		 never , [92mever[0m , ever
[92m,[0m 		 , ever [91m______[0m ever ,
[92mever[0m 		 ever , [92mever[0m , random
[92m<unk>[0m 		 , random [91msamples[0m ( &#124;
[92mo[0m 		 ( &#124; [91m)[0m ⫺ e
[92m⫺[0m 		 

In [74]:
print('Accuracy:', true_positive/all_data)

Accuracy: 0.24096385542168675


# [optional] How to Handle Unknown Words? 

This is not the best way to handle unknown words, but we can simply assign an index for unknown words.

**Hint:** Ensure that you have `gensim` version 3.7.0 first. Otherwise this part of the code won't work. Try `python -m pip install -U pip` and then `python -m pip install -U gensim==3.7.0`

In [75]:
vocab = Dictionary(['this is a foo bar sentence'.split()])
dict(vocab.items())

{0: 'a', 1: 'bar', 2: 'foo', 3: 'is', 4: 'sentence', 5: 'this'}

In [58]:
# See https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.patch_with_special_tokens
vocab = Dictionary(['this is a foo bar sentence'.split()])

try:
    special_tokens = {'<pad>': 0, '<unk>': 1}
    vocab.patch_with_special_tokens(special_tokens)
except: # If gensim is not 3.7.0
    pass
    
dict(vocab.items())

{6: 'a',
 7: 'bar',
 2: 'foo',
 3: 'is',
 4: 'sentence',
 5: 'this',
 0: '<pad>',
 1: '<unk>'}

# [optional] Lets Rewrite the `Word2VecText` Object

Now with the (i) unknown word patch in the vocabulary as well as (ii) `skipgram_iterator`

In [131]:
class Word2VecText(Dataset):
    def __init__(self, tokenized_texts, window_size, variant):
        """
        :param tokenized_texts: Tokenized text.
        :type tokenized_texts: list(list(str))
        """
        self.sents = tokenized_texts
        self._len = len(self.sents)
        
        # Add the unknown word patch here.
        self.vocab = Dictionary(self.sents)
        try:
            special_tokens = {'<pad>': 0, '<unk>': 1}
            self.vocab.patch_with_special_tokens(special_tokens)
        except:
            pass
        
        self.window_size = window_size
        self.variant = variant
        if variant.lower() == 'cbow':
            self._iterator = self.cbow_iterator
        elif variant.lower() == 'skipgram':
            self._iterator = self.skipgram_iterator

    def __getitem__(self, index):
        """
        The primary entry point for PyTorch datasets.
        This is were you access the specific data row you want.
        
        :param index: Index to the data point.
        :type index: int
        """
        vectorized_sent = self.vectorize(self.sents[index])
        
        return list(self._iterator(vectorized_sent, self.window_size))

    def __len__(self):
        return self._len
    
    def vectorize(self, tokens):
        """
        :param tokens: Tokens that should be vectorized. 
        :type tokens: list(str)
        """
        # See https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2idx 
        return self.vocab.doc2idx(tokens, unknown_word_index=1)
    
    def unvectorize(self, indices):
        """
        :param indices: Converts the indices back to tokens.
        :type tokens: list(int)
        """
        return [self.vocab[i] for i in indices]
    
    def cbow_iterator(self, tokens, window_size):
        n = window_size * 2 + 1
        for window in per_window(tokens, n):
            target = window.pop(window_size)
            yield {'x': window, 'y': target}   # X = window ; Y = target. 
            
    def skipgram_iterator(self, tokens, window_size):
        n = window_size * 2 + 1 
        for i, window in enumerate(per_window(tokens, n)):
            focus = window.pop(window_size)
            # Generate positive samples.
            for context_word in window:
                yield {'x': (focus, context_word), 'y':1}
            # Generate negative samples.
            for _ in range(n-1):
                leftovers = tokens[:i] + tokens[i+n:]
                if leftovers:
                    yield {'x': (focus, random.choice(leftovers)), 'y':0}
                

<a id="section-3-1-5"></a>

# Lets try the skipgram task

In [132]:
class SkipGram(nn.Module):
    def __init__(self, vocab_size, embd_size):
        super(SkipGram, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embd_size)
    
    def forward(self, focus, context):
        embed_focus = self.embeddings(focus).view((1, -1))
        embed_context = self.embeddings(context).view((1, -1))
        # See https://pytorch.org/docs/stable/torch.html#torch.t
        score = torch.mm(embed_focus, torch.t(embed_context))
        log_probs = F.logsigmoid(score)
        return log_probs

<a id="section-3-1-5-foward"></a>

# Take a closer look at what's in the `forward()`

In [133]:
xx1 = torch.rand(1,20)
xx2 = torch.rand(1,20)

xx1_numpy = xx1.detach().numpy()
xx2_numpy = xx2.detach().numpy()

In [134]:
print(xx1_numpy.shape)
print(xx2_numpy.T.shape)
print(np.dot(xx1_numpy, xx2_numpy.T))

(1, 20)
(20, 1)
[[3.1778643]]


In [135]:
print(xx1.shape)
print(torch.t(xx2).shape) 

print(torch.mm(xx1, torch.t(xx2))) # 

torch.Size([1, 20])
torch.Size([20, 1])
tensor([[3.1779]])


<a id="section-3-1-5-train"></a>

# Train a Skipgram model (for real)

In [136]:

embd_size = 100
learning_rate = 0.03
hidden_size = 300
window_size = 3

# Initialize the dataset.
w2v_dataset = Word2VecText(tokenized_text_train, window_size=3, variant='skipgram')
vocab_size = len(w2v_dataset.vocab)

criterion = nn.MSELoss()
model = SkipGram(vocab_size, embd_size,).to(device)
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

losses = []

model = nn.DataParallel(model)

num_epochs = 100
for _e in tqdm(range(num_epochs)):
    epcoh_loss = 0
    for sent_idx in range(w2v_dataset._len):
        for w2v_io in w2v_dataset[sent_idx]:
            # Retrieve the inputs and outputs.
            x1, x2 = w2v_io['x']
            x1, x2 = tensor(x1).to(device), tensor(x2).to(device)
            y = autograd.Variable(tensor(w2v_io['y'], dtype=torch.float)).to(device)
            # Zero gradient.
            model.zero_grad()
            # Calculate the log probability of the context embeddings.
            logprobs = model(x1, x2)
            # This unsqueeze thing is really a feature/bug... -_-
            loss = criterion(logprobs, y.unsqueeze(0)) 
            loss.backward()
            optimizer.step()
            epcoh_loss += float(loss)
    torch.save(model.state_dict(), 'skipgram_checkpoint_{}.pt'.format(_e))
    losses.append(epcoh_loss)




  0%|          | 0/100 [00:00<?, ?it/s][A
[A

KeyboardInterrupt: 

<a id="section-3-1-5-evaluate"></a>

# Evaluate the model on the skipgram task

In [137]:

true_positive = 0
all_data = 0
# Iterate through the test sentences. 
for sent in tokenized_text_test:
    # Extract all the CBOW contexts (X) and targets (Y)
    for w2v_io in w2v_dataset._iterator(w2v_dataset.vectorize(sent)):
        model.zero_grad()
        # Retrieve the inputs and outputs.
        x1, x2 = w2v_io['x']
        x1, x2 = tensor(x1), tensor(x2)
        y = w2v_io['y']
        _, prediction =  torch.max(model(x1, x2), 1)    
        true_positive += int(prediction) == int(y)
        all_data += 1

TypeError: skipgram_iterator() missing 1 required positional argument: 'window_size'

In [138]:
print('Accuracy:', true_positive/all_data)

ZeroDivisionError: division by zero

## Download the Collobert and Weston SENNA Embeddings


If you're on a Mac or Linux, you can use the `!` bang commands in the next cell to get the data.

```
!pip install kaggle
!mkdir -p .kaggle
!echo '{"username":"natgillin","key":"54ae95ab760b52c3307ed4645c6c9b5d"}' > .kaggle/kaggle.json
!chmod 600 .kaggle/kaggle.json
!kaggle datasets download -d alvations/vegetables-senna-embeddings --force -p ./
```

If you're on windows go to https://www.kaggle.com/alvations/vegetables-senna-embeddings and download the data files. 

What's most important are the 
 - `.txt` file that contains the vocabulary list
 - `.npy` file that contains the binarized numpy array
 
The rows of the numpy array corresponds to the vocabulary in the order from the `.txt` file.

<a id="section-3-1-6-vocab"></a>


## 3.1.6. Loading Pre-trained Embeddings

Lets overwrite the `Word2VecText` object with the pretrained embeddings. 

Most important thing is the overwrite the `Dictionary` from `gensim` with the vocabulary of the pre-trained embeddings, as such:

```python
        # Loads the pretrained keys. 
        with open('senna.wiki-reuters.lm2.50d.txt') as fin:
            pretrained_keys = {line.strip():i for i, line in enumerate(fin)}
        self.vocab = Dictionary({})
        self.vocab.token2id = pretrained_keys
```


In [104]:
class Word2VecText(Dataset):
    def __init__(self, tokenized_texts, window_size, variant):
        """
        :param tokenized_texts: Tokenized text.
        :type tokenized_texts: list(list(str))
        """
        self.sents = tokenized_texts
        self._len = len(self.sents)
        
        # Loads the pretrained keys. 
        with open('senna.wiki-reuters.lm2.50d.txt') as fin:
            pretrained_keys = {line.strip():i for i, line in enumerate(fin)}
        self.vocab = Dictionary({})
        self.vocab.token2id = pretrained_keys
        
        self.window_size = window_size
        self.variant = variant
        if variant.lower() == 'cbow':
            self._iterator = partial(self.cbow_iterator, window_size=self.window_size)
        elif variant.lower() == 'skipgram':
            self._iterator = partial(self.skipgram_iterator, window_size=self.window_size)

    def __getitem__(self, index):
        """
        The primary entry point for PyTorch datasets.
        This is were you access the specific data row you want.
        
        :param index: Index to the data point.
        :type index: int
        """
        vectorized_sent = self.vectorize(self.sents[index])
        
        return list(self._iterator(vectorized_sent))

    def __len__(self):
        return self._len
    
    def vectorize(self, tokens):
        """
        :param tokens: Tokens that should be vectorized. 
        :type tokens: list(str)
        """
        # See https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2idx 
        return self.vocab.doc2idx(tokens, unknown_word_index=-1)
    
    def unvectorize(self, indices):
        """
        :param indices: Converts the indices back to tokens.
        :type tokens: list(int)
        """
        return [self.vocab[i] for i in indices]
    
    def cbow_iterator(self, tokens, window_size):
        n = window_size * 2 + 1
        for window in per_window(tokens, n):
            target = window.pop(window_size)
            yield {'x': window, 'y': target}   # X = window ; Y = target. 
            
    def skipgram_iterator(self, tokens, window_size):
        n = window_size * 2 + 1 
        for i, window in enumerate(per_window(tokens, n)):
            focus = window.pop(window_size)
            # Generate positive samples.
            for context_word in window:
                yield {'x': (focus, context_word), 'y':1}
            # Generate negative samples.
            for _ in range(n-1):
                leftovers = tokens[:i] + tokens[i+n:]
                if leftovers:
                    yield {'x': (focus, random.choice(leftovers)), 'y':0}
                

<a id="section-3-1-6-pretrained"></a>

## Override the embeddings layer with the pre-trained weights.

In PyTorch, the weights of the `nn.Embedding` object can be easily overwritten with `from_pretrained` function, see https://pytorch.org/docs/stable/nn.html#embedding

In [105]:
class SkipGram(nn.Module):
    def __init__(self, pretrained_npy):
        super(SkipGram, self).__init__()
        self.embeddings = nn.Embedding.from_pretrained(pretrained_npy)
    
    def forward(self, focus, context):
        embed_focus = self.embeddings(focus).view((1, -1))
        embed_context = self.embeddings(context).view((1, -1))
        # See https://pytorch.org/docs/stable/torch.html#torch.t
        score = torch.mm(embed_focus, torch.t(embed_context))
        log_probs = F.logsigmoid(score)
        return log_probs

In [81]:
w2v_dataset = Word2VecText(tokenized_text_train, window_size=window_size, variant='skipgram')
pretrained_npy = torch.tensor(np.load('senna.wiki-reuters.lm2.50d.npy'))
pretrained_model = SkipGram(pretrained_npy)

NameError: name 'tokenized_text_train' is not defined

<a id="section-3-1-6-eval-skipgram"></a>
## Test Pretrained Embeddings on the Skipgram Task

In [106]:
true_positive = 0
all_data = 0
# Iterate through the test sentences. 
for sent in tokenized_text_test:
    # Extract all the CBOW contexts (X) and targets (Y)
    for w2v_io in w2v_dataset._iterator(w2v_dataset.vectorize(sent)):
        pretrained_model.zero_grad()
        # Retrieve the inputs and outputs.
        x1, x2 = w2v_io['x']
        if -1 in (x1, x2): # Skip unknown words.
            continue
        x1, x2 = tensor(x1), tensor(x2)
        y = w2v_io['y']
        with torch.no_grad():
            logprobs = pretrained_model(x1, x2)
            _, prediction =  torch.max(logprobs, 1)    
        true_positive += int(prediction) == int(y)
        all_data += 1

NameError: name 'tokenized_text_test' is not defined

In [107]:
with open('senna.wiki-reuters.lm2.50d.txt') as fin:
    pretrained_keys = {line.strip():i for i, line in enumerate(fin)}

In [108]:
print('Accuracy:', true_positive/all_data)

ZeroDivisionError: division by zero

<a id="section-3-1-6-eval-cbow"></a>
## Test Pretrained Embeddings on the CBOW Task

In [110]:
class CBOW(nn.Module):
    def __init__(self, pretrained_npy, context_size, hidden_size):
        super(CBOW, self).__init__()
        vocab_size, embd_size = list(pretrained_npy.shape)
        self.embeddings = nn.Embedding.from_pretrained(pretrained_npy)
        self.linear1 = nn.Linear(2*context_size*embd_size, hidden_size)
        self.linear2 = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, inputs):
        embedded = self.embeddings(inputs).float().view((1, -1))
        hid = F.relu(self.linear1(embedded))
        out = self.linear2(hid)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


In [111]:
window_size = 5
w2v_dataset = Word2VecText(tokenized_text_train, window_size=window_size, variant='cbow')
hidden_size = 300
pretrained_cbow_model = CBOW(pretrained_npy, window_size, hidden_size)

NameError: name 'tokenized_text_train' is not defined

In [112]:

true_positive = 0
all_data = 0
# Iterate through the test sentences. 
for sent in tokenized_text_test:
    # Extract all the CBOW contexts (X) and targets (Y)
    for w2v_io in w2v_dataset._iterator(w2v_dataset.vectorize(sent)):
        # Retrieve the inputs and outputs.
        x = tensor(w2v_io['x'])
        y = tensor(w2v_io['y'])
        
        if -1 in x: # Skip unknown words.
            continue
        with torch.no_grad():
            _, prediction =  torch.max(pretrained_cbow_model(x), 1)
        true_positive += int(prediction) == int(y)
        visualize_predictions(x, y, prediction, w2v_dataset.vocab, window_size=window_size)
        all_data += 1

NameError: name 'tokenized_text_test' is not defined

In [113]:
print('Accuracy:', true_positive/all_data)

ZeroDivisionError: division by zero

<a id="section-3-1-6-unfreeze-finetune"></a>
## Unfreeze the Embedddings and Tune it on the CBOW Task

In [114]:
class CBOW(nn.Module):
    def __init__(self, pretrained_npy, context_size, hidden_size):
        super(CBOW, self).__init__()
        vocab_size, embd_size = list(pretrained_npy.shape)
        self.embeddings = nn.Embedding.from_pretrained(pretrained_npy, freeze=False)
        self.linear1 = nn.Linear(2*context_size*embd_size, hidden_size)
        self.linear2 = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, inputs):
        embedded = self.embeddings(inputs).float().view((1, -1))
        hid = F.relu(self.linear1(embedded))
        out = self.linear2(hid)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


In [115]:
window_size = 2 
w2v_dataset = Word2VecText(tokenized_text_train, window_size=window_size, variant='cbow')
hidden_size = 300
pretrained_cbow_model = CBOW(pretrained_npy, window_size, hidden_size)

NameError: name 'tokenized_text_train' is not defined

In [116]:
learning_rate = 0.003
criterion = nn.NLLLoss()
optimizer = optim.SGD(pretrained_cbow_model.parameters(), lr=learning_rate)

losses = []

model = nn.DataParallel(pretrained_cbow_model)

num_epochs = 100
for _e in tqdm(range(num_epochs)):
    epoch_loss = []
    for sent_idx in range(w2v_dataset._len):
        for w2v_io in w2v_dataset[sent_idx]:
            # Retrieve the inputs and outputs.
            x = tensor(w2v_io['x'])
            y = autograd.Variable(tensor(w2v_io['y'], dtype=torch.long))
            
            if -1 in x or int(y) == -1:
                continue
            # Zero gradient.
            model.zero_grad()
            # Calculate the log probability of the context embeddings.
            logprobs = pretrained_cbow_model(x)
            # This unsqueeze thing is really a feature/bug... -_-
            loss = criterion(logprobs, y.unsqueeze(0)) 
            loss.backward()
            optimizer.step()
            epoch_loss.append(float(loss))
    # Save model after every epoch.
    torch.save(model.state_dict(), 'cbow_finetuning_checkpoint_{}.pt'.format(_e))
    losses.append(sum(epoch_loss)/len(epoch_loss))



NameError: name 'pretrained_cbow_model' is not defined

<a id="section-3-1-6-reval-cbow"></a>

## Re-Test Pretrained Embeddings on the CBOW Task

In [None]:

true_positive = 0
all_data = 0
# Iterate through the test sentences. 
for sent in tokenized_text_test:
    # Extract all the CBOW contexts (X) and targets (Y)
    for w2v_io in w2v_dataset._iterator(w2v_dataset.vectorize(sent)):
        # Retrieve the inputs and outputs.
        x = tensor(w2v_io['x'])
        y = tensor(w2v_io['y'])
        
        if -1 in x: # Skip unknown words.
            continue
        with torch.no_grad():
            _, prediction =  torch.max(pretrained_cbow_model(x), 1)
        true_positive += int(prediction) == int(y)
        visualize_predictions(x, y, prediction, w2v_dataset.vocab, window_size=window_size)
        all_data += 1

In [117]:
print('Accuracy:', true_positive/all_data)

ZeroDivisionError: division by zero