## Instructions
0. If you haven't already, follow [the setup instructions here](https://jennselby.github.io/MachineLearningCourseNotes/#setting-up-python3) to get all necessary software installed.
0. Install the Gensim word2vec Python implementation: `python3 -m pip install --upgrade gensim`
0. Get the trained model (1billion_word_vectors.zip) from Canvas and put it in the same folder as this ipynb file.
0. Unzip the trained model file. You should now have three files in the folder (if zip created a new folder, move these files out of that separate folder into the same folder as this ipynb file):
    * 1billion_word_vectors
    * 1billion_word_vectors.syn1neg.npy
    * 1billion_word_vectors.wv.syn0.npy
0. Read through the code in the following sections:
    * [Load trained word vectors](#Load-Trained-Word-Vectors)
    * [Explore word vectors](#Explore-Word-Vectors)
0. Optionally, complete [Exercise: Explore Word Vectors](#Exercise:-Explore-Word-Vectors)
0. Read through the code in the following sections:
    * [Use Word Vectors in an Embedding Layer of a Keras Model](#Use-Word-Vectors-in-an-Embedding-Layer-of-a-Keras-Model)
    * [IMDB Dataset](#IMDB-Dataset)
    * [Train IMDB Word Vectors](#Train-IMDB-Word-Vectors)
    * [Process Dataset](#Process-Dataset)
    * [Classification With Word Vectors Trained With Model](#Classification-With-Word-Vectors-Trained-With-Model)
0. Complete one of the two [Exercises](#Exercises). Remember to keep notes about what you do!

## Extra Details -- Do Not Do This
This took awhile, which is why I'm giving you the trained file rather than having you do this. But just in case you're curious, here is how to create the trained model file.
1. Download the corpus of sentences from [http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz](http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz)
1. Unzip and unarchive the file: `tar zxf 1-billion-word-language-modeling-benchmark-r13output.tar.gz` 
1. Run the following Python code:
    ```
    from gensim.models import word2vec
    import os

    corpus_dir = '1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled'
    sentences = word2vec.PathLineSentences(corpus_dir)
    model = word2vec.Word2Vec(sentences) # just use all of the default settings for now
    model.save('1billion_word_vectors')
    ```

## Documentation/Sources
* [https://radimrehurek.com/gensim/models/word2vec.html](https://radimrehurek.com/gensim/models/word2vec.html) for more information about how to use gensim word2vec in general
* _Blog post has been removed_ [https://codekansas.github.io/blog/2016/gensim.html](https://codekansas.github.io/blog/2016/gensim.html) for information about using it to create embedding layers for neural networks.
* [https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/) for information on sequence classification with keras
* [https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) for using pre-trained embeddings with keras (though the syntax they use for the model layers is different than most other tutorials).
* [https://keras.io/](https://keras.io/) Keras API documentation

## Load Trained Word Vectors

In [1]:
from gensim.models import word2vec

Load the trained model file into memory

In [2]:
wv_model = word2vec.Word2Vec.load('1billion_word_vectors')

Since we do not need to continue training the model, we can save memory by keeping the parts we need (the word vectors themselves) and getting rid of the rest of the model.

In [3]:
wordvec = wv_model.wv
del wv_model

## Explore Word Vectors
Now we can look at some of the relationships between different words.

Like [the gensim documentation](https://radimrehurek.com/gensim/models/word2vec.html), let's start with a famous example: king + woman - man

In [4]:
wordvec.most_similar(positive=['king', 'woman'], negative=['man'])

[('queen', 0.8407386541366577),
 ('monarch', 0.7541723847389221),
 ('prince', 0.7350203394889832),
 ('princess', 0.696908175945282),
 ('empress', 0.6771803498268127),
 ('sultan', 0.6649758815765381),
 ('Chakri', 0.6451102495193481),
 ('goddess', 0.6439394950866699),
 ('ruler', 0.6275452971458435),
 ('kings', 0.6273427605628967)]

This next one does not work as well as I'd hoped, but it gets close. Maybe you can find a better example.

In [5]:
wordvec.most_similar(positive=['panda', 'eucalyptus'], negative=['bamboo'])

[('okapi', 0.7140713334083557),
 ('gibbon', 0.7034620046615601),
 ('koala', 0.697202742099762),
 ('cub', 0.690765917301178),
 ('tortoise', 0.6886162757873535),
 ('beetle', 0.6859476566314697),
 ('salamander', 0.6855185627937317),
 ('psyllid', 0.6837549805641174),
 ('lynx', 0.6802859902381897),
 ('carnivore', 0.6794543266296387)]

Which one of these is not like the others?

Note: It looks like the gensim code needs to be updated to meet the requirements of later versions of numpy. You can ignore the warning.

In [6]:
wordvec.doesnt_match(['red', 'purple', 'laptop', 'turquoise', 'ruby'])

'laptop'

How far apart are different words?

In [7]:
wordvec.distances('laptop', ['computer', 'phone', 'rabbit'])

array([0.205414  , 0.36557418, 0.6597437 ], dtype=float32)

Let's see what one of these vectors actually looks like.

In [8]:
wordvec['textbook']

array([ 0.50756323, -2.8890731 ,  0.9743826 , -0.60089743, -0.23762947,
       -2.324566  , -0.64634913, -0.66476715, -2.3432739 ,  1.4446437 ,
       -0.15542823,  1.8248576 ,  1.1309539 , -0.21071543, -0.82512087,
       -0.2773584 , -0.1973424 , -0.5337731 ,  2.1143918 ,  1.0673765 ,
       -0.2341243 ,  1.5292411 ,  0.66977274,  1.1214821 , -0.57710004,
       -0.02504024,  0.6074397 ,  0.19416903, -1.1265849 , -0.6618393 ,
        1.7525213 ,  1.6232891 , -0.3886833 , -1.1867149 ,  0.45511633,
        1.4240934 , -0.87929034, -1.8920534 ,  2.6986032 , -0.5277589 ,
        2.1202435 ,  0.62670445,  1.0352231 ,  1.4998924 ,  2.5809426 ,
        0.74698585, -0.07757699, -0.67074645,  1.6887746 , -0.22081567,
        1.2107906 ,  0.16741815,  3.3496742 ,  1.1832954 ,  0.4423463 ,
        0.04771314, -0.14557275, -1.3345221 ,  1.3236852 ,  2.0154989 ,
       -0.6510446 ,  0.21808812, -0.31578887, -1.822629  ,  0.8436349 ,
       -1.1500564 ,  1.24044   , -2.6430037 ,  1.0617311 ,  1.20

What other methods are available to us?

In [9]:
help(wordvec)

Help on KeyedVectors in module gensim.models.keyedvectors object:

class KeyedVectors(gensim.utils.SaveLoad)
 |  KeyedVectors(vector_size, count=0, dtype=<class 'numpy.float32'>, mapfile_path=None)
 |  
 |  Serialize/deserialize objects from disk, by equipping them with the `save()` / `load()` methods.
 |  
 |  --------
 |  This uses pickle internally (among other techniques), so objects must not contain unpicklable attributes
 |  such as lambda functions etc.
 |  
 |  Method resolution order:
 |      KeyedVectors
 |      gensim.utils.SaveLoad
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __contains__(self, key)
 |  
 |  __getitem__(self, key_or_keys)
 |      Get vector representation of `key_or_keys`.
 |      
 |      Parameters
 |      ----------
 |      key_or_keys : {str, list of str, int, list of int}
 |          Requested key or list-of-keys.
 |      
 |      Returns
 |      -------
 |      numpy.ndarray
 |          Vector representation for `key_or_keys` (1D if

# Exercise: Explore Word Vectors

## Optional
What other interesting relationship can you find, using the methods used in the examples above or anything you find in the help message?

In [10]:
wordvec.most_similar(positive=['moscow', 'japan'], negative=['russia'])

[('ramallah', 0.6777190566062927),
 ('24-31', 0.6596744656562805),
 ('manama', 0.6435973644256592),
 ('cairo', 0.64258873462677),
 ('08th', 0.6417117714881897),
 ('tokyo', 0.637772262096405),
 ('riyadh', 0.6341758966445923),
 ('islamabad', 0.630756676197052),
 ('algiers', 0.6278460025787354),
 ('17-22', 0.6210923790931702)]

In [11]:
wordvec.most_similar(positive=['tokyo', 'russia'], negative=['japan'])

[('doha', 0.6637120246887207),
 ('kuwait', 0.6630067825317383),
 ('december', 0.6405400633811951),
 ('jeddah', 0.6365143060684204),
 ('Pardoned', 0.6338832974433899),
 ('HASSAN', 0.6331698894500732),
 ('tehran', 0.6282866597175598),
 ('october', 0.6272332668304443),
 ('Postpones', 0.6267986297607422),
 ('Pm', 0.6261789798736572)]

In [12]:
wordvec.most_similar(positive=['russia', 'capital'])

[('Tblisi', 0.5997629165649414),
 ('Tajikistan', 0.595119833946228),
 ('japan', 0.5884883403778076),
 ('Kuwait', 0.5838582515716553),
 ('Baku', 0.5808634161949158),
 ('petrodollars', 0.5748108625411987),
 ('Baltics', 0.5739815831184387),
 ('Ceyhan', 0.5721654295921326),
 ('india', 0.571567177772522),
 ('germany', 0.5645203590393066)]

??? Seems like that one doesn't work.

In [107]:
wordvec.most_similar(positive=['ran', 'swim'], negative=['run'])

[('swam', 0.7419927716255188),
 ('paddled', 0.6203966736793518),
 ('swims', 0.6203612089157104),
 ('went', 0.6167516112327576),
 ('frolicked', 0.6015949845314026),
 ('wandered', 0.5925613641738892),
 ('flew', 0.5924727916717529),
 ('paddling', 0.5871047973632812),
 ('walked', 0.5834778547286987),
 ('waddle', 0.5737559199333191)]

In [114]:
wordvec.most_similar(positive=['dove', 'bleed'], negative=['dive'])

[('bled', 0.6704407334327698),
 ('bleeds', 0.6500250101089478),
 ('rips', 0.5687224864959717),
 ('wiggling', 0.5533760786056519),
 ('stomped', 0.5458374619483948),
 ('thins', 0.5449791550636292),
 ('squish', 0.54433673620224),
 ('splayed', 0.5438013672828674),
 ('chapped', 0.5423763394355774),
 ('quivering', 0.5410894155502319)]

Unsurprisingly it works with irregular verbs too

In [13]:
wordvec.doesnt_match(['calculus', 'algebra', 'geometry', 'trigonometry', 'math'])

'calculus'

Interesting

In [14]:
wordvec.most_similar(positive=['america', 'president', 'recession'], negative=[])

[('economy', 0.5504289865493774),
 ('downturn', 0.5400635004043579),
 ('economic', 0.5373403429985046),
 ('crisis', 0.533382773399353),
 ('president-elect', 0.5192151665687561),
 ('fiscal', 0.510170042514801),
 ('reform', 0.5052710771560669),
 ('meltdown', 0.5023159980773926),
 ('globalization', 0.5014582872390747),
 ('chancellor', 0.5008257627487183)]

I think for the analogy to work, you need the things that you're subtracting and adding to be actual attributes of the subject, not just related terms.

In [15]:
wordvec.most_similar(positive=['prince', 'woman'], negative=['man'])

[('monarch', 0.75556480884552),
 ('princess', 0.7457445859909058),
 ('countess', 0.7068136930465698),
 ('sultan', 0.7052081227302551),
 ('duchess', 0.6884149312973022),
 ('royal', 0.6816162467002869),
 ('bridegroom', 0.676705002784729),
 ('queen', 0.6749706864356995),
 ('bride', 0.6683909296989441),
 ('king', 0.6559572815895081)]

In [16]:
wordvec.most_similar(positive=['doctor', 'woman'], negative=['man'])

[('dentist', 0.8244913220405579),
 ('nurse', 0.8198768496513367),
 ('pharmacist', 0.7904511094093323),
 ('pediatrician', 0.7857694029808044),
 ('gynecologist', 0.7747914791107178),
 ('physician', 0.7724529504776001),
 ('midwife', 0.7595626711845398),
 ('chiropractor', 0.7300244569778442),
 ('neurologist', 0.7300089001655579),
 ('dermatologist', 0.7267611026763916)]

In [17]:
wordvec.most_similar(positive=['doctor', 'man'], negative=['woman'])

[('dentist', 0.7486180663108826),
 ('surgeon', 0.7301046252250671),
 ('psychiatrist', 0.7274444699287415),
 ('neurologist', 0.7194344401359558),
 ('physician', 0.7150372266769409),
 ('pharmacist', 0.7138777375221252),
 ('cardiologist', 0.7023807764053345),
 ('chiropractor', 0.6808537840843201),
 ('colleague', 0.6757304072380066),
 ('radiologist', 0.6755921244621277)]

In [18]:
wordvec.most_similar(positive=['paramedic', 'woman'], negative=['man'])

[('nurse', 0.8294133543968201),
 ('midwife', 0.7920540571212769),
 ('caseworker', 0.7147266864776611),
 ('pharmacist', 0.7122661471366882),
 ('dispatcher', 0.6932677626609802),
 ('firefighter', 0.6930925250053406),
 ('chiropractor', 0.6906752586364746),
 ('receptionist', 0.6809461116790771),
 ('medic', 0.6792126893997192),
 ('dentist', 0.6783925294876099)]

# HMMM
Looks like word2vec is uncovering some societal associations, not just linguistic ones. Let's dig into this more

In [19]:
wordvec.most_similar(positive=['manager', 'woman'], negative=['man'])

[('administrator', 0.6977607607841492),
 ('director', 0.6935369968414307),
 ('manger', 0.6855568885803223),
 ('supervisor', 0.6438586711883545),
 ('treasurer', 0.6358998417854309),
 ('Manager', 0.6231703758239746),
 ('vice-president', 0.6213657259941101),
 ('co-owner', 0.6199526190757751),
 ('assistant', 0.6160887479782104),
 ('owner', 0.611164391040802)]

In [20]:
wordvec.most_similar(positive=['chairman', 'woman'], negative=['man'])

[('chairwoman', 0.8845049738883972),
 ('chairperson', 0.8089302182197571),
 ('vice-chairman', 0.7859854102134705),
 ('co-chairman', 0.7798680663108826),
 ('co-chair', 0.7480093240737915),
 ('vice-chair', 0.7432311177253723),
 ('vice-president', 0.7374943494796753),
 ('treasurer', 0.7311645150184631),
 ('Chairman', 0.724380373954773),
 ('director', 0.7126836776733398)]

In [21]:
wordvec.most_similar(positive=['criminal', 'arab'], negative=[])

[('islamic', 0.7542759776115417),
 ('terrorism', 0.6793594360351562),
 ('antidemocratic', 0.6656423211097717),
 ('terrorist', 0.6611432433128357),
 ('infidel', 0.6598539352416992),
 ('muslim', 0.6579608917236328),
 ('apostate', 0.6509077548980713),
 ('totalitarian', 0.6498880982398987),
 ('transnational', 0.6441349983215332),
 ('anti-revolutionary', 0.6280985474586487)]

In [22]:
wordvec.most_similar(positive=['slaves', 'white'], negative=['black'])

[('peasants', 0.7252991795539856),
 ('labourers', 0.7162727117538452),
 ('sharecroppers', 0.6973912715911865),
 ('slave', 0.6827452778816223),
 ('serfs', 0.6733953356742859),
 ('colonists', 0.6698740124702454),
 ('nomads', 0.6671000123023987),
 ('settlers', 0.6589178442955017),
 ('squatters', 0.6425473093986511),
 ('weavers', 0.6410399675369263)]

And historical analogies too

In [23]:
wordvec.distances('instrument', ['piano', 'guitar', 'flute'])

array([0.45758528, 0.44694537, 0.42413288], dtype=float32)

## Use Word Vectors in an Embedding Layer of a Keras Model

In [24]:
from keras.models import Sequential
import numpy

You may have noticed in the help text for wordvec that it has a built-in method for converting into a Keras embedding layer.

Since for this experimentation, we'll just be giving the embedding layer one word at a time, we can set the input length to 1.

In [25]:
test_embedding_layer = wordvec.get_keras_embedding()
test_embedding_layer.input_length = 1

AttributeError: 'KeyedVectors' object has no attribute 'get_keras_embedding'

## I did some digging and it looks like the get_keras_embedding() function is broken at this time.
[This page](https://github.com/RaRe-Technologies/gensim/wiki/Using-Gensim-Embeddings-with-Keras-and-Tensorflow) lists the new method

In [29]:
keyed_vectors = wordvec  # structure holding the result of training
weights = keyed_vectors.vectors  # vectors themselves, a 2D numpy array    
index_to_key = keyed_vectors.index_to_key  # which row in `weights` corresponds to which word?
train_embeddings = False

test_embedding_layer = Embedding(
    input_dim=weights.shape[0],
    output_dim=weights.shape[1],
    weights=[weights],
    trainable=train_embeddings,
)

In [30]:
test_embedding_layer.input_length = 1

In [31]:
embedding_model = Sequential()
embedding_model.add(test_embedding_layer)

But how do we actually use this? If you look at the [Keras Embedding Layer documentation](https://keras.io/layers/embeddings/) you might notice that it takes numerical input, not strings. How do we know which number corresponds to a particular word? In addition to having a vector, each word has an index:

In [32]:
wordvec.vocab['python'].index

AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
Use KeyedVector's .key_to_index dict, .index_to_key list, and methods .get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead.
See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

In [33]:
wordvec.key_to_index['python']

30438

As desired

Let's see if we get the same vector from the embedding layer as we get from our word vector object.

In [34]:
wordvec['python']

array([-1.1750487e+00,  2.3066440e-04, -6.0706180e-01, -1.1156354e+00,
       -1.0580894e+00, -2.7154784e+00, -3.6140988e+00, -1.0810910e+00,
        1.1234255e+00, -7.7326834e-01, -1.3322397e+00,  9.2905626e-02,
       -2.4488842e+00, -1.7817341e-01, -3.5459950e+00, -1.7320968e+00,
        1.9397168e+00, -6.3734710e-01,  2.3254216e+00, -1.3535864e+00,
       -1.4451812e-01, -2.4297442e+00,  1.5498929e+00,  8.1969726e-01,
        9.0982294e-01, -6.6116208e-01,  3.8905215e-01,  3.3855909e-01,
       -7.5454485e-01, -1.0352553e+00, -2.5936973e+00,  1.2103225e+00,
       -3.0236175e+00,  3.0580134e+00, -3.9140179e+00,  4.0223894e-01,
        1.7356061e+00,  9.0976155e-01,  2.0956397e-02,  2.0190549e+00,
        4.5332021e-01, -1.6634842e+00, -4.8180079e-01,  2.0414692e-01,
       -5.9267312e-01, -1.4182589e+00, -9.7301149e-01,  5.1611459e-01,
        2.0727324e+00,  2.0064230e+00, -7.5027935e-02, -1.1723986e+00,
       -8.6943096e-01,  1.7028141e+00,  2.2190344e+00,  9.3605727e-01,
      

In [35]:
embedding_model.predict(numpy.array([[30438]]))

array([[[-1.1750487e+00,  2.3066440e-04, -6.0706180e-01, -1.1156354e+00,
         -1.0580894e+00, -2.7154784e+00, -3.6140988e+00, -1.0810910e+00,
          1.1234255e+00, -7.7326834e-01, -1.3322397e+00,  9.2905626e-02,
         -2.4488842e+00, -1.7817341e-01, -3.5459950e+00, -1.7320968e+00,
          1.9397168e+00, -6.3734710e-01,  2.3254216e+00, -1.3535864e+00,
         -1.4451812e-01, -2.4297442e+00,  1.5498929e+00,  8.1969726e-01,
          9.0982294e-01, -6.6116208e-01,  3.8905215e-01,  3.3855909e-01,
         -7.5454485e-01, -1.0352553e+00, -2.5936973e+00,  1.2103225e+00,
         -3.0236175e+00,  3.0580134e+00, -3.9140179e+00,  4.0223894e-01,
          1.7356061e+00,  9.0976155e-01,  2.0956397e-02,  2.0190549e+00,
          4.5332021e-01, -1.6634842e+00, -4.8180079e-01,  2.0414692e-01,
         -5.9267312e-01, -1.4182589e+00, -9.7301149e-01,  5.1611459e-01,
          2.0727324e+00,  2.0064230e+00, -7.5027935e-02, -1.1723986e+00,
         -8.6943096e-01,  1.7028141e+00,  2.2190344

Looks good, right? But let's not waste our time when the computer could tell us definitively and quickly:

In [36]:
embedding_model.predict(numpy.array([[wordvec.key_to_index['python']]]))[0][0] == wordvec['python']

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])

Now we have a way to turn words into word vectors with Keras layers. Yes! Time to get some data.

## IMDB Dataset
The [IMDB dataset](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification) consists of movie reviews that have been marked as positive or negative. (There is also a built-in dataset of [Reuters newswires](https://keras.io/datasets/#reuters-newswire-topics-classification) that have been classified by topic.)

In [37]:
from keras.datasets import imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data()

It looks like our labels consist of 0 or 1, which makes sense for positive and negative.

In [38]:
print(y_train[0:9])
print(max(y_train))
print(min(y_train))

[1 0 0 1 0 0 1 0 1]
1
0


But x is a bit more trouble. The words have already been converted to numbers -- numbers that have nothing to do with the word embeddings we spent time learning!

In [39]:
x_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 22665,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 21631,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 31050,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5

Looking at the help page for imdb, it appears there is a way to get the word back. Phew.

In [40]:
help(imdb)

Help on module keras.datasets.imdb in keras.datasets:

NAME
    keras.datasets.imdb - IMDB sentiment classification dataset.

FILE
    /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/keras/datasets/imdb.py




In [41]:
imdb_offset = 3
imdb_map = dict((index + imdb_offset, word) for (word, index) in imdb.get_word_index().items())
imdb_map[0] = 'PADDING'
imdb_map[1] = 'START'
imdb_map[2] = 'UNKNOWN'

The knowledge about the initial indices and offset came from [this stack overflow post](https://stackoverflow.com/questions/42821330/restore-original-text-from-keras-s-imdb-dataset) after I got gibberish when I tried to translate the first review, below. It looks coherent now!

In [42]:
' '.join([imdb_map[word_index] for word_index in x_train[0]])

"START this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and shou

## Train IMDB Word Vectors
The word vectors from the 1 billion words dataset might work for us when trying to classify the IMDB data. Word vectors trained on the IMDB data itself might work better, though.

In [43]:
train_sentences = [['PADDING'] + [imdb_map[word_index] for word_index in review] for review in x_train]
test_sentences = [['PADDING'] + [imdb_map[word_index] for word_index in review] for review in x_test]

In [44]:
# min count says to put any word that appears at least once into the vocabulary
# size sets the dimension of the output vectors
imdb_wv_model = word2vec.Word2Vec(train_sentences + test_sentences + ['UNKNOWN'], min_count=1, vector_size=100)

In [45]:
imdb_wordvec = imdb_wv_model.wv
del imdb_wv_model

## Process Dataset
For this exercise, we're going to keep all inputs the same length (we'll see how to do variable-length later). This means we need to choose a maximum length for the review, cutting off longer ones and adding padding to shorter ones. What should we make the length? Let's understand our data.

In [46]:
lengths = [len(review) for review in x_train + x_test]
print('Longest review: {} Shortest review: {}'.format(max(lengths), min(lengths)))


Longest review: 2697 Shortest review: 70


2697 words! Wow. Well, let's see how many reviews would get cut off at a particular cutoff.

In [47]:
cutoff = 500
print('{} reviews out of {} are over {}.'.format(
    sum([1 for length in lengths if length > cutoff]), 
    len(lengths), 
    cutoff))

8485 reviews out of 25000 are over 500.


In [48]:
from keras.preprocessing import sequence
x_train_padded = sequence.pad_sequences(x_train, maxlen=cutoff)
x_test_padded = sequence.pad_sequences(x_test, maxlen=cutoff)

## Classification With Word Vectors Trained With Model

In [49]:
from keras.models import Sequential
from keras.layers import Embedding, Conv1D, Dense, Flatten

Model definition. The embedding layer here learns the 100-dimensional vector embedding within the overall classification problem training. That is usually what we want, unless we have a bunch of un-tagged data that could be used to train word vectors but not a classification model.

In [50]:
not_pretrained_model = Sequential()
not_pretrained_model.add(Embedding(input_dim=len(imdb_map), output_dim=100, input_length=cutoff))
not_pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
not_pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
not_pretrained_model.add(Flatten())
not_pretrained_model.add(Dense(units=128, activation='relu'))
not_pretrained_model.add(Dense(units=1, activation='sigmoid')) # because at the end, we want one yes/no answer
not_pretrained_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['binary_accuracy'])

Train the model. __This takes awhile. You might not want to re-run it.__

In [51]:
not_pretrained_model.fit(x_train_padded, y_train, epochs=1, batch_size=64)



<tensorflow.python.keras.callbacks.History at 0x127563730>

Assess the model. __This takes awhile. You might not want to re-run it.__

In [52]:
not_pretrained_scores = not_pretrained_model.evaluate(x_test_padded, y_test)
print('loss: {} accuracy: {}'.format(*not_pretrained_scores))

loss: 0.27524182200431824 accuracy: 0.8870800137519836


# Exercises

## These exercises will help you learn more about how to use word vectors in a model and how to translate between data representations.

## For any model that you try in these exercises, take notes about the performance you see and anything you notice about the differences between the models.

## Exercise Option #1 - Advanced Difficulty
Using the details above about how the imdb dataset and the keras embedding layer represent words, define a model that uses the pre-trained word vectors from the imdb dataset rather than an embedding that keras learns as it goes along. You'll need to replace the embedding layer and feed in different training data.

In [53]:
imdb_wordvec

<gensim.models.keyedvectors.KeyedVectors at 0x129981850>

In [82]:
keyed_vectors = imdb_wordvec  # structure holding the result of training
weights = keyed_vectors.vectors  # vectors themselves, a 2D numpy array    
index_to_key = keyed_vectors.index_to_key  # which row in `weights` corresponds to which word?
train_embeddings = False

imdb_embedding_layer = Embedding(
    input_dim=weights.shape[0],
    output_dim=weights.shape[1],
    weights=[weights],
    trainable=train_embeddings,
    input_shape=(500,),  # this man is a god! https://stackoverflow.com/a/45153347
)

In [83]:
imdb_embedding_layer.input_length = 500

In [84]:
#embedding_model = Sequential()
#embedding_model.add(test_embedding_layer)

In [85]:
imdb_pretrained_model = Sequential()
imdb_pretrained_model.add(imdb_embedding_layer)
imdb_pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
imdb_pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
imdb_pretrained_model.add(Flatten())
imdb_pretrained_model.add(Dense(units=128, activation='relu'))
imdb_pretrained_model.add(Dense(units=1, activation='sigmoid')) # because at the end, we want one yes/no answer
imdb_pretrained_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['binary_accuracy'])

In [94]:
imdb_pretrained_model.fit(x_train_padded, y_train, epochs=1, batch_size=64)  # 4 total epochs



<tensorflow.python.keras.callbacks.History at 0x126681c40>

After first epoch, 63% accuracy

In [91]:
imdb_pretrained_scores = imdb_pretrained_model.evaluate(x_test_padded, y_test)  # second epoch
print('loss: {} accuracy: {}'.format(*imdb_pretrained_scores))

loss: 0.5888961553573608 accuracy: 0.6945599913597107


In [93]:
imdb_pretrained_scores = imdb_pretrained_model.evaluate(x_test_padded, y_test)  # third epoch
print('loss: {} accuracy: {}'.format(*imdb_pretrained_scores))

loss: 0.5674989223480225 accuracy: 0.7126399874687195


In [95]:
imdb_pretrained_scores = imdb_pretrained_model.evaluate(x_test_padded, y_test)  # fourth epoch
print('loss: {} accuracy: {}'.format(*imdb_pretrained_scores))

loss: 0.6891921758651733 accuracy: 0.7110400199890137


Overfitting begins to take a toll in the fourth epoch

In [87]:
imdb_pretrained_model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 500, 100)          8859100   
_________________________________________________________________
conv1d_12 (Conv1D)           (None, 496, 32)           16032     
_________________________________________________________________
conv1d_13 (Conv1D)           (None, 492, 32)           5152      
_________________________________________________________________
flatten_6 (Flatten)          (None, 15744)             0         
_________________________________________________________________
dense_11 (Dense)             (None, 128)               2015360   
_________________________________________________________________
dense_12 (Dense)             (None, 1)                 129       
Total params: 10,895,773
Trainable params: 2,036,673
Non-trainable params: 8,859,100
___________________________________

In [106]:
x_train_padded

array([[    0,     0,     0, ...,    19,   178,    32],
       [    0,     0,     0, ...,    16,   145,    95],
       [    0,     0,     0, ...,     7,   129,   113],
       ...,
       [    0,     0,     0, ...,     4,  3586, 22459],
       [    0,     0,     0, ...,    12,     9,    23],
       [    0,     0,     0, ...,   204,   131,     9]], dtype=int32)

## Exercise Option #2 - Advanced Difficulty
Same as option 1, but try using the 1billion vector word embeddings instead of the imdb vectors. If you also did option 1, comment on how the performance changes.

In [96]:
keyed_vectors = wordvec  # structure holding the result of training
weights = keyed_vectors.vectors  # vectors themselves, a 2D numpy array    
index_to_key = keyed_vectors.index_to_key  # which row in `weights` corresponds to which word?
train_embeddings = False

bil_embedding_layer = Embedding(
    input_dim=weights.shape[0],
    output_dim=weights.shape[1],
    weights=[weights],
    trainable=train_embeddings,
    input_shape=(500,),
)

In [97]:
bil_embedding_layer.input_length = 500

In [98]:
bil_pretrained_model = Sequential()
bil_pretrained_model.add(bil_embedding_layer)
bil_pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
bil_pretrained_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
bil_pretrained_model.add(Flatten())
bil_pretrained_model.add(Dense(units=128, activation='relu'))
bil_pretrained_model.add(Dense(units=1, activation='sigmoid')) # because at the end, we want one yes/no answer
bil_pretrained_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['binary_accuracy'])

In [104]:
bil_pretrained_model.fit(x_train_padded, y_train, epochs=1, batch_size=64)  # 3 total epochs



<tensorflow.python.keras.callbacks.History at 0x12a5b8220>

In [101]:
bil_pretrained_scores = bil_pretrained_model.evaluate(x_test_padded, y_test)  # first round
print('loss: {} accuracy: {}'.format(*bil_pretrained_scores))

loss: 0.6138777136802673 accuracy: 0.6630399823188782


In [103]:
bil_pretrained_scores = bil_pretrained_model.evaluate(x_test_padded, y_test)  # second round
print('loss: {} accuracy: {}'.format(*bil_pretrained_scores))

loss: 0.5469103455543518 accuracy: 0.7221999764442444


In [105]:
bil_pretrained_scores = bil_pretrained_model.evaluate(x_test_padded, y_test)  # third round
print('loss: {} accuracy: {}'.format(*bil_pretrained_scores))

loss: 0.5919256210327148 accuracy: 0.7084800004959106


This model has a higher peak accuracy, but overfitting hits much earlier