Use the IMDB dataset packaged with Keras. It has already been labelled, where each word has been turned into an integer. We need to modify `np.load` to set `allow_pickle=True` otherwise we can't load the data. When loading the imdb data we use `num_words=10000` so that we only get the 10k most frequently used words. Although we lose information to train the neural net it allows our data to remain tractable.

In [2]:
import numpy as np
from keras.datasets import imdb

# Set allow_pickle to True since new version defaults to False
old = np.load
np.load = lambda *a,**k: old(*a, allow_pickle=True, **k)

# Keep only 10k most commonly used words, rest discarded, to keep vectors of tractable size
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

print ('Training Data: ', len(train_data), 'Testing Data: ', len(test_data))
print ('Training Data Example: ', train_data[3], ' Label: ', test_labels[3])

np.load = old
del(old)

Training Data:  25000 Testing Data:  25000
Training Data Example:  [1, 4, 2, 2, 33, 2804, 4, 2040, 432, 111, 153, 103, 4, 1494, 13, 70, 131, 67, 11, 61, 2, 744, 35, 3715, 761, 61, 5766, 452, 9214, 4, 985, 7, 2, 59, 166, 4, 105, 216, 1239, 41, 1797, 9, 15, 7, 35, 744, 2413, 31, 8, 4, 687, 23, 4, 2, 7339, 6, 3693, 42, 38, 39, 121, 59, 456, 10, 10, 7, 265, 12, 575, 111, 153, 159, 59, 16, 1447, 21, 25, 586, 482, 39, 4, 96, 59, 716, 12, 4, 172, 65, 9, 579, 11, 6004, 4, 1615, 5, 2, 7, 5168, 17, 13, 7064, 12, 19, 6, 464, 31, 314, 11, 2, 6, 719, 605, 11, 8, 202, 27, 310, 4, 3772, 3501, 8, 2722, 58, 10, 10, 537, 2116, 180, 40, 14, 413, 173, 7, 263, 112, 37, 152, 377, 4, 537, 263, 846, 579, 178, 54, 75, 71, 476, 36, 413, 263, 2504, 182, 5, 17, 75, 2306, 922, 36, 279, 131, 2895, 17, 2867, 42, 17, 35, 921, 2, 192, 5, 1219, 3890, 19, 2, 217, 4122, 1710, 537, 2, 1236, 5, 736, 10, 10, 61, 403, 9, 2, 40, 61, 4494, 5, 27, 4494, 159, 90, 263, 2311, 4319, 309, 8, 178, 5, 82, 4319, 4, 65, 15, 9225, 145, 1

Creating a reverse dictionary look to see what the vectors orginally looked like to get a sense of the data.

In [4]:
# Decoding encoded vector
word_index = imdb.get_word_index();
reverse_word_index = dict(
  [(value, key) for (key, value) in word_index.items()]
)
# Shift the index by 3 because the first 3 places are reserved for special characters
decoded_review = ' '.join(
  [reverse_word_index.get(i-3, '?') for i in train_data[0]]
)

print ('Decoded review of above: ', decoded_review)
print ('Label: ', train_labels[0])

Decoded review of above:  ? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what th

 Along the dimension axis all integers (words) contained in the sequence will become 1s. This means we lost all information contained in the ordering of the words, which is obvisously very important, but the NN will still learn having certain words in a sentence is good and other are bad. This allows our first layer of the NN to be a dense layer capable of handling floating point vector data.

### eg:
 `(3, [2, 34, 9934, 62, 88])`indicates that row 2 in the matrix will have indices 2, 34, 9934, 62, 88 as 1.
 and everything else in the dimension axis will be 0.
 

In [6]:
def vectorize_sequences(sequences, dimension=10000):
  results = np.zeros((len(sequences), dimension))
  for i, sequence in enumerate(sequences):
    results[i, sequence] = 1.
  return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

print ('Train example: ', x_train[0])
print ('Train label: ', y_train[0])

Train example:  [0. 1. 1. ... 0. 0. 0.]
Train label:  1.0
