# Deep-Learning with Keras

#### Ugur URESIN, AI Engineer | Data Scientist
#### Mail: uresin.ugur@gmail.com

## Chapter 09. Recurrent Neural Nets for Text Data

### Using Word Embeddings

Another popular and powerful way to associate a vector with a word is the use of **dense word vectors** (a.k.a **word embeddings**).  

**One-hot encoding** are binary, sparse (mostly made of zeros), and very high-dimensional!  

**Word embeddings** are low- dimensional floating-point vectors (that is, dense vectors, as opposed to sparse vec- tors).  
See figure below.

![one-hot vs. word-embeddings](./img/word-vectors.png "one-hot vs. word-embeddings")

Unlike the word vectors obtained via one-hot encoding, **word embeddings are learned from data**.  

It’s common to see word embeddings that are 256-dimensional, 512-dimensional, or 1,024-dimensional when dealing with very large vocabularies.  
On the other hand, one-hot encoding words generally leads to vectors that are 20,000 dimensional or greater.  
So, word embeddings pack more information into far fewer dimensions.

### Instantiating an Embedding layer

In [3]:
from keras.layers import Embedding
embedding_layer = Embedding(1000,64)

#### Loading the IMDB data for use with an Embedding layer

In [4]:
from keras.datasets import imdb
from keras import preprocessing

max_features = 10000 #number of words to consider as features
maxlen = 20 #cuts off the text after this numeber of words (among the max_features most common words)

(x_train, y_train), (x_test, y_test) = imdb.load_data(
    num_words = max_features) #loads the data as list of integers

x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen) #turns lists of integers into a 2D int tensor
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


#### Using an Embedding layer and classifier on the IMDB data

In [6]:
from keras.models import Sequential
from keras.layers import Flatten, Dense

model = Sequential()
model.add(Embedding(10000, 8 , input_length=maxlen))
'''
Specifies the maximum input length to the Embedding layer so you can later flatten the embedded inputs.
After the Embedding layer, the activations have shape (samples, maxlen, 8).
'''
model.add(Flatten()) #flattens the 3D tensor of embedding into a 2D tensor (shape=sample, maxlen*8)

model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_2 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### From raw text to word embeddings

#### Processing the labels of the raw IMDB data

First go to http://mng.bz/0tIo and download the raw IMDB dataset

In [11]:
import os

imdb_dir = './data/aclImdb'
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

#### Tokenizing the text of the raw IMDB data

In [13]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 100               #Cuts off reviews after 100 words
training_samples = 200     #Trains on 200 samples
validation_samples = 10000 #Validates on 10,000 samples
max_words = 10000          #Considers only the top 10,000 words in the dataset          

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen)

labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

indices = np.arange(data.shape[0])
'''
Splits the data into a training set and a validation set, but first shuffles the data,
because you’re starting with data in which samples are ordered (all negative first, then all positive)
'''
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

Found 88582 unique tokens.
Shape of data tensor: (25000, 100)
Shape of label tensor: (25000,)
