# Deep Learning with Python (Chollet)
## Chapter 6: Deep learning for text and sequences

Recall that a convolutional neural net (convnet) is a stack of `Conv2D`[(link)](https://keras.io/layers/convolutional/) and `MaxPooling2D`[(link)](https://keras.io/layers/pooling/) layers. See notes on chapter 5 for more on this topic.

In the area of sequence processing such as time series and text analysis, two fundamental deep-learning algorithms for sequence processing are: 
- Recurrent neural networks
- 1D convnets ([link](https://keras.io/layers/convolutional/) )

Two applications:

- Sentiment analysis on IMDB dataset
- Temperature forecasting

## Working with text data

- Text as a sequence of words. 
- Map the statistical structure of written language

First step, transform the text to an input that the computer can read. 

> **Vectorization**: Vectorizing text is the process fo transforming text into numeric tensors.

Approaches to do this: 

- Segment text into words, transform each word into a vector
- Segment text into character, transform characters into a vector
- Extract n-grams of words or characters, and transform each n-gram invo a vector. N-grams are overlapping groups of multiple consecutive words or characters. 
 - E.g. a 3-gram of "The cat did" equals to the following sets: {"The", "The cat", "The cat did", "cat did", "cat", "The did"}
 - Groups of N (or fewer) consecutrive words that can be extraced from the sentence.
  
In general, units into which text is broken down are called tokens. Breaking down text into units is therefore called tokenization. 

### One-hot encoding of words and characters

Most common and basic way to turn a token into a vector (token is there vs. token is not there encoded as 0/1 variable).   

In [1]:
# toy example
import numpy as np

samples = ["The cat sat on the mat.", "The dog ate my homework"]

token_index = {}

for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index) + 1
            
        
print(token_index)

max_length = 10

results = np.zeros(shape= (len(samples),
                           max_length, 
                           max(token_index.values()) + 1))

for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1

print("\nShape of results:", results.shape)
print("\nResult of one-hot encoding:\n\n", results)


{'The': 1, 'cat': 2, 'sat': 3, 'on': 4, 'the': 5, 'mat.': 6, 'dog': 7, 'ate': 8, 'my': 9, 'homework': 10}

Shape of results: (2, 10, 11)

Result of one-hot encoding:

 [[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

 [[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]


In general, built-in utility functions are used for one-hot encoding.

In [2]:
# use built-in function
from keras.preprocessing.text import Tokenizer

samples = ["The cat sat on the mat.", "The dog ate my homework"]

# initialize and fit Tokenizer object
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(samples)

sequences = tokenizer.texts_to_sequences(samples)

one_hot_results = tokenizer.texts_to_matrix(samples, mode="binary")
word_index = tokenizer.word_index

print('Found %s unique tokens.' % len(word_index))
print(word_index)

Using TensorFlow backend.


Found 9 unique tokens.
{'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework': 9}


left out: 
- (one-hot hashing trick)
- (hash collisions)

## Using word-embeddings

- Word embeddings are dense word vectors
 - word embeddings pack more information into far fewer dimensions. 
- ...

In [3]:
# embedding layer
from keras.layers import Embedding
embedding_layer = Embedding(1000, 64)

The embedding layer can be understood as a dictionary that maps integer indices (which stand for specific words) to dense vectors. 

- ...

In [4]:
# loading IMDB data for use with an Embedding layer
from keras.datasets import imdb
from keras import preprocessing

max_features = 10000 # words to consider as features
maxlen = 20 # cuts off text after 20 most common words 

# load data
(x_train, y_train), (x_test, y_test)= imdb.load_data(num_words=max_features)

print(x_train.shape)
print(y_train.shape)

# turn lists of integers into a 2D integer tensor
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

print(x_train.shape)
print(x_test.shape)

(25000,)
(25000,)
(25000, 20)
(25000, 20)


In [5]:
# embedding and classifier on the imdb data

# imports
from keras.models import Sequential
from keras.layers import Flatten, Dense

model = Sequential()
model.add(Embedding(10000, 8, input_length=maxlen))

# flatten 3d tensor into a 2D tensor
model.add(Flatten())

# add classifier on top
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])
model.summary()

history = model.fit(x_train, y_train,
                   epochs=10,
                   batch_size=32, 
                   validation_split=0.2)

# eval
test_loss, test_acc = model.evaluate(x_test, y_test)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [6]:
print("Test accuracy: "+ str(test_acc*100) + "%")

Test accuracy: 75.58%


Note that this model treats each word in the input sequence separately, without considering inter-word relationships and sentence structure. We can do better by adding recurrent layers or 1D convolutional layers on top of the embedded sequences to learn features that take sequences into account. This is the motivation for RNN. 

left out: 
- Using pretrained word embeddings

### Putting it all together 

- left out

