<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Neural Nets for Sequential Data

-----
**OBJECTIVES**


- Use RNN's and CNN's to model text data
------

In [1]:
import pandas_datareader as pdr
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Sequential Models for Text
-------

Now, we use the Keras `Tokenizer` to preprocess our spam data and feed it through different architectures of sequential network models.

In [2]:
import pandas as pd
import numpy as np

In [3]:
from keras.preprocessing.text import Tokenizer

In [5]:
spam = pd.read_csv('data/sms_spam.csv')

In [6]:
spam.head() #want to value the sequential nature of the text 

Unnamed: 0,type,text
0,ham,Hope you are having a good week. Just checking in
1,ham,K..give back my thanks.
2,ham,Am also doing in cbe only. But have to pay.
3,spam,"complimentary 4 STAR Ibiza Holiday or £10,000 ..."
4,spam,okmail: Dear Dave this is your final notice to...


### `Tokenizer`
------
Here, we set the limit to the number of words at 500, then fit the texts, and finally transform our text to sequences of integer values with the `.texts_to_sequences`.  To assure the same length we use the `pad_sequences` function.  

In [7]:
#create a tokenizer and specify the vocabulary
tokenizer = Tokenizer(500) #specify how many words we want in the vocabulary - 500 most frequently occurring words

In [8]:
#fit it on text
tokenizer.fit_on_texts(spam['text']) #learns the vocabulary 

In [9]:
#generate sequences
sequences = tokenizer.texts_to_sequences(spam['text']) #learned vocabulary to notate the sequences

In [10]:
sequences[:3] #messages have different numbers of words - different sized rows

[[122, 3, 22, 313, 4, 53, 110, 37, 8],
 [92, 134, 86, 11, 170],
 [60, 179, 155, 8, 62, 24, 17, 2, 387]]

In [11]:
from keras.preprocessing.sequence import pad_sequences

In [12]:
#pad sequences to 100
X = pad_sequences(sequences, maxlen = 100) #sequences just padded with zeroes at the front - normalizes the length of every observation

In [13]:
#take a peek
X[0]

array([  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
       122,   3,  22, 313,   4,  53, 110,  37,   8], dtype=int32)

### Model
-------

In [16]:
from keras.layers import Embedding, Dense, SimpleRNN
from keras.models import Sequential

In [17]:
#sequential model
text_model1 = Sequential()
#embedding layer - word embedding, take in data and try to re-represent it
text_model1.add(Embedding(input_dim = tokenizer.num_words, output_dim = 64))
#simple RNN
text_model1.add(SimpleRNN(16))
#dense layer
text_model1.add(Dense(20, activation = 'relu'))
#output
text_model1.add(Dense(1, activation = 'sigmoid'))
#compilation
text_model1.compile(loss = 'bce', metrics = ['accuracy'])

In [18]:
#make y binary
y = np.where(spam['type'] == 'ham', 0, 1)

In [20]:
#baseline?

1 - np.sum(y)/len(y)

#want to do better than 86%

0.8656233135456017

In [21]:
#fit it
text_model1.fit(X, y) #just one epoch have 94%



<keras.callbacks.History at 0x7fb7cce9df70>

In [22]:
history = text_model1.fit(X, y, epochs = 10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [23]:
#what about a validation set?

history = text_model1.fit(X, y, validation_split = 0.2, epochs = 10)

#goes into the data and splits it randomly for a train and test split - reminder this is adding onto the last group of epochs

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Convolutional Networks in 1D
--------

In [24]:
from keras.layers import Conv1D, MaxPooling1D

In [25]:
tokenizer.num_words

500

In [27]:
conv_test = Sequential()
#better representation of text -- word embedding
conv_test.add(Embedding(input_dim = tokenizer.num_words, output_dim = 64 ))
#convolution piece
conv_test.add(Conv1D(filters = 16, kernel_size = 10))
conv_test.add(MaxPooling1D(4))
#conventional network
conv_test.add(Dense(20, activation = 'relu'))
conv_test.add(Dense(1, activation = 'sigmoid'))
#compilation
conv_test.compile(loss = 'bce', metrics = ['acc'])

In [28]:
history = conv_test.fit(X, y)

ValueError: in user code:

    File "/opt/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 1021, in train_function  *
        return step_function(self, iterator)
    File "/opt/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 1010, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/opt/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 1000, in run_step  **
        outputs = model.train_step(data)
    File "/opt/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 860, in train_step
        loss = self.compute_loss(x, y, y_pred, sample_weight)
    File "/opt/anaconda3/lib/python3.9/site-packages/keras/engine/training.py", line 918, in compute_loss
        return self.compiled_loss(
    File "/opt/anaconda3/lib/python3.9/site-packages/keras/engine/compile_utils.py", line 201, in __call__
        loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    File "/opt/anaconda3/lib/python3.9/site-packages/keras/losses.py", line 141, in __call__
        losses = call_fn(y_true, y_pred)
    File "/opt/anaconda3/lib/python3.9/site-packages/keras/losses.py", line 245, in call  **
        return ag_fn(y_true, y_pred, **self._fn_kwargs)
    File "/opt/anaconda3/lib/python3.9/site-packages/keras/losses.py", line 1932, in binary_crossentropy
        backend.binary_crossentropy(y_true, y_pred, from_logits=from_logits),
    File "/opt/anaconda3/lib/python3.9/site-packages/keras/backend.py", line 5247, in binary_crossentropy
        return tf.nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)

    ValueError: `logits` and `labels` must have the same shape, received ((None, 22, 1) vs (None,)).


### Exercise

Build a model on the tweets data from `tweets.csv`. 