# ## Deep Learning for Text and Sequence Notes

*This is my notebook to record important terms, codes, and concepts of applying deep learning to work with text data.*

**Reference**: *Francois, C. (2017). Deep learning with Python.*

#### The fundamental algorithms: 
- RNN
- Conv1D

Vectorizing text is the process of transform text into numeric tensors.

Tokens: words, characters, n-grams: are the units that text is broken down to.

Tokenization: breaking text into tokens.

#### Methods of associating a vector with a token:
- One-hot encoding
- Word embedding

N-grams: groups of N consecutive words extracted from a sentence. Extracting N-grams is a form of feature engineering. Useful for shallow language-processing rather than in deep learning.

#### One-hot encoding:

The vector is all zeros, except the n-th entry (encoded as 1).

One-hot hashing trick: used when the number of unique tokens in the vocabulary is too large to handle explicitly.

- Binary
- Sparse
- High-dimensional

In [None]:
# word-level one-hot encoding

from keras.preprocessing.text import Tokenizer 

samples = ["I love learning deep learning.",
           "Machine learning is the future of humanity."]

tokenizer = Tokenizer(num_words=1000) # only tokenize 1000 common words
tokenizer.fit_on_texts(samples)

sequences = tokenizer.texts_to_sequences(samples)

one_hot_results = tokenizer.texts_to_matrix(samples, mode="binary")

word_index = tokenizer.word_index
print("Found %s unique tokens." % len(word_index))

#### Word embeddings

- Low-dimensional floating-point vectors (dense vectors)
- Learned from data.
- Pack more information into far fewer dimensions.

Words meaning different things are embedded at points far away from each other.

The Embedding layer ~ a dictionary that maps integer indices to dense vectors.

All sequences in a batch must have the same length. Shorter sequence is **padded with zeros**, longer sequence is **truncated**.

In [None]:
from keras.layers import Embedding

embedding_layer = Embedding(1000, 64) 
# max 1000 tokens/sequences, 64 dimensions/length

In [None]:
# IMDB movie-review sentiment-prediction

from keras.datasets import imdb
from keras import preprocessing
from keras.models import Sequential
from keras.layers import Flatten, Dense

max_features = 10000
maxlen = 20

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

In [None]:
model = Sequential()

model.add(Embedding(10000, 8, input_length=maxlen))

model.add(Flatten())

model.add(Dense(1, activation="sigmoid"))

model.summary()

In [None]:
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])

history = model.fit(x_train, y_train,
                   epochs=10,
                   batch_size=32,
                   validation_split=0.2)

#### Using GloVe embeddings (https://nlp.stanford.edu/projects/glove)

### Recurrent Neural Networks

Densely connected networks, convnet = feedforward networks.

RNN:
- Processes information incrementally while maintaining an internal model of what it's processing.
- Built from past infor and constantly updated as new infor comes in.

In [None]:
# RNN in Keras

import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, Dense

model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))

model.summary()

In [None]:
from keras.datasets import imdb
from keras.preprocessing import sequence

max_features = 10000
maxlen = 500
batch_size = 32

print("Loading data...")
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), "train sequences")
print(len(input_test), "test sequences")


print("Pad sequences (samples x time)")
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print("input_train_shape: ", input_train.shape)
print("input_test_shape: ", input_test.shape)

In [None]:
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation="sigmoid"))

model.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["acc"])

history = model.fit(input_train, y_train,
                   epochs = 10,
                   batch_size = 128,
                   validation_split = 0.2)

In [None]:
import matplotlib.pyplot as plt

acc = history.history["acc"]
val_acc = history.history["val_acc"]
loss = history.history["loss"]
val_loss = history.history["val_loss"]

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, "b")
plt.plot(epochs, val_acc, "bo")
plt.title("Training and validation accuracy.")
plt.show()

plt.plot(epochs, loss, "r")
plt.plot(epochs, val_loss, "ro")
plt.title("Training and validation loss.")
plt.show()

### LSTM and GRU layers 

SimpleRNN is too simple. Vanishing gradients problem.

LSTM and GRU can solve the problem.

LSTM saves information for later (allows past information to be reinjected at a later time), thus preventing older signals from gradually vanishing.

LSTM solves difficult NLP problems: Q&A, machine translation.

In [None]:
from keras.layers import LSTM

model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))
model.add(Dense(1, activation="sigmoid"))

model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])

history = model.fit(input_train, y_train,
                   epochs=10,
                   batch_size=128,
                   validation_split=0.2)

Recurrent dropout: prevents overfitting in recurrent layers.

Stacking recurrent layers; increases the representational power of the network.

Bidirectional recurrent layers: increases accuracy and mitigates forgetting issues.

In [None]:
fname = "../input/weather-archive-jena/jena_climate_2009_2016.csv"
f = open(fname)

data = f.read()
f.close()

lines = data.split("\n")
header = lines[0].split(",")
lines = lines[1:]

print(header)
print(len(lines))

In [None]:
# convert into a NumPy array

import numpy as np

float_data = np.zeros((len(lines), len(header) - 1))
for i, line in enumerate(lines):
    values = [float(x) for x in line.split(",")[1:]]
    float_data[i, :] = values

In [None]:
temp = float_data[:, 1]
plt.plot(range(len(temp)), temp)

In [None]:
plt.plot(range(1440), temp[:1440])

In [None]:
# to be updated