# 1 - Introduction

This notebook develops artificial neural network (ANN) models. The focus will be on recurrent neural networks (RNNs). The idea is that in order to properly

## 1.1 Load Packages and Global Variables

In [1]:
%matplotlib inline
import os
import luigi
import numpy as np
import nltk
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize, wordpunct_tokenize
from sklearn.model_selection import train_test_split
from keras.callbacks import ModelCheckpoint
from keras.callbacks import EarlyStopping 
from keras import optimizers
import scipy.stats

Using TensorFlow backend.


In [2]:
PROJECT_DIR = os.path.join(os.getcwd(), os.pardir)
os.chdir(PROJECT_DIR)

In [3]:
from src.data.clean import CleanData
from src.data.download import DownloadFile

## 1.2- Load the Data

The following Luigi tasks ensures that the cleaned test and train sets are available, and produces them if they are not.

In [4]:
luigi.build([CleanData()], local_scheduler = True)

DEBUG: Checking if CleanData() is complete
INFO: Informed scheduler that task   CleanData__99914b932b   has status   DONE
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
INFO: Worker Worker(salt=838591748, workers=1, host=DESKTOP-6UJS098, username=wertu, pid=4416) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====

Scheduled 1 tasks of which:
* 1 present dependencies were encountered:
    - 1 CleanData()

Did not run any tasks
This progress looks :) because there were no failed tasks or missing external dependencies

===== Luigi Execution Summary =====



True

Now that we have ensutre that the data is present, load it.

In [5]:
#Load data
train = joblib.load('data/interim/train.pkl')
test = joblib.load('data/interim/test.pkl')

The following Luigi task ensures that the Glove embeddings are availables, and downloads them if they are not.

In [6]:
luigi.build([DownloadFile(url='http://nlp.stanford.edu/data/glove.6B.zip',
                           out_name='data/external/GloveVectors', filetype='zip')], local_scheduler = True)

DEBUG: Checking if DownloadFile(out_name=data/external/GloveVectors, url=http://nlp.stanford.edu/data/glove.6B.zip, filetype=zip) is complete
INFO: Informed scheduler that task   DownloadFile_zip_data_external_Gl_http___nlp_stanf_8240f740b5   has status   DONE
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
INFO: Worker Worker(salt=442427354, workers=1, host=DESKTOP-6UJS098, username=wertu, pid=4416) was stopped. Shutting down Keep-Alive thread
INFO: 
===== Luigi Execution Summary =====

Scheduled 1 tasks of which:
* 1 present dependencies were encountered:
    - 1 DownloadFile(out_name=data/external/GloveVectors, url=http://nlp.stanford.edu/data/glove.6B.zip, filetype=zip)

Did not run any tasks
This progress looks :) because there were no failed tasks or missing external dependencies

===== Luigi Execution Summary =====



True

# 2 - Create Emedding Matrix

Use sklearn countvectorizer to create disctironary of tokens with indexes. Use nltk tokenizxer isntead of built in tokeinzer  in skleanr sinxze nltk doenst, comeply igonore puntation.

In [7]:
vectorizer = CountVectorizer(tokenizer=wordpunct_tokenize, min_df=5)
vectorizer.fit(train["full_text"])
vocab = vectorizer.vocabulary_

In [8]:
#this function does stuff
#it tkaes two iunputs, path to embeding file and owd index
#word index is a dictionary, with key a word and item as index number 
#this funtion reads the embedding file into a dictionary, one key perline/word
#item is ebedding
#after doing this, creates an embedding matrix with nuimber of rows equal to number of word ined
# it then iterates over 
def create_embedmatrix(embedding_file, word_index):
    #word embedding
    embeddings_index = {}
    not_found = {}
    f = open(embedding_file, encoding="utf8")
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()
    print('Found %s word vectors.' % len(embeddings_index))
    embedding_dim = next(iter(embeddings_index.values())).shape[0]
    
    #now make embedding
    #reserving row 0 for for the padding character - will be all 0s so can be masked laer
    #reserving row 1 for words that are not in vocab (will bve al 0s and masked alter)
    #words that are in vocab but not in the embeddings will get their own rows
    #might want to train them later (intitliaze with random)
    #make from truncated normal, parameters from loading the embddeing
    lower = -2
    upper = 2
    mu = 0.0 # mean
    sigma = 0.5 #standard deviation
    embedding_matrix = scipy.stats.truncnorm.rvs(
              (lower-mu)/sigma,(upper-mu)/sigma,loc=mu,scale=sigma,size=(len(word_index),embedding_dim))
    #make first row all zeroes, for masking of padding
    embedding_matrix[0,:] = 0.

    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i+1] = embedding_vector
        else:
            not_found[word] = not_found.get(word, 0) + 1   
    return embedding_matrix, embedding_dim, not_found

In [9]:
embedding_matrix, embedding_dim, not_found = create_embedmatrix('data/external/GloveVectors/glove.6B.200d.txt', vocab)

Found 400000 word vectors.


Above shows the need to be able to update word embeddings, isntead of just ignoring...

## 2.2 Sequences

Now transform into sequences. Should really make into pipeline.

In [10]:
max_len = 300

In [11]:
def create_seqs(texts, vocab, max_len):
    tokens = []
    for text in texts:
        tokens.append(wordpunct_tokenize(text.lower()))
    seqs = np.zeros((len(tokens), max_len), dtype=np.int32)
    
    for i, text in enumerate(tokens):
        for j, word in enumerate(text):
            if j >= max_len:
                break
            #need to increment by 1 since first row in embedding matrix is reserved
            #if word doesn't exist, it will return -1, whicll be incrmented to 1
            seqs[i,j] = vocab.get(word, -1) + 2
    return seqs

In [12]:
seqs = create_seqs(train["full_text"], vocab, 300)

In [13]:
#need validation set
seqs_train, seqs_valid, y_train, y_valid = train_test_split(seqs, train.funny.values, test_size = 0.125, random_state = 123)

# First Model

keep it very simple

In [14]:
from keras.layers import Embedding
from keras.layers import Input, Dense, Masking, BatchNormalization
from keras.layers import Dropout, Embedding, SpatialDropout1D
from keras.layers import  LSTM

from keras.models import Model
from keras.regularizers import l2
from keras import optimizers
import tensorflow as tf

In [15]:
from keras import losses
### TODO: Train the model.


adam = optimizers.Adam()

In [20]:
def create_model(L2):
    embedding_layer = Embedding(embedding_matrix.shape[0],
                            embedding_matrix.shape[1],
                            weights=[embedding_matrix],
                            input_length=max_len,
                            trainable=False, name = 'embedding')
    
    sequence_input = Input(shape=(max_len,), dtype='int32', name='joke_seq')

    embedded_input = embedding_layer(sequence_input)

    mask_pads = Masking(mask_value=0.)(embedded_input)

    lstm = LSTM(200, implementation=2, unroll=True)(mask_pads)

    dense = Dense(200, activation='relu',
                 kernel_initializer='he_normal',
                 kernel_regularizer=l2(L2))(lstm)

    preds = Dense(1, activation = 'sigmoid',
                 kernel_regularizer=l2(L2))(dense)

    model = Model(inputs=sequence_input, outputs=preds)
    
    return model

In [22]:
#once validation loss stops decreasing, resume training with decrease amount of regularization
#stop this proicess once regulriatoin is really low, or validation loss has not improved....
#starting with a L2 lambda of 0.1, and going to a min of 1e-8, means that there will 73 iterations....
model_path="models/neural_decrease_l2.hdf5"
model_prev_path = "models/neural_decrease_l2_prev.hdf5"
history_list = list()
best_val_loss_list = list()
prev_best_val_loss = np.inf
i = 0
L2 = 0.1
checkpointer = ModelCheckpoint(filepath=model_path,
                               monitor='val_binary_crossentropy',
                               verbose=1,
                               mode='min',
                               save_best_only=True)

earlystopper = EarlyStopping(monitor='val_binary_crossentropy',
                             min_delta=0,
                             patience=3,
                             verbose=1,
                             mode='min')

while L2 > 1e-8:
    #the stronger the reg, more patience
    patience = max(3, int(8-i/8))
    print("\nStarting iteration {0:} with an L2 lambda of {1:0.8f} and a patience of {2:}\n".format(i, L2, patience))

    model = create_model(L2)
    if i != 0:
        model.load_weights(model_path_prev)
        

    
    earlystopper = EarlyStopping(monitor='val_binary_crossentropy',
                             min_delta=0,
                             patience=patience,
                             verbose=1,
                             mode='min')
    
    model.compile(loss='binary_crossentropy',
              optimizer=adam,
              metrics=['acc', losses.binary_crossentropy])
    
    history = model.fit(x = seqs_train, y = y_train, epochs=100, batch_size=2500,
                    validation_data=(seqs_valid, y_valid), callbacks=[checkpointer, earlystopper], verbose=1)
    best_val_loss = min(history.history["val_binary_crossentropy"])
    
    print("\nValidation loss has gone from {0:0.5f} to {1:0.5f}\n".format(prev_best_val_loss, best_val_loss))
    
    if best_val_loss > prev_best_val_loss:
        print("\nValidation loss has NOT improved. Ignoring new history\n")
    else:
        print("\nValidation HAS improved. Incorporating new history\n")
        #save model (even the most recent iterations that have not led to an increase in validation score)
        #the checkpointer will ensure that we always have model with best validation score on hand
        #at model_path
        #saving to "prev" path, this is what is load and used for next iteration
        model.save(model_prev_path)
        best_val_loss_list.append(best_val_loss)
        history_list.append(history.history)
        prev_best_val_loss = best_val_loss
        
    L2 = L2 * 0.8
    i += 1


Starting iteration 0 with an L2 lambda of 0.10000000 and a patience of 8

Train on 171668 samples, validate on 24524 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100

KeyboardInterrupt: 

In [None]:
min(history.history["val_binary_crossentropy"])