# 1 - Introduction

This notebook prepares the files that will be used to train the neural networks. It saves 

develops artificial neural network (ANN) models. The focus will be on recurrent neural networks (RNNs). The idea is that in order to properly

## 1.1 Load Packages and Global Variables

In [1]:
%matplotlib inline
import os
import io
import zipfile
import requests
import numpy as np
import nltk
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
from nltk import wordpunct_tokenize
from sklearn.model_selection import train_test_split
import scipy.stats

In [2]:
PROJECT_DIR = os.path.join(os.getcwd(), os.pardir)
os.chdir(PROJECT_DIR)

## 1.2- Load the Data

Load the cleaned train and test data sets if they are present, otherwise raise an exception.

In [3]:
#Load data
try:
    train = joblib.load('data/interim/train.pkl')
    test = joblib.load('data/interim/test.pkl')
except FileNotFoundError:
    #need to run earlier notebook if files not present
    raise Exception("Files not found. Run Notebook 1")

# 2 - Create Emedding Matrix

Use sklearn countvectorizer to create a dictionary of tokens with indexes. Use nltk tokenizer instead of sklearn's built in tokenizer since nltk doesn't, comeply igonore puntation.

In [4]:
vectorizer = CountVectorizer(tokenizer=wordpunct_tokenize, min_df=5)
vectorizer.fit(train["full_text"])
vocab = vectorizer.vocabulary_

In [5]:
#this function download the Glove embeddings
#it will only be called if nececary
def download_Glove():
    url = 'http://nlp.stanford.edu/data/glove.6B.zip'
    out_name='data/external/GloveVectors'
    response = requests.get(url, allow_redirects=True)
    z = zipfile.ZipFile(io.BytesIO(response.content))
    z.extractall(path = out_name)

In [6]:
#this function takes two inputs, path to embedding file and the index
#word index is a dictionary, with key a word and item as index number 
#this funtion reads the embedding file into a dictionary, one key perline/word
#item is ebedding
#after doing this, creates an embedding matrix with nuimber of rows equal to number of word ined
# it then iterates over 
def create_embedmatrix(embedding_file, word_index):
    #word embedding
    embeddings_index = {}
    not_found = {}
    try:
        f = open(embedding_file, encoding="utf8")
    except FileNotFoundError:
        print("Pretrain Glove Embedding not found. Downloading them.")
        download_Glove()
        f = open(embedding_file, encoding="utf8")
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()
    print('Found %s word vectors.' % len(embeddings_index))
    embedding_dim = next(iter(embeddings_index.values())).shape[0]
    
    #now make embedding
    #reserving row 0 for for the padding character - will be all 0s so can be masked laer
    #reserving row 1 for words that are not in vocab (will be all 0s and masked from neural network)
    #words that are in vocab but not in the embeddings will get their own rows
    #might want to train them later (intitliaze with random)
    #make from truncated normal, parameters from loading the embddeing
    lower = -2
    upper = 2
    mu = 0.0 # mean
    sigma = 0.5 #standard deviation
    embedding_matrix = scipy.stats.truncnorm.rvs(
              (lower-mu)/sigma,(upper-mu)/sigma,loc=mu,scale=sigma,size=(len(word_index),embedding_dim))
    #make first row all zeroes, for masking of padding
    embedding_matrix[0,:] = 0.

    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i+1] = embedding_vector
        else:
            not_found[word] = not_found.get(word, 0) + 1   
    return embedding_matrix, embedding_dim, not_found

In [7]:
embedding_matrix, embedding_dim, not_found = create_embedmatrix('data/external/GloveVectors/glove.6B.50d.txt', vocab)

Found 400000 word vectors.


In [8]:
print("there were {} words that did not have a corresponding entry in the pretrained vectors. Their values "
      "have been initilziaed randomly".format(len(not_found)))

there were 1507 words that did not have a corresponding entry in the pretrained vectors. Their values have been initilziaed randomly


Above shows the need to be able to update word embeddings, isntead of just ignoring...

In [10]:
#now output
joblib.dump(embedding_matrix, "data/interim/embeddings50.pkl")

['data/interim/embeddings50.pkl']

In [11]:
embedding_matrix, embedding_dim, not_found = create_embedmatrix('data/external/GloveVectors/glove.6B.300d.txt', vocab)

Found 400000 word vectors.


In [12]:
#now output
joblib.dump(embedding_matrix, "data/interim/embeddings300.pkl")

['data/interim/embeddings300.pkl']

## 2.2 Sequences

Now transform into sequences. Should really make into pipeline.

In [None]:
#first, need to decide on a max length
#actually, do this later, sin e

In [13]:
#need to justify why max length of 300
#longer length doesn't really costs anything...
max_len = 300

In [14]:
def create_seqs(texts, vocab, max_len):
    tokens = []
    for text in texts:
        tokens.append(wordpunct_tokenize(text.lower()))
    seqs = np.zeros((len(tokens), max_len), dtype=np.int32)
    
    for i, text in enumerate(tokens):
        for j, word in enumerate(text):
            if j >= max_len:
                break
            #need to increment by 1 since first row in embedding matrix is reserved
            #if word doesn't exist, it will return -1, whicll be incrmented to 1
            seqs[i,j] = vocab.get(word, -1) + 2
    return seqs

In [15]:
seqs = create_seqs(train["full_text"], vocab, max_len)

In [16]:
#need validation set
seqs_train, seqs_valid, y_train, y_valid = train_test_split(seqs, train.funny.values, test_size = 0.125, random_state = 123)

In [17]:
#output train seqs
joblib.dump({"seqs":seqs_train, "labels":y_train},"data/processed/train_nn.pkl")

['data/processed/train_nn.pkl']

In [18]:
#output validation seqs
joblib.dump({"seqs":seqs_valid, "labels":y_valid},"data/processed/valid_nn.pkl")

['data/processed/valid_nn.pkl']

In [23]:
#create test seqs
seqs_test = create_seqs(test["full_text"], vocab, max_len)

In [24]:
#output test seqs
joblib.dump({"seqs":seqs_test, "labels":test.funny.values},"data/processed/test_nn.pkl")

['data/processed/test_nn.pkl']