<h1>Content<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load" data-toc-modified-id="Load-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load</a></span></li><li><span><a href="#Clearing" data-toc-modified-id="Clearing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Clearing</a></span></li><li><span><a href="#Vectorizing" data-toc-modified-id="Vectorizing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Vectorizing</a></span></li><li><span><a href="#Build-model" data-toc-modified-id="Build-model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Build model</a></span><ul class="toc-item"><li><span><a href="#Scores" data-toc-modified-id="Scores-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Scores</a></span></li><li><span><a href="#Models" data-toc-modified-id="Models-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Models</a></span></li></ul></li><li><span><a href="#Training" data-toc-modified-id="Training-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Training</a></span><ul class="toc-item"><li><span><a href="#Spacy-part" data-toc-modified-id="Spacy-part-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Spacy part</a></span><ul class="toc-item"><li><span><a href="#Test" data-toc-modified-id="Test-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span>Test</a></span></li></ul></li><li><span><a href="#Wor2vec-part" data-toc-modified-id="Wor2vec-part-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Wor2vec part</a></span><ul class="toc-item"><li><span><a href="#Test" data-toc-modified-id="Test-5.2.1"><span class="toc-item-num">5.2.1&nbsp;&nbsp;</span>Test</a></span></li></ul></li><li><span><a href="#Without-preprocessing" data-toc-modified-id="Without-preprocessing-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Without preprocessing</a></span><ul class="toc-item"><li><span><a href="#Spacy" data-toc-modified-id="Spacy-5.3.1"><span class="toc-item-num">5.3.1&nbsp;&nbsp;</span>Spacy</a></span></li><li><span><a href="#w2vec" data-toc-modified-id="w2vec-5.3.2"><span class="toc-item-num">5.3.2&nbsp;&nbsp;</span>w2vec</a></span></li></ul></li><li><span><a href="#Results" data-toc-modified-id="Results-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Results</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

### Load

In [1]:
import pandas as pd, numpy as np
import spacy, string
from time import time

In [2]:
train = pd.read_csv('data/train.csv', index_col='time')
test = pd.read_csv('data/test.csv', index_col='time')

In [3]:
X_train, y_train, X_valid, y_valid = train[[col for col in train.columns if col!='severity']], train['severity'],\
                                     test[[col for col in train.columns if col!='severity']], test['severity']

### Clearing

In [4]:
from nltk.corpus import stopwords

stopwords = stopwords.words('english')

nlp = spacy.load('en_core_web_sm')

In [5]:
# Clean text before feeding it to spaCy
punctuations = string.punctuation

# Define function to cleanup text by removing personal pronouns, stopwords, and puncuation
def cleanup_text(docs, logging=False):
    texts = []
    counter = 1
    for doc in docs:
        if counter % 1000 == 0 and logging:
            print("Processed %d out of %d documents." % (counter, len(docs)))
        counter += 1
        doc = nlp(doc, disable=['parser', 'ner'])
        tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
        tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations]
        tokens = ' '.join(tokens)
        texts.append(tokens)
    return pd.Series(texts)

In [6]:
train_cleaned = cleanup_text(X_train['message_encoding'], logging=True)
valid_cleaned = cleanup_text(X_valid['message_encoding'], logging=True)

Processed 1000 out of 10364 documents.
Processed 2000 out of 10364 documents.
Processed 3000 out of 10364 documents.
Processed 4000 out of 10364 documents.
Processed 5000 out of 10364 documents.
Processed 6000 out of 10364 documents.
Processed 7000 out of 10364 documents.
Processed 8000 out of 10364 documents.
Processed 9000 out of 10364 documents.
Processed 10000 out of 10364 documents.


### Vectorizing

In [7]:
def get_vect(X_cleaned):
    # Parse documents and print some info
    print('Parsing documents...')
    start = time()
    train_vec = []
    for doc in nlp.pipe(X_cleaned, batch_size=256):
        if doc.has_vector:
            train_vec.append(doc.vector)
            # If doc doesn't have a vector, then fill it with zeros.
        else:
            train_vec.append(np.zeros((128,), dtype="float32"))
        
    # train_vec = [doc.vector for doc in nlp.pipe(train_cleaned, batch_size=500)]
    train_vec = np.array(train_vec)

    end = time()
    print('Total time passed parsing documents: {} seconds'.format(end - start))
    print('Total number of documents parsed: {}'.format(len(train_vec)))
    print('Size of vector embeddings: ', train_vec.shape[1])
    print('Shape of vectors embeddings matrix: ', train_vec.shape)
    return train_vec

In [8]:
train_vec = get_vect(train_cleaned)
valid_vec = get_vect(valid_cleaned)

Parsing documents...
Total time passed parsing documents: 76.28474140167236 seconds
Total number of documents parsed: 10364
Size of vector embeddings:  128
Shape of vectors embeddings matrix:  (10364, 128)
Parsing documents...
Total time passed parsing documents: 7.493882656097412 seconds
Total number of documents parsed: 922
Size of vector embeddings:  128
Shape of vectors embeddings matrix:  (922, 128)


### Build model

In [9]:
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Input, LSTM, Embedding, Bidirectional, Flatten
from keras.layers import Conv1D, MaxPooling1D, GlobalMaxPooling1D
from keras.utils import to_categorical
from keras import backend as K

from keras.callbacks import EarlyStopping, ModelCheckpoint

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


#### Scores

In [10]:
def precision(y_true, y_pred):
    """Precision metric.

    Only computes a batch-wise average of precision.

    Computes the precision, a metric for multi-label classification of
    how many selected items are relevant.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def recall(y_true, y_pred):
    """Recall metric.

    Only computes a batch-wise average of recall.

    Computes the recall, a metric for multi-label classification of
    how many relevant items are selected.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def f1(y_true, y_pred):
    def recall(y_true, y_pred):
        """Recall metric.

        Only computes a batch-wise average of recall.

        Computes the recall, a metric for multi-label classification of
        how many relevant items are selected.
        """
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

    def precision(y_true, y_pred):
        """Precision metric.

        Only computes a batch-wise average of precision.

        Computes the precision, a metric for multi-label classification of
        how many selected items are relevant.
        """
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision
    precision = precision(y_true, y_pred)
    recall = recall(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall))

####  Models

In [11]:
def build_model(architecture='mlp'):
    model = Sequential()
    if architecture == 'mlp':
        # Densely Connected Neural Network (Multi-Layer Perceptron)
        model.add(Dense(512, activation='relu', kernel_initializer='he_normal', input_dim=128))
        model.add(Dropout(0.2))
        model.add(Dense(512, activation='relu', kernel_initializer='he_normal'))
        model.add(Dropout(0.2))
        model.add(Dense(512, activation='relu', kernel_initializer='he_normal'))
        model.add(Dropout(0.2))
        model.add(Dense(512, activation='relu', kernel_initializer='he_normal'))
        model.add(Dropout(0.2))
        model.add(Dense(5, activation='softmax'))
    elif architecture == 'cnn':
        # 1-D Convolutional Neural Network
        inputs = Input(shape=(128,1))

        x = Conv1D(64, 3, strides=1, padding='same', activation='relu')(inputs)

        #Cuts the size of the output in half, maxing over every 2 inputs
        x = MaxPooling1D(pool_size=2)(x)
        x = Conv1D(128, 3, strides=1, padding='same', activation='relu')(x)
        x = GlobalMaxPooling1D()(x) 
        outputs = Dense(5, activation='softmax')(x)

        model = Model(inputs=inputs, outputs=outputs, name='CNN')
    elif architecture == 'lstm':
        # LSTM network
        inputs = Input(shape=(128,1))

        x = Bidirectional(LSTM(64, return_sequences=True),
                          merge_mode='concat')(inputs)
        
        x = Dropout(0.2)(x)
        x = Flatten()(x)
        outputs = Dense(5, activation='softmax')(x)

        model = Model(inputs=inputs, outputs=outputs, name='LSTM')
    else:
        print('Error: Model type not found.')
    
    model.compile(optimizer='adadelta', 
                  loss='categorical_crossentropy', 
                  metrics=[precision, recall, f1])
    
    return model

### Training

In [12]:
y_train_ohe = to_categorical(y_train)
y_valid_ohe = to_categorical(y_valid)

In [13]:
def prepare_data(X_train, model):
    if model.name == "CNN" or model.name == "LSTM":
        X_train = np.expand_dims(X_train, axis=2)
        print('Text train shape: ', X_train.shape)
    return X_train

In [14]:
def train(X, y, batch_size = 64, epochs = 100, name='Test'):
    models = [build_model(), build_model('cnn'), build_model('lstm')]
    fited = []
    
    for ind, model in enumerate(models):
        X_prep = prepare_data(X, model)
        file_path= name + "best_weights_%d.h5" %ind
        checkpoint = ModelCheckpoint(file_path, monitor='val_f1', 
                                     verbose=1, save_best_only=True, mode='max')
        
        early = EarlyStopping(monitor="val_f1", mode="max", patience=30)
        
        callbacks_list = [checkpoint, early]
        model.fit(X_prep, y,
                  validation_split=0.2,
                  batch_size=batch_size, 
                  epochs=epochs, 
                  callbacks=callbacks_list)
        model.load_weights(file_path)
        fited.append(model)
    return fited

#### Spacy part

In [None]:
keras_m = train(train_vec, y_train_ohe, name='wieghts/spacy_')

Train on 8291 samples, validate on 2073 samples
Epoch 1/100

Epoch 00001: val_f1 did not improve
Epoch 2/100

Epoch 00002: val_f1 did not improve
Epoch 3/100

Epoch 00003: val_f1 improved from -inf to 0.60792, saving model to wieghts/spacy_best_weights_0.h5
Epoch 4/100

Epoch 00004: val_f1 did not improve
Epoch 5/100

Epoch 00005: val_f1 did not improve
Epoch 6/100

Epoch 00006: val_f1 did not improve
Epoch 7/100

Epoch 00007: val_f1 did not improve
Epoch 8/100

Epoch 00008: val_f1 did not improve
Epoch 9/100

Epoch 00009: val_f1 did not improve
Epoch 10/100

Epoch 00010: val_f1 did not improve
Epoch 11/100

Epoch 00011: val_f1 did not improve
Epoch 12/100

Epoch 00012: val_f1 did not improve
Epoch 13/100

Epoch 00013: val_f1 did not improve
Epoch 14/100

Epoch 00014: val_f1 did not improve
Epoch 15/100

Epoch 00015: val_f1 did not improve
Epoch 16/100

Epoch 00016: val_f1 did not improve
Epoch 17/100

Epoch 00017: val_f1 did not improve
Epoch 18/100

Epoch 00018: val_f1 did not improv


Epoch 00033: val_f1 did not improve
Text train shape:  (10364, 128, 1)
Train on 8291 samples, validate on 2073 samples
Epoch 1/100

Epoch 00001: val_f1 improved from -inf to 0.17695, saving model to wieghts/spacy_best_weights_1.h5
Epoch 2/100

Epoch 00002: val_f1 improved from 0.17695 to 0.60576, saving model to wieghts/spacy_best_weights_1.h5
Epoch 3/100

Epoch 00003: val_f1 did not improve
Epoch 4/100

Epoch 00004: val_f1 improved from 0.60576 to 0.60830, saving model to wieghts/spacy_best_weights_1.h5
Epoch 5/100

Epoch 00005: val_f1 did not improve
Epoch 6/100

Epoch 00006: val_f1 did not improve
Epoch 7/100

Epoch 00007: val_f1 did not improve
Epoch 8/100

Epoch 00008: val_f1 did not improve
Epoch 9/100

Epoch 00009: val_f1 did not improve
Epoch 10/100

Epoch 00010: val_f1 did not improve
Epoch 11/100

Epoch 00011: val_f1 did not improve
Epoch 12/100

Epoch 00012: val_f1 did not improve
Epoch 13/100

Epoch 00013: val_f1 did not improve
Epoch 14/100

Epoch 00014: val_f1 did not im


Epoch 00031: val_f1 did not improve
Epoch 32/100

Epoch 00032: val_f1 did not improve
Epoch 33/100

Epoch 00033: val_f1 did not improve
Epoch 34/100

Epoch 00034: val_f1 did not improve
Text train shape:  (10364, 128, 1)
Train on 8291 samples, validate on 2073 samples
Epoch 1/100

Epoch 00001: val_f1 improved from -inf to 0.60673, saving model to wieghts/spacy_best_weights_2.h5
Epoch 2/100

Epoch 00002: val_f1 did not improve
Epoch 3/100

Epoch 00003: val_f1 did not improve
Epoch 4/100

Epoch 00004: val_f1 did not improve
Epoch 5/100

Epoch 00005: val_f1 did not improve
Epoch 6/100

Epoch 00006: val_f1 did not improve
Epoch 7/100

Epoch 00007: val_f1 did not improve
Epoch 8/100

Epoch 00008: val_f1 did not improve
Epoch 9/100

Epoch 00009: val_f1 did not improve
Epoch 10/100

Epoch 00010: val_f1 did not improve
Epoch 11/100

Epoch 00011: val_f1 did not improve
Epoch 12/100

Epoch 00012: val_f1 did not improve
Epoch 13/100

Epoch 00013: val_f1 did not improve
Epoch 14/100

Epoch 00014:


Epoch 00028: val_f1 did not improve
Epoch 29/100

Epoch 00029: val_f1 did not improve
Epoch 30/100

Epoch 00030: val_f1 did not improve
Epoch 31/100

Epoch 00031: val_f1 did not improve
Epoch 32/100

Epoch 00032: val_f1 did not improve
Epoch 33/100

Epoch 00033: val_f1 did not improve
Epoch 34/100

Epoch 00034: val_f1 did not improve
Epoch 35/100

Epoch 00035: val_f1 did not improve
Epoch 36/100

Epoch 00036: val_f1 did not improve
Epoch 37/100

Epoch 00037: val_f1 did not improve
Epoch 38/100

Epoch 00038: val_f1 did not improve
Epoch 39/100

Epoch 00039: val_f1 did not improve
Epoch 40/100

Epoch 00040: val_f1 did not improve
Epoch 41/100

Epoch 00041: val_f1 did not improve
Epoch 42/100

Epoch 00042: val_f1 did not improve
Epoch 43/100

Epoch 00043: val_f1 did not improve
Epoch 44/100

##### Test

In [None]:
def testing(X, y, models, ind = 3):
    scores = []
    for model in models:
        X_prep = prepare_data(X, model)
        scores.append(round(model.evaluate(X_prep, y)[ind], 4))
    return scores

In [None]:
print (testing(valid_vec, y_valid_ohe, keras_m))

<table>
  <tr>
    <th>DNN type</th>
    <th>F1-score with spacy</th>
  </tr>
  <tr>
    <td>Multi-Layer Perceptron</td>
    <td>0.57</td>
  </tr>
  <tr>
    <td>1-D Convolutional Neural Network</td>
    <td>0.57</td>
  </tr>
  <tr>
    <td>LSTM network</td>
    <td>0.57</td>
   </tr>
</table>

#### Wor2vec part

In [None]:
from gensim.models.word2vec import Word2Vec

import re

In [None]:
# Define function to preprocess text for a word2vec model
def cleanup_text_word2vec(docs, logging=False):
    sentences = []
    counter = 1
    for doc in docs:
        if counter % 1000 == 0 and logging:
            print("Processed %d out of %d documents" % (counter, len(docs)))
        # Disable tagger so that lemma_ of personal pronouns (I, me, etc) don't getted marked as "-PRON-"
        doc = nlp(doc, disable=['tagger'])
        # Grab lemmatized form of words and make lowercase
        doc = " ".join([tok.lemma_.lower() for tok in doc])
        # Split into sentences based on punctuation
        doc = re.split("[\.?!;] ", doc)
        # Remove commas, periods, and other punctuation (mostly commas)
        doc = [re.sub("[\.,;:!?]", "", sent) for sent in doc]
        # Split into words
        doc = [sent.split() for sent in doc]
        sentences += doc
        counter += 1
    return sentences

In [None]:
train_cleaned_w2v = cleanup_text_word2vec(X_train['message_encoding'], logging=True)
valid_cleaned_w2v = cleanup_text_word2vec(X_valid['message_encoding'], logging=True)

In [None]:
text_dim = 128
print("Training Word2Vec model...")
wordvec_model = Word2Vec(train_cleaned_w2v, size=text_dim, window=5, min_count=3, workers=11, sg=1)
print("Word2Vec model created.")
print("%d unique words represented by %d dimensional vectors" % (len(wordvec_model.wv.vocab), text_dim))

In [None]:
def create_average_vec(doc):
    average = np.zeros((text_dim,), dtype='float32')
    num_words = 0.
    for word in doc.split():
        if word in wordvec_model.wv.vocab:
            average = np.add(average, wordvec_model[word])
            num_words += 1.
    if num_words != 0.:
        average = np.divide(average, num_words)
    return average

In [None]:
def get_vect_w2v(X):
    X_vec = np.zeros((X.shape[0], text_dim), dtype="float32")
    for i in range(len(X)):
        X_vec[i] = create_average_vec(X[i])
    return X_vec

In [None]:
train_cleaned_vec = get_vect_w2v(train_cleaned)
valid_cleaned_vec = get_vect_w2v(valid_cleaned)

In [None]:
keras_m_vec = train(train_cleaned_vec, y_train_ohe, name='wieghts/w2v_')

##### Test

In [None]:
print (testing(valid_cleaned_vec, y_valid_ohe, keras_m_vec, ind=3))

<table>
  <tr>
    <th>DNN type</th>
    <th>F1-score with w2v</th>
  </tr>
  <tr>
    <td>Multi-Layer Perceptron</td>
    <td>0.42</td>
  </tr>
  <tr>
    <td>1-D Convolutional Neural Network</td>
    <td>0.34</td>
  </tr>
  <tr>
    <td>LSTM network</td>
    <td>0.57</td>
   </tr>
</table>

#### Without preprocessing

In [None]:
train_vec_w_c = get_vect(X_train['message_encoding'])
valid_vec_w_c = get_vect(X_valid['message_encoding'])

In [None]:
text_dim = 128
print("Training Word2Vec model...")
wordvec_model = Word2Vec(X_train['message_encoding'], size=text_dim, window=5, min_count=3, workers=11, sg=1)
print("Word2Vec model created.")
print("%d unique words represented by %d dimensional vectors" % (len(wordvec_model.wv.vocab), text_dim))

In [None]:
train_cleaned_vec_w_c = get_vect_w2v(X_train['message_encoding'].values)
valid_cleaned_vec_w_c = get_vect_w2v(X_valid['message_encoding'].values)

##### Spacy

In [None]:
keras_m_w_c = train(train_vec_w_c, y_train_ohe, name='wieghts/spacy_w_c_')

In [None]:
print (testing(valid_vec_w_c, y_valid_ohe, keras_m_w_c, ind=3))

##### w2vec

In [None]:
keras_m_vec_w_c = train(train_cleaned_vec_w_c, y_train_ohe, name='wieghts/w2v_w_c_')

In [None]:
print (testing(valid_cleaned_vec_w_c, y_valid_ohe, keras_m_vec_w_c, ind=3))

#### Results

<table>
  <tr>
    <th>DNN type</th>
    <th>F1-score with spacy(no cleaning)</th>
    <th>F1-score with w2v(no cleaning)</th>
  </tr>
  <tr>
    <td>Multi-Layer Perceptron</td>
    <td>0.55</td>
    <td>0.55</td>
  </tr>
  <tr>
    <td>1-D Convolutional Neural Network</td>
    <td>0.57</td>
    <td>0.55</td>
  </tr>
  <tr>
    <td>LSTM network</td>
    <td>0.56</td>
    <td>0.57</td>
   </tr>
</table>

### Conclusion

If we don't use preprocessing, it will show better performance in LTSM network.

When we use TF-IDF accuracy is better. We can try other architectures, customize for each its own optimizer, tune hyperparmeters. Add features to vector of word. 
