<h1>Content<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load" data-toc-modified-id="Load-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load</a></span></li><li><span><a href="#Calculate-Term-Frequencies" data-toc-modified-id="Calculate-Term-Frequencies-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Calculate Term Frequencies</a></span></li><li><span><a href="#Build-keras-model" data-toc-modified-id="Build-keras-model-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Build keras model</a></span></li><li><span><a href="#Train-and-save-our-model" data-toc-modified-id="Train-and-save-our-model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Train and save our model</a></span></li><li><span><a href="#Testing-on-validation-set" data-toc-modified-id="Testing-on-validation-set-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Testing on validation set</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

## Load

In [1]:
import pandas as pd, numpy as np

In [2]:
train = pd.read_csv('data/train.csv', index_col='time')
test = pd.read_csv('data/test.csv', index_col='time')

In [3]:
X_train, y_train, X_valid, y_valid = train[[col for col in train.columns if col!='severity']], train['severity'],\
                                     test[[col for col in train.columns if col!='severity']], test['severity']

## Calculate Term Frequencies

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
import re

import scipy.sparse as sps

In [5]:
def Vect(df, use_idf=True, min_df=1, max_df=1.0, ngram_range = (4,8)):
    '''
    Tfidf or tf vectorizing
    
    us_idf: bool([True, False]), use tfidf or not;
    min_df: float()
    max_df: float()
    ngram_range: tuple(None, None)
    '''
    if use_idf == True:
        tfidf = TfidfVectorizer(stop_words=stopwords.words('english'), preprocessor=None,
                                ngram_range=ngram_range, strip_accents='ascii', analyzer='word',
                                min_df = min_df, max_df=max_df, tokenizer=None)
    
        message_encoding = tfidf.fit_transform(df['message_encoding'])
        
        return message_encoding, tfidf
    
    else:
        tf = CountVectorizer(stop_words=stopwords.words('english'), preprocessor=None,
                             ngram_range=ngram_range, strip_accents='ascii', analyzer='word',
                             min_df = min_df, max_df=max_df, tokenizer=None)

        message_encoding = tf.fit_transform(df['message_encoding'])
        
        return message_encoding, tf

In [6]:
X_train_idf, idf_m = Vect(X_train, max_df=0.80, min_df=30)
X_valid_idf = idf_m.transform(X_valid['message_encoding'])

X_train_tf, tf_m = Vect(X_train, use_idf=False, max_df=0.80, min_df=30)
X_valid_tf = tf_m.transform(X_valid['message_encoding'])

In [7]:
X_train_idf = sps.hstack((X_train_idf, train[['author_name', 'author_email', 'committer_name', 
                                              'committer_email', 'tm_year', 'tm_mon', 'tm_mday', 
                                              'tm_hour', 'tm_wday', 'tm_yday', 'first_w']]))

X_valid_idf = sps.hstack((X_valid_idf, test[['author_name', 'author_email', 'committer_name', 
                                             'committer_email', 'tm_year', 'tm_mon', 'tm_mday', 
                                             'tm_hour', 'tm_wday', 'tm_yday', 'first_w']]))

X_train_tf = sps.hstack((X_train_tf, train[['author_name', 'author_email', 'committer_name', 
                                              'committer_email', 'tm_year', 'tm_mon', 'tm_mday', 
                                              'tm_hour', 'tm_wday', 'tm_yday', 'first_w']]))

X_valid_tf = sps.hstack((X_valid_tf, test[['author_name', 'author_email', 'committer_name', 
                                           'committer_email', 'tm_year', 'tm_mon', 'tm_mday', 
                                           'tm_hour', 'tm_wday', 'tm_yday', 'first_w']]))

## Build keras model

I chose the most popular architecture and adjust it a little

In [8]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, Dropout
from keras.layers import LSTM, Activation
from keras.layers.convolutional import Convolution1D
from keras.layers.convolutional import MaxPooling1D
from keras.preprocessing import sequence
from keras.callbacks import EarlyStopping, ModelCheckpoint

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [9]:
# Embedding
max_features = 500
maxlen = 1344
embedding_size = 32

# Convolution
filter_length = 4
nb_filter = 32
pool_length = 8

# LSTM
lstm_output_size = 150

In [10]:
from keras import backend as K

def precision(y_true, y_pred):
    """Precision metric.

    Only computes a batch-wise average of precision.

    Computes the precision, a metric for multi-label classification of
    how many selected items are relevant.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def recall(y_true, y_pred):
    """Recall metric.

    Only computes a batch-wise average of recall.

    Computes the recall, a metric for multi-label classification of
    how many relevant items are selected.
    """
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def f1(y_true, y_pred):
    def recall(y_true, y_pred):
        """Recall metric.

        Only computes a batch-wise average of recall.

        Computes the recall, a metric for multi-label classification of
        how many relevant items are selected.
        """
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

    def precision(y_true, y_pred):
        """Precision metric.

        Only computes a batch-wise average of precision.

        Computes the precision, a metric for multi-label classification of
        how many selected items are relevant.
        """
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision
    precision = precision(y_true, y_pred)
    recall = recall(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall))

In [11]:
model = Sequential()
model.add(Embedding(max_features, embedding_size, input_length=maxlen))
model.add(Dropout(0.3))
model.add(Convolution1D(nb_filter=nb_filter,
                        filter_length=filter_length,
                        border_mode='valid',
                        activation='relu',
                        subsample_length=1))
model.add(MaxPooling1D(pool_length=pool_length))
model.add(LSTM(lstm_output_size))
model.add(Dense(5))
model.add(Activation('sigmoid'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=[f1, recall, precision])

  
  if __name__ == '__main__':


In [12]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1344, 32)          16000     
_________________________________________________________________
dropout_1 (Dropout)          (None, 1344, 32)          0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 1341, 32)          4128      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 167, 32)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 150)               109800    
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 755       
_________________________________________________________________
activation_1 (Activation)    (None, 5)                 0         
Total para

## Train and save our model

In [13]:
from keras.utils import to_categorical
y_train_ohe = to_categorical(y_train)
y_valid_ohe = to_categorical(y_valid)

In [14]:
def train(X, y, batch_size = 64, epochs = 100, name='Test'):
    file_path= name + "best_weights.h5"
    checkpoint = ModelCheckpoint(file_path, monitor='val_f1', 
                                 verbose=1, save_best_only=True, mode='max')
        
    early = EarlyStopping(monitor="val_f1", mode="max", patience=30)
        
    callbacks_list = [checkpoint, early]
    model.fit(X.toarray(), y,
              validation_split=0.2,
              batch_size=batch_size, 
              epochs=epochs, 
              callbacks=callbacks_list)
    model.load_weights(file_path)
    return model

In [15]:
keras_m = train(X_train_idf, y_train_ohe, epochs=100, name='wieghts/tfidf')

Train on 8291 samples, validate on 2073 samples
Epoch 1/100

Epoch 00001: val_f1 improved from -inf to 0.60927, saving model to wieghts/tfidfbest_weights.h5
Epoch 2/100

Epoch 00002: val_f1 improved from 0.60927 to 0.61023, saving model to wieghts/tfidfbest_weights.h5
Epoch 3/100

Epoch 00003: val_f1 improved from 0.61023 to 0.61116, saving model to wieghts/tfidfbest_weights.h5
Epoch 4/100

Epoch 00004: val_f1 improved from 0.61116 to 0.62470, saving model to wieghts/tfidfbest_weights.h5
Epoch 5/100

Epoch 00005: val_f1 did not improve
Epoch 6/100

Epoch 00006: val_f1 improved from 0.62470 to 0.62654, saving model to wieghts/tfidfbest_weights.h5
Epoch 7/100

Epoch 00007: val_f1 did not improve
Epoch 8/100

Epoch 00008: val_f1 improved from 0.62654 to 0.63858, saving model to wieghts/tfidfbest_weights.h5
Epoch 9/100

Epoch 00009: val_f1 did not improve
Epoch 10/100

Epoch 00010: val_f1 did not improve
Epoch 11/100

Epoch 00011: val_f1 did not improve
Epoch 12/100

Epoch 00012: val_f1 di

## Testing on validation set

In [17]:
print ('F-score:',keras_m.evaluate(X_valid_idf.toarray(), y_valid_ohe)[1])

F-score: 0.5913594516394195


## Conclusion

In comparison with Xgboost score a little worse. I make it like this. I'll try to make more solution with keras in another notebooks. We can try:

- KerasClassifier
- Text vectorizing with keras preprocessing tools
- More data to NN - better score, that's why XGBoost perform better.
- Add more layers!