## Intro

In this notebook, you will see:
* How this project is organized
* How each model is trained and evaluated

Please click the links offered and refer to other notebooks for more infomation of each model

## Agenda:
1. Baseline: Support Vector Classifier (SVC)
2. Logistic Regression
3. Multi-Layer Perceptron (MLP)
4. Long Short-Term Memory (LSTM)
5. Concurrent Neural Network (CNN)

## Baseline: Support Vector Classifier (SVC)

Please find the code cell and outputs in [SVC](SVC.ipynb).

### Tokenize the comments

In [None]:
vec = TfidfVectorizer()
train_tokens = vec.fit_transform(train['comment_text'])
test_tokens = vec.transform(test['comment_text'])

### Preprocessing (tranforming the vectors)

In [None]:
from scipy.sparse import vstack
global_tokens = vstack([train_tokens, test_tokens]
# Apply Truncated Singular Vector Decomposition (SVD)
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=100, n_iter=10)
svd.fit(global_tokens)
global_svd = svd.transform(global_tokens)
# Apply MaxAbsScaler
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
scaler.fit(global_svd)
global_scaled = scaler.transform(global_svd)
train_scaled = global_scaled[:len(train)]
test_scaled = global_scaled[len(train):]

### Predict for each class with pre-build SVC model

In [None]:
predict_on_test = np.zeros((len(test), len(labels)))
predict_on_train = np.zeros((len(train), len(labels)))
for idx, label in tqdm(enumerate(labels)):
    print("Training %s"%(label))
    m = svm.SVC(probability=True).fit(train_svd, train[label])
    predict_on_test[:,idx] = m.predict_proba(test_svd)[:,1]
    predict_on_train[:,idx] = m.predict_proba(train_svd)[:,1]

Score on Kaggle = 0.73786

### Improvement: balancing the weight of each class

In [None]:
m = svm.SVC(probability=True, class_weight='balanced').fit(train_svd, train[label])

Score on Kaggle = 0.91595

We can see that balancing the weight of each class is an effective method to solve the overfitting problem of SVC and improves the performance a lot.

## Logistic Regression

Please find the code cell and outputs in [Logistic Regression](Logistic_Regression.ipynb).

The input of Logistic Regression model and the preprocessing is the same as SVC. We just replaced the SVC model with pre-build Logistic Regression model in sklearn.

In [None]:
m = LogisticRegression().fit(train_scaled, train[label])

### Evaluate the model with cross validation

In [None]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import log_loss, roc_auc_score
def crossValidation(clf, X, y, n=5):
    cv = KFold(n_splits=n)
    scores = []
    i = 0
    y_pred  = []
    y_true = []
    # split the training data to training and validation data
    for train_index, valid_index in cv.split(X):
        i += 1
        X_tr, X_va, y_tr, y_va = X[train_index], X[valid_index], y.iloc[train_index], y.iloc[valid_index]
        clf.fit(X_tr, y_tr)
        y_pred_sub = clf.predict(X_va)
        newScore = clf.score(X_va, y_va)
        scores.append(newScore)
        newLogLoss = log_loss(y_va, y_pred_sub)
        newROCAUC = roc_auc_score(y_va, y_pred_sub)
        print("loop %d, accuracy %0.6f, logloss %0.6f, roc_auc_score %0.6f" % (i, newScore, newLogLoss, newROCAUC))
        # preserve one pair of y_true and y_pred
        if i == 1:
            y_pred = y_pred_sub
            y_true = y_va
    scores_array = np.asarray(scores)
    print("Accuracy: %0.6f (+/- %0.6f)" % (scores_array.mean(), scores_array.std() * 2))
    print()
    
    return y_true, y_pred

In [None]:
y_pred = []
y_true = []
for i, label in tqdm(enumerate(labels)):
    print("----- evaluating %s -----" % label)
    m = LogisticRegression()
    y_true_sub, y_pred_sub = crossValidation(m, train_scaled, train[label])
    y_true.append(y_true_sub)
    y_pred.append(y_pred_sub)

We evaluated our SVC and Logistic Regression models with Log Loss scores and ROC AUC scores. 

Score on Kaggle = 0.94415

Logistic Regression outperform SVC and it is much more efficient to train than SVC.

## Three Neural Network models

Please find the code cell and outputs in [MLP](MLP.ipynb), [LSTM](LSTM.ipynb), and [CNN](CNN.ipynb).

### Input Process

In [None]:
# Parameters
embed_size = 100
max_features = 20000
MAX_SEQUENCE_LENGTH = 100
EPOCH = 5
BATCH_SIZE = 32

In [None]:
tokenizer = Tokenizer(num_words = max_features)
tokenizer.fit_on_texts(list(train['comment_text']))
train_input = pad_sequences(tokenizer.texts_to_sequences(train['comment_text']), maxlen = MAX_SEQUENCE_LENGTH)
test_input = pad_sequences(tokenizer.texts_to_sequences(test['comment_text']), maxlen = MAX_SEQUENCE_LENGTH)
examples_input = pad_sequences(tokenizer.texts_to_sequences(examples['comment_text']), maxlen = MAX_SEQUENCE_LENGTH)

We used the Tokenizer funtion to create a tokenizer based on all training comment text. The tokenizer would create indexes for each words and also would count the number of word appearance. After that, we called tokenizer.texts_to_sequences to convert each word into a sequence number(index) created by Tokenizer. And then we called pad_sequences to unify the input length, which is 100 here. If the length is smaller than 100, it would pad 0 to the sequence and make the length be 100. Otherwise, it would truncate the length. Such word sequences would be the input of our neural networks.

In [None]:
import nltk
nltk.download('stopwords')
# Reference: https://stackoverflow.com/questions/37793118/load-pretrained-glove-vectors-in-python
from nltk.corpus import stopwords
EMBEDDING_FILE = '../data/glove.6B.100d.txt'
def get_coefs(word,*arr):
    return word, np.asarray(arr, dtype = 'float32')

def remove_stopwords(old_dict):
    for key in stopwords.words():
        if key in old_dict.keys():
            del old_dict[key]
    return old_dict
    
embedding_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE,encoding="utf8"))

all_embs = np.stack(embedding_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()

word_index = tokenizer.word_index
word_index_without_sw = remove_stopwords(word_index)
num_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (num_words, embed_size))
i = 0
for word in word_index_without_sw.keys():
    if i >= num_words: 
        break
    if word in embedding_index.keys():
        embedding_matrix[i] = embedding_index[word]
    i += 1

Here we used glove.6B.100d as our word embeddings data (reference code is given). We called the custom function named get_coefs to create a dictionary for 100-dimension word embeddings. The keys are words and the values are corresponding vectors. After that, we created an embedding matrix according to tokenizer data, which would be used in the Embedding layer of neural network.

### Multilayer Perceptron(MLP)

Please find the code cell and outputs in [MLP](MLP.ipynb).

In [None]:
model = models.Sequential()
model.add(layers.Input(shape = (MAX_SEQUENCE_LENGTH,)))
model.add(layers.Embedding(max_features, embed_size, weights=[embedding_matrix]))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(100, activation = "relu"))
model.add(layers.Dense(100, activation = "relu"))
model.add(layers.Dense(100, activation = "relu"))
model.add(layers.Dense(6, activation = "sigmoid"))

model.summary()
opt = optimizers.Adam(learning_rate=0.0005)
model.compile(optimizer = opt, loss = "binary_crossentropy", metrics = ["accuracy"])


In [None]:
history = model.fit(train_input, train_labels, batch_size = BATCH_SIZE, epochs = EPOCH, validation_split = 0.2);

It's the structure of our MLP. In this model, we only applied Dense layers from Keras to implement MLP, which represents fully-connected layers. At first we set the size of each input sequence as 100. After that, because our original input of neural networks is the word indexes, which is stored in train_input, we have to convert indexes into word embeddings according to our embedding_matrix. In this way, embedding layer helps us create word embeddings as the next input which can be viewed as the table used to map indexes to vectors. And then, because the size of each input is 100, the whole input would be 100 * 100, which is a 2-dimension input, while our final output would be a 1-dimension output. In this way, we have to call GlobalMaxPool1D layer to help us get a 1-dimension input. The network used ReLU activations between each hidden layer, and a sigmoid activation function would be used over the final output. The network would output 6 probabilities which corresponds to 6 classes. This model used the Adam optimizer with 0.0005 learning rate and used binary cross entropy loss over labels.

Score on Kaggle = 0.96942

### Long Short Term Memory(LSTM)

Please find the code cell and outputs in [LSTM](LSTM.ipynb).

In [None]:
model = models.Sequential()

model.add(layers.Input(shape = (MAX_SEQUENCE_LENGTH,)))
model.add(layers.Embedding(max_features, embed_size, weights=[embedding_matrix]))
model.add(layers.Bidirectional(layers.LSTM(50, return_sequences=True, dropout=0.2, recurrent_dropout=0.2)))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(50, activation="relu"))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(6, activation="sigmoid"))

model.summary()

opt = optimizers.Adam(learning_rate=0.0005)
model.compile(optimizer = opt, loss = "binary_crossentropy", metrics = ["accuracy"])


The only implementation difference between MLP and LSTM is the structure of networks. Similar to MLP, LSTM also used an input layer and an embedding layer to convert input data, and a GlobalMaxPool1D layer to process dimension. In this model, two more useful layers are used, which are dropout layers and bidirectional LSTM layers. Dropout layer is used to select a few neurons rather than all neurons from the previous layer, which is usually used to overcome overfitting problems. Bidirectional LSTM is the core layer in this model. In this layer, we also set some parameters involving dropout. ReLU activations are used between hidden layers and sigmoid activation is used over the final output. For a better comparison, this model also used the Adam optimizer with 0.0005 learning rate and used binary cross entropy loss over labels.

Score on Kaggle = 0.97206

### Convolutional Neural Network(CNN)

Please find the code cell and outputs in [CNN](CNN.ipynb).

In [None]:
model = models.Sequential()

model.add(layers.Input(shape = (MAX_SEQUENCE_LENGTH,)))
model.add(layers.Embedding(max_features, embed_size, weights=[embedding_matrix]))
model.add(layers.Conv1D(64, 3, activation='relu'))
model.add(layers.MaxPooling1D(pool_size=2, strides=1, padding='valid'))
model.add(layers.Dropout(0.2))
model.add(layers.Conv1D(128, 3, activation='relu'))
model.add(layers.MaxPooling1D(pool_size=2, strides=1, padding='valid'))
model.add(layers.Conv1D(64, 3, activation='relu'))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(50, activation="relu"))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(6, activation="sigmoid"))

model.summary()

opt = optimizers.Adam(learning_rate=0.0005)
model.compile(optimizer = opt, loss = "binary_crossentropy", metrics = ["accuracy"])


The last model is CNN, which is usually used to process image data. In this model, we treated our comment texts as image data and did convolution operation. In this model, Conv1D layers and MaxPooling1D layers are added. The kernel size we set is 3 and the stride is 1 for convolution. For max pooling, the padding rule we chose is valid, whose output shape would be ((input shape - pool size + 1) / strides). But it influenced the results a lot. ReLU activations are used between convolution layers and max pooling layers and sigmoid activation is used over the final output. This model also used an Adam optimizer with 0.0005 learning rate and used binary cross entropy loss over labels.

Score on Kaggle = 0.96000

### Evaluation of Neural Network

In [None]:
#Reference: https://towardsdatascience.com/machine-learning-recurrent-neural-networks-and-long-short-term-memory-lstm-python-keras-example-86001ceaaebc
import matplotlib.pyplot as plt
plt.clf()
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)
plt.plot(epochs, loss, 'g', label='Training loss')
plt.plot(epochs, val_loss, 'y', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
plt.clf()
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
plt.plot(epochs, acc, 'g', label='Training accuracy')
plt.plot(epochs, val_acc, 'y', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

Our three neural networks all used the same evaluation methods, which are training accuracy, validation accuracy, training loss and validation loss. The related link is given. The figures would show us the curves of four evaluation methods.