### Introduction

Hello! I'm Manfred Michael, a beginner Machine Learning enthusiast and this is my first Kaggle NLP notebook. How did I end up here? I joined a challange called #66DaysData initiated by [Ken jee](https://www.youtube.com/channel/UCiT9RITQ9PW6BhXK0y2jaeg), a Data Scientist Youtuber. From there, i found a minigroup where we discussed about how to tackle this notebook.

This notebook will contain scikit-learn pipeline, nltk modules, and custom transformers. This is the highest score i could achieve so far.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn #remove spooky warning when working on this spooky notebook

# Prepare data

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
np.set_printoptions(suppress=True)

author_data = pd.read_csv('../input/spooky-author-identification/train.zip')
author_data.head()

There are 3 authors: 
* EAP: Edgar Allan Poe
* HPL: HP Lovecraft
* MWS: Mary Wollstonecraft Shelley

Each data has author tag and a chunk of text from on of the author's books

In [None]:
print(author_data['author'].value_counts())
author_data.info()

Seems like the dataset is free of missing value. Now, let's see each text length.

In [None]:
author = author_data.copy()
author['text_length'] = author['text'].apply(lambda text: len(text))
author.head()

In [None]:
print(author['text_length'].describe())
ax = author['text_length'].hist(bins=100);
plt.axis([0, 1000, 0, 5000])
plt.xlabel('text length', fontsize=14)
plt.title('Text length distribution', fontsize=18)

In [None]:
author.groupby('author').mean()

Kinda disappointing, there is no obvious difference between authors text length. Which probably makes sense, the provider of this dataset might have taken similar length of texts chunk from authors book. Having to much variance in text length will give us trouble.

### Split data

For now, let's split our data. Here's our plan:
1. We split the data to create test set
2. Play with the train set
3. Don't touch the test set
4. Build our model and measure its performance with only train set
5. Still don't touch the test set
6. Only when we are confident with our model, we could try it with the test set


But why not using the test set to develop our model? Because we don't want to make a model which performs well on the test set, but we want it to perform well on any general dataset.

**NOTE:** This test set is the one we created, not the test.zip

In [None]:
X = author_data.drop(['author'], axis=1)
y = author_data.copy()['author']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y ,test_size=0.2, random_state=42)

In [None]:
len(X_train), len(X_test)

# Feature Engineering

And here's the plan for feature Engineering:
* Tokenize text
* Stem text
* Remove stopwords
* Use word count to vectorize text
* Use additional features

### Importing text feature exctraction modules

#### Tokenizer

In [None]:
import nltk
nltk.word_tokenize('We don\'t do that here')

#### Stemmer

In [None]:
stemmer = nltk.PorterStemmer()

for word in ('computing','computed','compulsive'):
    print(word,'=>',stemmer.stem(word))

#### Stopwords

In [None]:
from sklearn.feature_extraction import stop_words
stop_words.ENGLISH_STOP_WORDS

In [None]:
text_list = 'i myself believe it was beautiful'.split()
less_than_zero = list(filter(lambda x: not x in stop_words.ENGLISH_STOP_WORDS, text_list))
print(less_than_zero)

#### Word Counter

In [None]:
import re
from collections import Counter

sentence = 'I love real madrid for real but I dont like real betis'
sentence = re.sub(r'\W+', ' ', sentence, flags=re.M) #remove punctuation from string

c = Counter(sentence.split())
print(c.most_common())

In [None]:
c['love'] += 3  
print(c.most_common())

### Extra Feature

Credit: https://www.kaggle.com/sudalairajkumar/simple-feature-engg-notebook-spooky-author (sudalairajkumar's spooky notebook)

In [None]:
import string

def derive_new_features(data):
    data = data[['text']].copy()

    #Unique words Count
    data['unique_words_count'] = data.text.apply(lambda x: len(set(str(x).split())))     

    #Punctuation count
    data['punctuation_count'] = data.text.apply(lambda x: len([x for x in x.lower().split() if x in string.punctuation]))

    #Upper case words count
    data['uppercase_words_count'] = data.text.apply(lambda x: sum([x.isupper() for x in x.split()]))

    #Title words count
    data['title_words_count'] = data.text.apply(lambda x: sum([x.istitle() for x in x.split()]))

    return data.drop(['text'], axis=1)

In [None]:
derive_new_features(X_train).head()

### Create Pipelines

Now, we put all of our feature extractor in pipelines. The goal is to automate our data preparation, so we can tune its parameters later. We are using custom transformers to use feature extractor modules we have imported.

Credit: https://github.com/ageron/handson-ml2/blob/master/03_classification.ipynb (Aurelien Geron's Hands-On ML Chapter 3 Exercise)

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class TextToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, remove_punctuation=True, lower_case=True, stemming=True, replace_numbers=True, remove_stopwords=True):
        self.remove_punctuation = remove_punctuation
        self.lower_case = lower_case
        self.stemming = stemming
        self.replace_numbers = replace_numbers
        self.remove_stopwords = remove_stopwords
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        word_counters = []
        for text in X['text']:
            if self.lower_case:
                text = text.lower()
            if self.replace_numbers:
                text = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', 'NUMBER', text) #replace any numerical character with 'NUMBER'
            if self.remove_punctuation:
                text = re.sub(r'\W+', ' ', text, flags=re.M)
            if self.remove_stopwords:
                words = [word for word in text.split() if not word in stop_words.ENGLISH_STOP_WORDS]
                text = ' '.join(words)
            word_list = nltk.word_tokenize(text)
            word_count = Counter(word_list)
            if self.stemming:
                stemmed_word_count = Counter()
                for word in word_list:
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_count[stemmed_word] += 1
                word_count = stemmed_word_count
            word_counters.append(word_count)
        return np.array(word_counters)

Using TextToWordCounterTransformer, we transformed the text dataframe into a Counter objects. These Counter objects count how many times each words(on a single text) occured on the text.

In [None]:
X_word_count = TextToWordCounterTransformer().fit_transform(X_train)
X_word_count

In [None]:
from scipy.sparse import csr_matrix

class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10) 
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.most_common_ = most_common
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}  #spare an index for excluded words
        return self
    def transform(self, X, y=None):
        data = []
        rows = []
        cols = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                data.append(count)
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))

Now, WordCounterToVectorTransformer will transform the Counter objects from previous transformer into sparse metric. The words used as vector is limited too (this time only 1000 most common words).

In [None]:
vectorizer = WordCounterToVectorTransformer(vocabulary_size=1000)
vectorizer.fit_transform(X_word_count).toarray()

These are 10 most common words:

In [None]:
vectorizer.most_common_[:10]

Now, putting these 2 transformers in a single pipeline

In [None]:
from sklearn.pipeline import Pipeline

preprocess_pipeline = Pipeline([
    ('text_to_word_count', TextToWordCounterTransformer()),
    ('word_count_to_vector', WordCounterToVectorTransformer(vocabulary_size=14000)), 
])

Remember that we have derive_new_features() function. In order to put it in a pipeline, we need to turn it into custom transformer.

In [None]:
class NewFeaturesAdderTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        return np.array(derive_new_features(X))

In [None]:
NewFeaturesAdderTransformer().fit_transform(X_train)

Finally, combining the last transformer with previous pipeline gives us the full pipeline

In [None]:
from sklearn.compose import ColumnTransformer

full_pipeline = ColumnTransformer([
    ('feature_adder', NewFeaturesAdderTransformer(), ['text']),
    ('text_pipeline', preprocess_pipeline, ['text']),
])

# Train Model

The y_train is in object type (string). We need to transform it into 3 categories. We will use Scikit-learn LabelEncoder transformer

In [None]:
y_train.head()

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

In [None]:
X_train_transformed = full_pipeline.fit_transform(X_train)
y_train_transformed = label_encoder.fit_transform(y_train.values)

Be aware that the submission format requires the categories to be in this order: EAP,HPL,MWS

In [None]:
label_encoder.classes_    #submission format: id, EAP,HPL,MWS

Now, let's train several algorithms.

In [None]:
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42, loss='hinge')

In [None]:
from sklearn.model_selection import cross_val_score

cv_score = cross_val_score(
        sgd_clf, 
        X_train_transformed, y_train_transformed, 
        cv=5,
        verbose=3,
    )
print('mean score :', cv_score.mean())

In [None]:
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression(random_state=42)
cv_score = cross_val_score(
        log_clf, 
        X_train_transformed, y_train_transformed, 
        cv=5,
        verbose=3,
    )
print('mean score :', cv_score.mean())

In [None]:
from sklearn.naive_bayes import MultinomialNB

mnb_clf = MultinomialNB()
cv_score = cross_val_score(
        mnb_clf, 
        X_train_transformed, y_train_transformed, 
        cv=5,
        verbose=3,
    )
print('mean score :', cv_score.mean())

### Evaluate Model

This competition use logloss as scoring. I took this code from abishek's notebook.

credit: https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle (Abhishek Thakur's spooky notebook)

In [None]:
def multiclass_logloss(actual, predicted, eps=1e-15):
    """Multi class version of Logarithmic Loss metric.
    :param actual: Array containing the actual target classes
    :param predicted: Matrix with class predictions, one probability per class
    """
    # Convert 'actual' to a binary array if it's not already:
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2

    clip = np.clip(predicted, eps, 1 - eps)
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota

In [None]:
from sklearn.model_selection import cross_val_predict

y_scores = cross_val_predict(
        log_clf, 
        X_train_transformed, y_train_transformed, 
        cv=5,
        verbose=3,
        method='predict_proba',
    )

In [None]:
pd.DataFrame(columns=label_encoder.classes_, data=y_scores).head()

In [None]:
multiclass_logloss(y_train_transformed, y_scores)

In [None]:
y_scores = cross_val_predict(
        mnb_clf, 
        X_train_transformed, y_train_transformed, 
        cv=5,
        verbose=3,
        method='predict_proba',
    )

multiclass_logloss(y_train_transformed, y_scores)

### Try different preprocessing parameters

We have automated our data preparation. Now, let's try several parameters for our pipeline in hoping to find better score

In [None]:
import itertools

stopword_params = [True, False]
vocabulary_params = [7000, 8000, 9000, 10000]
data_scores = []

for stopword, vocabulary in list(itertools.product(stopword_params, vocabulary_params)):
   
    full_pipeline.set_params(
            text_pipeline__text_to_word_count__remove_stopwords=stopword,
            text_pipeline__word_count_to_vector__vocabulary_size=vocabulary,
        )
    X_train_processed = full_pipeline.fit_transform(X_train)
    y_scores = cross_val_predict(
            log_clf, 
            X_train_processed, y_train_transformed, 
            cv=3,
            method='predict_proba',
        )
    data_scores += [(stopword, vocabulary, multiclass_logloss(y_train_transformed, y_scores))]

In [None]:
for stopword, vocabulary, logloss in data_scores:
    print('remove_stopwords:', stopword, ',', 'vocabulary_size:',vocabulary)
    print('logloss: ', logloss)

Let's change our final pipeline parameters with its best parameters

In [None]:
full_pipeline.set_params(
            text_pipeline__text_to_word_count__remove_stopwords=False,
            text_pipeline__word_count_to_vector__vocabulary_size=8_000,
        )
X_train_transformed = full_pipeline.fit_transform(X_train)

### Grid Search

The algorithms we used are not on their best hyperparameters. Tuning their parameters manually would waste our energy. Let's just use Scikit-learn GridSearchCV to do the job for us.

In [None]:
%%time
from sklearn.model_selection import GridSearchCV

log_grid_params = {
    'penalty': ['L1', 'l2'],
    'dual': [False],
    'tol':[1e-4, 1e-5],
    'class_weight': ['balanced', None],
    'solver': ['newton-cg', 'lbfgs', 'sag', 'saga'],
    'multi_class': ['ovr', 'auto', 'multinomial'],
}
LogisticRegression #

log_grid_search = GridSearchCV(
    estimator=LogisticRegression(random_state=42),
    param_grid=log_grid_params,
    scoring='neg_log_loss',
    cv=3,
    verbose=2,
)

log_grid_search.fit(X_train_transformed, y_train_transformed)

In [None]:
print(log_grid_search.best_params_)

In [None]:
import joblib

joblib.dump(log_grid_search, 'log_grid_best.pkl')
log_clf_best = joblib.load('log_grid_best.pkl').best_estimator_
print(log_clf_best.get_params())

In [None]:
%%time
from sklearn.model_selection import GridSearchCV

sgd_grid_params = {
    'loss': ['log'],
    'penalty' : ['l2'],
    'eta0':[0.1],
    'alpha': [1e-4, 1e-5],
    'tol': [1e-3, 1e-4],
    'epsilon': [0.3, 0.5, 1],
    'learning_rate': ['adaptive'],
    'class_weight': ['balanced', None],
    'average':[True, False],
}

sgd_grid_search = GridSearchCV(
    estimator=SGDClassifier(random_state=42),
    param_grid=sgd_grid_params,
    scoring='neg_log_loss',
    cv=3,
    verbose=2,
)

sgd_grid_search.fit(X_train_transformed, y_train_transformed)

In [None]:
print(sgd_grid_search.best_params_)

In [None]:
joblib.dump(sgd_grid_search, 'sgd_grid_best.pkl')
sgd_clf_best = joblib.load('sgd_grid_best.pkl').best_estimator_
print(sgd_clf_best.get_params())

In [None]:
%%time
from sklearn.model_selection import GridSearchCV

mnb_grid_params = {
    'alpha': [0, 0.25, 0.5, 0.75, 1],
    'fit_prior': [False, True],
}

GridSearchCV

mnb_grid_search = GridSearchCV(
    estimator=MultinomialNB(),
    param_grid=mnb_grid_params,
    scoring='neg_log_loss',
    cv=3,
    verbose=2,
)

mnb_grid_search.fit(X_train_transformed, y_train_transformed)

In [None]:
print(mnb_grid_search.best_params_)

In [None]:
joblib.dump(mnb_grid_search, 'mnb_grid_best.pkl')
mnb_clf_best = joblib.load('mnb_grid_best.pkl').best_estimator_
print(mnb_clf_best.get_params())

### Ensemble

Ensembling some models often gives you better score. Let's try some combination of ensemble models!

In [None]:
from sklearn.ensemble import VotingClassifier

sgd_clf_best.set_params(loss='log') #need log soft voting classifier
estimators=[('log_clf', log_clf_best), ('sgd_clf',sgd_clf_best), ('mnb_clf',mnb_clf_best)]
vot_clf = VotingClassifier(
    estimators=estimators,
    voting='soft',
)

In [None]:
y_scores = cross_val_predict(
        vot_clf, 
        X_train_transformed, y_train_transformed, 
        cv=5,
        verbose=3,
        method='predict_proba',
    )

multiclass_logloss(y_train_transformed, y_scores)

Amazing score! It is almost less than 0.4

We used log_loss on our SGD model GridSearch. Because of that, we couldn't use several hyperparameters like loss='hinge'. I later found out that changing our SGD model 'loss' hyperparameter to 'hinge' gives us better score.

In [None]:
sgd_clf_best.set_params(loss='hinge')

In [None]:

from sklearn.ensemble import StackingClassifier

estimators=[('sgd_clf',sgd_clf_best), ('log_clf', log_clf_best),  ('mnb_clf',mnb_clf_best)]
stk_clf = StackingClassifier(
    estimators=estimators,
)

In [None]:
y_scores = cross_val_predict(
        stk_clf, 
        X_train_transformed, y_train_transformed, 
        cv=5,
        verbose=3,
        method='predict_proba',
    )

multiclass_logloss(y_train_transformed, y_scores)

This is the our best score! Now that we have got the score less than 0.4, we could test it on the test set.

# Testing Model

In [None]:
final_model = stk_clf
final_model.fit(X_train_transformed, y_train_transformed)

In [None]:
X_test_transformed = full_pipeline.transform(X_test)
y_test_transformed = label_encoder.transform(y_test)

y_scores = final_model.predict_proba(X_test_transformed)
y_scores

In [None]:
multiclass_logloss(y_test_transformed, y_scores)

Fantastic! We have never touch the test set, but we have pretty similar score from the train set score.

Now that we are satisfied with our score, I will end this notebook here. I would appreciate any critics and suggestion. Thanks for your time going through this notebook!

### Create Submission File

In [None]:
X_transformed = full_pipeline.transform(X)
y_transformed = label_encoder.transform(y)
final_model.fit(X_transformed, y_transformed)

In [None]:
test_data = pd.read_csv('../input/spooky-author-identification/test.zip')
test_data.head()

In [None]:
test_data_prepared = full_pipeline.transform(test_data)
test_scores = final_model.predict_proba(test_data_prepared)
test_scores

In [None]:
submission_file = pd.DataFrame({
    'id': test_data['id'].values,
    'EAP': test_scores[:,0],
    'HPL': test_scores[:,1],
    'MWS': test_scores[:,2],
}) 
submission_file.head()

In [None]:
submission_file.to_csv('submission.csv', index=False)