# Introduction

The objective of this task is to identify the author of excerpts from the works of three famous horror writers, Edgar Allan Poe, Mary Shelley, and HP Lovecraft. 

A brief summary of the code: I'll begin with the preparation of the text data (read, convert into a dataframe, clean and lemmatize), then convert the text data into vectors, determine the features and create the models. I'll use Countvectorizer to vectorize the text data and MultinomialNB, LogisticRegression and XGBClassifier for prediction models. 

Let me begin by importing the libs/mods I'll be using in this notebook:

In [None]:
import numpy as np
import pandas as pd
import re
import string
from string import punctuation 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import Counter

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn import model_selection
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

labelencoder = LabelEncoder()

In [None]:
train_set=pd.read_csv("../input/spooky-author-identification/train.zip")
test_set=pd.read_csv("../input/spooky-author-identification/test.zip")

In [None]:
train_set.head()

We convert the data into a dataframe and see how it looks like (as usual (:)

In [None]:
char_set = "[!#$%&'()*+,./:;<=>?@[\]^_`{|}„“~—\"\-]–+«»…"
stopwords_en = stopwords.words("english")+["one","le","de","u","us"]
                   
lemmatizer=WordNetLemmatizer()

def lemmatize(sent):

    sent = re.sub('\w*\d\w*', '', sent)
    
    nopunct=[ch for ch in sent if ch not in char_set]
    nopunct=''.join(nopunct)
    
    lemmas = []
    for lemma in nopunct.split():
        lemma=lemma.lower()
        lemma = lemma.strip()
        if lemma not in stopwords_en and len(lemma)!=1:
            lemma = lemmatizer.lemmatize(lemma)
            lemmas.append(lemma)
       
    return (lemmas)

This is the code I'll be using for cleaning the text data; getting rid of the punctuation marks and stopwords, and also for lemmatizing it.

In [None]:
train_set['lems']=''
train_set['lemphrases']=''
    
for i,j in enumerate(train_set['text']):
    train_set['lems'][i]=lemmatize(j)
    train_set['lemphrases'][i]=" ".join(train_set['lems'][i])

In [None]:
train_set.sample(3)

In [None]:
all_words={"EAP":[], "HPL":[], "MWS":[]}
aw=[]

for auth in all_words.keys():
    for line in train_set[train_set["author"]==auth]["lems"]:
        for w in line:
            all_words[auth].append(w)
            aw.append(w)

In [None]:
Counter(aw).most_common()[:10]

Some of the most common words may actually be added to the list of stopwords, but I personally think such words may give a clue about the style and wording of the author.

In [None]:
all_words_unique={}
all_words_unique["EAP"]=set(all_words['EAP'])-(set(all_words['HPL'])| set(all_words['MWS']))
all_words_unique["HPL"]=  set(all_words['HPL'])-(set(all_words['EAP']) | set(all_words['MWS']))
all_words_unique["MWS"]= set(all_words['MWS'])-(set(all_words['EAP'])| set(all_words['HPL']))

auth_counts={"EAP": len(all_words_unique["EAP"]),
             "HPL": len(all_words_unique["HPL"]),
             "MWS":len(all_words_unique["MWS"])
             }

fig_sizes = {'S' : (6.5,4),
             'M' : (9.75,6),
             'L' : (13,8)}

def show_plot(f_size=(6.5,4),plot_title="",x_title="",y_title=""):
    plt.figure(figsize=f_size)
    plt.xlabel(x_title)
    plt.ylabel(y_title)
    plt.title(plot_title)

ax_bp = show_plot((6.5,4),'Unique Vocabulary by Author','Author','Count')
#sns.barplot(x=list(auth_counts.keys()), y=list(auth_counts.values()), ax=ax_bp)
sns.barplot(x=list(all_words_unique.keys()), y=list(len(i) for i in all_words_unique.values()), ax=ax_bp)
plt.show()

We see that out of the three authors Mary Shelley has got the least unique vocabulary. I'm not going to use the uniqueness of the vocabulary by author in thie notebook, but it may be worth thinking about how that could be transformed into a feature.


Now I'll test several models, with the original text, lemmatized phrases and also lemmatized phrases including the  stopwords, and also a combination of all of them in a pipeline.

### MultinomialNB

In [None]:
training_scores=[]
validation_scores=[]

def model_selector(data):

    y=data['author']
    
    y = labelencoder.fit_transform(y)
    
    
    for i in ['text', 'lemphrases']:
        
        X=data[i]
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)

        CV=CountVectorizer(ngram_range=(1,4), min_df=2)
        transformer=CV.fit(X_train)

        text_train=transformer.transform(X_train)
        text_test=transformer.transform(X_test)

        model = MultinomialNB()
        model = model.fit(text_train, y_train)
        
        score=model.score(text_train, y_train)
        training_scores.append(score)        
 
        score=model.score(text_test, y_test)
        validation_scores.append(score)      

    Xp=data.drop(['id','author', 'lems'], axis=1)
    yp=data['author']
    yp = labelencoder.fit_transform(yp)
    Xp_train, Xp_test, yp_train, yp_test = train_test_split(Xp, yp, test_size=0.2, shuffle=True, random_state=42)
    
    p_transformer=make_column_transformer((CountVectorizer(ngram_range=(1,4), min_df=2), "text"),
                                            (CountVectorizer(ngram_range=(1,3), min_df=2), "lemphrases")
                                            )

    p_model=make_pipeline(p_transformer, MultinomialNB())
    p_model.fit(Xp_train, yp_train)
    
    score=p_model.score(Xp_train, yp_train)
    training_scores.append(score)
    
    score=p_model.score(Xp_test, yp_test)
    validation_scores.append(score)   
     
    
model_selector(train_set) 

results=pd.DataFrame()
results["Features"]=["Original Text","Lemmatized","Pipeline"]
results["Training"]=training_scores
results["Validation"]=validation_scores

results

### XGBClassifier

In [None]:
def xgb_prediction(X,y):

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    CV=CountVectorizer(ngram_range=(1,4), min_df=2)
    transformer=CV.fit(X_train)

    text_train=transformer.transform(X_train)
    text_test=transformer.transform(X_test)

    xgb_model = XGBClassifier(learning_rate=0.1, max_depth=8, n_estimators=250, random_state=42)
    xgb_model.fit(text_train, y_train)
        
    from sklearn.calibration import CalibratedClassifierCV
    cal_model = CalibratedClassifierCV(xgb_model, method="sigmoid", cv=3)
    cal_model.fit(text_train, y_train)
        
    #print(f"Training score: {xgb_model.score(text_train, y_train)}") 
    print(f"Validation score: {xgb_model.score(text_test, y_test)}")
    print(f"Calibrated Validation score: {cal_model.score(text_test, y_test)}")

In [None]:
X=train_set["text"]
y=train_set["author"]
y=labelencoder.fit_transform(y)
xgb_prediction(X,y)

I'm trying the model for the original texts first and later I'll repeat it for the lemmatized phrases as well. It's interesting to note that we get better results with non-lemmatized text input without calibration than the lemmatized input (with or without calibration).

In [None]:
X=train_set["lemphrases"]
xgb_prediction(X,y)

I've already used Naive Bayes' MultinomialNB above and here I'm using XGBClassifier as an estimator. It's possible to do some cross validation and also check the results for various parameters by using, for example, GridSearchCV. I've also implemented some calibration here and I've obtained better results with the original texts (non-cleaned data :)) It's possible to get much better results with bigger n_estimators and max_depth (like 250 or 500 to 10-12, for example),but it takes significantly longer to train and test the model.

In [None]:
test_set['lems']=''
test_set['lemphrases']=''
    
for i,j in enumerate(test_set['text']):
    test_set['lems'][i]=lemmatize(j)
    test_set['lemphrases'][i]=" ".join(test_set['lems'][i])

We process the test data the same way as we did with the train/validation data.

### LogisticRegression (with a pipeline)

In [None]:
yp=train_set['author']
Xp=train_set.drop(['id','author', 'lems'], axis=1)
yp=labelencoder.fit_transform(yp)

Xp_train, Xp_test, yp_train, yp_test = train_test_split(Xp, yp, test_size=0.2, shuffle=True, random_state=42)

p_transformer=make_column_transformer((CountVectorizer(ngram_range=(1,4), min_df=2), "text"),
                                        (CountVectorizer(ngram_range=(1,3), min_df=2), "lemphrases")
                                        )

p_model=make_pipeline(p_transformer, LogisticRegression(penalty="l2", max_iter=2000, solver="newton-cg"))
p_model.fit(Xp_train,yp_train)
p_model.score(Xp_train,yp_train)

In [None]:
p_model.score(Xp_test,yp_test)

### LogisticRegression (Kfold cross validation)

In [None]:
kfold = model_selection.KFold(n_splits=3, shuffle=True, random_state=2323)
results = model_selection.cross_val_score(p_model, Xp, yp, cv=kfold)
print("Accuracy: %.1f%%" % (results.mean()*100.0))

In [None]:
f_model=p_model.fit(Xp, yp) #final model

In [None]:
Xt=test_set.drop(['id','lems'], axis=1)
probs=f_model.predict_proba(Xt)

In [None]:
for i,j in enumerate(probs):
    for k,m in enumerate(j):
        if m>0.99:
            probs[i][k]=0.98
        if m<0.01:
            probs[i][k]=0.01
#We do this in order to get rid of too small or too big values so as to reduce the high variance.

In [None]:
pd.set_option('display.float_format', lambda x: '%.10f' % x)
preds=pd.DataFrame(data=probs, columns = ["EAP","HPL","MWS"])
results3 = pd.concat([test_set[['id']], preds], axis=1)

results3.head()

Et, voila, les resultats! :)) They aren't the best and obviously far from perfection, but it's possible to improve with more complex models and by more careful calibrations. For example taking the average of MultinomialNb and LogisticRegression model results gives better predictions. It's also possible to try this by taking the average of the results from the three models above. Alternatively, and even better, to use Simple Linear Regression with the results from 2-3 different models.