To find the author of text we need two things: features and classificator. My approach here is to use one classifier for tagged words and one for feature classification. First I take treebank POS tagged words form NLTK packages. With this I train the tagger. POS means Part OF Speach, so we get a word and must define it is verb (VB) for example or adjective (JJ). The idea is to use tagged words ( or sequence of them) as features.

In [None]:
from nltk.corpus import treebank
from nltk.tag.sequential import ClassifierBasedPOSTagger

Now other imports:

In [None]:
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
import numpy as np
import mglearn
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
import itertools

In [None]:
# -------------- Main code
train = pd.read_csv('train.csv')
train_sents = treebank.tagged_sents()
tagger = ClassifierBasedPOSTagger(train=train_sents)
stemmer = SnowballStemmer('english')

NLTK has a good parser RegexpParser, it takes as argument taggs in appropriate format. For more info follow this link:

[nltk book](http://http://www.nltk.org/api/nltk.chunk.htm)

Tags can give as information about style of give author for example. I use this and search features as "tags sequences", with other words: sequence of words with given tag. One sequence is one unique feature. It can be more or less usefull but I beleave this approach can gvie as many possibilities: see get_sequence_tags()

In [None]:
# Define tag sequences
SEQ_1 = "SEQ_1: {<DT|PP>?<JJ>*}"
SEQ_2 = "SEQ_2: {<NN><DT|PP\$>?<JJ>}"
SEQ_3 = "SEQ_3: {<NP>?<VERB>?<NP|JJ>}"
SEQ_4 = "SEQ_4: {<VB.*><NP|PP|CLAUSE>+$}"

cp1 = nltk.RegexpParser(SEQ_1)
cp2 = nltk.RegexpParser(SEQ_2)
cp3 = nltk.RegexpParser(SEQ_3)
cp4 = nltk.RegexpParser(SEQ_4)

lst_seq = list([cp1, cp2, cp3, cp4])

In the code above I define foor sequnces which then I send as arguments to nltk.RegexpParser(), then I collect the foor nltk.RegexpParser objects. This list is ised as you can see in function get_sequence_tags. Here is the place to show all functions we need to get features, except function plot_confusion_matrix(), the others are for gewtting features.

In [None]:
def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')


def get_number_of_spaces(sentence):
    return sentence.count(' ')


def get_number_of_capitals(sentence):
    n = sum(1 for c in sentence if c.isupper())
    return n


def get_number_of_nouns(taged_tokens):
    n = sum(1 for word, tag in taged_tokens if tag == 'NN' or tag == 'NNS' \
            or tag == 'NNP' or tag == 'NNP')
    return n


def get_number_of_adjectives(taged_tokens):
    n = sum(1 for word, tag in taged_tokens if tag == 'JJ' or tag == 'JJR' or tag == 'JJS')
    return n


def get_count_of_tagged(taged_tokens, tag_in):
    n = sum(1 for word, tag in taged_tokens if tag == tag_in)
    return n


def is_past_tense(taged_tokens):
    n = sum(1 for word, tag in taged_tokens if tag == 'VBD')
    return (n > 0)


def is_modal(taged_tokens):
    n = sum(1 for word, tag in taged_tokens if tag == 'MD')
    return (n > 0)


def vocab_richness(sentence):
    unique = set(sentence.split())
    count_uniques = len(unique)
    return count_uniques


def get_first_words(sentence, count):
    arr_words = sentence.split()
    ret_words = arr_words[:count]
    str_ret = ' '.join(ret_words)
    return str_ret


def get_one_word(sentence, position):
    arr_words = sentence.split()
    if len(arr_words) >= (position + 1):
        ret_word = arr_words[position]
        return ret_word
    else:
        return False


def exists_she(sentense):
    if 'she' in sentense.lower():
        return True
    else:
        return False


def exists_he(sentense):
    if 'he' in sentense.lower():
        return True
    else:
        return False



def first_tag(taged_tokens):
    return str(taged_tokens[0][1])


def second_tag(taged_tokens):
    if len(taged_tokens) > 1:
        return str(taged_tokens[1][1])
    else:
        return False


def third_tag(taged_tokens):
    if len(taged_tokens) > 2:
        return str(taged_tokens[2][1])
    else:
        return False

def get_consonant_letters(sentence):
    consonants = 0
    for word in sentence:
        for letter in word:
            if letter in 'bcdfghjklmnpqrstvwxz':
                consonants += 1

    return consonants


def get_sonant_letters(sentence):
    sonants = 0
    for word in sentence:
        for letter in word:
            if letter in 'aieouy':
                sonants += 1

    return sonants


def lexical_diversity(text):
    return len(set(text)) / len(text)


def get_sequence_tags(taged_tokens, n_sequence):
    countSequence = 0
    cp = lst_seq[n_sequence-1]
    result = cp.parse(taged_tokens)

    for tre in result:
        if isinstance(tre, nltk.tree.Tree):
            if tre.label() ==  cp._stages[0]._chunk_label:
                countSequence += 1

    return (countSequence > 0)

The most important function, is get_sentence_features(), lets comment a litle bit the code.
Here is the right place to use the NLTK.SnowballStemmer(), when the input sentence com in the function
first it is transformed with the stemmer.What it does? It simply gets the word and give as output the grammatical stem
of it. then the sentence is tokenized from nltk.wordpunct_tokenize and finaly the token string
commes to the tagger. The variable taged_tokens is used as argument to functions to get features.

In [None]:
def get_sentence_features(sentens_in):
    stemmed_words = list()
    for w in sentens_in.split():
        stemmed_words.append(stemmer.stem(w))

    sentence = ' '.join(stemmed_words)
    word_tokens = nltk.wordpunct_tokenize(sentence)

    taged_tokens = tagger.tag(word_tokens)

    X_dict = {}

    X_dict['seq_01'] = get_sequence_tags(taged_tokens, 1)
    X_dict['seq_02'] = get_sequence_tags(taged_tokens, 2)
    X_dict['seq_03'] = get_sequence_tags(taged_tokens, 3)
    X_dict['seq_04'] = get_sequence_tags(taged_tokens, 4)

    X_dict['lexical_diversity'] = lexical_diversity(sentence.lower())
    X_dict['get_consonant_letters'] = get_consonant_letters(sentence.lower())
    X_dict['get_sonant_letters'] = get_sonant_letters(sentence.lower())

    X_dict['count_of_spaces'] = get_number_of_spaces(sentence)
    X_dict['count_capitals'] = get_number_of_capitals(sentence)
    X_dict['count_nouns'] = get_number_of_nouns(taged_tokens)
    X_dict['count_adjectives'] = get_number_of_adjectives(taged_tokens)

    X_dict['count_numbers'] = get_count_of_tagged(taged_tokens, 'CD')
    X_dict['count_NNS'] = get_count_of_tagged(taged_tokens, 'NNS')
    X_dict['count_NNP'] = get_count_of_tagged(taged_tokens, 'NNP')
    X_dict['count_NNPS'] = get_count_of_tagged(taged_tokens, 'NNPS')
    X_dict['count_RBS'] = get_count_of_tagged(taged_tokens, 'RBS')
    X_dict['count_RBR'] = get_count_of_tagged(taged_tokens, 'RBR')
    X_dict['count_WP'] = get_count_of_tagged(taged_tokens, 'WP')
    X_dict['count_WP$'] = get_count_of_tagged(taged_tokens, 'WP$')
    X_dict['count_WRB'] = get_count_of_tagged(taged_tokens, 'WRB')
    X_dict['count_PRP'] = get_count_of_tagged(taged_tokens, 'PRP')
    X_dict['count_POS'] = get_count_of_tagged(taged_tokens, 'POS')
    X_dict['count_FW'] = get_count_of_tagged(taged_tokens, 'FW')
    X_dict['count_VB'] = get_count_of_tagged(taged_tokens, 'VB')
    X_dict['count_VBD'] = get_count_of_tagged(taged_tokens, 'VBD')
    X_dict['count_VBG'] = get_count_of_tagged(taged_tokens, 'VBG')
    X_dict['count_VBN'] = get_count_of_tagged(taged_tokens, 'VBN')
    X_dict['count_CC'] = get_count_of_tagged(taged_tokens, 'CC')

    X_dict['count_DT']         = get_count_of_tagged(taged_tokens, 'DT')
    X_dict['count_UH']         = get_count_of_tagged(taged_tokens, 'UH')
    X_dict['count_SYM']        = get_count_of_tagged(taged_tokens, 'SYM')
    X_dict['count_PDT']        = get_count_of_tagged(taged_tokens, 'PDT')
    X_dict['count_LS']         = get_count_of_tagged(taged_tokens, 'LS')

    X_dict['count_3rd person'] = get_count_of_tagged(taged_tokens, 'VBZ')
    X_dict['count_gerund'] = get_count_of_tagged(taged_tokens, 'VBG')

    X_dict['is_past_tense'] = is_past_tense(taged_tokens)
    X_dict['is_modal'] = is_modal(taged_tokens)
    X_dict['vocab_richness'] = vocab_richness(sentence)
    X_dict['first_tag'] = first_tag(taged_tokens)
    X_dict['second_tag'] = second_tag(taged_tokens)
    X_dict['third_tag'] = third_tag(taged_tokens)
    
    X_dict['first_one_word'] = get_one_word(sentence, 0)
    X_dict['second_one_word'] = get_one_word(sentence, 1)
    X_dict['third_one_word'] = get_one_word(sentence, 2)
    X_dict['forth_one_word'] = get_one_word(sentence, 3)
    X_dict['fifth_one_word'] = get_one_word(sentence, 4)
    X_dict['sixth_one_word'] = get_one_word(sentence, 5)
    X_dict['seventh_one_word'] = get_one_word(sentence, 6)
    X_dict['eith_one_word'] = get_one_word(sentence, 7)
    X_dict['ninth_one_word'] = get_one_word(sentence, 8)
    X_dict['tenth_one_word'] = get_one_word(sentence, 9)

    X_dict['first_6_word'] = get_first_words(sentence, 6)
    X_dict['first_5_word'] = get_first_words(sentence, 5)
    X_dict['first_4_word'] = get_first_words(sentence, 4)
    X_dict['first_3_word'] = get_first_words(sentence, 3)
    X_dict['first_2_word'] = get_first_words(sentence, 2)

    X_dict['exists_she'] = exists_she(sentence)
    X_dict['exists_he']  = exists_he(sentence)

    X_dict['first_word_is_the'] = ('the' == get_first_words(sentence.lower(), 1))
    X_dict['first_word_is_she'] = ('she' == get_first_words(sentence.lower(), 1))
    X_dict['first_word_is_he']  = ('he' == get_first_words(sentence.lower(), 1))
    X_dict['first_word_is_it']  = ('it' == get_first_words(sentence.lower(), 1))
    X_dict['first_word_is_this'] = ('this' == get_first_words(sentence.lower(), 1))
    X_dict['first_word_is_you']  = ('you' == get_first_words(sentence.lower(), 1))


    X_dict['Raymond'] = ('raymond' in sentence.lower())
    X_dict['Perdita'] = ('perdita' in sentence.lower())
    X_dict['Idris']   = ('idris' in sentence.lower())
    X_dict['Adrian']  = ('adrian' in sentence.lower())
    X_dict['Chapter'] = ('chapter' in sentence.lower())
    X_dict['sinister'] = ('sinister' in sentence.lower())
    X_dict['weird']    = ('weird' in sentence.lower())
    X_dict['horrible'] = ('horrible' in sentence.lower())

    return X_dict

 
This function is expected to return dictionary with features which are used later for classification. Let explain the method I use to extract features from a sentence. I mentioned above the tag sequences. What is a tag? This is grammatical description of a given word. The full name of these tags are POS or Part Of Speech tags. It marks a word as verb or adjective for example. In my opinion every author of text, not only horror authors use different style and structure of the written text. This style can be described as feature.

The full list of POS tags can be viewed here:

[penn_treebank_pos](http:///www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

Next variables: SEQ_1, SEQ_2, SEQ_3, SEQ_4 are just strings used as argument for nltk.RegexpParser(). Here you can find detailed information how to used this parser:

[nltk.org book](http://www.nltk.org/book/ch07.html)

Take attention to functions I use for feature extraction: get_sonant_letters(): All letters that are sonant in english language: "aieouy" Another function to get consonant letters is: get_consonant_letters() theese letters are : "bcdfghjklmnpqrstvwxz" Getting the count of these types of lettes are our features.

Another important feature is vocab_richness: it gives the count of unoque words in sentence. Also lexical_diversity feature that gives the divercity of given sentence. The function get_count_of_tagged(taged_tokens, 'CD') returns the count of "CD Cardinal number" numbers, the function get_count_of_tagged(taged_tokens, 'NNP') returns the count of nouns. So in this way we can get count of different POS tags in the sentence. So, we can find the use of " 3rd person" or if the sentence is in past tense. Additionaly I take first, second etc... words evey as feature. Last features are just words form train.csv. They are choosen emiricaly , not used any special method, just words with many counts. 

However they do not give as much information about the author, we cannot rely on them, it is not sure that these names are used in any text from the authors.

Now, using LbelEncoder I will transform "y" lables as binary with values : [0,1,2]. I realy need this because our classifier will understand only these values.

In [None]:
lbl_enc = preprocessing.LabelEncoder()
y = lbl_enc.fit_transform(train.author.values)


The i split the input data to foor variables, two for train data and two for validation data.

In [None]:
xtrain, xvalid, ytrain, yvalid = train_test_split(train.text.values, y,
                                                  stratify=y,
                                                  random_state=32,
                                                  test_size=0.2, shuffle=True)
print(xtrain.shape)
print(xvalid.shape)


Next I need is to get stopwords. Frquently used for NLP tasks stopwords usualy increase positive percent of classifications. Here i just added an array with additional values '.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}']

In [None]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
stop_words.update( ['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}'])


The main processs for extracting features begins here: For all data in input IN_x we try to extract features. The list Out_x contains feature vectors for all input data (IN_x), Out_y is a list with labels. As final result we have two pairs:

1. X_Train, Y_train -> for train data;
2. X_valid, Y_valid -> for validation;

In [None]:
def get_train_features(IN_x, IN_y):
    Out_x, Out_y = [], []
    index = 0

    for sentens_edna in IN_x:
        word_tokens1 = [i for i in word_tokenize(sentens_edna) if i not in stop_words]
        sentens_in = ' '.join(word_tokens1)
        X_feat_dict = get_sentence_features(sentens_in)
        Out_x.append(X_feat_dict)
        Out_y.append(IN_y[index])
        index += 1

    return Out_x, Out_y


X_Train, Y_train = get_train_features(xtrain, ytrain)
X_valid, Y_valid = get_train_features(xvalid, yvalid)

We make pipeline and cross validation to achieve best results. 
Next step is to transform Xtrain with DictVectorizer() from sklearn.feature_extraction. Our classifier is Naive Bayes or the variant implemention BernoulliNB(). On cross validation, GridSearchCV sends different alpha parameters to BernoulliNB(), the best result is shown.

We need Xtrain_ctv to train the classifier and X_valid_ctv for our confusion matrix

In [None]:
clf = grid.best_estimator_.named_steps['bernoullinb']
coef = grid.best_estimator_.named_steps['bernoullinb'].coef_
best_alpha = grid.best_estimator_.named_steps['bernoullinb'].alpha
print("Best cross-validation alpha: {:.2f}".format(best_alpha))

We have to determine which of the features are important or what is the "coef" of them. 
First we get the feature names, then with help the great tool mglearn we draw 3 grapchics , one for every class.

In [None]:
feature_names = np.array(dict_vect.get_feature_names())

mglearn.tools.visualize_coefficients(coef[0], feature_names, n_top_features=25)
mglearn.tools.visualize_coefficients(coef[1], feature_names, n_top_features=25)
mglearn.tools.visualize_coefficients(coef[2], feature_names, n_top_features=25)

mglearn.tools.visualize_coefficients method takes as argument coef - it is array with 3 dimensions, one for every class, next argument is feature names and n_top_features=25 means we need 25 from all features with best coeficient. Next 3 pictures show the these features.
The coeficients are negative and you can see on the rigth most important features. Some of them are:
"get_consonant_letters",
"lexical_diversity",
"vocab_richness"
and other.
 First graphics is most important features for  EAP.

![](https://i.imgur.com/gIp6PWo.png)

Next is graphic for second author HPL. we see that the most important features here are almost the same.

![](https://i.imgur.com/D9kPt9i.png)

Last graphic is for last author MWS:
The most important features are litle bit different here.

![](https://i.imgur.com/ufNyje6.png)

I must explain that features are comming for DictVectorizer(), so it generates ( when transform) many, many features.
So the features we can see in function are only basic features. The real features you can see in the pcituires above.
Exactly these features takes the classifier. 


Confusion matrix shows the percent we achieve or the score of predicted sentences for every one author:  the higher the diagonal values of the confusion matrix - the better indicating many correct predictions.

In [None]:
predictions = clf.predict_proba(X_valid_ctv)
predicted_lables = clf.predict(X_valid_ctv)

cnf_matrix = confusion_matrix(Y_valid, predicted_lables)

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')

plt.show()

![](https://i.imgur.com/GKha9Zv.png)

Take a look to the diagonal form left upper corner. First is EAP with 75% predicted results score. HPL is 60% and MWS is 66%. Also we can see here that most of the predictions are biased to EAP.

Next is Receiver Operating Characteristic curve (or ROC curve.)
The ROC curve is created by plotting the true positive rate  against the false positive rate.
The true-positive rate is also known as sensitivity. ROC is bumary classification so we have one curve per class.
On the graphic you see them with different colors. The legend shows AUC ot Area under the Curve.
If we can say with some words AUC as representing the probability that a classifier will rank a randomly chosen positive observation higher than a randomly chosen negative observation.
Here you find good explanation of ROC and AUC:


[www.dataschool.io](http://www.dataschool.io/roc-curves-and-auc-explained/)

![](https://i.imgur.com/Te4Iaye.png)

In [None]:
from sklearn.metrics import roc_curve, auc
n_classes = len(class_names)
from sklearn.preprocessing import label_binarize

# Binarize the output
Y_valid = label_binarize(Y_valid, classes=[0, 1, 2])

# Compute ROC curve and ROC area for each class
fpr     = dict()
tpr     = dict()
roc_auc = dict()

plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')


for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(  Y_valid[:,i] , predictions[:, i] )
    roc_auc[i] = auc(fpr[i], tpr[i])
    plt.plot(fpr[i], tpr[i], label=class_names[i] + 'ROC curve (area = %0.2f)' % roc_auc[i])

print('EAP ROC curve (area = %0.2f)' % roc_auc[0])
print('HPL ROC curve (area = %0.2f)' % roc_auc[1])
print('MWS ROC curve (area = %0.2f)' % roc_auc[2])

plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc="lower right")
plt.show()
