Some ideas to consider

* Word Frequency (WF)
* Sentence Length (SL)
* Paragraph Length (PL)
* Vocabulary Complexity (VC) (does the author use simple words, or more elegant ones?)
    [Pattern](https://www.clips.uantwerpen.be/pattern) ? This my be better known as 'lexical richness' 
* Bayesian Analysis (BA)
* Semanitc Analysis (SA) (What mood does the author write in?)

I would like to explore each of these, but the initial focus will be attemting to do some level semantic analysis to determine the author of a particular phrase.  Semantic analysis alone will not be enough here, but could be an important factor to consider.

Also, will be pursuing the use of image classification for this problem.  Using https://pypi.python.org/pypi/pylinkgrammar we can geenrate sentence diagrams.  We can them trun these diagrams into images.  Then we can train an image classifier against these images,

This will be our block for all imports.  We can do them inline, this just keeps them in one spot.

In [None]:
import numpy as np
import pandas as pd
import nltk
import matplotlib.pyplot as plt
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from wordcloud import WordCloud, STOPWORDS

Some obligatory 'first steps'.  Load our data and show a sample.

In [None]:
train = pd.read_csv("../input/train.csv")
train.head()

Show the 'shape' of the data i this case 19759 passages grouped by three authors.

In [None]:
print(train.shape)

Here we just extract the text values by author, and print one to show some content.

In [None]:
eap = train[train.author=="EAP"]["text"].values; print(eap)
hpl = train[train.author=="HPL"]["text"].values
mws = train[train.author=="MWS"]["text"].values

Cloud words are a cool visual, and I copied this straight from another notebook.

In [None]:
def make_cloud(terms):
    wc = WordCloud(background_color="black", max_words=10000, 
               stopwords=STOPWORDS, max_font_size= 40)
    wc.generate(" ".join(terms))
    return wc

In [None]:
plt.figure(figsize=(14,11))
plt.title("HP Lovecraft", fontsize=16)
plt.imshow(make_cloud(hpl).recolor( colormap= 'Pastel2' , random_state=17), alpha=0.9)

Here is a function that will 'clean' the test a bit.  It will tokenize and remove all stop words

In [None]:
def clean_text(text):
    text_list = nltk.word_tokenize(text)
    stopwords = nltk.corpus.stopwords.words('english')
    return [word for word in text_list if word.lower() not in stopwords]

Next we'll create a sentiment analyzer

In [None]:
analyzer = SentimentIntensityAnalyzer()

Here we calculate the 'avg semantic value' for an author. (This may be a poor approach, but I am just starting. :) ) Additionally we are tracking the average word and sentence lengths.

In [None]:
def calc_avg_stats(author, text):
    pos = 0 
    neg = 0 
    neu = 0
    word_lens = 0 
    total_words = 0
    length = text.size
    print('Analyzing %d passages for author: %s...' % (length, author))
    for s in text:
        res = analyzer.polarity_scores(s)
        pos += res['pos']
        neg += res['neg']
        neu += res['neg']
        words = clean_text(s)
        word_lens += sum([len(w) for w in words])
        total_words += len(words)
    return {
        'avg_pos': (pos/length), #avg positive semantic score
        'avg_neg': (neg/length), #avg negative semantic score
        'avg_neu': (neu/length), #avg neutral semantic score
        'avg_wlen': (word_lens/total_words), #avg word length
        'lex_rich': (total_words/length) #lexical richness
        }

In [None]:
print(calc_avg_stats('HPL', hpl))
print(calc_avg_stats('MWS', mws))
print(calc_avg_stats('EAP', eap))