In [None]:
import warnings 
warnings.filterwarnings('ignore')
import nltk 
import pandas as pd 
from nltk.tokenize import RegexpTokenizer #will use this to remove puntuation and tokenize the text
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv("../input/train.csv")

The goal of this analysis is to examine how much differnce there is between the authors for different word types: adjectives, nouns and verbs. It is possible to distinguish between authors with word-usage but are some word types better for making the distinction?

The analysis will require using parts of speech (pos) tagging to identify and extract the adjectives, nouns and verbs. The steps I followed are:

separate the 3 authors into 3 pandas datasets

concatenate all rows for the 'text' column in each dataset, remove punctuation and tokenize the text

tag the parts of speech

extract the adjectives, verbs and nouns for each author

investigate how each author uses different word types

In [None]:
df_eap = df[df['author'] == 'EAP'] #separate the 3 authors
df_hpl = df[df['author'] == 'HPL']
df_mws = df[df['author'] == 'MWS']

text_eap = df_eap['text'].str.cat(sep = ' ').lower() #concatenate all sentences for each author and set all characters to lower case
text_hpl = df_hpl['text'].str.cat(sep = ' ').lower()
text_mws = df_mws['text'].str.cat(sep = ' ').lower()

#I am changing everything to lower case because I will do some word frequency analysis, 
#I don't want things like 'Old' and 'old counted as different words

tokenizer = RegexpTokenizer(r'\w+') 
tokens_eap = tokenizer.tokenize(text_eap) #this will return a list of tokens(words) in lower case with punctuation removed
tokens_hpl = tokenizer.tokenize(text_hpl)
tokens_mws = tokenizer.tokenize(text_mws)
#note: one problem with the above approach is phrases like the "man's hat" will be tokenized into 3 tokens (man, s, hat)
pos_list_eap = nltk.pos_tag(tokens_eap) #this step will add the pos tags
pos_list_hpl = nltk.pos_tag(tokens_hpl)
pos_list_mws = nltk.pos_tag(tokens_mws)

The above code generates a list where each element in the list is of the form (word,pos-tag). The next step is to extract the adjectives so pull out each element where pos-tag = JJ, JJR or JJS. You can get a complete list of tags by running: print(nltk.help.upenn_tagset())

In [None]:
# function to test if something is an adjective, comparitive or superlatives
def is_adj(pos):
    result = False
    if pos in ('JJ','JJR','JJS'):
        result = True
    return result

adj_eap = [word for word, pos in pos_list_eap if is_adj(pos) and len(word) > 1] #this is just a list of all adjectives for EAP
adj_hpl = [word for word, pos in pos_list_hpl if is_adj(pos) and len(word) > 1]
adj_mws = [word for word, pos in pos_list_mws if is_adj(pos) and len(word) > 1]

# I added the > 1 test because 'i' is sometimes marked as JJ. I think this is because JJ also covers numerals and ordinals, maybe 
#'i' is being seen as Roman numeral for one(?)

freq_eap = nltk.FreqDist(adj_eap) #this gets the frequency distribution for the adjectives in the list adj_eap
freq_hpl = nltk.FreqDist(adj_hpl)
freq_mws = nltk.FreqDist(adj_mws)

#if you want to print out the top twenty list use:
#print(freq_eap.most_common(20))
#print(freq_hpl.most_common(20))
#print(freq_mws.most_common(20))

Visualising the ratio of adjectives used to total words used.

In [None]:
eap = len(adj_eap)/len(tokens_eap)#number of adjectives divided by total number of words for EAP
hpl = len(adj_hpl)/len(tokens_hpl)
mws = len(adj_mws)/len(tokens_mws)

d = {'EAP':eap, 'HPL': hpl, 'MWS':mws} 

plt.bar(range(len(d)), d.values(), align='center')
plt.xticks(range(len(d)), d.keys())
plt.title("Adjectives used as a fraction of total words for each author")

plt.show()

Note: the pos tagger is not 100% accurate, some of the words it marked as 'JJ' include: 'adrian' which is a name not an adjective. Also words like 'such' and 'other' which I think are determiners not adjectives. So we can't assume that all values returned by the tagger are what we expect them to be. 

Based on the test data HPL uses adjectives the most (just under 10% of all words) and MWS the least (about 7.5% of all words). 

Top 20 adjectives for each author

In [None]:
freq_eap.plot(20,cumulative=False,title='top 20 adjectives for EAP') #looking just at top 20 adjectives  for EAP

In [None]:
freq_hpl.plot(20,cumulative=False,title='top 20 adjectives for HPL')  # HPL

Note HPL's use of the word 'old', he really does like that word.

In [None]:
freq_mws.plot(20,cumulative=False, title='top 20 adjectives for MWS')  # MWS

To further investigate the degree of similarity in adjectives used by the different authors I will convert the lists of adjectives to sets and calculate the Jaccard similarities.

In [None]:
from math import*
 
def jaccard_similarity(x,y):
    intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
    union_cardinality = len(set.union(*[set(x), set(y)]))
    return intersection_cardinality/float(union_cardinality)

set_eap = set(adj_eap)
set_hpl = set(adj_hpl)
set_mws = set(adj_mws)
print('Jaccard Similarity scores for adjectives used by each pair of authors')
print('EAP - HPL: ' + str(jaccard_similarity(set_eap,set_hpl)))
print('EAP - MWS: ' + str(jaccard_similarity(set_eap,set_mws)))
print('MWS - HPL: ' + str(jaccard_similarity(set_hpl,set_mws)))

These numbers are quite low, this is good news for everyone using word-usage to distinguish authors, at least for the adjectives. High numbers would mean lots of overlap in word-usage. It looks like the adjectives used by HPL and MWS are the most divergent. While those used by EAP and MWS are the most similar.

**Verb usage:**

In [None]:
def is_verb(pos):
    result = False
    if pos in ('VB','VBD','VBG','VBN','VBP','VBZ'):
        result = True
    return result

verb_eap = [word for word, pos in pos_list_eap if is_verb(pos) and len(word) > 1] 
verb_hpl = [word for word, pos in pos_list_hpl if is_verb(pos) and len(word) > 1]
verb_mws = [word for word, pos in pos_list_mws if is_verb(pos) and len(word) > 1]
#the >1 test prevents things like 'i' being tagged, there are no verbs in English with just 1 character
freq_eap_verb = nltk.FreqDist(verb_eap) 
freq_hpl_verb = nltk.FreqDist(verb_hpl)
freq_mws_verb = nltk.FreqDist(verb_mws)

#if you want to print out the top twenty list use:
#print(freq_eap_verb.most_common(20))
#print(freq_hpl_verb.most_common(20))
#print(freq_mws_verb.most_common(20))

In [None]:
eap = len(verb_eap)/len(tokens_eap)
hpl = len(verb_hpl)/len(tokens_hpl)
mws = len(verb_mws)/len(tokens_mws)

d = {'EAP':eap, 'HPL': hpl, 'MWS':mws} 

plt.bar(range(len(d)), d.values(), align='center')
plt.xticks(range(len(d)), d.keys())
plt.title("Verbs used as a fraction of total words for each author")
plt.show()

Note that each author uses a lot more verbs than adjectives.

Top twenty verbs:

In [None]:
freq_eap_verb.plot(20,cumulative=False,title='top 20 verbs for EAP')

In [None]:
freq_hpl_verb.plot(20,cumulative=False,title='top 20 verbs for HPL')

In [None]:
freq_mws_verb.plot(20,cumulative=False,title='top 20 verbs for MWS')

In [None]:
from math import*
 
def jaccard_similarity(x,y):
    intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
    union_cardinality = len(set.union(*[set(x), set(y)]))
    return intersection_cardinality/float(union_cardinality)

set_eap = set(verb_eap)
set_hpl = set(verb_hpl)
set_mws = set(verb_mws)
 
print('Jaccard Similarity scores for verbs used by each pair of authors')
print('EAP - HPL: ' + str(jaccard_similarity(set_eap,set_hpl)))
print('EAP - MWS: ' + str(jaccard_similarity(set_eap,set_mws)))
print('MWS - HPL: ' + str(jaccard_similarity(set_hpl,set_mws)))

These values are higher than for adjectives - there is more overlap in the verbs used. Or put it another way, adjectives are better at distinguishing authors.

**Noun usage:**

In [None]:
def is_noun(pos):
    result = False
    if pos in ('NN','NNP','NNPS','NNS'):
        result = True
    return result

noun_eap = [word for word, pos in pos_list_eap if is_noun(pos) and len(word) > 1]
noun_hpl = [word for word, pos in pos_list_hpl if is_noun(pos) and len(word) > 1]
noun_mws = [word for word, pos in pos_list_mws if is_noun(pos) and len(word) > 1]
#the >1 test gets rid of the 's' problem mentioned above, it also gets rid of the pronoun 'i' which dominates
#the plots
freq_eap_noun = nltk.FreqDist(noun_eap) 
freq_hpl_noun = nltk.FreqDist(noun_hpl)
freq_mws_noun = nltk.FreqDist(noun_mws)

#if you want to print out the top twenty list use:
#print(freq_eap_noun.most_common(20))
#print(freq_hpl_noun.most_common(20))
#print(freq_mws_noun.most_common(20))

In [None]:
eap = len(noun_eap)/len(tokens_eap)
hpl = len(noun_hpl)/len(tokens_hpl)
mws = len(noun_mws)/len(tokens_mws)

d = {'EAP':eap, 'HPL': hpl, 'MWS':mws} 

plt.bar(range(len(d)), d.values(), align='center')
plt.xticks(range(len(d)), d.keys())
plt.title("Nouns used as a fraction of total words for each author")
plt.show()

Nouns are used more than either verbs or adjectives. There isn't any significant difference in the ratio of nouns used by any author.

In [None]:
freq_eap_noun.plot(20,cumulative=False,title='top 20 nouns for EAP')

In [None]:
freq_hpl_noun.plot(20,cumulative=False,title='top 20 nouns for HPL')

In [None]:
freq_mws_noun.plot(20,cumulative=False,title='top 20 nouns for MWS')

In [None]:
from math import*
 
def jaccard_similarity(x,y):
    intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
    union_cardinality = len(set.union(*[set(x), set(y)]))
    return intersection_cardinality/float(union_cardinality)

set_eap = set(noun_eap)
set_hpl = set(noun_hpl)
set_mws = set(noun_mws)
 
print('Jaccard Similarity scores for nouns used by each pair of authors')
print('EAP - HPL: ' + str(jaccard_similarity(set_eap,set_hpl)))
print('EAP - MWS: ' + str(jaccard_similarity(set_eap,set_mws)))
print('MWS - HPL: ' + str(jaccard_similarity(set_hpl,set_mws)))

MWS and HPL have the lowest overlap in terms of the nouns they use. EAP and MWS are closest in their use of nouns. I find it interesting that all three authors have the nouns man/men and time in their top six nouns even though they are writing about different things, and two of them have 'eyes' in the top six nouns. 



**Analysis of stopwords:**

In [None]:
stopwords = nltk.corpus.stopwords.words('english')

eap_stop = [t for t in tokens_eap if t in stopwords]
hpl_stop = [t for t in tokens_hpl if t in stopwords]
mws_stop = [t for t in tokens_mws if t in stopwords]

freq_eap_stop = nltk.FreqDist(eap_stop) 
freq_hpl_stop = nltk.FreqDist(hpl_stop)
freq_mws_stop = nltk.FreqDist(mws_stop)

freq_eap_stop.plot(20,cumulative=False,title='top 20 stop words for EAP')

In [None]:
freq_hpl_stop.plot(20,cumulative=False,title='top 20 stop words for HPL')

In [None]:
freq_mws_stop.plot(20,cumulative=False,title='top 20 stop words for MWS')

Note the similarities in the graphs and the frequency distributions for the top stopwords. There is a lot less choice in words used when it comes to stopwords.

In [None]:
from math import*
 
def jaccard_similarity(x,y):
    intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
    union_cardinality = len(set.union(*[set(x), set(y)]))
    return intersection_cardinality/float(union_cardinality)

set_eap = set(eap_stop)
set_hpl = set(hpl_stop)
set_mws = set(mws_stop)
 
print('Jaccard Similarity scores for nouns used by each pair of authors')
print('EAP - HPL: ' + str(jaccard_similarity(set_eap,set_hpl)))
print('EAP - MWS: ' + str(jaccard_similarity(set_eap,set_mws)))
print('MWS - HPL: ' + str(jaccard_similarity(set_hpl,set_mws)))

These numbers are much higher than for other parts of speech, this is not surprising because the set of stopwords is quite small and authors often have little or no choice in the words they can use - stopwords often perform grammatical functions so are not optional or interchangable.

**Third person (male v female) pronoun usage:**

In [None]:
pronoun = ['he','she','him','her','his']

eap_pronoun = [t for t in tokens_eap if t in pronoun]
hpl_pronoun = [t for t in tokens_hpl if t in pronoun]
mws_pronoun = [t for t in tokens_mws if t in pronoun]

freq_eap_pronoun = nltk.FreqDist(eap_pronoun) 
freq_hpl_pronoun = nltk.FreqDist(hpl_pronoun)
freq_mws_pronoun = nltk.FreqDist(mws_pronoun)

In [None]:
freq_eap_pronoun.plot(5,cumulative=False,title='top third person pronouns for EAP')

In [None]:
freq_hpl_pronoun.plot(5,cumulative=False,title='top third person pronouns for HPL')

In [None]:
freq_mws_pronoun.plot(5,cumulative=False,title='top third person pronouns for MWS')

If you chose to remove stopwords before making your predictions how much data will you lose?
What percentage of all words used do stopwords represent for each author?

In [None]:
eap = len(eap_stop)/len(tokens_eap)
hpl = len(hpl_stop)/len(tokens_hpl)
mws = len(mws_stop)/len(tokens_mws)

d = {'EAP':eap, 'HPL': hpl, 'MWS':mws} 

plt.bar(range(len(d)), d.values(), align='center')
plt.xticks(range(len(d)), d.keys())
plt.title("Stopwords as a fraction of total words for each author")
plt.show()

If you remove stopwords you will lose around 40% to 50% of the available data.

To test the effect of this I created two submission files using the code in Sohier Dane's kernel: How to Generate & Format a Simple Submission. I slightly changed the code by adding a function that removed stopwords before generating the submission file.


**Results of testing**
The first submission (included stopwords) generated a score of 0.46887 and put me in 330th place. The second submission generated a score of 0.45376 and put me in 300th position so an improvement of about 3% in the score and 30 places on the leaderboard. For multinomial naive bayes removing the stopwords seems to result in a small improvement in the final result.

**Summary:** I think adjectives are the best words (at least based on the training dataset) to use in order to distinguish between the authors, they show the most divergence between authors. However adjectives account for less than 10% of the total words used. So nouns and verbs cannot be discarded. Stopwords show the least divergence, it is a small set of words which are grammatically essential so authors can't avoid using them and have little or no choice in the words they use (for example there are no synonyms for 'the' or 'a' or 'is'...). My testing indicates a modest improvement for multinomial naive bayes when stopwords are removed.
Overall MWS and HPL are the most divergent in terms of the sets of words they use so they should be a little easier to distinguish. EAP and MWS are the least divergent so they might be more difficult to distinguish.