In [1]:
import pandas as pd

# Predicting Authorship of the Disputed Federalist Papers

The Federalist Papers are a collection of **85 essays** written by James Madison, Alexander Hamilton, and John Jay under the collective pseudonym "Publius" to promote the ratification of the United States Constitution.

Authorship of most of the papers were revealed some years later by Hamilton, though his claim to authorshipt of 12 papers were disputed for nearly 200 years (studies generally agree that the disputed essays were written by James Madison.)

| Author | Papers |
| :- | -: | 
| Jay | 2, 3, 4, 5, 64
| Madison | 10, 14, 37-48
| Hamilton | 1, 6, 7, 8, 9, 11, 12, 13, 15, 16, 17, 21-36, 59, 60, 61, 65-85
| Hamilton and Madison | 18, 19, 20
| Disputed | 49-58, 62, 63

The goal of this problem is to train a classifier that predicts the author of the disputed papers.

In [2]:
# load Federalist papers data
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/papers.csv'
data = pd.read_csv(url)
data.head()

Unnamed: 0,paper,author
0,To the People of the State of New York: AFTE...,Hamilton
1,To the People of the State of New York: WHEN...,Jay
2,To the People of the State of New York: IT I...,Jay
3,To the People of the State of New York: MY L...,Jay
4,To the People of the State of New York: QUEE...,Jay


In [3]:
# Federalist paper No. 1
print(data.paper[0])

 To the People of the State of New York:  AFTER an unequivocal experience of the inefficacy of the subsisting federal government, you are called upon to deliberate on a new Constitution for the United States of America. The subject speaks its own importance; comprehending in its consequences nothing less than the existence of the UNION, the safety and welfare of the parts of which it is composed, the fate of an empire in many respects the most interesting in the world. It has been frequently remarked that it seems to have been reserved to the people of this country, by their conduct and example, to decide the important question, whether societies of men are really capable or not of establishing good government from reflection and choice, or whether they are forever destined to depend for their political constitutions on accident and force. If there be any truth in the remark, the crisis at which we are arrived may with propriety be regarded as the era in which that decision is to be ma

In [4]:
data.author.value_counts()

Hamilton            51
Madison             14
Disputed            12
Jay                  5
Hamilton+Madison     3
Name: author, dtype: int64

In [5]:
import pandas as pd

from sklearn.feature_extraction import _stop_words

from sklearn.feature_extraction.text import CountVectorizer

from nltk.stem import WordNetLemmatizer

import string
import nltk

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk.stem import PorterStemmer

from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score,confusion_matrix

from sklearn.model_selection import GridSearchCV

from sklearn import set_config

set_config(display='diagram')

In [6]:
nltk.pos_tag

<function nltk.tag.pos_tag(tokens, tagset=None, lang='eng')>

**Part 1 (text processing):** remove stop words and punctuations from the papers, and lemmatize them.

In [7]:
lemmatizer=WordNetLemmatizer()
stop_words=stopwords.words('english')
punctuation=[punc for punc in string.punctuation]
def process_pos(pos):
    if pos.startswith('J'): #adjective
        return 'a'
    elif pos.startswith('V'): #verb
        return 'v'
    elif pos.startswith('N'): #noun
        return 'n'
    elif pos.startswith('R'): #adverb
        return 'r'
    else:
        return 'n'


In [8]:
def process_text(text):
    words=word_tokenize(text)
    words=[word.lower() for word in words]
    lemmatized_words=[lemmatizer.lemmatize(word, pos = process_pos(pos)) for word,pos in nltk.pos_tag(words) 
                  if word not in stop_words and word not in punctuation]
    
    return ' '.join(lemmatized_words)

In [9]:
data['processed_papers']=data.paper.apply(process_text)

We'll use the papers written by Hamilton and Madion as the training set, and the disputed papers as the testing set.

In [10]:
data_train = data[data.author.isin(['Hamilton','Madison'])]
data_test = data[data.author=='Disputed']

Extract feature matrices X_train and X_test, and target vector y_train

In [11]:
X_train=data_train['processed_papers'].copy()
X_test= data_test['processed_papers'].copy()
y_train=data_train.author.copy()

**Part 3:** build a classification pipeline (count vectorizer + Naive Bayes model) that predicts the author of a paper.

In [12]:
pipe=Pipeline(steps=[
    ('vect',CountVectorizer()),
    ('clf',MultinomialNB())
])

**Part 4:** Use a grid search to tune the pipeline hyperparameters

In [13]:
params_dic={
    'vect__max_features':[1000,2000,5000],
    'vect__min_df':[1,5,10],
    'vect__ngram_range':[(1,1),(1,2)],   
}
grid=GridSearchCV(pipe,params_dic, cv=5, scoring='accuracy', n_jobs=-1,verbose=True)

In [14]:
grid.fit(X_train,y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


In [15]:
grid.best_score_

0.9230769230769231

In [16]:
grid.best_params_

{'vect__max_features': 2000, 'vect__min_df': 5, 'vect__ngram_range': (1, 2)}

In [17]:
best_clf=grid.best_estimator_

In [18]:
data_train['predicted_author']=best_clf.predict(X_train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_train['predicted_author']=best_clf.predict(X_train)


In [19]:
confusion_matrix(data_train.author,data_train.predicted_author)

array([[51,  0],
       [ 0, 14]], dtype=int64)

In [20]:
data_test['predicted_author']=best_clf.predict(X_test).copy()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_test['predicted_author']=best_clf.predict(X_test).copy()


In [21]:
data_test[['author','predicted_author']]

Unnamed: 0,author,predicted_author
48,Disputed,Madison
49,Disputed,Madison
50,Disputed,Madison
51,Disputed,Madison
52,Disputed,Hamilton
53,Disputed,Madison
54,Disputed,Madison
55,Disputed,Hamilton
56,Disputed,Madison
57,Disputed,Madison


**Part 5:** How does your classification model choose between Hamilton and Madison?

In [22]:
words=best_clf['vect'].get_feature_names()
best_clf['clf'].classes_

array(['Hamilton', 'Madison'], dtype='<U8')

In [23]:
ham_count=best_clf['clf'].feature_count_[0]
mad_count=best_clf['clf'].feature_count_[1]

In [24]:
words_df=pd.DataFrame({'words':words,'hamilton':ham_count,'madison':mad_count}).set_index('words')
words_df

Unnamed: 0_level_0,hamilton,madison
words,Unnamed: 1_level_1,Unnamed: 2_level_1
abandon,7.0,2.0
ability,12.0,0.0
able,44.0,13.0
abolish,13.0,8.0
abolition,5.0,2.0
...,...,...
yet may,7.0,0.0
yield,10.0,4.0
york,100.0,21.0
zeal,12.0,9.0


In [25]:
# add 1 to avoid diving by 0
words_df= words_df+1
#frequencies
words_df=words_df/words_df.sum()
#compute ratios
words_df['hamilton_ratio']=words_df['hamilton']/words_df['madison']
words_df['madison_ratio']=words_df['madison']/words_df['hamilton']
words_df.head()

Unnamed: 0_level_0,hamilton,madison,hamilton_ratio,madison_ratio
words,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
abandon,0.00016,0.000159,1.012263,0.987885
ability,0.000261,5.3e-05,4.934784,0.202643
able,0.000903,0.00074,1.220139,0.819579
abolish,0.000281,0.000476,0.590487,1.693517
abolition,0.00012,0.000159,0.759198,1.31718


In [26]:
words_df.sort_values(by='hamilton_ratio',ascending=False).head(5)

Unnamed: 0_level_0,hamilton,madison,hamilton_ratio,madison_ratio
words,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
upon,0.007482,0.000423,17.698794,0.056501
intend,0.000702,5.3e-05,13.285958,0.075267
enough,0.000682,5.3e-05,12.906359,0.077481
kind,0.001725,0.000159,10.881832,0.091896
readily,0.000502,5.3e-05,9.48997,0.105374


In [30]:
words_df.sort_values(by='madison_ratio',ascending=False).head(5)

Unnamed: 0_level_0,hamilton,madison,hamilton_ratio,madison_ratio
words,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
judiciary department,8e-05,0.001427,0.056237,17.781932
whilst,4e-05,0.000687,0.0584,17.123342
article confederation,0.0001,0.001533,0.065448,15.27929
relief,4e-05,0.000423,0.0949,10.537441
exist congress,4e-05,0.000423,0.0949,10.537441


**Part 6:** use your classifier to find who was the most likely author of the 12 disputed essays: Hamilton or Madison.

In [28]:
data_test[['author','predicted_author']]

Unnamed: 0,author,predicted_author
48,Disputed,Madison
49,Disputed,Madison
50,Disputed,Madison
51,Disputed,Madison
52,Disputed,Hamilton
53,Disputed,Madison
54,Disputed,Madison
55,Disputed,Hamilton
56,Disputed,Madison
57,Disputed,Madison
