In [1]:
import pandas as pd

# Predicting Authorship of the Disputed Federalist Papers

The Federalist Papers are a collection of **85 essays** written by James Madison, Alexander Hamilton, and John Jay under the collective pseudonym "Publius" to promote the ratification of the United States Constitution.

Authorship of most of the papers were revealed some years later by Hamilton, though his claim to authorshipt of 12 papers were disputed for nearly 200 years (studies generally agree that the disputed essays were written by James Madison.)

| Author | Papers |
| :- | -: | 
| Jay | 2, 3, 4, 5, 64
| Madison | 10, 14, 37-48
| Hamilton | 1, 6, 7, 8, 9, 11, 12, 13, 15, 16, 17, 21-36, 59, 60, 61, 65-85
| Hamilton and Madison | 18, 19, 20
| Disputed | 49-58, 62, 63

The goal of this problem is to train a classifier that predicts the author of the disputed papers.

In [2]:
# load Federalist papers data
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/papers.csv'
data = pd.read_csv(url)
data.head()

Unnamed: 0,paper,author
0,To the People of the State of New York: AFTE...,Hamilton
1,To the People of the State of New York: WHEN...,Jay
2,To the People of the State of New York: IT I...,Jay
3,To the People of the State of New York: MY L...,Jay
4,To the People of the State of New York: QUEE...,Jay


In [3]:
# Federalist paper No. 1
print(data.paper[0])

 To the People of the State of New York:  AFTER an unequivocal experience of the inefficacy of the subsisting federal government, you are called upon to deliberate on a new Constitution for the United States of America. The subject speaks its own importance; comprehending in its consequences nothing less than the existence of the UNION, the safety and welfare of the parts of which it is composed, the fate of an empire in many respects the most interesting in the world. It has been frequently remarked that it seems to have been reserved to the people of this country, by their conduct and example, to decide the important question, whether societies of men are really capable or not of establishing good government from reflection and choice, or whether they are forever destined to depend for their political constitutions on accident and force. If there be any truth in the remark, the crisis at which we are arrived may with propriety be regarded as the era in which that decision is to be ma

In [4]:
data.author.value_counts()

Hamilton            51
Madison             14
Disputed            12
Jay                  5
Hamilton+Madison     3
Name: author, dtype: int64

**Part 1 (text processing):** remove stop words and punctuations from the papers, and lemmatize them.

In [5]:
def process_pos(pos):
    if pos.startswith('J'): # adjective
        return wordnet.ADJ
    elif pos.startswith('V'): # verb
        return wordnet.VERB
    elif pos.startswith('N'): # noun
        return wordnet.NOUN
    elif pos.startswith('R'): #adverb
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()

stemmer = PorterStemmer()
stop_words = stopwords.words('english')
punctuation = [punc for punc in string.punctuation]

def process_text(text):
    words = word_tokenize(text)
    words = [word.lower() for word in words]
    lemmatized_words = [lemmatizer.lemmatize(word,pos=process_pos(pos))
                       for word,pos in nltk.pos_tag(words)
                       if word not in stop_words and word not in punctuation
                       ]
    return(' '.join(lemmatized_words))

data['processed_paper'] = data.paper.apply(process_text)

**Part 2: train-test split**

We'll use the papers written by Hamilton and Madion as the training set, and the disputed papers as the testing set.

In [7]:
data.head(3)

Unnamed: 0,paper,author,processed_paper
0,To the People of the State of New York: AFTE...,Hamilton,people state new york unequivocal experience i...
1,To the People of the State of New York: WHEN...,Jay,people state new york people america reflect c...
2,To the People of the State of New York: IT I...,Jay,people state new york new observation people c...


In [8]:
data_train = data[data.author.isin(['Hamilton','Madison'])]
data_test = data[data.author=='Disputed']

Extract feature matrices X_train and X_test, and target vector y_train

In [27]:
X_train = data_train.processed_paper
X_test = data_test.processed_paper
y_train = data_train.author
y_test = data_test.author

In [28]:
y_train.unique()

array(['Hamilton', 'Madison'], dtype=object)

**Part 3:** build a classification pipeline (count vectorizer + Naive Bayes model) that predicts the author of a paper.

In [52]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB # or any other classifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

In [54]:
pipe = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(max_features = 1000, ngram_range=(1,2))), # idk what ngram range is. less features better for TfidVectorizer
    ('clf', MultinomialNB())
])
pipe.fit(X_train,y_train)

**Part 4:** Use a grid search to tune the pipeline hyperparameters

In [43]:
params_dic = {'vectorizer__max_features':[500,1000,2000,4000],
             'vectorizer__ngram_range': [(1,1), (1,2)],
             'vectorizer__use_idf': [False, True], # False (CountVectorizer), True (TfidfVectorizer)
              'vectorizer__min_df': [1,2,4],
              'vectorizer__max_df': [1.0,0.7,],
              'clf__alpha': ([0.01,0.1, 0.5,0.9]),
             }

In [44]:
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(pipe, params_dic, cv=5, n_jobs=-1, verbose=2)
grid.fit(X_train, y_train)

Fitting 5 folds for each of 384 candidates, totalling 1920 fits


**Part 5:** How does your classification model choose between Hamilton and Madison?

In [45]:
grid.best_params_
best_pipe = grid.best_estimator_
y_test_pred = best_pipe.predict(X_test)
#confusion_matrix(y_test,y_test_pred)

In [46]:
y_test_pred

array(['Madison', 'Madison', 'Madison', 'Hamilton', 'Hamilton',
       'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton',
       'Hamilton', 'Hamilton'], dtype='<U8')

In [49]:
# store the vocabulary
words = best_pipe['vectorizer'].get_feature_names_out()

# counts
hamilton_word_count = best_pipe['clf'].feature_count_[0,:]
madison_word_count = best_pipe['clf'].feature_count_[1,:]

words_df = pd.DataFrame({'word':words,
                        'hamilton':hamilton_word_count,
                         'madison':madison_word_count}).set_index('word')
words_df = words_df + 1
words_df

Unnamed: 0_level_0,hamilton,madison
word,Unnamed: 1_level_1,Unnamed: 2_level_1
able,1.880116,1.174335
abolish,1.240288,1.183015
absolute,1.520991,1.138469
absolutely,1.177196,1.249269
abuse,1.476452,1.066716
...,...,...
writer,1.321227,1.093854
year,1.650584,1.399374
yet,1.910602,1.267580
york,2.057784,1.171113


### Get words with highest use ratios

In [50]:
# convert counts into frequencies
words_df.hamilton = words_df.hamilton/words_df.hamilton.sum()
words_df.madison = words_df.madison/words_df.madison.sum()

# ratios
words_df['hamilton_ratio'] = words_df.hamilton/words_df.madison
words_df['madison_ratio'] = words_df.madison/words_df.hamilton

words_df.sort_values(by='madison_ratio', ascending=False).head(20)

Unnamed: 0_level_0,hamilton,madison,hamilton_ratio,madison_ratio
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
article confederation,0.000658,0.001324,0.496958,2.012244
coin,0.000615,0.001212,0.507648,1.969869
judiciary department,0.000632,0.001228,0.514726,1.94278
legislative department,0.000631,0.00118,0.534817,1.869798
confederation,0.0009,0.001612,0.558581,1.790251
consequently,0.000623,0.001101,0.565988,1.766821
department,0.000987,0.001694,0.582513,1.716701
executive judiciary,0.0007,0.001188,0.589193,1.697237
congress,0.000946,0.001561,0.60576,1.650819
legislative executive,0.000731,0.001193,0.612359,1.63303


**Part 6:** use your classifier to find who was the most likely author of the 12 disputed essays: Hamilton or Madison.

In [51]:
grid.best_params_
best_pipe = grid.best_estimator_
y_test_pred = best_pipe.predict(X_test)
y_test_pred

array(['Madison', 'Madison', 'Madison', 'Hamilton', 'Hamilton',
       'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton',
       'Hamilton', 'Hamilton'], dtype='<U8')

It predicts Madison wrote 3 while Hamilton wrote the other 9. Though other studies seem to find Madison was the author of them all. Not 100% sure why my model isn't correct. Maybe I should add more frequency parameters in gridsearch?