In [1]:
import pandas as pd

# Predicting Authorship of the Disputed Federalist Papers

The Federalist Papers are a collection of **85 essays** written by James Madison, Alexander Hamilton, and John Jay under the collective pseudonym "Publius" to promote the ratification of the United States Constitution.

Authorship of most of the papers were revealed some years later by Hamilton, though his claim to authorshipt of 12 papers were disputed for nearly 200 years (studies generally agree that the disputed essays were written by James Madison.)

| Author | Papers |
| :- | -: | 
| Jay | 2, 3, 4, 5, 64
| Madison | 10, 14, 37-48
| Hamilton | 1, 6, 7, 8, 9, 11, 12, 13, 15, 16, 17, 21-36, 59, 60, 61, 65-85
| Hamilton and Madison | 18, 19, 20
| Disputed | 49-58, 62, 63

The goal of this problem is to train a classifier that predicts the author of the disputed papers.

In [2]:
# load Federalist papers data
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/papers.csv'
data = pd.read_csv(url)
data.head()

Unnamed: 0,paper,author
0,To the People of the State of New York: AFTE...,Hamilton
1,To the People of the State of New York: WHEN...,Jay
2,To the People of the State of New York: IT I...,Jay
3,To the People of the State of New York: MY L...,Jay
4,To the People of the State of New York: QUEE...,Jay


In [3]:
# Federalist paper No. 1
print(data.paper[0])

 To the People of the State of New York:  AFTER an unequivocal experience of the inefficacy of the subsisting federal government, you are called upon to deliberate on a new Constitution for the United States of America. The subject speaks its own importance; comprehending in its consequences nothing less than the existence of the UNION, the safety and welfare of the parts of which it is composed, the fate of an empire in many respects the most interesting in the world. It has been frequently remarked that it seems to have been reserved to the people of this country, by their conduct and example, to decide the important question, whether societies of men are really capable or not of establishing good government from reflection and choice, or whether they are forever destined to depend for their political constitutions on accident and force. If there be any truth in the remark, the crisis at which we are arrived may with propriety be regarded as the era in which that decision is to be ma

In [42]:
data.author.value_counts()

Hamilton            51
Madison             14
Disputed            12
Jay                  5
Hamilton+Madison     3
Name: author, dtype: int64

**Part 1 (text processing):** remove stop words and punctuations from the papers, and lemmatize them.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()


**Part 2: train-test split**

We'll use the papers written by Hamilton and Madion as the training set, and the disputed papers as the testing set.

In [136]:
data_train = data[data.author.isin(['Hamilton','Madison'])]
data_test = data[data.author=='Disputed']

Extract feature matrices X_train and X_test, and target vector y_train

**Part 3:** build a classification pipeline (count vectorizer + Naive Bayes model) that predicts the author of a paper.

**Part 4:** Use a grid search to tune the pipeline hyperparameters

**Part 5:** How does your classification model choose between Hamilton and Madison?

**Part 6:** use your classifier to find who was the most likely author of the 12 disputed essays: Hamilton or Madison.