In [7]:
import pandas as pd

# Project 2: Predicting Authorship of the Disputed Federalist Papers

The Federalist Papers are a collection of 85 essays written by James Madison, Alexander Hamilton, and John Jay under the collective pseudonym "Publius" to promote the ratification of the United States Constitution.

<img src="images\the_federalist_papers.jpg" width=200 height=50 />

Authorship of most of the papers were revealed some years later by Hamilton, though his claim to authorshipt of 12 papers were disputed for nearly 200 years.

| Author | Papers |
| :- | -: | 
| Jay | 2, 3, 4, 5, 64
| Madison | 10, 14, 37-48
| Hamilton | 1, 6, 7, 8, 9, 11, 12, 13, 15, 16, 17, 21-36, 59, 60, 61, 65-85
| Hamilton and Madison | 18, 19, 20
| Disputed | 49-58, 62, 63

The goal of this project is to use NLP and Naive Bayes to predict the author of the disputed papers.

Table of Contents

- [Getting and processing the data](#1.-Getting-and-processing-the-data)
- [Naive Bayes](#2.-Naive-Bayes-Classification)

## 1. Getting and processing the data

Retrieve an electronic version of the Federalist Papers from the [Gutenberg project](http://www.gutenberg.org/). Use the search facility to search for the Federalist Papers. Several versions are available. 
We'll use the plain text version [1408-8.txt](http://www.gutenberg.org/cache/epub/1404/pg1404.txt)

First, we'll build a dictionary that identifies the author of each Federalist paper. We'll use the phrase To the People of the state of New York to identify the beginning of a paper, and the word PUBLIUS to identify the end of a paper (The word PUBLIUS marks the end of all papers except 37; we'll need to insert PUBLIUS at the end of Paper 37 manually).

In [3]:
from re import match

In [4]:
path = 'Data/papers.txt'
Fed_dict = {}
opening = 'To the People of the State of New York'
closing = 'PUBLIUS'

counter = 0
paper = ''

# build a dictionary with the Federalist papers 
with open(path) as f:
    for string in f: #  iterate over the lines of the txt file
        if match(opening, string):
            paper = '' # initialize Federalist Paper as an empty string
            counter += 1 # increase counter
        paper = paper+' '+string.replace('\n','') # remove end of line simbol \n; append new line; 
        if match(closing, string):
            Fed_dict[counter]=paper # done

In [5]:
len(Fed_dict)

85

In [8]:
# put the Federalist Papers into a DataFrame
papers = pd.DataFrame.from_dict(Fed_dict, orient='index',columns=['paper'])
papers.head(5)

Unnamed: 0,paper
1,To the People of the State of New York: AFTE...
2,To the People of the State of New York: WHEN...
3,To the People of the State of New York: IT I...
4,To the People of the State of New York: MY L...
5,To the People of the State of New York: QUEE...


In [9]:
# authorship function
def author(paper_num):
    'it returns the author of a Federalist Paper'
    # papers authored by Jay:
    Jay_list = [2,3,4,5,64]
    # papers authored by Madison:
    Madison_list = [10,14]+list(range(37,49))
    # papers authored by Hamilton
    Hamilton_list = [1,6,7,8,9,11,12,13,15,16,17]+list(range(21,37))+[59,60,61]+list(range(65,86))
    # papers authored by Hamilton+Madison
    Hamilton_Madison_list = [18,19,20]
    # disputed papers
    disputed_list = list(range(49,59))+[62,63]
    if paper_num in Jay_list:
        return 'Jay'
    elif paper_num in Hamilton_list:
        return 'Hamiltion'
    elif paper_num in Madison_list:
        return 'Madison'
    elif paper_num in Hamilton_Madison_list:
        return 'Hamilton+Madison'
    elif paper_num in disputed_list:
        return 'Disputed'

In [11]:
# add column author to DataFrame
papers['author'] = papers.index.map(author)
papers.head(5)

Unnamed: 0,paper,author
1,To the People of the State of New York: AFTE...,Hamiltion
2,To the People of the State of New York: WHEN...,Jay
3,To the People of the State of New York: IT I...,Jay
4,To the People of the State of New York: MY L...,Jay
5,To the People of the State of New York: QUEE...,Jay


## 2. Naive Bayes Classification