# Samuel Anozie
## Author Attribution
Attributing authors to writers of the Federalist papers, documents arguing the correct structure of the US Constitution.

Importing data via pandas, and displaying prelimiary data.

In [226]:
import pandas as pd
df = pd.read_csv('federalist.csv', usecols=[0,1], header=0)
df["author"] = pd.Categorical(df.author)
print(df.head(5))
author_counts = df.author.value_counts()
print(author_counts)

     author                                               text
0  HAMILTON  FEDERALIST. No. 1 General Introduction For the...
1       JAY  FEDERALIST No. 2 Concerning Dangers from Forei...
2       JAY  FEDERALIST No. 3 The Same Subject Continued (C...
3       JAY  FEDERALIST No. 4 The Same Subject Continued (C...
4       JAY  FEDERALIST No. 5 The Same Subject Continued (C...
HAMILTON                49
MADISON                 15
HAMILTON OR MADISON     11
JAY                      5
HAMILTON AND MADISON     3
Name: author, dtype: int64


Dividing into training and testing sets, and displaying the shape of the data.

In [227]:
from sklearn.model_selection import train_test_split
X = df.text
y = df.author
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1234)

print(X_train.shape)
print(X_test.shape)

(66,)
(17,)


Using vectorizor to derive meaning from the words.

In [228]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

stopwords = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(stop_words=stopwords)

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

print(X_train.shape)
print(X_test.shape)

(66, 7876)
(17, 7876)


Using Naive Bayes to predict authors.

In [229]:
import numpy as np
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report
naive_bayes = BernoulliNB()
naive_bayes.fit(X_train, y_train)
#output
BernoulliNB(alpha=1.0, class_prior=None, fit_prior=True)

# make predictions on the test data
pred = naive_bayes.predict(X_test)
# print confusion matrix
print(classification_report(y_test, pred, labels=np.unique(pred)))

              precision    recall  f1-score   support

    HAMILTON       0.59      1.00      0.74        10

   micro avg       0.59      1.00      0.74        10
   macro avg       0.59      1.00      0.74        10
weighted avg       0.59      1.00      0.74        10



Naive Bayes did not work well with the current settings. Redo with max-vectorization set to 1000.

In [230]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1234)

vectorizer = TfidfVectorizer(stop_words=stopwords, max_features=1000, ngram_range=(1,2))

X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

print(X_train.shape)
print(X_test.shape)

naive_bayes = BernoulliNB()
naive_bayes.fit(X_train, y_train)
#output
BernoulliNB(alpha=1.0, class_prior=None, fit_prior=True)

# make predictions on the test data
pred = naive_bayes.predict(X_test)
# print confusion matrix
print(classification_report(y_test, pred, labels=np.unique(pred)))

(66, 1000)
(17, 1000)
                     precision    recall  f1-score   support

           HAMILTON       0.91      1.00      0.95        10
HAMILTON OR MADISON       1.00      1.00      1.00         3
                JAY       1.00      0.50      0.67         2
            MADISON       1.00      1.00      1.00         2

           accuracy                           0.94        17
          macro avg       0.98      0.88      0.90        17
       weighted avg       0.95      0.94      0.93        17



Naive Bayes did much better with the new settings. We can now try logistic regression to see if this algorithm can preform better.

In [231]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1234)

pipe1 = Pipeline([
        ('tfidf', TfidfVectorizer(stop_words=stopwords, binary=True)),
        ('logreg', LogisticRegression(solver='lbfgs')),
])

pipe1.fit(X_train, y_train)

pred = pipe1.predict(X_test)
print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred, average='macro'))
print('recall score: ', recall_score(y_test, pred, average='macro'))
print('f1 score: ', f1_score(y_test, pred, average='macro'))

accuracy score:  0.5882352941176471
precision score:  0.14705882352941177
recall score:  0.25
f1 score:  0.18518518518518517


  _warn_prf(average, modifier, msg_start, len(result))


Logistic regression was not the best algorithm to use. Nerual Networks may preform better in this context.

In [232]:
from sklearn.neural_network import MLPClassifier
vectorizer = TfidfVectorizer(stop_words=stopwords, binary=True)
X = vectorizer.fit_transform(df.text)
y = df.author
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1234)


classifier = MLPClassifier(solver='lbfgs', alpha=1e-5,
                   hidden_layer_sizes=(15, 2), random_state=1)
classifier.fit(X_train, y_train)

pred = classifier.predict(X_test)
print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred, average='macro'))
print('recall score: ', recall_score(y_test, pred, average='macro'))
print('f1 score: ', f1_score(y_test, pred, average='macro'))

accuracy score:  0.7647058823529411
precision score:  0.35714285714285715
recall score:  0.5
f1 score:  0.4


  _warn_prf(average, modifier, msg_start, len(result))


After playing around with different topologies, I was unable to get a result better than 15, 2 for the hidden nodes. Below is one that got me the closest in terms of accuracy.

In [233]:
classifier = MLPClassifier(solver='lbfgs', alpha=1e-5,
                   hidden_layer_sizes=(10, 8, 5), random_state=1)
classifier.fit(X_train, y_train)
pred = classifier.predict(X_test)
print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred, average='macro'))
print('recall score: ', recall_score(y_test, pred, average='macro'))
print('f1 score: ', f1_score(y_test, pred, average='macro'))

accuracy score:  0.7058823529411765
precision score:  0.3106060606060606
recall score:  0.5
f1 score:  0.3630952380952381


  _warn_prf(average, modifier, msg_start, len(result))
