# **Author Attribution**

Different machine learning models are better suited for different tasks. This Notebook explores the performance of different ML methods for a classification task on bodies of text. The data for testing and training the models comes from `federalist.csv`. The classification labels are the authors: Alexander Hamilton, James Madison, and John Jay. The Data is made up of 83 documents from the Federalist Papers. This program uses ML to classify the authorship of a document.

In [23]:
# For text preprocessing
import pandas as pd
# To split the given dataset into training and testing sets
from sklearn.model_selection import train_test_split
# To Vectorize the bodies of text to allow model fitting
from sklearn.feature_extraction.text import TfidfVectorizer
# Machine Learning Models
from sklearn.naive_bayes import BernoulliNB
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
# For evaluating model performance
from sklearn.metrics import accuracy_score
# For text preprocessing
from nltk.corpus import stopwords
# To catch ConvergenceWarning from the neural network
import warnings
warnings.filterwarnings('error')

### Preprocessing Data

In [24]:
# PART 1

data = pd.read_csv('federalist.csv')
#Sets the values of the author column to category type
data.author = data.author.astype('category')

# splits dataframe into labels and data
Y = data.author
X = data.text

#prints First few columns \n Counts of authors
print(data.head())
print('\nCounts by Author\n___________________________')
print(Y.value_counts())

     author                                               text
0  HAMILTON  FEDERALIST. No. 1 General Introduction For the...
1       JAY  FEDERALIST No. 2 Concerning Dangers from Forei...
2       JAY  FEDERALIST No. 3 The Same Subject Continued (C...
3       JAY  FEDERALIST No. 4 The Same Subject Continued (C...
4       JAY  FEDERALIST No. 5 The Same Subject Continued (C...

Counts by Author
___________________________
HAMILTON                49
MADISON                 15
HAMILTON OR MADISON     11
JAY                      5
HAMILTON AND MADISON     3
Name: author, dtype: int64


In [25]:
# PART 2

# Splits data into 80% training, 20% testing
rawTrainX, rawTestX, trainY, testY = train_test_split(X, Y, test_size=0.2, random_state=1234)
print('Training Dataset Shape:\t', rawTrainX.shape)
print('Testing Dataset Shape:\t', rawTestX.shape)

Training Dataset Shape:	 (66,)
Testing Dataset Shape:	 (17,)


#### Vectorization

Since machine learning models are just complex mathematical models under the hood, text data must be processed into meaningful numerical representations. To do this, TF-IDF Vectorization can be used. TF-IDF stands for *term frequncy-inverse document frequency*. This is a metric that reflects how "important" a word is in the context of a document. A higher value for a word reflects a higher occurrence/importance in the text body. These values are used as weighting factors for the classification model to use.

In [26]:
# PART 3

# Vectorizer processes text while ignoring the nltk stopwords
sw = set(stopwords.words('english'))
vect = TfidfVectorizer(stop_words=sw)

# Processed data sets
trainX = vect.fit_transform(rawTrainX)
testX = vect.transform(rawTestX)

print('Training Set Shape:\t', trainX.shape)
print('Testing Set Shape:\t', testX.shape)

Training Set Shape:	 (66, 7876)
Testing Set Shape:	 (17, 7876)


## Bernoulli Naive Bayes Classifier

**Naive Bayes** Classifiers are simple classifiers that use a statistical model that makes the "naive" assumption of independence between data features. The Naive Bayes probabilistic model uses Bayes' Theorem by observing probabilities from prior data observations to make a prediction of the most likely classification of a given data point.

Here, `sklearn`'s default BernoulliNB classifier model is fitted to the training data.

In [27]:
# PART 4

bnb = BernoulliNB()
bnb.fit(trainX, trainY)

preds = bnb.predict(testX)

print('Accuracy Score:\t%.3f%%' % (accuracy_score(testY, preds) * 100))

Accuracy Score:	58.824%


The above accuracy is not very high, but it is possible to adjust the classifier to better fit the training data. By decreasing the number of features that are fitted to prevent overfitting, and including bigrams to get more context to the model, we can help the classifier model better fit the training set.

In [28]:
# PART 5

maxFeats = 1000
newVect = TfidfVectorizer(stop_words=sw, max_features=maxFeats, ngram_range=(1,2))
trainX = newVect.fit_transform(rawTrainX)
testX = newVect.transform(rawTestX)

print('Updated Training Set Shape:\t', trainX.shape)
print('Updated Testing Set Shape:\t', testX.shape)

bnb.fit(trainX, trainY)

preds = bnb.predict(testX)

print('\nUpdated Accuracy Score:\t%.3f%%' % (accuracy_score(testY, preds) * 100))

Updated Training Set Shape:	 (66, 1000)
Updated Testing Set Shape:	 (17, 1000)

Updated Accuracy Score:	94.118%


## Logistic Regression Classifier

The **Logistic Regression** Classifier creates a linear decision boundary using a linear combination of predictors.

In [29]:
# PART 6

lr = LogisticRegression()
lr.fit(trainX, trainY)

preds = lr.predict(testX)

print('Accuracy Score:\t%.3f%%' % (accuracy_score(testY, preds) * 100))

Accuracy Score:	58.824%


The above accuracy is, once again, not very high, but it is possible to adjust the classifier to better fit the training data. The C parameter of the Regression model is the inverse of the regularization strength. A higher C value indicates that the model will give a higher weight to the training date. By increasing the C value of the model from the default value of 1, we can lead the model to "trust" the training data more and fit the data better.

In [30]:
lr = LogisticRegression(C=30)
lr.fit(trainX, trainY)

preds = lr.predict(testX)

print('Updated Accuracy Score:\t%.3f%%' % (accuracy_score(testY, preds) * 100))
# print(lr.score(testX, testY))

Updated Accuracy Score:	76.471%


## Neural Network Classifier

**Neural Networks** are machine learning  models composed of "layers" of nodes that each have a weight and threshold. Each node contributes to the final decision of the Network, with the role it plays determined by the weight of the node. With the combined decisions of all of the nodes, the Network outputs the prediction.

Below, to find the best topology of a 2 layer Neural Network, all possible combinations of topologies are tested. It's possible that more layers can be more accurate but will take longer to find by using this method. This is a naive approach and could possibly be more efficient, but this works well enough for finding a good, simple 2 layer topology.

In [31]:
errors = []
maxAcc, maxFirst, maxSecond = 0, 0, 0

for i in range(1,15):
    for j in range(1,15):
        try:
            nn = MLPClassifier(hidden_layer_sizes=(i,j), max_iter=500, random_state=1)

            nn.fit(trainX, trainY)

            preds = nn.predict(testX)

            currAcc = accuracy_score(testY, preds)
            if currAcc > maxAcc:
                maxAcc = currAcc
                maxFirst = i
                maxSecond = j
        except:
            errors.append((i, j))

print('Best Accuracy Score:\t%.3f%% with layers: %d, %d' % ((maxAcc * 100), maxFirst, maxSecond))

Best Accuracy Score:	88.235% with layers: 7, 11
