# NLP Final Project: Authorship Attribution

## Yuval Timen

We will be exploring the task of Authorship Attribution - trying to assign to each piece of text the author who wrote it. We will be looking at the works of Robert Browning, Winston Churchill (the American author, not the British Prime Minister), William Shakespeare, Henry Wadsworth Longfellow, Percy Bysshe Shelley. All data was collected from Project Gutenburg.

We originally used authors such as Aristotle and Nietzsche, however, after realizing that the data collected for these authors was a translation composed of various translators, we decided to only use works originally written in English.

In [1]:
# Imports
import re
import numpy as np
import pandas as pd
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
import matplotlib.pyplot as plt

%matplotlib inline

## Constants

We define some constants here that we will use later in the program.

In [2]:
# Define the list of stopwords to use by default
STOP_WORDS = set(stopwords.words('english'))

# We examine this number of sentences at a time when making predictions
SENTENCES_PER_INPUT = 6

## Import the data

In [3]:
# Importing the raw data

# Each data file is a String
with open('./data_files/browning.txt', 'r') as fb:
    data_browning = fb.read()
    
with open('./data_files/churchill.txt', 'r') as fc:
    data_churchill = fc.read()
    
with open('./data_files/longfellow.txt', 'r') as fl:
    data_longfellow = fl.read()
    
with open('./data_files/shakespeare.txt', 'r') as fse:
    data_shakespeare = fse.read()
    
with open('./data_files/shelley.txt', 'r') as fsl:
    data_shelley = fsl.read()

In [None]:
raw_documents = [
    data_browning,
    data_churchill,
    data_longfellow,
    data_shakespeare,
    data_shelley
]

## Preprocessing and cleaning

We examine our data to see what sort of cleaning needs to be done. We remove all unwanted characters, replace any symbols with their English equivalents, and do any additional processing as necessary. Some preprocessing was done by hand, such as manually deleting the table of contents or Project Gutenberg licences at the end of the file. The rest was done programmatically. 

In [4]:
# Let's first see what kind of characters are in our text:

def build_charset(documents):
    all_chars = list() 
    for doc in documents:
        all_chars.extend(list(doc))
    print(set(all_chars))
    return all_chars

all_chars = build_charset(raw_documents)

{'υ', 'φ', 'Æ', '9', '|', 'ύ', '<', '6', 'ו', 'e', 'ή', 'è', 'r', '(', 'ῥ', '“', 'ú', '\n', 'ἰ', 'H', ']', 'έ', '=', 'î', 'ὸ', 'ὲ', 'י', 'Υ', 'ἐ', '&', 'F', 'l', 'à', 'β', 'ῃ', 'O', 'Ι', '4', 'η', 'ô', 'ἤ', 'ὐ', '”', 'm', 'S', 'ד', 'ὴ', 'Χ', 'Î', ' ', 'ù', '-', 'ς', 'P', '·', 'ש', 'c', 'α', 'ό', '.', '2', '*', 'Σ', 'G', 'ö', 'ך', 'ח', '‘', 'έ', 'd', '$', '[', 'ק', 'ὁ', 'ó', 'i', 'Q', 'N', 'v', '£', '%', 'Ἀ', 'ῦ', 'ç', 'I', '?', 'נ', 'z', 'V', 'σ', 'ε', 'γ', 'é', 'ύ', '—', '8', 'κ', 'ῶ', 'x', 'Z', 'b', '׃', '\\', 'ό', 'g', 'ἄ', '`', 'M', 'k', 'f', '_', 'ῖ', 'T', 'ῆ', 'ξ', 't', 'W', 'ι', 'ω', 'J', 'á', 'À', 'δ', 'p', 'ï', 'X', 'h', '0', 'L', 'ὼ', 'τ', 'â', 'a', 'ë', ')', 'ᾴ', 'o', 'Θ', '־', '{', 'æ', ';', 'Y', 'ὰ', '\t', '~', 'q', 'œ', 'D', 'j', 'ά', 'π', '!', 'K', 'ר', '>', 'ᵒ', 'U', 'ί', 'ά', 'ἀ', '3', 'u', 'ώ', 'ב', 'ἔ', 'ο', 'ὔ', 'ὅ', ',', 'ὖ', 'Œ', 'θ', 'ρ', '}', 'É', 'מ', 'n', 'ì', 'ל', 'ם', 'y', '’', 'A', 'ׁ', 'ὃ', 'ê', '1', 'Ὕ', 'E', 'ē', 'ט', 'ä', 'ἱ', 'א', 'û', 'χ', 'R', 'ò', '

In [5]:
# We see that, in addition to symbols and punctuation, we have
# some Greek and Hebrew words. We must make sure to substitute them
# for special tokens during data cleaning.
# For Greek words, we will use _GREEK_WORD_ token,
# and for Hebrew we will use _HEBREW_WORD_ token.

# This function takes in a String of data and
# returns a list of sentences. It removes
# all punctuation, symbols, numbers, and quotation 
# marks. Finally, we turn all upper-case into lower-case.
# If the resulting sentence does not contain enough words, we exclude it. 
def clean_data(data, word_threshold=3, stopwords_to_remove=STOP_WORDS, num_sentences_per_input=SENTENCES_PER_INPUT):
    
    # Define the stopword regex
    regex_stopwords = r"\b(" + "|".join(stopwords_to_remove) + r")\b"
    
    # Define common regex for cleaning
    regex_numbers = r'[0-9]'
    regex_amounts = '£[0-9]+(.*[0-9]*)*'
    regex_punctuation = r'([\“\”·£־§”$&*ᵒ%+=`‘.,;:_"\/\\\(\)\[\]!?<>\—\-\‐]|\n|--)'
    
    # Expressions matching a single quote that:
    # - precede a word
    # - follow a word
    # - are not next to a word
    regex_single_quote_start = r"([^a-zA-Z])\\'([a-zA-Z])"
    regex_single_quote_end = r"([a-zA-Z])\\'([^a-zA-Z])"
    regex_single_quote_standalone = r"([^a-zA-Z])\\'([^a-zA-Z])"
    
    # Expressions matching symbols/special characters
    # Some symbols can be replaced with ascii characters,
    # such as æ -> ae, or È -> e
    regex_replaceable_ae = r'[Ææ]'
    regex_replaceable_c = r'[Çç]'
    regex_replaceable_a = r'[ÀÁÂÄàáâä]'
    regex_replaceable_e = r'[ÉÊÈêéèēë]'
    regex_replaceable_oe = r'[Œœ]'
    regex_replaceable_u = r'[ÜÛüûùú]'
    regex_replaceable_i = r'[ÏÎìïî]'
    regex_replaceable_o = r'[ÔÖôöòó]'
    
    # Some words are entirely non-English, and must 
    # be transformed to a special token. 
    regex_greek = r'([ὔνΣῡὝύᾴαδθυοπὶἀὖἄἔἆωῥγτῃῆάζἰἤύόχέῶΘὸὴρόὃῑξφηΙμἐλΧςώὼιίάῦβήἈ᾽ἑὲὀεἴἱσκέ]+?)+'
    regex_hebrew = r'([ןזסשדטתאהלכשׁרפךחמבצנףוםקעיץג]+?)+'
    
    # Book or chapter headings often contain Roman numerals
    regex_book = r'^BOOK (((?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3}))\s)'
    regex_chapter = r'^CHAPTER (((?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3}))\s)'
    regex_roman_num = r'(?!I)((?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3}))\s'    
    
    # Regex to match any word containing a non-ascii character
    regex_non_ascii = r"[^ -~]"
    
    
    # Before we begin sentence tokenization,
    # we remove all chapter headings. We do 
    # this before, because tokenization
    # will make it tricky to eliminate 
    # all headings consistently.
    data = re.sub(regex_book, ' ', data, re.IGNORECASE)
    data = re.sub(regex_chapter, ' ', data, re.IGNORECASE)
    
    sentences = sent_tokenize(data)
    output = []
    
    current_input = ""
    count = 0
    
    for sent in sentences:
        tmp = sent
        
        # Remove all unwanted characters
        tmp = re.sub(regex_amounts, ' ', tmp)
        tmp = re.sub(regex_numbers, ' ', tmp)
        tmp = re.sub(regex_punctuation, ' ', tmp)

        # Remove all quotes, preserving single apostrophes
        tmp = re.sub(regex_single_quote_start, r'\1 \2', tmp)
        tmp = re.sub(regex_single_quote_end, r'\1 \2', tmp)
        tmp = re.sub(regex_single_quote_standalone, r'\1 \2', tmp)
        
        # Replace all symbols with corresponding English characters
        tmp = re.sub(regex_replaceable_ae, 'ae', tmp)
        tmp = re.sub(regex_replaceable_c, 'c', tmp)
        tmp = re.sub(regex_replaceable_a, 'a', tmp)
        tmp = re.sub(regex_replaceable_e, 'e', tmp)
        tmp = re.sub(regex_replaceable_oe, 'oe', tmp)
        tmp = re.sub(regex_replaceable_u, 'u', tmp)
        tmp = re.sub(regex_replaceable_i, 'i', tmp)
        tmp = re.sub(regex_replaceable_o, 'o', tmp)
        
        # Replace foreign languages with corresponding tokens
        tmp = re.sub(regex_greek, '_GREEK_WORD_', tmp)
        tmp = re.sub(regex_hebrew, '_HEBREW_WORD_', tmp)
        
        tmp = tmp.lower()
        tmp = tmp.strip()
        
        # If the sentence still contains a Roman Numeral, we don't want it
        if (re.search(regex_roman_num, tmp, re.IGNORECASE)):
            continue
            
        # Remove any stopwords if necessary
        if stopwords_to_remove:
            tmp = re.sub(regex_stopwords, '', tmp)
            
        # We've reached the end of cleaning, remove any words that contain non-ascii characters
        # This acts as the last filter, in case previous preprocessing missed anything
        tmp = re.sub(regex_non_ascii, '', tmp)
        
        # The sentence must have the minimum word threshold to be included
        if tmp and len(tmp.split()) >= word_threshold:
            
            # Since it's above the word threshold, we will be using the sentence
            count += 1
            
            # If we've reached the limit, add to output and clear accumulator
            if count % num_sentences_per_input == 0:
                output.append(" ".join([current_input, tmp]))
                current_input = ""
            else:
                current_input = " ".join([current_input, tmp])
                
    # If there are any remaining sentences in the accumulator when we are
    # done, we just print them out. We do not want to use an input that 
    # has less than SENTENCES_PER_INPUT sentence. While these extra sentences
    # would be okay to be converted into inputs, for the purpose of this
    # experiment, we want to have every aspect be as standard as possible. 
    if current_input:
        print("Unused sentences: " + current_input)
                
    
    return output

In [6]:
# Convert all of our data into cleaned lines
browning_cleaned = clean_data(data_browning)
churchill_cleaned = clean_data(data_churchill)
longfellow_cleaned = clean_data(data_longfellow)
shakespeare_cleaned = clean_data(data_shakespeare)
shelley_cleaned = clean_data(data_shelley)

Unused sentences:  ' become  waring woe   went galloping   war   sick  ' sure     say  blame    ran away   seen better days  dear
Unused sentences:    modern schools french   example   really taught  pupils   acquire  mere smattering   language nothing short  victory    among us  advocate  boycott  germany  peace  declared  say   instance   america' action  entering  war    unselfish   attempting  turn calamity  good
Unused sentences:  tristia  book iii   elegy x tristia  book iii   elegy xii    sea  frozen     'er  ister   comes  sarmatian boor driving  stridulous cart
Unused sentences:  love comforteth like sunshine  rain   lust’ effect  tempest  sun                   love’ gentle spring doth always fresh remain  lust’ winter comes ere summer half  done dost thou drink tears   thou provok’st  weeping  may  heavy groan advantage thee  face remains alive ’ worth  viewing  canst thou boast  things long since   anything ensuing
Unused sentences:    spirits   air   oh whose waves  years  

## Data Exploration & Statistics

We generate some basic statistics about the data, such as sentence and word count, sentence and word length, and vocabulary size. We then run a pairwise set difference across all the vocabularies of our authors. Finally, we create a frequency graph to visualize these vocabulary differences.

In [7]:
# Prints some basic statistics about our dataset:
# 1. Total Number of Sentences
# 2. Average Sentence Length in Words
# 3. Total Number of Words
# 4. Average Word Length in Characters
# 5. Vocabulary Size
# The function returns the vocabulary of the dataset
def print_dataset_statistics(sentences):
    num_sentences = len(sentences) * SENTENCES_PER_INPUT
    avg_sent_length = np.mean([len(sent.split()) for sent in sentences]) / SENTENCES_PER_INPUT
    all_words = []
    for sent in sentences:
        all_words.extend(sent.split())
    avg_word_length = np.mean([len(word) for word in all_words])
    num_words = len(all_words)
    vocab_size = len(set(all_words))
    
    print(f"Total sentences: {num_sentences}")
    print(f"Average sentence length: {round(avg_sent_length, 3)} words")
    print(f"Total number of words: {num_words}")
    print(f"Average word length {round(avg_word_length, 3)} characters")
    print(f"Vocabulary size: {vocab_size}")
    print('\n')
    
    return list(set(all_words))

print("---Browning statistics---")
all_words_browning = print_dataset_statistics(browning_cleaned)

print("---Churchill statistics---")
all_words_churchill = print_dataset_statistics(churchill_cleaned)

print("---Longfellow statistics---")
all_words_longfellow = print_dataset_statistics(longfellow_cleaned)


print("---Shakespeare statistics---")
all_words_shakespeare = print_dataset_statistics(shakespeare_cleaned)
 
print("---Shelley statistics---")
all_words_shelley = print_dataset_statistics(shelley_cleaned)

all_vocab = list(set(all_words_browning 
                     + all_words_churchill 
                     + all_words_longfellow 
                     + all_words_shakespeare
                     + all_words_shelley))

---Browning statistics---
Total sentences: 5172
Average sentence length: 4.829 words
Total number of words: 24975
Average word length 5.152 characters
Vocabulary size: 6849


---Churchill statistics---
Total sentences: 10596
Average sentence length: 4.636 words
Total number of words: 49126
Average word length 5.342 characters
Vocabulary size: 8909


---Longfellow statistics---
Total sentences: 2034
Average sentence length: 4.556 words
Total number of words: 9266
Average word length 5.31 characters
Vocabulary size: 3322


---Shakespeare statistics---
Total sentences: 11934
Average sentence length: 4.448 words
Total number of words: 53081
Average word length 5.117 characters
Vocabulary size: 8413


---Shelley statistics---
Total sentences: 2016
Average sentence length: 5.012 words
Total number of words: 10105
Average word length 5.597 characters
Vocabulary size: 3665




In [8]:
# Pairwise set differences in vocabulary.

# Overlap Measure metric for measuring set similarity
# Given two sets, A and B, we calculate the Overlap as follows:
# J(A,B) = |intersection(A,B)| / min(|A|,|B|)
def overlap_measure(setA, setB):
    intersect = len(set(setA).intersection(set(setB)))
    denom = min(len(set(setA)), len(set(setB)))
    ans = 100 * intersect / denom
    return ans


# Given a list of sets and a metric function, this function will
# conduct a pairwise set evalution. For example, given s1, s2, s3,
# and the function f(), this function will yield:
# f(s1, s2), f(s1, s3), f(s2, s3)
def pairwise_metrics(func, list_of_sets):
    
    output = []
    total = len(list_of_sets)
    
    for i in range(total):
        cur = list_of_sets[i]
        rest = list_of_sets[i+1:]
        
        for other in rest:
            output.append(func(cur, other))
            
    return output


# Given the result of pairwise_metrics, this function
# formats and prints the statistics for us to see neatly.
def print_vocab_metrics(vocabs):
    
    metric_list = pairwise_metrics(overlap_measure, vocabs)
    
    print(f"Minimum overlap: {str(min(metric_list))[:4]}%")
    print(f"Maximum overlap: {str(max(metric_list))[:4]}%")
    print(f"Average overlap: {str(np.mean(metric_list))[:4]}%")
    

# Define the list of all vocabularies
all_vocabs = [all_words_browning, 
              all_words_churchill, 
              all_words_longfellow, 
              all_words_shakespeare, 
              all_words_shelley]

entire_vocab = list()
for v in all_vocabs:
    entire_vocab.extend(v)

entire_vocab = list(set(entire_vocab))
print_vocab_metrics(all_vocabs)

Minimum overlap: 35.4%
Maximum overlap: 55.1%
Average overlap: 45.7%


In [9]:
# Graph - 

## Featurization

We featurize our data in 2 ways. The baseline method will be to create one-hot encodings for each word in the sample sentence. The features for a single input sentence will be the vector sum of all one-hot vectors for each word in the given sentence. Essentially, this will be a vector of length |V|, where V is the set of vocabulary words; the words that appear in the sentence will have a 1 in the position of their index in the vector, indicating that they are present. All other indices will be 0. This is the one-hot vector equivalent of Bag-of-Words.

The second method is a TF-IDF vector representation for each author. TODO: 


In [10]:
# Imports 
from functools import reduce
from math import floor

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

#### First method - create the one-hot encodings

In [11]:
# Converts a list of sentences to a list of lists of words
def sentences_to_word_lists(sentences):
    output = []
    for sent in sentences:
        output.append(sent.split())
        
    return output
    

browning_listed = sentences_to_word_lists(browning_cleaned)
churchill_listed = sentences_to_word_lists(churchill_cleaned)
longfellow_listed = sentences_to_word_lists(longfellow_cleaned)
shakespeare_listed = sentences_to_word_lists(shakespeare_cleaned)
shelley_listed = sentences_to_word_lists(shelley_cleaned)

In [12]:
# Takes a list of list of words and an encoder and
# returns a list of list of integers
def encode_sentences(sentences, encoder, reshape=False):
    output = []
    for sent in sentences:
        if reshape:
            sent = sent.reshape(len(sent), 1)
        output.append(encoder.transform(sent))
    
    return output
    
# Create the indexer and fit it to the entire vocabulary
label_encoder = LabelEncoder()
label_encoder.fit(all_vocab)

# We need to reshape the result so its a column vector
vocab_indexed = label_encoder.transform(all_vocab)
vocab_indexed = vocab_indexed.reshape(len(vocab_indexed), 1)

# Index our data
browning_indexed = encode_sentences(browning_listed, label_encoder)
churchill_indexed = encode_sentences(churchill_listed, label_encoder)
longfellow_indexed = encode_sentences(longfellow_listed, label_encoder)
shakespeare_indexed = encode_sentences(shakespeare_listed, label_encoder)
shelley_indexed = encode_sentences(shelley_listed, label_encoder)

# Create the OneHotEncoder and fit it to the indexed vocab
onehot_encoder = OneHotEncoder(handle_unknown='ignore')
onehot_encoder.fit(vocab_indexed)

# One Hot Encode all of our data
browning_onehot = encode_sentences(browning_indexed, onehot_encoder, reshape=True)
churchill_onehot = encode_sentences(churchill_indexed, onehot_encoder, reshape=True)
longfellow_onehot = encode_sentences(longfellow_indexed, onehot_encoder, reshape=True)
shakespeare_onehot = encode_sentences(shakespeare_indexed, onehot_encoder, reshape=True)
shelley_onehot = encode_sentences(shelley_indexed, onehot_encoder, reshape=True)

In [13]:
# Takes in a a list of 2D numpy array, each of shape 
# (n, V) where n is the number of words in a sentence 
# and V is the vocab size. This function returns a 
# list of 1D np.arrays representing the condensed 
# encodings. 
def consolidate_encodings(all_encodings):
    output = []
    shape = all_encodings[0].shape
    for enc in all_encodings:
        
        output.append(np.array(reduce(lambda a1, a2: a1 + a2, list(enc), np.zeros((1, shape[1])))))
        
    return output

In [14]:
# Convert the one-hot encoded list of lists into a single encoded list
browning_encoded = consolidate_encodings(browning_onehot)
churchill_encoded = consolidate_encodings(churchill_onehot)
longfellow_encoded = consolidate_encodings(longfellow_onehot)
shakespeare_encoded = consolidate_encodings(shakespeare_onehot)
shelley_encoded = consolidate_encodings(shelley_onehot)
print(shelley_encoded[0].shape)

(1, 19444)


In [15]:
# We split the data into X (matrix: (n_samples, n_features))
# and y: (n_samples, 1).

label_mapping = {
    'browning': 1, 
    'churchill': 2,
    'longfellow': 3,
    'shakespeare': 4,
    'shelley': 5
}

encoded_data = np.array(browning_encoded 
                + churchill_encoded 
                + longfellow_encoded 
                + shakespeare_encoded
                + shelley_encoded)

encoded_data = encoded_data.reshape(encoded_data.shape[0], encoded_data.shape[2])

labels_ohe = np.array([label_mapping['browning'] for _ in browning_encoded]
         + [label_mapping['churchill'] for _ in churchill_encoded]
         + [label_mapping['longfellow'] for _ in longfellow_encoded]
         + [label_mapping['shakespeare'] for _ in shakespeare_encoded]
         + [label_mapping['shelley'] for _ in shelley_encoded])

df_ohe = pd.DataFrame(encoded_data)
df_ohe['label'] = labels_ohe
df_ohe.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19435,19436,19437,19438,19439,19440,19441,19442,19443,label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


#### Second method - TfIdf Vectors

In [16]:
# Convert each author's list of sentences into a single string
def cleaned_lines_to_document(ls_of_ls):
    return " ".join(ls_of_ls)

browning_document = cleaned_lines_to_document(browning_cleaned)
churchill_document = cleaned_lines_to_document(churchill_cleaned)
longfellow_document = cleaned_lines_to_document(longfellow_cleaned)
shakespeare_document = cleaned_lines_to_document(shakespeare_cleaned)
shelley_document = cleaned_lines_to_document(shelley_cleaned)

In [17]:
# Convert the documents to a count matrix
count_vec = CountVectorizer()
tfidf_vec = TfidfTransformer()

corpus = [
    browning_document,
    churchill_document, 
    longfellow_document,
    shakespeare_document,
    shelley_document
]

countMatrix = count_vec.fit_transform(corpus).toarray()
tfIdfMatrix = tfidf_vec.fit_transform(countMatrix).toarray()

print(f"Shape of tfidf matrix: {tfIdfMatrix.shape}")

# Convert all inputs into tfidf
browning_encoded_count = count_vec.transform(np.array(browning_cleaned)).toarray()
browning_encoded_tfidf = tfidf_vec.transform(browning_encoded_count).toarray()

churchill_encoded_count = count_vec.transform(np.array(churchill_cleaned)).toarray()
churchill_encoded_tfidf = tfidf_vec.transform(churchill_encoded_count).toarray()

longfellow_encoded_count = count_vec.transform(np.array(longfellow_cleaned)).toarray()
longfellow_encoded_tfidf = tfidf_vec.transform(longfellow_encoded_count).toarray()


shakespeare_encoded_count = count_vec.transform(np.array(shakespeare_cleaned)).toarray()
shakespeare_encoded_tfidf = tfidf_vec.transform(shakespeare_encoded_count).toarray()

shelley_encoded_count = count_vec.transform(np.array(shelley_cleaned)).toarray()
shelley_encoded_tfidf = tfidf_vec.transform(shelley_encoded_count).toarray()

labels_tfidf = np.array([label_mapping['browning'] for _ in browning_encoded_count]
         + [label_mapping['churchill'] for _ in churchill_encoded_count]
         + [label_mapping['longfellow'] for _ in longfellow_encoded_count]
         + [label_mapping['shakespeare'] for _ in shakespeare_encoded_count]
         + [label_mapping['shelley'] for _ in shelley_encoded_count])


print(browning_encoded_tfidf.shape)

tfidf_data = np.concatenate([
    browning_encoded_tfidf,
    churchill_encoded_tfidf,
    longfellow_encoded_tfidf,
    shakespeare_encoded_tfidf,
    shelley_encoded_tfidf
])

df_tfidf = pd.DataFrame(tfidf_data)
df_tfidf['label'] = labels_tfidf

print(tfidf_data.shape)
print(labels_tfidf.shape)

df_tfidf.head()

Shape of tfidf matrix: (5, 17969)
(862, 17969)
(5292, 17969)
(5292,)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,17960,17961,17962,17963,17964,17965,17966,17967,17968,label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


## Running the Models

We implement Logistic Regression and a Feed Forward Neural Network.

In [18]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import f1_score, precision_score, recall_score

from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

In [19]:
# We create generators for the sequences
# and define a test-train split
train_percent = 0.8

# Shuffle everything up
df_ohe = df_ohe.sample(frac=1).reset_index(drop=True)
df_tfidf = df_tfidf.sample(frac=1).reset_index(drop=True)


# Inputs for One Hot Encodings
data_X_one_hot = df_ohe.drop('label', axis=1).to_numpy()
data_y_one_hot = df_ohe['label'].to_numpy()

train_idx_cutoff_ohe = floor(len(data_y_one_hot) * train_percent)

train_X_ohe = data_X_one_hot[:train_idx_cutoff_ohe]
test_X_ohe = data_X_one_hot[train_idx_cutoff_ohe + 1:]

train_y_ohe = data_y_one_hot[:train_idx_cutoff_ohe]
test_y_ohe = data_y_one_hot[train_idx_cutoff_ohe + 1:]


# Inputs for TF-IDF
data_X_tfidf = df_tfidf.drop('label', axis=1).to_numpy()
data_y_tfidf = df_tfidf['label'].to_numpy()

train_idx_cutoff_tfidf = floor(len(data_y_tfidf) * train_percent)

train_X_tfidf = data_X_tfidf[:train_idx_cutoff_tfidf]
test_X_tfidf = data_X_tfidf[train_idx_cutoff_tfidf + 1:]

train_y_tfidf = data_y_tfidf[:train_idx_cutoff_tfidf]
test_y_tfidf = data_y_tfidf[train_idx_cutoff_tfidf + 1:]

In [20]:
# Logistic Regression:
# Here, we implement sklearn's Logistic Regression model.
# This is 
clf_logit_ohe = LogisticRegression(n_jobs=4, verbose=True)
clf_logit_tfidf = LogisticRegression(n_jobs=4, verbose=True)

clf_logit_ohe.fit(train_X_ohe, train_y_ohe)
predicted_logit_ohe = clf_logit_ohe.predict(test_X_ohe)

clf_logit_tfidf.fit(train_X_tfidf, train_y_tfidf)
predicted_logit_tfidf = clf_logit_tfidf.predict(test_X_tfidf)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   1 out of   1 | elapsed:  1.6min finished
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   1 out of   1 | elapsed:  1.5min finished


In [21]:
clf_nn_ohe = MLPClassifier(solver='sgd', max_iter=1000, learning_rate_init=0.005, verbose=True)
clf_nn_tfidf = MLPClassifier(solver='sgd', max_iter=1000, learning_rate_init=0.005, verbose=True)

clf_nn_ohe.fit(train_X_ohe, train_y_ohe)
predicted_nn = clf_nn_ohe.predict(test_X_ohe)


clf_nn_tfidf.fit(train_X_tfidf, train_y_tfidf)
predicted_nn = clf_nn_tfidf.predict(test_X_tfidf)

Iteration 1, loss = 1.52570091
Iteration 2, loss = 1.28958675
Iteration 3, loss = 1.12024631
Iteration 4, loss = 1.01488181
Iteration 5, loss = 0.93528046
Iteration 6, loss = 0.86862986
Iteration 7, loss = 0.81119586
Iteration 8, loss = 0.75973735
Iteration 9, loss = 0.71251829
Iteration 10, loss = 0.67074503
Iteration 11, loss = 0.63238850
Iteration 12, loss = 0.59762638
Iteration 13, loss = 0.56534937
Iteration 14, loss = 0.53593928
Iteration 15, loss = 0.50899393
Iteration 16, loss = 0.48389944
Iteration 17, loss = 0.46085722
Iteration 18, loss = 0.43922415
Iteration 19, loss = 0.41896333
Iteration 20, loss = 0.40018077
Iteration 21, loss = 0.38259111
Iteration 22, loss = 0.36564956
Iteration 23, loss = 0.34988163
Iteration 24, loss = 0.33502149
Iteration 25, loss = 0.32092261
Iteration 26, loss = 0.30785305
Iteration 27, loss = 0.29543910
Iteration 28, loss = 0.28361027
Iteration 29, loss = 0.27227989
Iteration 30, loss = 0.26140121
Iteration 31, loss = 0.25127612
Iteration 32, los

Iteration 47, loss = 0.99705944
Iteration 48, loss = 0.98677024
Iteration 49, loss = 0.97661522
Iteration 50, loss = 0.96700743
Iteration 51, loss = 0.95728458
Iteration 52, loss = 0.94776026
Iteration 53, loss = 0.93851621
Iteration 54, loss = 0.92939702
Iteration 55, loss = 0.92056872
Iteration 56, loss = 0.91190135
Iteration 57, loss = 0.90337592
Iteration 58, loss = 0.89500729
Iteration 59, loss = 0.88697830
Iteration 60, loss = 0.87884155
Iteration 61, loss = 0.87104601
Iteration 62, loss = 0.86311543
Iteration 63, loss = 0.85541334
Iteration 64, loss = 0.84783523
Iteration 65, loss = 0.84044200
Iteration 66, loss = 0.83324472
Iteration 67, loss = 0.82596203
Iteration 68, loss = 0.81912052
Iteration 69, loss = 0.81190651
Iteration 70, loss = 0.80518975
Iteration 71, loss = 0.79843433
Iteration 72, loss = 0.79200155
Iteration 73, loss = 0.78540118
Iteration 74, loss = 0.77896006
Iteration 75, loss = 0.77265697
Iteration 76, loss = 0.76641040
Iteration 77, loss = 0.76022925
Iteratio

Iteration 297, loss = 0.15558936
Iteration 298, loss = 0.15462585
Iteration 299, loss = 0.15358133
Iteration 300, loss = 0.15255915
Iteration 301, loss = 0.15172811
Iteration 302, loss = 0.15060928
Iteration 303, loss = 0.14971764
Iteration 304, loss = 0.14877724
Iteration 305, loss = 0.14776647
Iteration 306, loss = 0.14674541
Iteration 307, loss = 0.14592919
Iteration 308, loss = 0.14504857
Iteration 309, loss = 0.14399228
Iteration 310, loss = 0.14299119
Iteration 311, loss = 0.14201816
Iteration 312, loss = 0.14125528
Iteration 313, loss = 0.14021904
Iteration 314, loss = 0.13938085
Iteration 315, loss = 0.13856737
Iteration 316, loss = 0.13760737
Iteration 317, loss = 0.13679023
Iteration 318, loss = 0.13585425
Iteration 319, loss = 0.13503493
Iteration 320, loss = 0.13419269
Iteration 321, loss = 0.13341365
Iteration 322, loss = 0.13243243
Iteration 323, loss = 0.13160055
Iteration 324, loss = 0.13077705
Iteration 325, loss = 0.12994124
Iteration 326, loss = 0.12914477
Iteration 

Iteration 546, loss = 0.04108350
Iteration 547, loss = 0.04089721
Iteration 548, loss = 0.04074289
Iteration 549, loss = 0.04058104
Iteration 550, loss = 0.04039420
Iteration 551, loss = 0.04023004
Iteration 552, loss = 0.04007749
Iteration 553, loss = 0.03991720
Iteration 554, loss = 0.03975411
Iteration 555, loss = 0.03960575
Iteration 556, loss = 0.03944191
Iteration 557, loss = 0.03928480
Iteration 558, loss = 0.03911718
Iteration 559, loss = 0.03899348
Iteration 560, loss = 0.03883716
Iteration 561, loss = 0.03865378
Iteration 562, loss = 0.03850821
Iteration 563, loss = 0.03835049
Iteration 564, loss = 0.03820222
Iteration 565, loss = 0.03808262
Iteration 566, loss = 0.03789168
Iteration 567, loss = 0.03775899
Iteration 568, loss = 0.03759693
Iteration 569, loss = 0.03746945
Iteration 570, loss = 0.03733656
Iteration 571, loss = 0.03717780
Iteration 572, loss = 0.03703783
Iteration 573, loss = 0.03688667
Iteration 574, loss = 0.03675821
Iteration 575, loss = 0.03660601
Iteration 

## Evaluation

Finally, we evaluate our models and interpret our results:

In [24]:
print("--- LOGISTIC REGRESSION (OneHot) ---")
print(f"Precision: {precision_score(test_y_ohe, predicted_logit_ohe, average='weighted')}")
print(f"Recall: {recall_score(test_y_ohe, predicted_logit_ohe, average='weighted')}")
print(f"F1 Measure: {f1_score(test_y_ohe, predicted_logit_ohe, average='weighted')}")
print()
print("--- LOGISTIC REGRESSION (TFIDF) ---")
print(f"Precision: {precision_score(test_y_tfidf, predicted_logit_ohe, average='weighted')}")
print(f"Recall: {recall_score(test_y_tfidf, predicted_logit_ohe, average='weighted')}")
print(f"F1 Measure: {f1_score(test_y_tfidf, predicted_logit_ohe, average='weighted')}")
print()
# print("--- NEURAL NETWORK (OneHot) ---")
# print(f"Precision: {precision_score(test_y_ohe, predicted_nn_ohe, average='weighted')}")
# print(f"Recall: {recall_score(test_y_ohe, predicted_nn_ohe, average='weighted')}")
# print(f"F1 Measure: {f1_score(test_y_ohe, predicted_nn_ohe, average='weighted')}")
print()
print("--- NEURAL NETWORK (TFIDF) ---")
print(f"Precision: {precision_score(test_y_tfidf, predicted_nn, average='weighted')}")
print(f"Recall: {recall_score(test_y_tfidf, predicted_nn, average='weighted')}")
print(f"F1 Measure: {f1_score(test_y_tfidf, predicted_nn, average='weighted')}")

--- LOGISTIC REGRESSION (OneHot) ---
Precision: 0.892646976928959
Recall: 0.8969754253308129
F1 Measure: 0.8906275762141447

--- LOGISTIC REGRESSION (TFIDF) ---
Precision: 0.30493274011745736
Recall: 0.31285444234404536
F1 Measure: 0.30860753063179047


--- NEURAL NETWORK (OneHot) ---
Precision: 0.9246455736504013
Recall: 0.9243856332703214
F1 Measure: 0.9222097963146505


## 