This project is for author fingerprinting. It is the continuation of the not-so-good (https://www.kaggle.com/code/hjaimes/nltk-used-for-author-fingerprinting)

The intention remains the same: Who is the author of the book of Hebrews?, was it the Apostle Paul? or was it Apostle Luke?. While some belive it was Apostle Paul, this is still uncertain to this time.

Since the book of Hebrews is relatively short, one may use lines of these and other books that we know were written by Apostle Paul. Sometimes it is pretty straight forward knowing the author as some books start like: "Paul, a servant of Jesus Christ..." or something similar. Then we would gather books written by Apostle Luke, with this we can train our model and feed it with "Hebrews" for prediction. Let's see what would be the results.

In this second attempt I will use scikit-learn approach, but in the future I would use spaCy (https://spacy.io/) to close the truth gap, the bible version to be used need to be in actual language, as the American Standard Version or the World English Version (https://worldenglish.bible/), The later will be used. The King James version cannot be used, even it is the most popular, it is a form of old english and spaCy might not be optimized for it.

In [1]:
# Some interesting libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import metrics

# Useful definitions
FILE = 'WEB.txt' # Name of the bible version to analyze (World English version)

# Useful functions
# Print lines of a portion
def print_lines(bible_lines, start, stop):
    for i in range(start, stop):
        print(bible_lines[i])
        
# Filter unwanted characters, punctuation, etc., from lines, just alphanumeric and spaces
def filter_line(line):
    line_fltd = [] # list
    for word in line.split(): # list
        word_fltd = ''
        for char in word: # str
            if char.isalnum(): # is alphanumeric character
                word_fltd += char
        line_fltd.append(word_fltd)
    line_fltd = ' '.join(line_fltd).lower() # list to str
    return line_fltd
        
# Retrieve a full book dictionary
def retrieve_book(book_name, bible_lines):
    book_dict = {'bible_lines_indexes': [], 'references': [], 'lines': [], 'lines_fltd': []}
    for i in range(len(bible_lines)):
        if book_name in bible_lines[i].split('\t')[0]:
            book_dict['bible_lines_indexes'].append(i)
            book_dict['references'].append(bible_lines[i].split('\t')[0])
            book_dict['lines'].append(bible_lines[i].split('\t')[1])
            book_dict['lines_fltd'].append(filter_line(bible_lines[i].split('\t')[1]))
    return book_dict

# Prepare dictionary of the book for model training
def add_book_author(book_dict, book_author):
    book_df = pd.DataFrame(book_dict)
    book_df['author'] = book_author
    return book_df.to_dict('list')

In [2]:
# Read the full bible
with open(FILE, 'r', encoding='utf-8') as corpus:
    bible_lines = corpus.readlines() 

In [3]:
# Retrieve the book of interest and add book author
hebrews = add_book_author(retrieve_book('Hebrews', bible_lines), 'Unknown')

# Convert to Dataframe
hebrews_df = pd.DataFrame(hebrews)

hebrews_df.head()

Unnamed: 0,bible_lines_indexes,references,lines,lines_fltd,author
0,29966,Hebrews 1:1,"God, having in the past spoken to the fathers ...",god having in the past spoken to the fathers t...,Unknown
1,29967,Hebrews 1:2,has at the end of these days spoken to us by h...,has at the end of these days spoken to us by h...,Unknown
2,29968,Hebrews 1:3,"His Son is the radiance of his glory, the very...",his son is the radiance of his glory the very ...,Unknown
3,29969,Hebrews 1:4,"having become so much better than the angels, ...",having become so much better than the angels a...,Unknown
4,29970,Hebrews 1:5,For to which of the angels did he say at any t...,for to which of the angels did he say at any t...,Unknown


In [4]:
# Retrieve the book of interest and add book author
romans = add_book_author(retrieve_book('Romans', bible_lines), 'Paul')

# Convert to Dataframe
romans_df = pd.DataFrame(romans)

romans_df.head()

Unnamed: 0,bible_lines_indexes,references,lines,lines_fltd,author
0,27933,Romans 1:1,"Paul, a servant of Jesus Christ, called to be ...",paul a servant of jesus christ called to be an...,Paul
1,27934,Romans 1:2,which he promised before through his prophets ...,which he promised before through his prophets ...,Paul
2,27935,Romans 1:3,"concerning his Son, who was born of the seed o...",concerning his son who was born of the seed of...,Paul
3,27936,Romans 1:4,who was declared to be the Son of God with pow...,who was declared to be the son of god with pow...,Paul
4,27937,Romans 1:5,through whom we received grace and apostleship...,through whom we received grace and apostleship...,Paul


In [5]:
# Retrieve the book of interest and add book author
ephesians = add_book_author(retrieve_book('Ephesians', bible_lines), 'Paul')

# Convert to Dataframe
ephesians_df = pd.DataFrame(ephesians)

ephesians_df.head()

Unnamed: 0,bible_lines_indexes,references,lines,lines_fltd,author
0,29209,Ephesians 1:1,"Paul, an apostle of Christ Jesus through the w...",paul an apostle of christ jesus through the wi...,Paul
1,29210,Ephesians 1:2,Grace to you and peace from God our Father and...,grace to you and peace from god our father and...,Paul
2,29211,Ephesians 1:3,Blessed be the God and Father of our Lord Jesu...,blessed be the god and father of our lord jesu...,Paul
3,29212,Ephesians 1:4,even as he chose us in him before the foundati...,even as he chose us in him before the foundati...,Paul
4,29213,Ephesians 1:5,having predestined us for adoption as children...,having predestined us for adoption as children...,Paul


In [6]:
# Retrieve the book of interest and add book author
galatians = add_book_author(retrieve_book('Galatians', bible_lines), 'Paul')

# Convert to Dataframe
galatians_df = pd.DataFrame(galatians)

galatians_df.head()

Unnamed: 0,bible_lines_indexes,references,lines,lines_fltd,author
0,29060,Galatians 1:1,"Paul, an apostle (not from men, neither throug...",paul an apostle not from men neither through m...,Paul
1,29061,Galatians 1:2,"and all the brothers who are with me, to the a...",and all the brothers who are with me to the as...,Paul
2,29062,Galatians 1:3,"Grace to you and peace from God the Father, an...",grace to you and peace from god the father and...,Paul
3,29063,Galatians 1:4,"who gave himself for our sins, that he might d...",who gave himself for our sins that he might de...,Paul
4,29064,Galatians 1:5,to whom be the glory forever and ever. Amen.\n,to whom be the glory forever and ever amen,Paul


In [7]:
# Retrieve the book of interest and add book author
corinthians_1 = add_book_author(retrieve_book('1 Corinthians', bible_lines), 'Paul')

# Convert to Dataframe
corinthians_1_df = pd.DataFrame(corinthians_1)

corinthians_1_df.head()

Unnamed: 0,bible_lines_indexes,references,lines,lines_fltd,author
0,28366,1 Corinthians 1:1,"Paul, called to be an apostle of Jesus Christ ...",paul called to be an apostle of jesus christ t...,Paul
1,28367,1 Corinthians 1:2,to the assembly of God which is at Corinth; th...,to the assembly of god which is at corinth tho...,Paul
2,28368,1 Corinthians 1:3,Grace to you and peace from God our Father and...,grace to you and peace from god our father and...,Paul
3,28369,1 Corinthians 1:4,"I always thank my God concerning you, for the ...",i always thank my god concerning you for the g...,Paul
4,28370,1 Corinthians 1:5,"that in everything you were enriched in him, i...",that in everything you were enriched in him in...,Paul


In [8]:
# Retrieve the book of interest and add book author
corinthians_2 = add_book_author(retrieve_book('2 Corinthians', bible_lines), 'Paul')

# Convert to Dataframe
corinthians_2_df = pd.DataFrame(corinthians_2)

corinthians_2_df.head()

Unnamed: 0,bible_lines_indexes,references,lines,lines_fltd,author
0,28803,2 Corinthians 1:1,"Paul, an apostle of Christ Jesus through the w...",paul an apostle of christ jesus through the wi...,Paul
1,28804,2 Corinthians 1:2,Grace to you and peace from God our Father and...,grace to you and peace from god our father and...,Paul
2,28805,2 Corinthians 1:3,Blessed be the God and Father of our Lord Jesu...,blessed be the god and father of our lord jesu...,Paul
3,28806,2 Corinthians 1:4,"who comforts us in all our affliction, that we...",who comforts us in all our affliction that we ...,Paul
4,28807,2 Corinthians 1:5,"For as the sufferings of Christ abound to us, ...",for as the sufferings of christ abound to us e...,Paul


In [9]:
# Retrieve the book of interest and add book author
colossians = add_book_author(retrieve_book('Colossians', bible_lines), 'Paul')

# Convert to Dataframe
colossians_df = pd.DataFrame(colossians)

colossians_df.head()

Unnamed: 0,bible_lines_indexes,references,lines,lines_fltd,author
0,29468,Colossians 1:1,"Paul, an apostle of Christ Jesus through the w...",paul an apostle of christ jesus through the wi...,Paul
1,29469,Colossians 1:2,to the saints and faithful brothers in Christ ...,to the saints and faithful brothers in christ ...,Paul
2,29470,Colossians 1:3,We give thanks to God the Father of our Lord J...,we give thanks to god the father of our lord j...,Paul
3,29471,Colossians 1:4,"having heard of your faith in Christ Jesus, an...",having heard of your faith in christ jesus and...,Paul
4,29472,Colossians 1:5,because of the hope which is laid up for you i...,because of the hope which is laid up for you i...,Paul


In [10]:
# Retrieve the book of interest and add book author
philippians = add_book_author(retrieve_book('Philippians', bible_lines), 'Paul') # Paul & Timothy

# Convert to Dataframe
philippians_df = pd.DataFrame(philippians)

philippians_df.head()

Unnamed: 0,bible_lines_indexes,references,lines,lines_fltd,author
0,29364,Philippians 1:1,"Paul and Timothy, servants of Jesus Christ; To...",paul and timothy servants of jesus christ to a...,Paul
1,29365,Philippians 1:2,"Grace to you, and peace from God, our Father, ...",grace to you and peace from god our father and...,Paul
2,29366,Philippians 1:3,"I thank my God whenever I remember you,\n",i thank my god whenever i remember you,Paul
3,29367,Philippians 1:4,always in every request of mine on behalf of y...,always in every request of mine on behalf of y...,Paul
4,29368,Philippians 1:5,for your partnership in furtherance of the Goo...,for your partnership in furtherance of the goo...,Paul


In [11]:
# Retrieve the book of interest and add book author
thessalonians = add_book_author(retrieve_book('Thessalonians', bible_lines), 'Paul') # Paul, Silvanus & Timothy

# Convert to Dataframe
thessalonians_df = pd.DataFrame(thessalonians)

thessalonians_df.head()

Unnamed: 0,bible_lines_indexes,references,lines,lines_fltd,author
0,29563,1 Thessalonians 1:1,"Paul, Silvanus, and Timothy, to the assembly o...",paul silvanus and timothy to the assembly of t...,Paul
1,29564,1 Thessalonians 1:2,"We always give thanks to God for all of you, m...",we always give thanks to god for all of you me...,Paul
2,29565,1 Thessalonians 1:3,remembering without ceasing your work of faith...,remembering without ceasing your work of faith...,Paul
3,29566,1 Thessalonians 1:4,"We know, brothers loved by God, that you are c...",we know brothers loved by god that you are chosen,Paul
4,29567,1 Thessalonians 1:5,and that our Good News came to you not in word...,and that our good news came to you not in word...,Paul


In [12]:
# Retrieve the book of interest and add book author
timothy_1 = add_book_author(retrieve_book('1 Timothy', bible_lines), 'Paul') 

# Convert to Dataframe
timothy_1_df = pd.DataFrame(timothy_1)

timothy_1_df.head()

Unnamed: 0,bible_lines_indexes,references,lines,lines_fltd,author
0,29699,1 Timothy 1:1,"Paul, an apostle of Christ Jesus according to ...",paul an apostle of christ jesus according to t...,Paul
1,29700,1 Timothy 1:2,"to Timothy, my true child in faith: Grace, mer...",to timothy my true child in faith grace mercy ...,Paul
2,29701,1 Timothy 1:3,As I urged you when I was going into Macedonia...,as i urged you when i was going into macedonia...,Paul
3,29702,1 Timothy 1:4,neither to pay attention to myths and endless ...,neither to pay attention to myths and endless ...,Paul
4,29703,1 Timothy 1:5,"but the goal of this command is love, out of a...",but the goal of this command is love out of a ...,Paul


In [13]:
# Retrieve the book of interest and add book author
timothy_2 = add_book_author(retrieve_book('2 Timothy', bible_lines), 'Paul')

# Convert to Dataframe
timothy_2_df = pd.DataFrame(timothy_2)

timothy_2_df.head()

Unnamed: 0,bible_lines_indexes,references,lines,lines_fltd,author
0,29812,2 Timothy 1:1,"Paul, an apostle of Jesus Christ through the w...",paul an apostle of jesus christ through the wi...,Paul
1,29813,2 Timothy 1:2,"to Timothy, my beloved child: Grace, mercy, an...",to timothy my beloved child grace mercy and pe...,Paul
2,29814,2 Timothy 1:3,"I thank God, whom I serve as my forefathers di...",i thank god whom i serve as my forefathers did...,Paul
3,29815,2 Timothy 1:4,"longing to see you, remembering your tears, th...",longing to see you remembering your tears that...,Paul
4,29816,2 Timothy 1:5,having been reminded of the sincere faith that...,having been reminded of the sincere faith that...,Paul


In [14]:
# Retrieve the book of interest and add book author
titus = add_book_author(retrieve_book('Titus', bible_lines), 'Paul')

# Convert to Dataframe
titus_df = pd.DataFrame(titus)

titus_df.head()

Unnamed: 0,bible_lines_indexes,references,lines,lines_fltd,author
0,29895,Titus 1:1,"Paul, a servant of God, and an apostle of Jesu...",paul a servant of god and an apostle of jesus ...,Paul
1,29896,Titus 1:2,"in hope of eternal life, which God, who can’t ...",in hope of eternal life which god who cant lie...,Paul
2,29897,Titus 1:3,but in his own time revealed his word in the m...,but in his own time revealed his word in the m...,Paul
3,29898,Titus 1:4,"to Titus, my true child according to a common ...",to titus my true child according to a common f...,Paul
4,29899,Titus 1:5,"I left you in Crete for this reason, that you ...",i left you in crete for this reason that you w...,Paul


In [15]:
# Retrieve the book of interest and add book author
philemon = add_book_author(retrieve_book('Philemon', bible_lines), 'Paul') # Paul & Timothy

# Convert to Dataframe
philemon_df = pd.DataFrame(philemon)

philemon_df.head()

Unnamed: 0,bible_lines_indexes,references,lines,lines_fltd,author
0,29941,Philemon 1:1,"Paul, a prisoner of Christ Jesus, and Timothy ...",paul a prisoner of christ jesus and timothy ou...,Paul
1,29942,Philemon 1:2,"to the beloved Apphia, to Archippus, our fello...",to the beloved apphia to archippus our fellow ...,Paul
2,29943,Philemon 1:3,Grace to you and peace from God our Father and...,grace to you and peace from god our father and...,Paul
3,29944,Philemon 1:4,"I thank my God always, making mention of you i...",i thank my god always making mention of you in...,Paul
4,29945,Philemon 1:5,"hearing of your love, and of the faith which y...",hearing of your love and of the faith which yo...,Paul


In [16]:
# Retrieve the book of interest and add book author
luke = add_book_author(retrieve_book('Luke', bible_lines), 'Luke')

# Convert to Dataframe
luke_df = pd.DataFrame(luke)

luke_df.head()

Unnamed: 0,bible_lines_indexes,references,lines,lines_fltd,author
0,24896,Luke 1:1,Since many have undertaken to set in order a n...,since many have undertaken to set in order a n...,Luke
1,24897,Luke 1:2,even as those who from the beginning were eyew...,even as those who from the beginning were eyew...,Luke
2,24898,Luke 1:3,"it seemed good to me also, having traced the c...",it seemed good to me also having traced the co...,Luke
3,24899,Luke 1:4,that you might know the certainty concerning t...,that you might know the certainty concerning t...,Luke
4,24900,Luke 1:5,"There was in the days of Herod, the king of Ju...",there was in the days of herod the king of jud...,Luke


In [17]:
# Retrieve the book of interest and add book author
acts = add_book_author(retrieve_book('Acts', bible_lines), 'Luke')

# Convert to Dataframe
acts_df = pd.DataFrame(acts)

acts_df.head()

Unnamed: 0,bible_lines_indexes,references,lines,lines_fltd,author
0,26926,Acts 1:1,"The first book I wrote, Theophilus, concerned ...",the first book i wrote theophilus concerned al...,Luke
1,26927,Acts 1:2,"until the day in which he was received up, aft...",until the day in which he was received up afte...,Luke
2,26928,Acts 1:3,To these he also showed himself alive after he...,to these he also showed himself alive after he...,Luke
3,26929,Acts 1:4,"Being assembled together with them, he command...",being assembled together with them he commande...,Luke
4,26930,Acts 1:5,"For John indeed baptized in water, but you wil...",for john indeed baptized in water but you will...,Luke


In [18]:
# Let's try Luke and Paul
# Paul: romans_df, luke_df, ephesians_df, corinthians_1_df, corinthians_2_df, colossians_df, galatians_df, philippians_df, thessalonians_df, timothy_1_df, timothy_2_df, titus_df,
#       philemon_df (& Timothy)
# Luke: luke_df, acts_df

# Classificator Dataframe
luke_paul_df = pd.concat([romans_df, luke_df, ephesians_df, corinthians_1_df, corinthians_2_df, colossians_df, galatians_df, philippians_df, thessalonians_df, timothy_1_df, timothy_2_df, titus_df,\
                          philemon_df, acts_df])

In [19]:
# Let's see how balanced is the data to avoid bias as much as possible
luke_paul_df['author'].value_counts()

Luke    2158
Paul    2033
Name: author, dtype: int64

In [20]:
# Yep, just two values for classification
luke_paul_df['author'].unique()

array(['Paul', 'Luke'], dtype=object)

In [21]:
# So far so good as no null information
luke_paul_df.isnull().sum()

bible_lines_indexes    0
references             0
lines                  0
lines_fltd             0
author                 0
dtype: int64

In [22]:
# Train and evaluate the models
X = luke_paul_df['lines_fltd']
y = luke_paul_df['author']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=5)

In [23]:
# Naïve Bayes:
text_clf_mnb = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB()),])

# Train the model 
text_clf_mnb.fit(X_train, y_train)

# Form a prediction set
predictions = text_clf_mnb.predict(X_test)

# Print confusion matrix
print(metrics.confusion_matrix(y_test,predictions))

# Print a classification report
print(metrics.classification_report(y_test,predictions))

# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

[[495  48]
 [ 61 444]]
              precision    recall  f1-score   support

        Luke       0.89      0.91      0.90       543
        Paul       0.90      0.88      0.89       505

    accuracy                           0.90      1048
   macro avg       0.90      0.90      0.90      1048
weighted avg       0.90      0.90      0.90      1048

0.8959923664122137


In [24]:
# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),])

# Train the model 
text_clf_lsvc.fit(X_train, y_train)

# Form a prediction set
predictions = text_clf_lsvc.predict(X_test)

# Print confusion matrix
print(metrics.confusion_matrix(y_test,predictions))

# Print a classification report
print(metrics.classification_report(y_test,predictions))

# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

[[496  47]
 [ 63 442]]
              precision    recall  f1-score   support

        Luke       0.89      0.91      0.90       543
        Paul       0.90      0.88      0.89       505

    accuracy                           0.90      1048
   macro avg       0.90      0.89      0.89      1048
weighted avg       0.90      0.90      0.89      1048

0.8950381679389313




In [25]:
# I guess 90% of accuracy and F1-score for both Multinomial NB and SVC are acceptable, let's see which one would be the author of "Hebrews"
hebrews_pred = {'references': [], 'lines': [], 'MNB pred': [], 'SVC pred': []}
for i in range(len(hebrews_df['lines'])):
    hebrews_pred['references'].append(hebrews_df['references'][i])
    hebrews_pred['lines'].append(hebrews_df['lines'][i])
    hebrews_pred['MNB pred'].append(text_clf_mnb.predict([hebrews_df['lines'][i]])[0])
    hebrews_pred['SVC pred'].append(text_clf_lsvc.predict([hebrews_df['lines'][i]])[0])

hebrews_pred_df = pd.DataFrame(hebrews_pred)

In [26]:
hebrews_pred_df.head()

Unnamed: 0,references,lines,MNB pred,SVC pred
0,Hebrews 1:1,"God, having in the past spoken to the fathers ...",Luke,Luke
1,Hebrews 1:2,has at the end of these days spoken to us by h...,Luke,Luke
2,Hebrews 1:3,"His Son is the radiance of his glory, the very...",Luke,Luke
3,Hebrews 1:4,"having become so much better than the angels, ...",Paul,Luke
4,Hebrews 1:5,For to which of the angels did he say at any t...,Luke,Luke


In [27]:
hebrews_pred_df.tail()

Unnamed: 0,references,lines,MNB pred,SVC pred
298,Hebrews 13:21,make you complete in every good work to do his...,Paul,Paul
299,Hebrews 13:22,"But I exhort you, brothers, endure the word of...",Paul,Paul
300,Hebrews 13:23,"Know that our brother Timothy has been freed, ...",Paul,Paul
301,Hebrews 13:24,Greet all of your leaders and all the saints. ...,Paul,Paul
302,Hebrews 13:25,Grace be with you all. Amen.\n,Paul,Paul


In [28]:
# Let's see how the models agree on each other
hebrews_pred_df[['MNB pred', 'SVC pred']].value_counts()/len(hebrews_pred_df)*100

MNB pred  SVC pred
Paul      Paul        50.495050
Luke      Luke        33.993399
Paul      Luke         8.250825
Luke      Paul         7.260726
dtype: float64

It is hard to interpret the truth from here, as even both models agree on 50.5% that the book of "Hebrews" was written by Paul, both of them also predict 34.% of it was written by Luke. On the other side, they disagree in 15.5% of the book giving opposite authoring. Perhaps there are other contrubitors to the book!.

In [29]:
# Let's see what MNB Model 'thinks'...
hebrews_pred_df['MNB pred'].value_counts()/len(hebrews_pred_df)*100

Paul    58.745875
Luke    41.254125
Name: MNB pred, dtype: float64

In [30]:
# Let's see what SVC Model 'thinks'...
hebrews_pred_df['SVC pred'].value_counts()/len(hebrews_pred_df)*100

Paul    57.755776
Luke    42.244224
Name: SVC pred, dtype: float64

Interesting enough, as per the data, it seems both Paul and Luke had written the book of "Hebrews". And may be others in minor proportion (Apostle Timothy?). The thing is that Apostle Luke and Apostle Timothy visited Apostle Paul in jail. In the next attempt I will try to use spaCy (https://spacy.io/) to uncover this conundrum sometime in the future.