# Discovery of Writing Differences - Data Exploration

Capstone project by Tomo Umer

<img src="https://tomoumerdotcom.files.wordpress.com/2022/04/cropped-pho_logo_notext.png" style="width:400px;height:400px;"/>

## Imports

In [1]:
import pandas as pd
import numpy as np
import re
import glob
import plotly.express as px
import pickle

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

## Exploring Available Books

An initial exploration of available books and what was not downloaded (but should, according to the metadata).

To start with, figure out which books were not downloaded and yet they are present in the metadata csv!

In [2]:
books_list = []

# list of books downloaded successfully into the /raw/ folder
for name in glob.glob('../data/raw/*'):
    books_list.append(re.findall(r'PG\d*', name)[0])

# the metadata (books that should be there)
library = pd.read_csv('../data/metadata.csv')

print('there are', len(library) - len(books_list), '"books" listed in the metadata that did not get downloaded.')

there are 3435 "books" listed in the metadata that did not get downloaded.


In [3]:
missing_library = library.loc[~library['id'].isin(books_list)]

print(missing_library['type'].value_counts(dropna=False))
print('\nEnglish missing only:')
print(missing_library.loc[missing_library['language'].str.find('en') > -1]['type'].value_counts(dropna=False))

NaN            2215
Sound          1104
Dataset          83
Image            33
MovingImage       7
StillImage        3
Collection        1
Text              1
Name: type, dtype: int64

English missing only:
NaN            1991
Sound          1039
Dataset          83
Image            33
MovingImage       7
StillImage        3
Collection        1
Text              1
Name: type, dtype: int64


Initially, I filtered them one by one and explored the contents. For the sake of keeping the notebook cleaner, I decided to compact the code below, to display (up to) five rows for each category:

- 'NaNs' are the actual books; so I am missing 2215 of them in my local library. If I get the time I'd like to explore why they did not get downloaded
- 'Sounds' are fequently transcriptions of speeches. Not interested in those.
- 'Datasets' are primarily Human Genome Project (72 of them). There are 10 calculations of square roots and 1/pi to a million digits. 'Moby Word Lists' is just info on gutenberg, disclaimers, etc...
- the 'Image' contains music sheets
- 'MovingImage' contains comets video, rotating earth and 5 nuclear test videos
- 'StillImages' contain a kids story illustrated and two maps/ map images
- 'Collection' contains 'Project Gutenberg DVD: The July 2006 Special' and the only not downloaded text is just empty.
- And then there is 'text' which I'm sure it was supposed to be for books, but it's just NaNs all around

In [4]:
# get a list of the file types (note: it's the same as in the full library)
file_types = missing_library['type'].unique()

# loop over the list and display dataframes belonging to those 8 different types
for file_type in file_types:
    filtered_library = missing_library.loc[missing_library['type'].isna() if file_type is np.nan
                                           else missing_library['type'] == file_type]
    display(filtered_library.head())


Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
151,PG10137,Mary Had a Little Lamb: Recording taken from M...,"Edison, Thomas A. (Thomas Alva)",1847.0,1931.0,['en'],21,"{'Nursery rhymes, American'}",Sound
168,PG10152,Voice Trial - Kinetophone actor audition,"Lett, Bob",,,['en'],4,{'Auditions'},Sound
169,PG10153,Voice Trial - Kinetophone Actor Audition,"Lenord, Frank",,,['en'],4,{'Auditions'},Sound
170,PG10154,Voice Trial - Kinetophone Actor Audition,"Schultz, Siegfried Von",,,['en'],0,{'Auditions'},Sound
171,PG10155,The Right of the People to Rule,"Roosevelt, Theodore",1858.0,1919.0,['en'],9,"{'Progressivism (United States politics)', 'Po...",Sound


Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
606,PG10547,Topsy-Turvy,"Verne, Jules",1828.0,1905.0,['en'],126,"{'Science fiction, French -- Translations into...",
703,PG10634,"The Queen of Hearts, and Sing a Song for Sixpence","Caldecott, Randolph",1846.0,1886.0,['en'],44,"{'Picture books for children', 'Nursery rhymes...",
841,PG10762,Impressions of Theophrastus Such,"Eliot, George",1819.0,1880.0,['en'],110,"{'Authors -- Fiction', 'England -- Fiction', '...",
923,PG10836,The Algebra of Logic,"Couturat, Louis",1868.0,1914.0,['en'],97,"{'Logic, Symbolic and mathematical', 'Algebrai...",
1106,PG10,The King James Version of the Bible,,,,['en'],5831,{'Bible'},


Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
1108,PG11001,String Quartet No. 05 in A major Opus 18,"Beethoven, Ludwig van",1770.0,1827.0,['en'],5,"{'Music', 'String quartets -- Scores'}",Image
1109,PG11002,"String Quartet No. 11 in F minor Opus 95 ""Seri...","Beethoven, Ludwig van",1770.0,1827.0,['en'],6,"{'String quartets -- Scores', 'Music'}",Image
1944,PG11755,String Quartet No. 10 in E flat major Opus 74 ...,"Beethoven, Ludwig van",1770.0,1827.0,['en'],15,"{'Music', 'String quartets -- Scores'}",Image
2381,PG12149,String Quartet No. 03 in D major Opus 18,"Beethoven, Ludwig van",1770.0,1827.0,['en'],15,"{'String quartets -- Scores', 'Music'}",Image
2479,PG12237,String Quartet No. 16 in F major Opus 135,"Beethoven, Ludwig van",1770.0,1827.0,['en'],21,"{'Music', 'String quartets -- Scores'}",Image


Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
1661,PG114,The Tenniel Illustrations for Carroll's Alice ...,"Tenniel, John",1820.0,1914.0,['en'],391,"{""Children's stories"", 'Fantasy fiction'}",StillImage
15515,PG239,Radar Map of the United States,United States,,,['en'],27,{'United States -- Maps'},StillImage
67797,PG758,"LandSat Picture of Washington, DC",United States. National Aeronautics and Space ...,,,['en'],36,{'Washington (D.C.) -- Remote-sensing images'},StillImage


Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
1966,PG11775,"Human Genome Project, Build 34, Chromosome Num...",Human Genome Project,,,['en'],37,{'Nucleotide sequence'},Dataset
1967,PG11776,"Human Genome Project, Build 34, Chromosome Num...",Human Genome Project,,,['en'],6,{'Nucleotide sequence'},Dataset
1968,PG11777,"Human Genome Project, Build 34, Chromosome Num...",Human Genome Project,,,['en'],2,{'Nucleotide sequence'},Dataset
1969,PG11778,"Human Genome Project, Build 34, Chromosome Num...",Human Genome Project,,,['en'],3,{'Nucleotide sequence'},Dataset
1970,PG11779,"Human Genome Project, Build 34, Chromosome Num...",Human Genome Project,,,['en'],1,{'Nucleotide sequence'},Dataset


Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
3416,PG13082,"Images of Comet Wild 2, Taken by NASA's Stardu...",United States. National Aeronautics and Space ...,,,['en'],38,{'Comets'},MovingImage
17404,PG256,Motion Picture of Rotating Earth,United States,,,['en'],41,{'World maps'},MovingImage
46756,PG5212,Film: Trinity Shot (first US Atomic Test),,,,['en'],26,"{'Nuclear weapons -- Testing', 'Manhattan Proj...",MovingImage
46767,PG5213,Film: the Bikini Island ABLE Atomic Test,,,,['en'],17,{'Nuclear weapons -- Marshall Islands -- Bikin...,MovingImage
46778,PG5214,Film: the Bikini Island BAKER Atomic Test,,,,['en'],15,{'Nuclear weapons -- Marshall Islands -- Bikin...,MovingImage


Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
10150,PG19159,Project Gutenberg DVD: The July 2006 Special,,,,['en'],73,set(),Collection


Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
69464,PG90907,,,,,['en'],1,set(),Text


## Selecting (usable) English books

Filtering down to just books (removing other file types) and then specifically focussing on English books only.

This part is done in five steps, starting from the metadata "library":
1. Select books, i.e., "type" being NaN (see exploration above)
2. Found out later - clear out index of Gutenberg ("books", which are just links to actual works)
3. Select just only works makred as "English"
4. Select works that have actually been downloaded to my laptop
5. Remove unknown or various authors (I'm interested in known, single authors)

In [5]:
print('Full Collection:', len(library))

library_en = library.loc[library['type'].isna()].copy()
print('Books Only:', len(library_en))

library_en = library_en.loc[~(library_en['title'].str.find('Index') > -1) & ~(library_en['title'].str.find('Gutenberg') > -1)]
print('After removing "Index" Books:', len(library_en))

library_en = library_en.loc[library_en['language'].str.find('en') > -1]
print('English Books:', len(library_en))

library_en = library_en.loc[library_en['id'].isin(books_list)]
print('Downloaded English Books:', len(library_en))

library_en = library_en[~library_en['author'].isin(['Anonymous', 'Unknown', 'Various'])]
print('=====================================')
print('Downloaded English Books by known authors:', len(library_en))
print('Unique authors:', library_en['author'].nunique())
print('Total downloads:', library_en['downloads'].sum())

Full Collection: 70449
Books Only: 69197
After removing "Index" Books: 68853
English Books: 55428
Downloaded English Books: 53440
Downloaded English Books by known authors: 49328
Unique authors: 18231
Total downloads: 6696949


Furthermore, it may be worth considering that the vast majority of books haas less than 100 downloads in the last 30 days.

In [6]:
print(library_en[library_en['downloads'] <= 10].shape[0])
print(library_en[library_en['downloads'] <= 100].shape[0])
print(library_en[library_en['downloads'] <= 1000].shape[0])

5187
42379
48701


In [7]:
library_en.head()

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
1,PG10001,Apocolocyntosis,"Seneca, Lucius Annaeus",,65.0,['en'],400,"{'Claudius, Emperor of Rome, 10 B.C.-54 A.D. -...",
2,PG10002,The House on the Borderland,"Hodgson, William Hope",1877.0,1918.0,['en'],666,{'Science fiction'},
3,PG10003,"My First Years as a Frenchwoman, 1876-1879","Waddington, Mary King",1833.0,1923.0,['en'],43,"{'France -- Social life and customs', 'France ...",
4,PG10004,The Warriors,"Lindsay, Anna Robertson Brown",1864.0,1948.0,['en'],27,{'Christianity'},
5,PG10005,A Voyage to the Moon: With Some Account of the...,"Tucker, George",1775.0,1861.0,['en'],58,"{'Science fiction', 'Space flight to the moon ...",


## Selecting Writers

Choosing authors to use for further analysis. Given memory (and time) constraints, I will need to limit this analysis to a few select authors. 

Ideas for selecting authors:
- first I started with top 7 based on the total # of books written to make sure my logic works

In [8]:
(
library_en
    .groupby('author')[['title']]
    .count()
    .sort_values(by='title', ascending=False)
    .rename(columns={'title': 'num_books'})
    .head(7)
    .reset_index()
)

Unnamed: 0,author,num_books
0,"Shakespeare, William",178
1,"Ebers, Georg",162
2,"Kingston, William Henry Giles",132
3,"Oliphant, Mrs. (Margaret)",132
4,"Parker, Gilbert",131
5,"Fenn, George Manville",128
6,"Twain, Mark",125


Other possible options included:
- most downloads over last 30 days
- at least 1000 downloads over last 30 days
- select based on the century
- personal preference

I went back and forth on this a lot and rewrote a bunch of code, including various selects and eliminations. Current approach is to get authors by century, make sure they have at least 5 books written and limit to 70 most popular (model 1) if they have more (to reduce class imbalance), as well as make sure I know of the author.

In [9]:
# function to calculate century of living, taking int account both birth and death years where possible
def calc_century(birth, death):

    if np.isnan(birth) & np.isnan(death):
        x = np.NaN
        return x

    elif np.isnan(birth):
        x = death
    elif np.isnan(death):
        x = birth
    else:
        x = (birth + (death/2 - birth/2))

    # adjust for how centuries are calculated; the first +1 adjusts for the shift of one year, because year 0 is skipped
    # and then the // 100 + 1 adjusts for the correct century, e.g. 19th is 1901 - 2000
    if x >= 0:
        x = (x - 1) // 100 + 1

    # the same logic from above is not needed for years before 0
    else:
        x = x // 100 

    return int(x)

In [10]:
# function to convert the numeric int representation of century to ordinal, plus appending BCE or CE
def annotate_century(num_century):

    if np.isnan(num_century):
        return 'unknown'
    elif int(num_century) < 0 :
        ctry = ' century BCE'
        num_century = abs(int(num_century))
    else:
        num_century = int(num_century)
        ctry = ' century CE'


    # determine ordinal numbering
    if (num_century % 10 == 1) and (num_century % 100 != 11):
        ordinal_century = str(num_century) + 'st'

    elif (num_century % 10 == 2) and (num_century % 100 != 12):
        ordinal_century = str(num_century) + 'nd'

    elif (num_century % 10 == 3) and (num_century % 100 != 13):
        ordinal_century = str(num_century) + 'rd'
        
    else:
        ordinal_century = str(num_century) + 'th'

    return ordinal_century + ctry

In [11]:
library_en['authorcentury'] = library_en.apply(lambda x: calc_century(x.authoryearofbirth, x.authoryearofdeath), axis=1)

In [12]:
library_en['authorcentury_str'] = library_en['authorcentury'].apply(annotate_century)

Here's a breakdown of number authors per century. Most of them are of course concentrated in the 19th or 20th century, plus the unknown. History is written by the victors.

In [57]:
(
library_en
    .groupby(['authorcentury', 'authorcentury_str'], dropna=False)[['author']]
    .agg(['count', 'nunique'])     #.nunique()
    .reset_index()
    #.drop(columns='authorcentury')
)

Unnamed: 0_level_0,authorcentury,authorcentury_str,author,author
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,nunique
0,-7.0,7th century BCE,14,1
1,-6.0,6th century BCE,5,1
2,-5.0,5th century BCE,30,8
3,-4.0,4th century BCE,57,5
4,-3.0,3rd century BCE,3,2
5,-2.0,2nd century BCE,1,1
6,-1.0,1st century BCE,46,9
7,1.0,1st century CE,50,9
8,2.0,2nd century CE,10,2
9,3.0,3rd century CE,9,3


The next few blocks of code are various explorations into the numbers by authors. Like stated before, I ultimately decided for a personalized approach. Will keep the blocks here for reference:

In [14]:
# authors with more than 1000 downloads over last 30 days
(
library_en.loc[library_en['downloads'] > 1000]
    .groupby('author')[['title','authorcentury']]
    .agg({'title':'count', 'authorcentury':'max'})
    .sort_values(by='title', ascending=False)
    .head(20)
)

Unnamed: 0_level_0,title,authorcentury
author,Unnamed: 1_level_1,Unnamed: 2_level_1
"Dickens, Charles",12,19.0
"Shakespeare, William",12,16.0
"Doyle, Arthur Conan",10,19.0
"Nietzsche, Friedrich Wilhelm",9,19.0
"Twain, Mark",9,19.0
"Austen, Jane",8,18.0
"Christie, Agatha",8,20.0
Plato,8,-4.0
"Wilde, Oscar",7,19.0
"Chesterton, G. K. (Gilbert Keith)",6,20.0


In [15]:
# authors based on number of downloads overall
display(library_en.sort_values(by='downloads', ascending=False).head())

display(library_en.groupby('author')[['downloads','authorcentury']].agg({'downloads':'sum', 'authorcentury':'max'}).sort_values(by='downloads', ascending=False).head())

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type,authorcentury,authorcentury_str
5702,PG1513,Romeo and Juliet,"Shakespeare, William",1564.0,1616.0,['en'],166112,"{'Juliet (Fictitious character) -- Drama', 'Ro...",,16.0,16th century CE
18200,PG2641,A Room with a View,"Forster, E. M. (Edward Morgan)",1879.0,1970.0,['en'],145035,"{'British -- Italy -- Fiction', 'Florence (Ita...",,20.0,20th century CE
30065,PG37106,"Little Women; Or, Meg, Jo, Beth, and Amy","Alcott, Louisa May",1832.0,1888.0,['en'],139345,"{'Bildungsromans', 'March family (Fictitious c...",,19.0,19th century CE
5102,PG145,Middlemarch,"Eliot, George",1819.0,1880.0,['en'],138208,"{'Bildungsromans', 'Married people -- Fiction'...",,19.0,19th century CE
18867,PG2701,"Moby Dick; Or, The Whale","Melville, Herman",1819.0,1891.0,['en'],135040,"{'Whaling -- Fiction', 'Psychological fiction'...",,19.0,19th century CE


Unnamed: 0_level_0,downloads,authorcentury
author,Unnamed: 1_level_1,Unnamed: 2_level_1
"Smollett, T. (Tobias)",364509,18.0
"Shakespeare, William",333424,16.0
"Alcott, Louisa May",154196,19.0
"Forster, E. M. (Edward Morgan)",150394,20.0
"Eliot, George",145359,19.0


In [16]:
# same as above, but by century if interested in that
# cte = 19

# display(library_en.loc[library_en['authorcentury'] == cte].groupby('author')[['title']].count().sort_values(by='title',ascending=False).head()) #
# display(library_en.loc[library_en['authorcentury'] == cte].groupby('author')[['downloads']].sum().sort_values(by='downloads',ascending=False).head())

Finally, my approach to chosing the authors (ended up with 21), was an iterative process. This was my initial setup:
```    
        library_en
        .loc[(library_en['author'].isin(select_authors)) | (library_en['authorcentury'] >= -7.)]
        .groupby('author')
        .agg({'authorcentury': 'max', 'title':'count', 'downloads': 'max'})
        .sort_values(by=['authorcentury', 'title', 'downloads'], ascending=[True, False, False])
        .head(30)
```
The authors list was empty and for the century essentially I picked all (no books are older than 7th BCE). Afterwards it was an iterative process:
1. for each century, pick at an author with at least 5 books that I recognize (if possible)
2. add the author to the select_authors list, and increase the authorcentury filter by 1
3. note: for later centuries I started picking more authors
4. make any adjustments to authors names or cut down # of books if too many

In [17]:
# changed the name to pull together Plato and Shakespeare
library_en = library_en.replace({'author': {'Plato (spurious and doubtful works)': 'Plato'}})
library_en = library_en.replace({'author': {'Shakespeare (spurious and doubtful works)': 'Shakespeare, William'}})

In [18]:
# to use in the streamlit app
library_en.to_pickle('../data/library_en.pkl')

In [19]:
select_authors = [
    'Homer',
    'Confucius',
    'Plato',
    'Cicero, Marcus Tullius',
    'Seneca, Lucius Annaeus',
    #'Marcus Aurelius, Emperor of Rome', excluded because only had 4 works after removing index
    'Dante Alighieri',
    'Boccaccio, Giovanni',
    'Machiavelli, Niccolò',
    'Shakespeare, William',
    'Molière',
    'Defoe, Daniel',
    'Jefferson, Thomas',
    'Austen, Jane',
    'Twain, Mark',
    'Dickens, Charles',
    'Doyle, Arthur Conan',
    'Dumas, Alexandre',
    'Churchill, Winston',
    'Dick, Philip K.',
    'Huxley, Aldous',
    'Lovecraft, H. P. (Howard Phillips)'
]

In [20]:
len(select_authors)

21

In [21]:
(
    library_en
        .loc[(library_en['author'].isin(select_authors)) | (library_en['authorcentury'] >= 22.)]
        .groupby('author')
        .agg({'authorcentury': 'max', 'title':'count', 'downloads': 'max'})
        .sort_values(by=['authorcentury', 'title', 'downloads'], ascending=[True, False, False])
        .head(30)
        #.tail(10)
)

Unnamed: 0_level_0,authorcentury,title,downloads
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Homer,-7.0,14,11357
Confucius,-6.0,5,469
Plato,-4.0,32,8046
"Cicero, Marcus Tullius",-1.0,14,1103
"Seneca, Lucius Annaeus",1.0,5,590
Dante Alighieri,13.0,19,6896
"Boccaccio, Giovanni",14.0,6,2695
"Machiavelli, Niccolò",15.0,5,12776
"Shakespeare, William",16.0,188,166112
Molière,17.0,18,1288


In [22]:
# this was to see individual authors and thier most popular works
library_en.loc[library_en['author'] == 'Shakespeare, William'].sort_values(by='downloads', ascending=False).head(10)

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type,authorcentury,authorcentury_str
5702,PG1513,Romeo and Juliet,"Shakespeare, William",1564.0,1616.0,['en'],166112,"{'Juliet (Fictitious character) -- Drama', 'Ro...",,16.0,16th century CE
110,PG100,The Complete Works of William Shakespeare,"Shakespeare, William",1564.0,1616.0,['en'],124452,{'English drama -- Early modern and Elizabetha...,,16.0,16th century CE
14452,PG23042,The Tempest: The Works of William Shakespeare ...,"Shakespeare, William",1564.0,1616.0,['en'],6820,"{'Spirits -- Drama', 'Tragicomedy', 'Shipwreck...",,16.0,16th century CE
5824,PG1524,"Hamlet, Prince of Denmark","Shakespeare, William",1564.0,1616.0,['en'],4015,"{'Kings and rulers -- Succession -- Drama', 'H...",,16.0,16th century CE
19689,PG27761,"Hamlet, Prince of Denmark","Shakespeare, William",1564.0,1616.0,['en'],3638,"{'Kings and rulers -- Succession -- Drama', 'H...",,16.0,16th century CE
5924,PG1533,Macbeth,"Shakespeare, William",1564.0,1616.0,['en'],2911,"{'Scotland -- Kings and rulers -- Drama', 'Reg...",,16.0,16th century CE
1250,PG1112,The Tragedy of Romeo and Juliet,"Shakespeare, William",1564.0,1616.0,['en'],1512,"{'Juliet (Fictitious character) -- Drama', 'Ro...",,16.0,16th century CE
5902,PG1531,"Othello, the Moor of Venice","Shakespeare, William",1564.0,1616.0,['en'],1443,"{'Jealousy -- Drama', 'Interracial marriage --...",,16.0,16th century CE
5713,PG1514,A Midsummer Night's Dream,"Shakespeare, William",1564.0,1616.0,['en'],1399,"{'Fairy plays', 'Courtship -- Drama', 'Athens ...",,16.0,16th century CE
5724,PG1515,The Merchant of Venice,"Shakespeare, William",1564.0,1616.0,['en'],1159,"{'Venice (Italy) -- Drama', 'Jews -- Italy -- ...",,16.0,16th century CE


In [23]:
# this is to count the books in order based on descending popularity; +1 is there to start the count at 1 and not 0
# note: I could also sort them randomly
library_en['book_count'] =  (
     library_en
        .sort_values("downloads", ascending=False)
        .groupby('author')
        .cumcount() + 1
)

In [24]:
# get the authors that I selected above
library_select = library_en.loc[(library_en['author'].isin(select_authors))]

# for authors with more than 70 books, filter it down to 70
# library_select = library_select.loc[library_select['book_count'] <= 70.]

# for authors with more than 50 books, filter it down to 50
library_select = library_select.loc[library_select['book_count'] <= 50.]

# this is for hugging face, as I'll be selecting multiple parts within book
# library_select = library_select.loc[library_select['book_count'] <= 5.]

In [25]:
# to keep the order consistend
library_select = library_select.sort_values(by=['authorcentury', 'author', 'downloads'], ascending=[True,True,False])

In [26]:
# this is to have numbers for classification
select_authors =  list(library_select['author'].unique())

authors_to_num = {select_authors[i]: i for i in range(len(select_authors))}

library_select['author_num'] = library_select['author'].map(authors_to_num)

In [27]:
# for streamlit
library_select[['author', 'authorcentury', 'authorcentury_str', 'author_num']].drop_duplicates().to_pickle('../data/select_authors.pkl')

## Reading Book Contents

Code to read in the actual texts (from library_select)

Function that opens files and extracts the text (leaving the Gutenberg info at top and bottom out).

In [28]:
def import_book(filepath):
    
    try:
        with open(filepath, encoding = 'utf-8') as fi:
            book = fi.read()

    except:
        # note: when using this, the weird characters, such as ì get left out!
        with open(filepath, encoding = 'unicode_escape') as fi:
            book = fi.read()

    if(not re.search('\*\*\*\ START OF .+? \*\*\*', book)):
        book_start = 0
    else:
        book_start = re.search('\*\*\* START OF .+? \*\*\*', book).end()

    if(not re.search('\*\*\*\ END OF .+? \*\*\*', book)):
        book_end = -1
    else:
        book_end = re.search('\*\*\* END OF .+? \*\*\*', book).start()

    book = book[book_start : book_end]

    return book

Process the selected authors books and save their texts into library_select:

In [29]:
for book_id in library_select['id']:
    
    filepath = f'../data/raw/{book_id}_raw.txt'

    try:
        book = import_book(filepath)

        library_select.loc[library_select['id'] == book_id, 'book_content'] = book
        
    except:
        print('could not open', filepath)

        library_select.loc[library_select['id'] == book_id, 'book_content'] = 'could not open'

In [30]:
# there are some books who had weird characters and were not able to open; fixed that
library_unopened_books = library_select.loc[library_select['book_content'] == 'could not open']
print('could not open:', library_unopened_books.shape)

library_select = library_select.loc[library_select['book_content'] != 'could not open']
print('shape of library_select:', library_select.shape)

could not open: (0, 14)
shape of library_select: (105, 14)


In [31]:
# found out authors name appears at beginning of book, sometimes multiple time during introduction! need to address that
names_to_remove = [
    'Homer',
    'Confucius',
    'Plato',
    'Cicero', 'Marcus', 'Tullius',
    'SENECA', 'Lucius', 'Annaeus',
    'Dante', 'Alighieri',
    'Boccaccio', 'Giovanni',
    'Machiavelli', 'Niccolò',
    'Shakespeare', 'William',
    'Molière',
    'Defoe', 'Daniel',
    'Jefferson', 'Thomas',
    'Austen', 'Jane',
    'Twain', 'Mark',
    'Dickens', 'Charles',
    'Doyle', 'Arthur', 'Conan',
    'Dumas', 'Alexandre',
    'Churchill', 'Winston',
    'Dick', 'Philip',
    'Huxley', 'Aldous',
    'Lovecraft', 'Howard', 'Phillips'
]

In [32]:
# have to account for and delete all variations of the name (first letter capitalized, all uppercase, all lowercase)
for name in names_to_remove:
    library_select['book_content'] = library_select['book_content'].replace(name,'', regex=True)
    library_select['book_content'] = library_select['book_content'].replace(name.upper(),'', regex=True)
    library_select['book_content'] = library_select['book_content'].replace(name.lower(),'', regex=True)

In [33]:
# to see more in a column of df; set 50 to None to display all
# pd.set_option('display.max_colwidth', 50) 

In [34]:
# FIXED author names appearing in beginning

# hugging face
# library_select.to_pickle('../data/library_fixed_author_five.pkl')

# sklearn
library_select.to_pickle('../data/library_fixed_author_fifty.pkl')

In [35]:
#library_en.loc[library_en['author'].isin(authors_df['author'])].groupby('author')[['id']].count().rename(columns={'id': 'num_books'})

In [36]:
# testing
# library_select.loc[library_select['book_content'].str.find('\n\nContents\n\n') > -1]
# library_select.loc[library_select['book_content'].str.find('INTRODUCTION') > -1]

## Testing with LogReg

This part does not need to be run (and therefore it is commented). Testing purposes only.

In [37]:
# # dictionary of
# authors_to_num = {select_authors[i]: i for i in range(len(select_authors))}

# library_select = library_select.replace({'author': authors_to_num})

# # to invert the above
# num_to_authors = {v: k for k, v in authors_to_num.items()}

In [38]:
# authors_to_num

In [39]:
# X = library_select[['book_content']]
# y = library_select['author']

# X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

In [40]:
# print(X_train.shape)

# print(X_test.shape)

In [41]:
# pipe_logreg = Pipeline(
#     steps = [
#         ('vect', TfidfVectorizer(min_df=2, max_df=0.8, ngram_range=(1,2))),
#         ('logreg', LogisticRegression(max_iter = 10000))
#     ]
# )

In [42]:
# pipe_logreg.fit(X_train['book_content'], y_train)
# y_pred = pipe_logreg.predict(X_test['book_content'])

# print('accuracy score:', accuracy_score(y_test, y_pred), '\n')
# print('---- confusion matrix ------')
# print(confusion_matrix(y_test, y_pred), '\n')
# print('-------- classification report ---------')
# print(classification_report(y_test, y_pred))

In [43]:
# fig = px.imshow(confusion_matrix(y_test, y_pred),
#                 width=1000,
#                 height=800,
#                 text_auto=True,
#                 labels=dict(x="Predicted Label",
#                             y="True Label"),
#                             x=tuple(authors_to_num.keys()),
#                             y=tuple(authors_to_num.keys()),
#                             color_continuous_scale='Teal'
#                             )

# fig.update(layout_coloraxis_showscale=False)

# fig.show()

In [44]:
## pipe_logreg['vect'].vocabulary_['î']

# {k:v for (k,v) in pipe_logreg['vect'].vocabulary_.items() if v < 100}

To test out my logic, let's see how my own two books measure up!

In [45]:
# my_books = pd.DataFrame()

# for book_name in ('Deathway', 'Lambda'):
#         filepath = f'../data/{book_name} by Tomo Umer.txt'

#         with open(filepath, encoding = 'utf-8') as fi:
#                 book = fi.read()
        
#         tmp_book = pd.DataFrame({'author': 'Umer, Tomo', 'title': [book_name], 'book_content': [book]})

#         my_books = pd.concat([my_books, tmp_book], ignore_index = True)

In [46]:
# pipe_logreg.predict_proba(my_books['book_content'])

It would appear both of my books are most similar to Mark Twain!

In [47]:
# pd.DataFrame(pipe_logreg.predict_proba(my_books['book_content']),columns=authors_to_num.keys())

In [48]:
# test_df = pd.DataFrame(pipe_logreg.predict_proba(my_books['book_content']).T, columns=['Deathway', 'Lambda'])

# test_df.insert (0, 'Authors', authors_to_num.keys())

# test_df