# Discovery of Writing Differences

Capstone project by Tomo Umer

<img src="https://tomoumerdotcom.files.wordpress.com/2022/04/cropped-pho_logo_notext.png" alt="PRAISE DOG" style="width:400px;height:400px;"/>



## Imports

In [1]:
import pandas as pd
import numpy as np
import re
import glob

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, ConfusionMatrixDisplay
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.preprocessing import MaxAbsScaler
from copy import deepcopy

import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

import umap
from scipy import spatial
from scipy.cluster.hierarchy import linkage, dendrogram

## Exploring Available Books

To start with, figure out which books were not downloaded and yet they are present in the metadata csv!

In [2]:
# list of books downloaded successfully into the /raw/ folder
books_list = []

for name in glob.glob('../data/raw/*'):
    books_list.append(re.findall(r'PG\d*', name)[0])

# the metadata (books that should be there)
library = pd.read_csv('../data/metadata.csv')

# the difference
len(library) - len(books_list)

3435

There are 3435 "books" listed in the metadata that did not get downloaded.

In [3]:
library.loc[~library['id'].isin(books_list)]['type'].value_counts(dropna=False)

NaN            2215
Sound          1104
Dataset          83
Image            33
MovingImage       7
StillImage        3
Collection        1
Text              1
Name: type, dtype: int64

Starting with those that are marked as 'type' being NaN. From my exploration further on, those are actually books. I might have to come back to it at a later date and figure out why those books were not downloaded.

In [4]:
library.loc[(~library['id'].isin(books_list)) & (library['type'].isna())].head()

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
606,PG10547,Topsy-Turvy,"Verne, Jules",1828.0,1905.0,['en'],126,"{'Science fiction, French -- Translations into...",
703,PG10634,"The Queen of Hearts, and Sing a Song for Sixpence","Caldecott, Randolph",1846.0,1886.0,['en'],44,"{'Picture books for children', 'Nursery rhymes...",
841,PG10762,Impressions of Theophrastus Such,"Eliot, George",1819.0,1880.0,['en'],110,"{'Authors -- Fiction', 'England -- Fiction', '...",
923,PG10836,The Algebra of Logic,"Couturat, Louis",1868.0,1914.0,['en'],97,"{'Logic, Symbolic and mathematical', 'Algebrai...",
1106,PG10,The King James Version of the Bible,,,,['en'],5831,{'Bible'},


For 'Sound' I don't care that yhey did not get downloaded -  I'm only looking for books and not for audio files.

In [5]:
library.loc[(~library['id'].isin(books_list)) & (library['type'] == 'Sound')].head()

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
151,PG10137,Mary Had a Little Lamb: Recording taken from M...,"Edison, Thomas A. (Thomas Alva)",1847.0,1931.0,['en'],21,"{'Nursery rhymes, American'}",Sound
168,PG10152,Voice Trial - Kinetophone actor audition,"Lett, Bob",,,['en'],4,{'Auditions'},Sound
169,PG10153,Voice Trial - Kinetophone Actor Audition,"Lenord, Frank",,,['en'],4,{'Auditions'},Sound
170,PG10154,Voice Trial - Kinetophone Actor Audition,"Schultz, Siegfried Von",,,['en'],0,{'Auditions'},Sound
171,PG10155,The Right of the People to Rule,"Roosevelt, Theodore",1858.0,1919.0,['en'],9,"{'Progressivism (United States politics)', 'Po...",Sound


Next up, "datasets". It appears the vast majority of them are genomes. There are 10 calculations of square roots and 1/pi to a million digits. And 'Moby Word Lists' is just info on gutenberg, disclaimers, etc...

In [6]:
library.loc[(~library['id'].isin(books_list)) & (library['type'] == 'Dataset')].groupby('author')[['id']].count()

Unnamed: 0_level_0,id
author,Unnamed: 1_level_1
"Bonnell, Jerry T.",2
"De Forest, Norman L.",1
Human Genome Project,72
"Kanada, Yasumasa",1
"Kerr, Stan",1
"Nemiroff, Robert J.",5
"Ward, Grady",1


Onto checking out the 'images'! the 'Image' contains music sheets. 'MovingImage' contains comets video, rotating earth and 5 nuclear test videos. 'StillImages' contain a kids story illustrated and two maps/ map images.

In [7]:
library.loc[(~library['id'].isin(books_list)) & (library['type'] == 'Image')].head()

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
1108,PG11001,String Quartet No. 05 in A major Opus 18,"Beethoven, Ludwig van",1770.0,1827.0,['en'],5,"{'Music', 'String quartets -- Scores'}",Image
1109,PG11002,"String Quartet No. 11 in F minor Opus 95 ""Seri...","Beethoven, Ludwig van",1770.0,1827.0,['en'],6,"{'String quartets -- Scores', 'Music'}",Image
1944,PG11755,String Quartet No. 10 in E flat major Opus 74 ...,"Beethoven, Ludwig van",1770.0,1827.0,['en'],15,"{'Music', 'String quartets -- Scores'}",Image
2381,PG12149,String Quartet No. 03 in D major Opus 18,"Beethoven, Ludwig van",1770.0,1827.0,['en'],15,"{'String quartets -- Scores', 'Music'}",Image
2479,PG12237,String Quartet No. 16 in F major Opus 135,"Beethoven, Ludwig van",1770.0,1827.0,['en'],21,"{'Music', 'String quartets -- Scores'}",Image


In [8]:
library.loc[(~library['id'].isin(books_list)) & (library['type'] == 'StillImage')]

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
1661,PG114,The Tenniel Illustrations for Carroll's Alice ...,"Tenniel, John",1820.0,1914.0,['en'],391,"{""Children's stories"", 'Fantasy fiction'}",StillImage
15515,PG239,Radar Map of the United States,United States,,,['en'],27,{'United States -- Maps'},StillImage
67797,PG758,"LandSat Picture of Washington, DC",United States. National Aeronautics and Space ...,,,['en'],36,{'Washington (D.C.) -- Remote-sensing images'},StillImage


And finally, Collection contains 'Project Gutenberg DVD: The July 2006 Special' and the only not downloaded text is just empty.

In [9]:
library.loc[(~library['id'].isin(books_list)) & (library['type'] == 'Collection')]

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
10150,PG19159,Project Gutenberg DVD: The July 2006 Special,,,,['en'],73,set(),Collection


In [10]:
library.loc[(~library['id'].isin(books_list)) & (library['type'] == 'Text')]

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
69464,PG90907,,,,,['en'],1,set(),Text


In [11]:
# to find specific authors
# library.loc[library['author'].str.find('Lovecraft') > -1]

## Selecting English books

Starting with 70449 "books" in the catalogue, first select all the texts in the library that are marked as being in english ('en').

That reduces the library to 56954 books.

In [12]:
library_en = library.loc[library['language'].str.find('en') > -1]

Further, for right now I'm also going to drop all of the additional files that were not downloaded (that I briefly explored in the previous part):
- NaN 1991
- Sound 1039
- Dataset 83
- Image 33
- MovingImage 7
- StillImage 3
- Collection 1
- Text 1

That additionally reduces the library to 53796 books.

In [13]:
library_en = library_en.loc[library_en['id'].isin(books_list)]

Finally, deleting some files accompanying sound, dataset, etc.. that got downloaded. Only 20 total.

In [14]:
library_en = library_en[library_en['type'].isna()]

Final count of books to potentially use is 53776!

## Beginning Exploration of Authors

There are:
- 6345 books with 10 or less downloads 
- 42617 books with 100 or less downloads

Potentially worth considering!

In [15]:
library_en[library_en['downloads'] <= 10].shape

(6345, 9)

Grouping by author, I noticed that there's 117 titles by "unknown" author, 601 "anonymous" and 3422 "various". Upon further inspection of Various, they are majority "periodicals", meaning various magazines and so I decided it was safe to remove that.

As for unknown and anonymous, those might be interesting to check once I have a model, but as is, since I'm looking for writing style, I do want to know who the author is. (lol at Happy and Gay Marching Away - children's poetry by Unknown author).

In [16]:
library_en.groupby('author')[['title']].count().sort_values(by='title', ascending=False).head(30)

Unnamed: 0_level_0,title
author,Unnamed: 1_level_1
Various,3422
Anonymous,601
"Shakespeare, William",178
"Ebers, Georg",163
"Parker, Gilbert",132
"Oliphant, Mrs. (Margaret)",132
"Kingston, William Henry Giles",132
"Twain, Mark",128
"Fenn, George Manville",128
Unknown,117


In [17]:
library_en[library_en['author'] == 'Various']['subjects'].value_counts().head(20)

{'English wit and humor -- Periodicals'}                                                                                            550
{'Periodicals'}                                                                                                                     233
{'Questions and answers -- Periodicals'}                                                                                            220
{'Popular literature -- Great Britain -- Periodicals'}                                                                              195
{"Children's periodicals, American"}                                                                                                162
{'Congregational churches -- Missions -- Periodicals', 'Home missions -- Periodicals'}                                              145
{'Encyclopedias and dictionaries'}                                                                                                  136
{'American periodicals'}                        

Below, keeping authors that are not Anonymous, Unknown or Various, which cuts down to 49636 books.

In [18]:
library_en = library_en[~library_en['author'].isin(['Anonymous', 'Unknown', 'Various'])]

Ideas for selecting authors:
- first I started with top 6 based on the total # of books written
    - used `library_en.groupby('author')['title'].count().sort_values(ascending=False).head(6).index.to_list()`
- another idea was to look at most downloads over last 30 days
- another one to look based on the decade

In [19]:
# note: downloads are for the last 30 days
display(library_en.sort_values(by='downloads', ascending=False).head())

display(library_en.groupby('author')[['downloads']].sum().sort_values(by='downloads', ascending=False).head())

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
5702,PG1513,Romeo and Juliet,"Shakespeare, William",1564.0,1616.0,['en'],166112,"{'Juliet (Fictitious character) -- Drama', 'Ro...",
18200,PG2641,A Room with a View,"Forster, E. M. (Edward Morgan)",1879.0,1970.0,['en'],145035,"{'British -- Italy -- Fiction', 'Florence (Ita...",
30065,PG37106,"Little Women; Or, Meg, Jo, Beth, and Amy","Alcott, Louisa May",1832.0,1888.0,['en'],139345,"{'Bildungsromans', 'March family (Fictitious c...",
5102,PG145,Middlemarch,"Eliot, George",1819.0,1880.0,['en'],138208,"{'Bildungsromans', 'Married people -- Fiction'...",
18867,PG2701,"Moby Dick; Or, The Whale","Melville, Herman",1819.0,1891.0,['en'],135040,"{'Whaling -- Fiction', 'Psychological fiction'...",


Unnamed: 0_level_0,downloads
author,Unnamed: 1_level_1
"Smollett, T. (Tobias)",364524
"Shakespeare, William",333424
"Alcott, Louisa May",154321
"Forster, E. M. (Edward Morgan)",150394
"Eliot, George",145394


In [20]:
# function to calculate centuy of living, taking int account both birth and death where possible
def calc_century(birth, death):

    if np.isnan(birth) & np.isnan(death):
        x = np.NaN

    elif np.isnan(birth):
        x = int(death/100)
    
    elif np.isnan(death):
        x = int(birth/100)

    else:
        x = int((birth + (death/2 - birth/2)) / 100)

    return x

In [22]:
library_en['authorcentury'] = library_en.apply(lambda x: calc_century(x.authoryearofbirth, x.authoryearofdeath), axis=1)

In [23]:
# Using that calculation, out of 18256 authors, and the breakdown is below
# note the NaN means they didn't have either yearofbirth or yearofdeath
library_en.groupby('author')[['authorcentury']].max().value_counts(dropna=False)

authorcentury
 18.0            8194
NaN              4573
 19.0            4487
 17.0             559
 16.0             227
 15.0             104
 0.0               18
 14.0              17
 13.0              14
 12.0              10
-4.0                8
 11.0               6
-3.0                6
 10.0               6
 3.0                5
 2.0                3
 7.0                3
 5.0                2
 9.0                2
 6.0                2
-2.0                2
 1.0                2
 20.0               1
-7.0                1
-5.0                1
 4.0                1
-1.0                1
 8.0                1
dtype: int64

In [48]:
cte = 6.0

display(library_en.loc[library_en['authorcentury'] == cte].groupby('author').count())

display(library_en.loc[library_en['authorcentury'] == cte])

Unnamed: 0_level_0,id,title,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type,authorcentury
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
"Muhammad, Prophet",1,1,0,1,1,1,1,0,1
Sengcan,1,1,0,1,1,1,1,0,1


Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type,authorcentury
31701,PG38580,True Heart/Mind,Sengcan,,606.0,"['en', 'zh']",36,{'Zen Buddhism'},,6.0
53747,PG58426,The Speeches & Table-Talk of the Prophet Mohammad,"Muhammad, Prophet",,632.0,['en'],64,"{'Religious life -- Islam', 'Muhammad, Prophet...",,6.0


In [None]:
# Plato (spurious and doubtful works)
# unknoen: -2 ce, 2ce, 3ce, 4ce, 5ce, 
# -1 ce, only 1 book from Cato; to check how long
['Homer', 'Confucius', 'Herodotus', 'Plato', 'Seneca, Lucius Annaeus', 'Marcus Aurelius, Emperor of Rome'
]

In [None]:
select_authors = library_en.groupby('author')['title'].count().sort_values(ascending=False).head(6).index.to_list()

In [None]:
select_authors = ['Shakespeare, William', 'Melville, Herman', 'Dumas, Alexandre', 'Austen, Jane',
'Fitzgerald, F. Scott (Francis Scott)', 'Ibsen, Henrik', 'Stoker, Bram', 'Dickens, Charles',
'Doyle, Arthur Conan', 'Kafka, Franz', 'Tolstoy, Leo, graf', 'Dostoyevsky, Fyodor', 'Joyce, James',
'Machiavelli, Niccolò', 'Homer', 'Hobbes, Thomas', 'Plato', 'Nietzsche, Friedrich Wilhelm', 'Poe, Edgar Allan',
'Christie, Agatha', 'Twain, Mark', 'Verne, Jules', 'Dickinson, Emily', 'Dick, Philip K.', 'Plato', 'Lovecraft, H. P. (Howard Phillips)']

In [None]:
library_en[library_en['author'].isin(select_authors)].groupby('author')['authoryearofbirth'].max()

In [None]:
library_top_six = library_en[library_en['author'].isin(top_six_authors_list)]

Just some interesting authors that I recognize...
- Shakespeare, William
- Melville, Herman
- Dumas, Alexandre
- Austen, Jane
- Fitzgerald, F. Scott (Francis Scott)
- Ibsen, Henrik
- Stoker, Bram
- Dickens, Charles
- Doyle, Arthur Conan
- Kafka, Franz
- Tolstoy, Leo, graf
- Dostoyevsky, Fyodor
- Joyce, James
- Machiavelli, Niccolò
- Homer
- Hobbes, Thomas
- Plato
- Nietzsche, Friedrich Wilhelm
- Poe, Edgar Allan
- Christie, Agatha
- Twain, Mark
- Verne, Jules
- Dickinson, Emily
- Dick, Philip K.
- Plato
- Lovecraft, H. P. (Howard Phillips)

Authors based on most downloads;

## Reading and Tokenizing Books

Function that opens files and extracts the text (leaving the Gutenberg info at top and bottom out).

In [None]:
def import_book(filepath):
    
    try:
        with open(filepath, encoding = 'utf-8') as fi:
            book = fi.read()

    except:
        # note: when using this, the weird characters, such as ì get left out!
        with open(filepath, encoding = 'unicode_escape') as fi:
            book = fi.read()

    if(not re.search('\*\*\* START OF .+? \*\*\*', book)):
        book_start = 0
    else:
        book_start = re.search('\*\*\* START OF .+? \*\*\*', book).end()

    if(not re.search('\*\*\* END OF .+? \*\*\*', book)):
        book_end = -1
    else:
        book_end = re.search('\*\*\* END OF .+? \*\*\*', book).start()

    book = book[book_start : book_end]

    return book

Process the top 6 authors books;

> NOTE: here I found out that some files have strange characters and won't be opened. I wil lhave to decide what to do with those.

In [None]:
for book_id in library_top_six['id']:
    
    filepath = f'../data/raw/{book_id}_raw.txt'

    try:
        book = import_book(filepath)

        library_top_six.loc[library_top_six['id'] == book_id, 'book_content'] = book
        
    except:
        print('could not open', filepath)

        library_top_six.loc[library_top_six['id'] == book_id, 'book_content'] = 'could not open'

    

In [None]:
# there are some books who had weird characters and were not able to open
library_unopened_books = library_top_six.loc[library_top_six['book_content'] == 'could not open']

library_top_six = library_top_six.loc[library_top_six['book_content'] != 'could not open']

In [None]:
# to see more in a column of df
# pd.set_option('display.max_colwidth', 50) #set it to None to display all

In [None]:
top_six_authors_dict = {top_six_authors_list[i]: i for i in range(6)}

library_top_six = library_top_six.replace({'author': top_six_authors_dict})

In [None]:
# to invert the above
num_to_author = {v: k for k, v in top_six_authors_dict.items()}

In [None]:
top_six_authors_dict

In [None]:
X = library_top_six[['book_content']]
y = library_top_six['author']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

In [None]:
X_train.shape

X_test.shape

In [None]:
pipe_vect_logreg = Pipeline(
    steps = [
        ('vect', TfidfVectorizer(min_df=2, max_df=0.8, ngram_range=(1,2))),
        ('logreg', LogisticRegression(max_iter = 10000))
    ]
)

In [None]:
pipe_vect_logreg.fit(X_train['book_content'], y_train)
y_pred = pipe_vect_logreg.predict(X_test['book_content'])

print('accuracy score:', accuracy_score(y_test, y_pred), '\n')
print('---- confusion matrix ------')
print(confusion_matrix(y_test, y_pred), '\n')
print('-------- classification report ---------')
print(classification_report(y_test, y_pred))

In [None]:
fig = px.imshow(confusion_matrix(y_test, y_pred),
                width=1000,
                height=800,
                text_auto=True,
                labels=dict(x="Predicted Label",
                            y="True Label"),
                            x=tuple(top_six_authors_dict.keys()),
                            y=tuple(top_six_authors_dict.keys()),
                            color_continuous_scale='Teal'
                            )

fig.update(layout_coloraxis_showscale=False)

fig.show()

In [None]:
#pipe_vect_logreg['vect'].vocabulary_['î']

{k:v for (k,v) in pipe_vect_logreg['vect'].vocabulary_.items() if v < 100}

To test out my logic, let's see how my own two books measure up!

In [None]:
my_books = pd.DataFrame()

for book_name in ('Deathway', 'Lambda'):
        filepath = f'../data/{book_name} by Tomo Umer.txt'

        with open(filepath, encoding = 'utf-8') as fi:
                book = fi.read()
        
        tmp_book = pd.DataFrame({'title': [book_name], 'book_content': [book]})

        my_books = pd.concat([my_books, tmp_book], ignore_index = True)

In [None]:
pipe_vect_logreg.predict_proba(my_books['book_content'])

It would appear both of my books are most similar to Mark Twain!

In [None]:
pd.DataFrame(pipe_vect_logreg.predict_proba(my_books['book_content']),columns=top_six_authors_dict.keys())

In [None]:
test_df = pd.DataFrame(pipe_vect_logreg.predict_proba(my_books['book_content']).T, columns=['Deathway', 'Lambda'])

test_df.insert (0, 'Authors', top_six_authors_dict.keys())

test_df

## Neural Networks

Using neural networks in order to obtain a better representation of similarity (by using an intermediate hidden layer of 100 neurons).

In [None]:
pipe_vect_nn = Pipeline(
    steps = [
        ('vect', TfidfVectorizer(min_df=2, max_df=0.8, ngram_range=(1,2))),
        ('scaler', MaxAbsScaler()),  #this is needed in order to make it converge in a reasonable time!
        ('nn', MLPClassifier(verbose = True,
                             hidden_layer_sizes = (100, ),
                             #activation = 'tanh',
                             #max_iter = 10000,
                             #alpha=0.05
                             ))
    ]
)

In [None]:
pipe_vect_nn.fit(X_train['book_content'], y_train)
y_pred = pipe_vect_nn.predict(X_test['book_content'])

print('accuracy score:', accuracy_score(y_test, y_pred), '\n')
print('---- confusion matrix ------')
print(confusion_matrix(y_test, y_pred), '\n')
print('-------- classification report ---------')
print(classification_report(y_test, y_pred))

Now that I have the neural network fit, I need to create an autoencoder with that.

In [None]:
# have to use the regressor because classifiers thinks we're expecting int results, even if we chop it off before final step!
encoder = MLPRegressor()
encoder.coefs_ = pipe_vect_nn['nn'].coefs_[:1]
encoder.intercepts_ = pipe_vect_nn['nn'].intercepts_[:1]
encoder.n_layers_ = 2
encoder.out_activation_ = 'relu'

In [None]:
# copy the pipeline
pipe_vect_encoder = deepcopy(pipe_vect_nn)

# remove the classifier
pipe_vect_encoder.steps.pop(2)

# append the new encoder (essentially, it contains all layers minus the final one)
pipe_vect_encoder.steps.append(['enc', encoder])

In [None]:
# projection = pipe_vect_encoder.predict(X_test['book_content'])

# plt.figure(figsize = (10,6))
# sns.scatterplot(x=projection[:,0], y=projection[:,1], hue = y_test.astype('category'))
# plt.legend(bbox_to_anchor = (1,1));


In [None]:
# now using predict from the encoder to get the 100-dimensional projection of the top six authors
nn_represent_top_six = pipe_vect_encoder.predict(library_top_six['book_content'])

Next use UMAP to represent the 100 dimensional projection into a 2d one!

In [None]:
umap_mnist = umap.UMAP()
umap_mnist.fit(nn_represent_top_six)

In [None]:
umap_projection = umap_mnist.transform(nn_represent_top_six)

plt.figure(figsize = (10,6))
sns.scatterplot(x=umap_projection[:,0], y=umap_projection[:,1], 
                hue = library_top_six['author'].astype('category'),
               alpha = 0.7);

Now to use the encoder to see how the six authors get represented in this 100-dimensional space.

In [None]:
print(library_top_six.shape)
print(nn_represent_top_six.shape)

In [None]:
library_top_six_breakdown = pd.DataFrame(nn_represent_top_six, columns=[f'dim_{x}' for x in range(100)])

In [None]:
library_title_author = library_top_six[['title', 'author']].replace({'author': num_to_author})

In [None]:
library_top_six_breakdown = pd.concat([library_title_author, library_top_six_breakdown.set_index(library_title_author.index)], axis=1)

Calculating cosine similarity

In order to plot it, I'll condense the authors.

In [None]:
library_top_six_grouped = library_top_six_breakdown.drop(columns=['title']).groupby('author').mean()

In [None]:
dists = spatial.distance.pdist(library_top_six_grouped.values, metric = 'cosine')

In [None]:
mergings = linkage(dists, method='complete')

plt.figure(figsize = (12,8))
dendrogram(mergings,
           labels = list(library_top_six_grouped.index),
           leaf_rotation = 90,
           leaf_font_size = 6);

plt.tight_layout()
#plt.savefig('images/dendogram_complete_cosine.png', transparent=False, facecolor='white', dpi = 150);

In [None]:
# this is a bit convoluted, but .. I first concatenate all the texts in the pandas series which returns an extremely long string
# I then turn that string into a pandas series (the predict requires an iterable object)
# pipe_vect_encoder.predict(pd.Series(library_top_six.loc[library_top_six['author'] == 0]['book_content'].str.cat()))