This notebook will compare the metadata files created by the SPGC and the pg_catalog.csv from project gutenberg
Then, it will make a dataset

In [1]:
import numpy as np
import pandas as pd
import os, sys
import glob

from collections import Counter
import matplotlib.pyplot as plt

import misc_utils.dataset_filtering as dataset_filtering

In [2]:
git_repo_path = '/Users/dean/Documents/gitRepos'
gutenberg_repo_path = os.path.join(git_repo_path, 'gutenberg')
gutenberg_analysis_repo = os.path.join(git_repo_path, 'gutenberg-analysis')

In [3]:
## import internal helper functions
analysis_src_dir = os.path.join(gutenberg_analysis_repo,'src')
sys.path.append(analysis_src_dir)
from data_io import get_book

gutenberg_src_dir = os.path.join(gutenberg_repo_path,'src')
sys.path.append(gutenberg_src_dir)

from metaquery import meta_query
from jsd import jsdalpha

# Read in both metadata files

In [4]:
mq_filepath=os.path.join(gutenberg_repo_path,'metadata','metadata.csv')
pg_catalog_filepath=os.path.join(git_repo_path, 'gutenberg_corpus_analysis', 'sample_dataset', 'pg_catalog.csv')

# Load both  metadata files

Load both the metadata file generated by SPGC and the metadata file from PG

In [5]:
df = dataset_filtering.read_metadata_and_catalog(mq_filepath, pg_catalog_filepath)
original_shape=df.shape

Get only English books, according to PG catalog

In [6]:
df = df.query('Language=="en"')
df['Language'].unique()

array(['en'], dtype=object)

Let's verify that the language column in both metadata files match

In [7]:
df['language'].unique()

array(["['en']", "['ne']"], dtype=object)

Uh oh, it doesn't!  What book is this causing the problems?

In [8]:
df.query('Language=="en" and language=="[\'ne\']"')

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type,PG_ID,type_pgc,title_pgc,Language,Authors,Subjects,LoCC,Bookshelves
72655,PG75522,The soldier's orphans,"Stephens, Ann S. (Ann Sophia)",1810.0,1886.0,['ne'],0,set(),Text,75522,Text,The soldier's orphans,en,"Stephens, Ann S. (Ann Sophia), 1810-1886",American fiction -- 19th century,PS,


Lets get rid of it!

In [9]:
index_to_drop = df.query('Language=="en" and language=="[\'ne\']"').index
df.drop(index_to_drop, inplace=True)

In [10]:
# Verifying that everything is good
print(df['Language'].unique())
print(df['language'].unique())

['en']
["['en']"]


In [11]:
print(f'Original Shape: {original_shape}')
print(f'Current Shape: {df.shape}')

Original Shape: (74141, 17)
Current Shape: (58991, 17)


# Lets get rid of anything missing a title or an author

In [12]:
tdf = df[['title', 'title_pgc', 'author', 'Authors']]
tdf[tdf.isnull().any(axis=1)]

Unnamed: 0,title,title_pgc,author,Authors
30,Spalding's Official Baseball Guide - 1913,Spalding's Official Baseball Guide - 1913,,"Foster, John B. (John Buckingham), 1863-1941 [..."
93,Moorish Literature: Comprising Romantic Ballad...,Moorish Literature\r\nComprising Romantic Ball...,,"Basset, René, 1855-1924 [Editor]"
114,"The Great Events by Famous Historians, Volume ...","The Great Events by Famous Historians, Volume ...",,"Johnson, Rossiter, 1840-1931 [Editor]; Horne, ..."
126,"The Great Events by Famous Historians, Volume ...","The Great Events by Famous Historians, Volume ...",,"Johnson, Rossiter, 1840-1931 [Editor]; Horne, ..."
134,The Literature of Arabia: With Critical and Bi...,The Literature of Arabia\r\nWith Critical and ...,,"Wilson, Epiphanius, 1845-1916 [Editor]"
...,...,...,...,...
75226,Who Was Who: 5000 B. C. to Date: Biographical ...,Who Was Who: 5000 B. C. to Date\r\nBiographica...,,"Gordon, Irwin Leslie, 1888-1954 [Editor]"
75300,Spalding's Baseball Guide and Official League ...,Spalding's Baseball Guide and Official League ...,,"Chadwick, Henry, 1824-1908 [Editor]"
75305,The Garden of Bright Waters: One Hundred and T...,The Garden of Bright Waters\r\nOne Hundred and...,,"Mathers, E. Powys (Edward Powys), 1892-1939 [T..."
75314,"The Great Events by Famous Historians, Volume 12","The Great Events by Famous Historians, Volume 12",,"Johnson, Rossiter, 1840-1931 [Editor]; Horne, ..."


Well, it looks like most of these HAVE authors (or at least editors), it's just messed up in the metadata created by SPGC.  Let's just drop them.

In [13]:
tdf.head()

Unnamed: 0,title,title_pgc,author,Authors
0,The Magna Carta,The Magna Carta,Anonymous,Anonymous
1,Apocolocyntosis,Apocolocyntosis,"Seneca, Lucius Annaeus","Seneca, Lucius Annaeus, 5? BCE-65; Rouse, W. H..."
2,The House on the Borderland,The House on the Borderland,"Hodgson, William Hope","Hodgson, William Hope, 1877-1918"
3,"My First Years as a Frenchwoman, 1876-1879","My First Years as a Frenchwoman, 1876-1879","Waddington, Mary King","Waddington, Mary King, 1833-1923"
4,The Warriors,The Warriors,"Lindsay, Anna Robertson Brown","Lindsay, Anna Robertson Brown, 1864-1948"


In [14]:
to_drop = tdf[tdf.isnull().any(axis=1)].index
df.loc[to_drop].head()

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type,PG_ID,type_pgc,title_pgc,Language,Authors,Subjects,LoCC,Bookshelves
30,PG10028,Spalding's Official Baseball Guide - 1913,,,,['en'],125,{'Baseball'},Text,10028,Text,Spalding's Official Baseball Guide - 1913,en,"Foster, John B. (John Buckingham), 1863-1941 [...",Baseball,GV,Browsing: Sports/Hobbies/Motoring
93,PG10085,Moorish Literature: Comprising Romantic Ballad...,,,,['en'],186,{'Spanish literature -- Translations into Engl...,Text,10085,Text,Moorish Literature\r\nComprising Romantic Ball...,en,"Basset, René, 1855-1924 [Editor]",Spanish literature -- Translations into English,PQ,Browsing: Literature
114,PG10103,"The Great Events by Famous Historians, Volume ...",,,,['en'],157,{'World history'},Text,10103,Text,"The Great Events by Famous Historians, Volume ...",en,"Johnson, Rossiter, 1840-1931 [Editor]; Horne, ...",World history,D,Browsing: History - European; Browsing: Histor...
126,PG10114,"The Great Events by Famous Historians, Volume ...",,,,['en'],273,{'World history'},Text,10114,Text,"The Great Events by Famous Historians, Volume ...",en,"Johnson, Rossiter, 1840-1931 [Editor]; Horne, ...",World history,D,Greece; Browsing: History - European; Browsing...
134,PG10121,The Literature of Arabia: With Critical and Bi...,,,,['en'],109,{'Arabic literature'},Text,10121,Text,The Literature of Arabia\r\nWith Critical and ...,en,"Wilson, Epiphanius, 1845-1916 [Editor]",Arabic literature,PJ,Browsing: Culture/Civilization/Society; Browsi...


In [15]:
df.drop(to_drop, inplace=True)

# Lets see if titles match

In [16]:
dont_match, attribute_errors = dataset_filtering.compare_columns(df, 'title', 'title_pgc', verbose=True)#['author']

Dont Match: id: PG59774   Thirty Strange Stories   30 Strange Stories
Dont Match: id: PG63765   The Divine and Perpetual Obligation of the Observance of the Sabbath, with Reference more Especially to a Pamphlet Lately Puvblished by the Rev. C.J. Vaughan, D.D., Head Master of Harrow School, Entitled “A Few Words on the Crystal Palace Question”   Mutiny
Dont Match: id: PG6420   Copyright Renewals 1960   U.S. Copyright Renewals, 1960


Let's get rid of PG63765 and lets note that we should get rid of copyright renewals en masse.

We can also ditch the duplicate column

In [17]:
df = df[df['id']!='PG63765']
df.drop('title_pgc', axis=1, inplace=True)

# Verify Author Matches

In [18]:
dont_match, attribute_errors = dataset_filtering.compare_columns(df, 'author', 'Authors')
dont_match[['title','author','Authors']]

Unnamed: 0,title,author,Authors
1,Apocolocyntosis,"Seneca, Lucius Annaeus","Seneca, Lucius Annaeus, 5? BCE-65; Rouse, W. H..."
2,The House on the Borderland,"Hodgson, William Hope","Hodgson, William Hope, 1877-1918"
3,"My First Years as a Frenchwoman, 1876-1879","Waddington, Mary King","Waddington, Mary King, 1833-1923"
4,The Warriors,"Lindsay, Anna Robertson Brown","Lindsay, Anna Robertson Brown, 1864-1948"
5,A Voyage to the Moon: With Some Account of the...,"Tucker, George","Tucker, George, 1775-1861"
...,...,...,...
75389,"France and England in North America, Part III:...","Parkman, Francis","Parkman, Francis, 1823-1893"
75390,Poems,"Betham, Matilda","Betham, Matilda, 1776-1852"
75391,"Harriet, the Moses of Her People","Bradford, Sarah H. (Sarah Hopkins)","Bradford, Sarah H. (Sarah Hopkins), 1818-1912"
75393,Collected Articles of Frederick Douglass,"Douglass, Frederick","Douglass, Frederick, 1818-1895"


Note that there are actually a bunch that don't match properly, but it appears that it is mostly a formatting issue.  We can come back to it, if needed.  Leaving them here

Lets ditch the duplicate authors column though

In [19]:
df.drop('Authors', axis=1, inplace=True)

# Do IDs match?  They better!

Note: This should be totally unnecessary since we joined on ID

In [20]:
# Create a numeric version of the 'id' column with the "PG" removed
df['id_numeric'] = (
    df['id']
    .str.replace('PG', '')  # remove the literal "PG"
    .astype(str)            # convert to integer
)

df['PG_ID'] = df['PG_ID'].astype(str)              # Convert numeric to string

dont_match, attribute_errors = dataset_filtering.compare_columns(
    df,
    'id_numeric',
    'PG_ID',
    verbose=True
)

In [21]:
print(dont_match)
print(attribute_errors)

Empty DataFrame
Columns: [id, title, author, authoryearofbirth, authoryearofdeath, language, downloads, subjects, type, PG_ID, type_pgc, Language, Subjects, LoCC, Bookshelves, id_numeric]
Index: []
[]


No entries show up with unmatching id's. We can drop the placeholder id_numeric column. I keep the redundant PG_ID column here not knowing if it will be useful later to query the raw data. 

# Where do we stand?

In [22]:
df.shape

(56712, 16)

In [23]:
df.columns

Index(['id', 'title', 'author', 'authoryearofbirth', 'authoryearofdeath',
       'language', 'downloads', 'subjects', 'type', 'PG_ID', 'type_pgc',
       'Language', 'Subjects', 'LoCC', 'Bookshelves', 'id_numeric'],
      dtype='object')

In [24]:
df['author'].value_counts()

author
Various                                3608
Anonymous                               756
Lytton, Edward Bulwer Lytton, Baron     217
Shakespeare, William                    180
Ebers, Georg                            165
                                       ... 
Brown, Helen Dawes                        1
Stephens, William                         1
Prindle, Frances Carruth                  1
Pearce, Joseph                            1
George, Marian Minnie                     1
Name: count, Length: 19526, dtype: int64

In [25]:

#fig, ax = plt.subplots()
#df['author'].value_counts().plot(ax=ax, kind='bar')

In [26]:
#df['author'].value_counts().plot(kind='bar')

It will be difficult to categorize "Various", "Anonymous", or "Unknown" authors, let's ditch them

In [27]:
df = df[~df['author'].isin(['Various', 'Anonymous', 'Unknown'])]
df.shape

(52230, 16)

Lets see how many authors have more than 10 or 20 books

In [28]:
vc = df['author'].value_counts()
vc

author
Lytton, Edward Bulwer Lytton, Baron    217
Shakespeare, William                   180
Ebers, Georg                           165
Twain, Mark                            158
Oliphant, Mrs. (Margaret)              141
                                      ... 
Brown, Helen Dawes                       1
Stephens, William                        1
Prindle, Frances Carruth                 1
Pearce, Joseph                           1
George, Marian Minnie                    1
Name: count, Length: 19523, dtype: int64

In [29]:
print(f'There are a total of {len(vc)} authors')
for book_count in [10, 20, 30, 40, 50, 75, 100]:
    print(f'There are {len(vc[vc > book_count])} authors with more than {book_count} books')


There are a total of 19523 authors
There are 802 authors with more than 10 books
There are 330 authors with more than 20 books
There are 173 authors with more than 30 books
There are 109 authors with more than 40 books
There are 75 authors with more than 50 books
There are 29 authors with more than 75 books
There are 15 authors with more than 100 books


In [30]:
# What should be the minimum number of books per author?
book_count_cutoff=30

In [31]:
authors_to_include = vc[vc > book_count_cutoff].index

mask = df['author'].isin(authors_to_include)
df = df[mask]

In [32]:
df.shape

(9817, 16)

# Add information on the length of books

Adds the number of lines, the number of words, and the number of unique words

By default, drops the books you haven't downloaded

#### Add the total word count of the entry, called 'word_count'.

In [None]:
count_path = os.path.join(gutenberg_repo_path, 'data', 'counts')
df['word_count'] = df['id'].apply(lambda pid: dataset_filtering.get_word_count(pid, count_path))

#### Add the total unique word count of the entry, called 'unique_word_count'.

In [None]:


count_path = os.path.join(gutenberg_repo_path, 'data', 'counts')
df['unique_word_count'] = df['id'].apply(
    lambda pid: dataset_filtering.get_unique_word_count(pid, count_path)
)

#### Add total lines of text in the raw text file, called 'line_count'.

Note that this is taking line count of the somewhat-cleaned files in the text folder, not the files in the raw folder.

In [None]:


text_path = os.path.join(gutenberg_repo_path, 'data', 'text')
df['line_count'] = df['id'].apply(
    lambda pid: dataset_filtering.get_line_count(pid, text_path)
)

#### Add total tokens in the entry, called 'token_count'.

In [None]:


token_path = os.path.join(gutenberg_repo_path, 'data', 'tokens')
df['token_count'] = df['id'].apply(
    lambda pid: dataset_filtering.get_token_count(pid, token_path)
)

In [37]:
df.shape

(9817, 20)

# SETTINGS

In [38]:
df['author'].value_counts().min()

31

In [39]:
#####################
#####################
def normalize_dataset(df, how='num_books'):
    author_list = df['author'].unique()
    vc = df['author'].value_counts()

    min_num_books = vc.min()

    for author in author_list:
        tdf = df.query('author==@author')
        num_to_drop = tdf.shape[0] - min_num_books
        ind_to_drop = tdf.sample(num_to_drop).index
        df.drop(ind_to_drop, inplace=True)
    
def split_test_train(df, train_perc=0.8):
    author_list = df['author'].unique()
    train_ind = []
    for author in author_list:
        tdf = df.query('author==@author')
        single_author_train_ind = tdf.sample(frac=train_perc).index
        train_ind = [*train_ind,*single_author_train_ind]

    train_df = df.loc[train_ind]
    test_df = df.drop(train_ind)

    return train_df, test_df
    

In [40]:
normalize_dataset(df)

In [41]:
df.shape

(5363, 20)

In [42]:
train_df, test_df = split_test_train(df)

In [43]:
train_df

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type,PG_ID,type_pgc,Language,Subjects,LoCC,Bookshelves,id_numeric,word_count,unique_word_count,line_count,token_count
1785,PG11610,Madam Crowl's Ghost and the Dead Sexton,"Le Fanu, Joseph Sheridan",1814.0,1873.0,['en'],239,{'Ghost stories'},Text,11610,Text,en,Ghost stories,PR,Browsing: Literature; Browsing: Religion/Spiri...,11610,13642.0,2479.0,1548.0,13642.0
33856,PG40510,"The Watcher, and other weird stories","Le Fanu, Joseph Sheridan",1814.0,1873.0,['en'],219,"{'Ireland -- Fiction', 'Horror tales, English'...",Text,40510,Text,en,"Fantasy fiction, English; Horror tales, Englis...",PR,Browsing: Literature; Browsing: Science-Fictio...,40510,61889.0,6797.0,6823.0,61889.0
27358,PG34662,Willing to Die: A Novel,"Le Fanu, Joseph Sheridan",1814.0,1873.0,['en'],152,{'Fiction'},Text,34662,Text,en,Fiction,PR,Browsing: Literature; Browsing: Fiction,34662,,,,
33429,PG40126,The Cock and Anchor,"Le Fanu, Joseph Sheridan",1814.0,1873.0,['en'],292,"{'Dublin (Ireland) -- Fiction', 'City and town...",Text,40126,Text,en,City and town life -- Fiction; Dublin (Ireland...,PR,Browsing: Culture/Civilization/Society; Browsi...,40126,153175.0,11067.0,17442.0,153175.0
17281,PG25584,The Purcell Papers: Index and Contents of the ...,"Le Fanu, Joseph Sheridan",1814.0,1873.0,['en'],117,"{'Indexes', 'Ireland -- Fiction', 'Fantasy fic...",Text,25584,Text,en,"Fantasy fiction, English; Ireland -- Fiction; ...",PR,Browsing: Culture/Civilization/Society; Browsi...,25584,285.0,122.0,123.0,285.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74924,PG9578,Poems in Wartime: Part 4 From Volume III of Th...,"Whittier, John Greenleaf",1807.0,1892.0,['en'],107,"{'United States -- History -- Civil War, 1861-...",Text,9578,Text,en,"United States -- History -- Civil War, 1861-18...",PS,Browsing: History - American; Browsing: Litera...,9578,,,,
74935,PG9588,My Summer with Dr. Singletary: Part 2 from Vol...,"Whittier, John Greenleaf",1807.0,1892.0,['en'],112,{'American literature'},Text,9588,Text,en,American literature,PS,Browsing: History - American; Browsing: Litera...,9588,,,,
74947,PG9599,"The Works of John Greenleaf Whittier, Volume V...","Whittier, John Greenleaf",1807.0,1892.0,['en'],255,"{'American literature', 'United States -- Poli...",Text,9599,Text,en,United States -- Politics and government; Lite...,PS,Browsing: History - American; Browsing: Litera...,9599,,,,
74931,PG9584,"The Tent on the Beach, and other poems: Part 4...","Whittier, John Greenleaf",1807.0,1892.0,['en'],90,{'American poetry -- 19th century'},Text,9584,Text,en,American poetry -- 19th century,PS,Browsing: Literature; Browsing: Poetry,9584,,,,


In [44]:
test_df

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type,PG_ID,type_pgc,Language,Subjects,LoCC,Bookshelves,id_numeric,word_count,unique_word_count,line_count,token_count
7,PG10007,Carmilla,"Le Fanu, Joseph Sheridan",1814.0,1873.0,['en'],12084,"{'Vampires -- Fiction', 'Young women -- Fiction'}",Text,10007,Text,en,Young women -- Fiction; Vampires -- Fiction,PR,Horror; Gothic Fiction; Browsing: Gender & Sex...,10007,27281.0,3918.0,3345.0,27281.0
43,PG1003,"Divine Comedy, Longfellow's Translation, Paradise",Dante Alighieri,1265.0,1321.0,['en'],526,"{'Epic poetry, Italian -- Translations into En...",Text,1003,Text,en,"Epic poetry, Italian -- Translations into Engl...",PQ,Italy; Browsing: Culture/Civilization/Society;...,1003,36996.0,5147.0,6728.0,36996.0
49,PG10045,"Dave Darrin's Second Year at Annapolis: Or, Tw...","Hancock, H. Irving (Harrie Irving)",1868.0,1922.0,['en'],96,{'United States Naval Academy -- Juvenile fict...,Text,10045,Text,en,United States Naval Academy -- Juvenile fiction,PZ,Children's Book Series; Browsing: Children & Y...,10045,36826.0,4235.0,5838.0,36826.0
76,PG1006,"Divine Comedy, Cary's Translation, Purgatory",Dante Alighieri,1265.0,1321.0,['en'],192,"{'Epic poetry, Italian -- Translations into En...",Text,1006,Text,en,"Epic poetry, Italian -- Translations into Engl...",PQ,Italy; Browsing: History - European; Browsing:...,1006,,,,
209,PG1018,The Water-Babies,"Kingsley, Charles",1819.0,1875.0,['en'],575,"{'Fairy tales -- England', 'Chimney sweeps -- ...",Text,1018,Text,en,Chimney sweeps -- Juvenile fiction; Fairy tale...,PZ,Browsing: Children & Young Adult Reading; Brow...,1018,65078.0,5403.0,7443.0,65078.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74986,PG9633,Sir George Tressady — Volume I,"Ward, Humphry, Mrs.",1851.0,1920.0,['en'],96,{'English fiction'},Text,9633,Text,en,English fiction,PR,Browsing: Literature; Browsing: Fiction,9633,,,,
74993,PG963,Little Dorrit,"Dickens, Charles",1812.0,1870.0,['en'],1664,"{'Inheritance and succession -- Fiction', 'Lon...",Text,963,Text,en,London (England) -- Fiction; Inheritance and s...,PR,Browsing: Culture/Civilization/Society; Browsi...,963,324389.0,14414.0,36886.0,324389.0
75115,PG974,The Secret Agent: A Simple Tale,"Conrad, Joseph",1857.0,1924.0,['en'],1633,"{'Conspiracies -- Fiction', 'London (England) ...",Text,974,Text,en,London (England) -- Fiction; Political fiction...,PR,Mystery Fiction; Browsing: Crime/Mystery; Brow...,974,86317.0,8489.0,9763.0,86317.0
75146,PG9778,Vane of the Timberlands,"Bindloss, Harold",1866.0,1945.0,['en'],158,{'British Columbia -- Fiction'},Text,9778,Text,en,British Columbia -- Fiction,PR,Browsing: Literature; Browsing: Travel & Geogr...,9778,,,,


In [45]:
def write_csv_in_metadata_format(df, outfile):
    cols_to_keep = ['id', 'title', 'author', 'authoryearofbirth', 'authoryearofdeath',
       'language', 'downloads', 'subjects']
    df = df[cols_to_keep]
    df.to_csv(outfile)

In [46]:
train_outfile = 'train.csv'
test_outfile = 'test.csv'
write_csv_in_metadata_format(train_df, train_outfile)
write_csv_in_metadata_format(test_df, test_outfile)