This notebook will compare the metadata files created by the SPGC and the pg_catalog.csv from project gutenberg
Then, it will make a dataset

In [2]:
import numpy as np
import pandas as pd
import os, sys
import glob

from collections import Counter
import matplotlib.pyplot as plt

import misc_utils.dataset_filtering as dataset_filtering

In [5]:
git_repo_path = '/Users/dean/Documents/gitRepos'
gutenberg_repo_path = os.path.join(git_repo_path, 'gutenberg')
gutenberg_analysis_repo = os.path.join(git_repo_path, 'gutenberg-analysis')

In [6]:
## import internal helper functions
analysis_src_dir = os.path.join(gutenberg_analysis_repo,'src')
sys.path.append(analysis_src_dir)
from data_io import get_book

gutenberg_src_dir = os.path.join(gutenberg_repo_path,'src')
sys.path.append(gutenberg_src_dir)

from metaquery import meta_query
from jsd import jsdalpha

# Read in both metadata files

In [None]:
mq_filepath=os.path.join(gutenberg_repo_path,'metadata','metadata.csv')
pg_catalog_filepath='/home/dean/Documents/gitRepos/gutenberg_corpus_analysis/sample_dataset/pg_catalog.csv'

# Load both  metadata files

Load both the metadata file generated by SPGC and the metadata file from PG

In [5]:
df = dataset_filtering.read_metadata_and_catalog(mq_filepath, pg_catalog_filepath)
original_shape=df.shape

Get only English books, according to PG catalog

In [6]:
df = df.query('Language=="en"')
df['Language'].unique()

array(['en'], dtype=object)

Let's verify that the language column in both metadata files match

In [7]:
df['language'].unique()

array(["['en']", "['ne']"], dtype=object)

Uh oh, it doesn't!  What book is this causing the problems?

In [8]:
df.query('Language=="en" and language=="[\'ne\']"')

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type,PG_ID,type_pgc,title_pgc,Language,Authors,Subjects,LoCC,Bookshelves
72655,PG75522,The soldier's orphans,"Stephens, Ann S. (Ann Sophia)",1810.0,1886.0,['ne'],0,set(),Text,75522,Text,The soldier's orphans,en,"Stephens, Ann S. (Ann Sophia), 1810-1886",American fiction -- 19th century,PS,


Lets get rid of it!

In [9]:
index_to_drop = df.query('Language=="en" and language=="[\'ne\']"').index
df.drop(index_to_drop, inplace=True)

In [10]:
# Verifying that everything is good
print(df['Language'].unique())
print(df['language'].unique())

['en']
["['en']"]


In [11]:
print(f'Original Shape: {original_shape}')
print(f'Current Shape: {df.shape}')

Original Shape: (74141, 17)
Current Shape: (58991, 17)


# Lets get rid of anything missing a title or an author

In [12]:
tdf = df[['title', 'title_pgc', 'author', 'Authors']]
tdf[tdf.isnull().any(axis=1)]

Unnamed: 0,title,title_pgc,author,Authors
30,Spalding's Official Baseball Guide - 1913,Spalding's Official Baseball Guide - 1913,,"Foster, John B. (John Buckingham), 1863-1941 [..."
93,Moorish Literature: Comprising Romantic Ballad...,Moorish Literature\r\nComprising Romantic Ball...,,"Basset, René, 1855-1924 [Editor]"
114,"The Great Events by Famous Historians, Volume ...","The Great Events by Famous Historians, Volume ...",,"Johnson, Rossiter, 1840-1931 [Editor]; Horne, ..."
126,"The Great Events by Famous Historians, Volume ...","The Great Events by Famous Historians, Volume ...",,"Johnson, Rossiter, 1840-1931 [Editor]; Horne, ..."
134,The Literature of Arabia: With Critical and Bi...,The Literature of Arabia\r\nWith Critical and ...,,"Wilson, Epiphanius, 1845-1916 [Editor]"
...,...,...,...,...
75226,Who Was Who: 5000 B. C. to Date: Biographical ...,Who Was Who: 5000 B. C. to Date\r\nBiographica...,,"Gordon, Irwin Leslie, 1888-1954 [Editor]"
75300,Spalding's Baseball Guide and Official League ...,Spalding's Baseball Guide and Official League ...,,"Chadwick, Henry, 1824-1908 [Editor]"
75305,The Garden of Bright Waters: One Hundred and T...,The Garden of Bright Waters\r\nOne Hundred and...,,"Mathers, E. Powys (Edward Powys), 1892-1939 [T..."
75314,"The Great Events by Famous Historians, Volume 12","The Great Events by Famous Historians, Volume 12",,"Johnson, Rossiter, 1840-1931 [Editor]; Horne, ..."


Well, it looks like most of these HAVE authors, it's just messed up in the metadata created by SPGC.  Let's just drop them.

In [13]:
to_drop = tdf[tdf.isnull().any(axis=1)].index
df.loc[to_drop].head()

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type,PG_ID,type_pgc,title_pgc,Language,Authors,Subjects,LoCC,Bookshelves
30,PG10028,Spalding's Official Baseball Guide - 1913,,,,['en'],125,{'Baseball'},Text,10028,Text,Spalding's Official Baseball Guide - 1913,en,"Foster, John B. (John Buckingham), 1863-1941 [...",Baseball,GV,Browsing: Sports/Hobbies/Motoring
93,PG10085,Moorish Literature: Comprising Romantic Ballad...,,,,['en'],186,{'Spanish literature -- Translations into Engl...,Text,10085,Text,Moorish Literature\r\nComprising Romantic Ball...,en,"Basset, René, 1855-1924 [Editor]",Spanish literature -- Translations into English,PQ,Browsing: Literature
114,PG10103,"The Great Events by Famous Historians, Volume ...",,,,['en'],157,{'World history'},Text,10103,Text,"The Great Events by Famous Historians, Volume ...",en,"Johnson, Rossiter, 1840-1931 [Editor]; Horne, ...",World history,D,Browsing: History - European; Browsing: Histor...
126,PG10114,"The Great Events by Famous Historians, Volume ...",,,,['en'],273,{'World history'},Text,10114,Text,"The Great Events by Famous Historians, Volume ...",en,"Johnson, Rossiter, 1840-1931 [Editor]; Horne, ...",World history,D,Greece; Browsing: History - European; Browsing...
134,PG10121,The Literature of Arabia: With Critical and Bi...,,,,['en'],109,{'Arabic literature'},Text,10121,Text,The Literature of Arabia\r\nWith Critical and ...,en,"Wilson, Epiphanius, 1845-1916 [Editor]",Arabic literature,PJ,Browsing: Culture/Civilization/Society; Browsi...


In [14]:
df.drop(to_drop, inplace=True)

# Lets see if titles match

In [15]:
dont_match, attribute_errors = dataset_filtering.compare_columns(df, 'title', 'title_pgc', verbose=True)#['author']

Dont Match: id: PG59774   Thirty Strange Stories   30 Strange Stories
Dont Match: id: PG63765   The Divine and Perpetual Obligation of the Observance of the Sabbath, with Reference more Especially to a Pamphlet Lately Puvblished by the Rev. C.J. Vaughan, D.D., Head Master of Harrow School, Entitled “A Few Words on the Crystal Palace Question”   Mutiny
Dont Match: id: PG6420   Copyright Renewals 1960   U.S. Copyright Renewals, 1960


Let's get rid of PG63765 and lets note that we should get rid of copyright renewals en masse.

We can also ditch the duplicate column

In [16]:
df = df[df['id']!='PG63765']
df.drop('title_pgc', axis=1, inplace=True)

# Verify Author Matches

In [17]:
dont_match, attribute_errors = dataset_filtering.compare_columns(df, 'author', 'Authors')#['author']

Note that there are actually a bunch that don't match properly, but it appears that it is mostly a formatting issue.  We can come back to it, if needed.  Leaving them here

Lets ditch the duplicate authors column though

In [18]:
df.drop('Authors', axis=1, inplace=True)

# Do IDs match?  They better!

Do this next

In [19]:
# TODO




# Where do we stand?

In [20]:
df.shape

(56712, 15)

In [21]:
df.columns

Index(['id', 'title', 'author', 'authoryearofbirth', 'authoryearofdeath',
       'language', 'downloads', 'subjects', 'type', 'PG_ID', 'type_pgc',
       'Language', 'Subjects', 'LoCC', 'Bookshelves'],
      dtype='object')

In [22]:
df['author'].value_counts()

author
Various                                3608
Anonymous                               756
Lytton, Edward Bulwer Lytton, Baron     217
Shakespeare, William                    180
Ebers, Georg                            165
                                       ... 
Ryus, W. H. (William Henry)               1
Schoenrich, Otto                          1
Akenside, Mark                            1
Moore, T. Sturge (Thomas Sturge)          1
Turnbull, Margaret                        1
Name: count, Length: 19526, dtype: int64

In [23]:

#fig, ax = plt.subplots()
#df['author'].value_counts().plot(ax=ax, kind='bar')

In [24]:
#df['author'].value_counts().plot(kind='bar')

It will be difficult to categorize "Various", "Anonymous", or "Unknown" authors, let's ditch them

In [25]:
df = df[~df['author'].isin(['Various', 'Anonymous', 'Unknown'])]
df.shape

(52230, 15)

Lets see how many authors have more than 10 or 20 books

In [26]:
vc = df['author'].value_counts()
vc

author
Lytton, Edward Bulwer Lytton, Baron    217
Shakespeare, William                   180
Ebers, Georg                           165
Twain, Mark                            158
Oliphant, Mrs. (Margaret)              141
                                      ... 
Beer, Max                                1
Gates, Burton N. (Burton Noble)          1
Hedge, Mary Ann                          1
Mackinlay, M. (Malcolm) Sterling         1
Stearns, Henry Putnam                    1
Name: count, Length: 19523, dtype: int64

In [27]:
print(f'There are a total of {len(vc)} authors')
for book_count in [10, 20, 30, 40, 50, 75, 100]:
    print(f'There are {len(vc[vc > book_count])} authors with more than {book_count} books')


There are a total of 19523 authors
There are 802 authors with more than 10 books
There are 330 authors with more than 20 books
There are 173 authors with more than 30 books
There are 109 authors with more than 40 books
There are 75 authors with more than 50 books
There are 29 authors with more than 75 books
There are 15 authors with more than 100 books


In [28]:
# What should be the minimum number of books per author?
book_count_cutoff=30

In [29]:
authors_to_include = vc[vc > book_count_cutoff].index

mask = df['author'].isin(authors_to_include)
df = df[mask]

In [30]:
df.shape

(9817, 15)

# Add information on the length of books

Adds the number of lines, the number of words, and the number of unique words

By default, drops the books you haven't downloaded

In [31]:
# NOTE: INCOMPLETE, SOME BOOKS DIDNT DOWNLOAD, FIX SOON

raw_text_path=os.path.join(gutenberg_repo_path, 'data', 'text')
#tdf = dataset_filtering.add_line_counts(df, raw_text_path, drop_missing=False)

In [32]:
df.shape

(9817, 15)

# SETTINGS

In [33]:
df['author'].value_counts().min()

np.int64(31)

In [40]:
#####################
#####################
def normalize_dataset(df, how='num_books'):
    author_list = df['author'].unique()
    vc = df['author'].value_counts()

    min_num_books = vc.min()

    for author in author_list:
        tdf = df.query('author==@author')
        num_to_drop = tdf.shape[0] - min_num_books
        ind_to_drop = tdf.sample(num_to_drop).index
        df.drop(ind_to_drop, inplace=True)
    
def split_test_train(df, train_perc=0.8):
    author_list = df['author'].unique()
    train_ind = []
    for author in author_list:
        tdf = df.query('author==@author')
        single_author_train_ind = tdf.sample(frac=train_perc).index
        train_ind = [*train_ind,*single_author_train_ind]

    train_df = df.loc[train_ind]
    test_df = df.drop(train_ind)

    return train_df, test_df
    

In [35]:
normalize_dataset(df)

In [36]:
df.shape

(5363, 15)

In [41]:
train_df, test_df = split_test_train(df)

In [42]:
train_df

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type,PG_ID,type_pgc,Language,Subjects,LoCC,Bookshelves
34653,PG41228,"Guy Deverell, v. 1 of 2","Le Fanu, Joseph Sheridan",1814.0,1873.0,['en'],88,"{'England -- Fiction', 'Country homes -- Ficti...",Text,41228,Text,en,England -- Fiction; Revenge -- Fiction; Countr...,PR,Browsing: Culture/Civilization/Society; Browsi...
28253,PG35468,"The Tenants of Malory, Volume 2","Le Fanu, Joseph Sheridan",1814.0,1873.0,['en'],109,{'Mystery fiction'},Text,35468,Text,en,Mystery fiction,PR,Browsing: Crime/Mystery; Browsing: Literature;...
18241,PG26451,"A Stable for Nightmares; or, Weird Tales","Le Fanu, Joseph Sheridan",1814.0,1873.0,['en'],203,{'Fantasy fiction'},Text,26451,Text,en,Fantasy fiction,PR,Browsing: Literature; Browsing: Science-Fictio...
30146,PG37172,"In a Glass Darkly, v. 1/3","Le Fanu, Joseph Sheridan",1814.0,1873.0,['en'],325,"{'Ireland -- Fiction', 'Paranormal fiction'}",Text,37172,Text,en,Ireland -- Fiction; Paranormal fiction,PR,Browsing: Literature; Browsing: Religion/Spiri...
34654,PG41229,"Guy Deverell, v. 2 of 2","Le Fanu, Joseph Sheridan",1814.0,1873.0,['en'],154,"{'England -- Fiction', 'Country homes -- Ficti...",Text,41229,Text,en,England -- Fiction; Revenge -- Fiction; Countr...,PR,Browsing: Culture/Civilization/Society; Browsi...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74929,PG9582,Personal Poems II: Part 2 from Volume IV of Th...,"Whittier, John Greenleaf",1807.0,1892.0,['en'],93,{'American poetry -- 19th century'},Text,9582,Text,en,American poetry -- 19th century,PS,Browsing: History - American; Browsing: Litera...
74917,PG9571,"Snow Bound, and other poems: Part 4 From Volum...","Whittier, John Greenleaf",1807.0,1892.0,['en'],95,{'American poetry -- 19th century'},Text,9571,Text,en,American poetry -- 19th century,PS,Browsing: Literature; Browsing: Poetry
74941,PG9593,Historical Papers: Part 3 from Volume VI of Th...,"Whittier, John Greenleaf",1807.0,1892.0,['en'],101,{'History'},Text,9593,Text,en,History,PS,Browsing: History - General; Browsing: Literature
74907,PG9562,"Barclay of Ury, and other poems: Part 3 From V...","Whittier, John Greenleaf",1807.0,1892.0,['en'],110,{'American poetry -- 19th century'},Text,9562,Text,en,American poetry -- 19th century,PS,Browsing: Literature; Browsing: Poetry


In [43]:
test_df

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type,PG_ID,type_pgc,Language,Subjects,LoCC,Bookshelves
32,PG1002,"Divine Comedy, Longfellow's Translation, Purga...",Dante Alighieri,1265.0,1321.0,['en'],487,"{'Epic poetry, Italian -- Translations into En...",Text,1002,Text,en,"Epic poetry, Italian -- Translations into Engl...",PQ,Italy; Browsing: Culture/Civilization/Society;...
94,PG10086,The Minute Boys of the Mohawk Valley,"Otis, James",1848.0,1912.0,['en'],101,"{'United States -- History -- Revolution, 1775...",Text,10086,Text,en,"United States -- History -- Revolution, 1775-1...",PZ,Browsing: Children & Young Adult Reading; Brow...
98,PG1008,"Divine Comedy, Cary's Translation, Complete",Dante Alighieri,1265.0,1321.0,['en'],442,"{'Epic poetry, Italian -- Translations into En...",Text,1008,Text,en,"Epic poetry, Italian -- Translations into Engl...",PQ,Italy; Harvard Classics; Browsing: Literature;...
209,PG1018,The Water-Babies,"Kingsley, Charles",1819.0,1875.0,['en'],575,"{'Chimney sweeps -- Juvenile fiction', 'Fairy ...",Text,1018,Text,en,Chimney sweeps -- Juvenile fiction; Fairy tale...,PZ,Browsing: Children & Young Adult Reading; Brow...
407,PG10368,The Vizier of the Two-Horned Alexander,"Stockton, Frank R.",1834.0,1902.0,['en'],132,"{'Immortalism -- Fiction', 'Fantasy fiction', ...",Text,10368,Text,en,Fantasy fiction; Adventure stories; Immortalis...,PS,Browsing: Literature; Browsing: Science-Fictio...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74940,PG9592,Personal Sketches and Tributes: Part 2 from Vo...,"Whittier, John Greenleaf",1807.0,1892.0,['en'],130,{'Biography'},Text,9592,Text,en,Biography,CT; PS,Browsing: Biographies; Browsing: History - Gen...
74969,PG9618,The Field of Ice: Part II of the Adventures of...,"Verne, Jules",1828.0,1905.0,['en'],165,"{'Sea stories', 'Adventure stories'}",Text,9618,Text,en,Sea stories; Adventure stories,PQ,Browsing: Literature; Browsing: Fiction
75110,PG9745,The Rock of Chickamauga: A Story of the Wester...,"Altsheler, Joseph A. (Joseph Alexander)",1862.0,1919.0,['en'],185,"{'United States -- History -- Civil War, 1861-...",Text,9745,Text,en,"United States -- History -- Civil War, 1861-18...",PZ,US Civil War; Children's Fiction; Browsing: Ch...
75130,PG9763,"Alice, or the Mysteries — Book 01","Lytton, Edward Bulwer Lytton, Baron",1803.0,1873.0,['en'],65,{'English fiction -- 19th century'},Text,9763,Text,en,English fiction -- 19th century,PR,Browsing: Literature; Browsing: Fiction


In [45]:
def write_csv_in_metadata_format(df, outfile):
    cols_to_keep = ['id', 'title', 'author', 'authoryearofbirth', 'authoryearofdeath',
       'language', 'downloads', 'subjects']
    df = df[cols_to_keep]
    df.to_csv(outfile)

In [46]:
train_outfile = 'train.csv'
test_outfile = 'test.csv'
write_csv_in_metadata_format(train_df, train_outfile)
write_csv_in_metadata_format(test_df, test_outfile)