In [6]:
import pandas as pd
import re
import glob

## Exploring Available Books

To start with, figure out which books were not downloaded, but are present in the metadata csv!

I made a copy of the RAW texts just in case.

In [7]:
books_list = []

for name in glob.glob('../data/raw/*'):
    books_list.append(re.findall(r'PG\d*', name)[0])

In [8]:
library = pd.read_csv('../data/metadata.csv')

In [9]:
len(library) - len(books_list)

3435

There are 3435 "books" listed in the metadata that do not get downloaded. Next up, to explore why.

In [10]:
library.loc[~library['id'].isin(books_list)]['type'].value_counts()

Sound          1104
Dataset          83
Image            33
MovingImage       7
StillImage        3
Collection        1
Text              1
Name: type, dtype: int64

Starting with those that are marked as 'type' being NaN. It is possible either the flags are incorrect (I checked it with "The King James Version of the Bible"), or there is something else going on that is causing this issue. Might have to look into NaNs a little bit more.

In [11]:
library.loc[(~library['id'].isin(books_list)) & (library['type'].isna())]

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
606,PG10547,Topsy-Turvy,"Verne, Jules",1828.0,1905.0,['en'],126,"{'Science fiction, French -- Translations into...",
703,PG10634,"The Queen of Hearts, and Sing a Song for Sixpence","Caldecott, Randolph",1846.0,1886.0,['en'],44,"{'Picture books for children', 'Nursery rhymes...",
841,PG10762,Impressions of Theophrastus Such,"Eliot, George",1819.0,1880.0,['en'],110,"{'Authors -- Fiction', 'England -- Fiction', '...",
923,PG10836,The Algebra of Logic,"Couturat, Louis",1868.0,1914.0,['en'],97,"{'Logic, Symbolic and mathematical', 'Algebrai...",
1106,PG10,The King James Version of the Bible,,,,['en'],5831,{'Bible'},
...,...,...,...,...,...,...,...,...,...
70441,PG9995,Little Journey to Puerto Rico: For Intermediat...,"George, Marian Minnie",1865.0,,['en'],12,{'Puerto Rico -- Description and travel'},
70442,PG9996,"""'Tis Sixty Years Since"": Address of Charles F...","Adams, Charles Francis",1835.0,1915.0,['en'],12,"{'Philosophy, Modern'}",
70443,PG9997,"France and England in North America, Part III:...","Parkman, Francis",1823.0,1893.0,['en'],34,{'Canada -- History -- To 1763 (New France)'},
70445,PG9999,"Harriet, the Moses of Her People","Bradford, Sarah H. (Sarah Hopkins)",1818.0,1912.0,['en'],103,"{'Slaves -- United States -- Biography', 'Afri...",


For 'Sound' it is pretty straightforward. I'm only looking for boox and not for audio files.

In [12]:
library.loc[(~library['id'].isin(books_list)) & (library['type'] == 'Sound')]

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
151,PG10137,Mary Had a Little Lamb: Recording taken from M...,"Edison, Thomas A. (Thomas Alva)",1847.0,1931.0,['en'],21,"{'Nursery rhymes, American'}",Sound
168,PG10152,Voice Trial - Kinetophone actor audition,"Lett, Bob",,,['en'],4,{'Auditions'},Sound
169,PG10153,Voice Trial - Kinetophone Actor Audition,"Lenord, Frank",,,['en'],4,{'Auditions'},Sound
170,PG10154,Voice Trial - Kinetophone Actor Audition,"Schultz, Siegfried Von",,,['en'],0,{'Auditions'},Sound
171,PG10155,The Right of the People to Rule,"Roosevelt, Theodore",1858.0,1919.0,['en'],9,"{'Progressivism (United States politics)', 'Po...",Sound
...,...,...,...,...,...,...,...,...,...
70159,PG9740,Tom Tiddler's Ground,"Dickens, Charles",1812.0,1870.0,['en'],6,{'English fiction'},Sound
70160,PG9741,The Uncommercial Traveller,"Dickens, Charles",1812.0,1870.0,['en'],6,{'England -- Social life and customs -- 19th c...,Sound
70161,PG9742,The Wreck of the Golden Mary,"Dickens, Charles",1812.0,1870.0,['en'],3,"{'Sea stories', 'Shipwrecks -- Fiction', 'Gold...",Sound
70162,PG9743,Sketches of Young Couples,"Dickens, Charles",1812.0,1870.0,['en'],3,"{'Couples -- England', 'England -- Social life...",Sound


Next up, looking into datasets. It appears the vast majority of them are genomes. There are 10 calculations of square roots and 1/pi to a million digits. And 'Moby Word Lists' is just info on gutenberg, disclaimers, etc...

In [13]:
library.loc[(~library['id'].isin(books_list)) & (library['type'] == 'Dataset')].groupby('author').count()

Unnamed: 0_level_0,id,title,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"Bonnell, Jerry T.",2,2,0,0,2,2,2,2
"De Forest, Norman L.",1,1,0,0,1,1,1,1
Human Genome Project,72,72,0,0,72,72,72,72
"Kanada, Yasumasa",1,1,1,1,1,1,1,1
"Kerr, Stan",1,1,0,0,1,1,1,1
"Nemiroff, Robert J.",5,5,0,0,5,5,5,5
"Ward, Grady",1,1,1,0,1,1,1,1


Onto checking out the images! the Image contains music sheets. MovingImage contains comets video, rotating earth and 5 nuclear test videos. StillImages contain a kids story illustrated and two maps/ map images.

In [14]:
library.loc[(~library['id'].isin(books_list)) & (library['type'] == 'Image')]

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
1108,PG11001,String Quartet No. 05 in A major Opus 18,"Beethoven, Ludwig van",1770.0,1827.0,['en'],5,"{'Music', 'String quartets -- Scores'}",Image
1109,PG11002,"String Quartet No. 11 in F minor Opus 95 ""Seri...","Beethoven, Ludwig van",1770.0,1827.0,['en'],6,"{'String quartets -- Scores', 'Music'}",Image
1944,PG11755,String Quartet No. 10 in E flat major Opus 74 ...,"Beethoven, Ludwig van",1770.0,1827.0,['en'],15,"{'Music', 'String quartets -- Scores'}",Image
2381,PG12149,String Quartet No. 03 in D major Opus 18,"Beethoven, Ludwig van",1770.0,1827.0,['en'],15,"{'String quartets -- Scores', 'Music'}",Image
2479,PG12237,String Quartet No. 16 in F major Opus 135,"Beethoven, Ludwig van",1770.0,1827.0,['en'],21,"{'Music', 'String quartets -- Scores'}",Image
2986,PG12695,String Quartet No. 04 in C minor Opus 18,"Beethoven, Ludwig van",1770.0,1827.0,['en'],11,"{'Music', 'String quartets -- Scores'}",Image
3412,PG13078,String Quartet No. 12 in E flat major Opus 127,"Beethoven, Ludwig van",1770.0,1827.0,['en'],8,"{'String quartets -- Scores', 'Music'}",Image
3413,PG13079,String Quartet No. 14 in C-sharp minor Opus 131,"Beethoven, Ludwig van",1770.0,1827.0,['en'],14,"{'String quartets -- Scores', 'Music'}",Image
3495,PG13153,String Quartet No. 15 in A minor Opus 132,"Beethoven, Ludwig van",1770.0,1827.0,['en'],36,"{'String quartets -- Scores', 'Music'}",Image
3850,PG13473,String Quartet No. 06 in B flat major Opus 18,"Beethoven, Ludwig van",1770.0,1827.0,['en'],7,"{'Music', 'String quartets -- Scores'}",Image


In [15]:
library.loc[(~library['id'].isin(books_list)) & (library['type'] == 'StillImage')]

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
1661,PG114,The Tenniel Illustrations for Carroll's Alice ...,"Tenniel, John",1820.0,1914.0,['en'],391,"{""Children's stories"", 'Fantasy fiction'}",StillImage
15515,PG239,Radar Map of the United States,United States,,,['en'],27,{'United States -- Maps'},StillImage
67797,PG758,"LandSat Picture of Washington, DC",United States. National Aeronautics and Space ...,,,['en'],36,{'Washington (D.C.) -- Remote-sensing images'},StillImage


And finally, Collection contains 'Project Gutenberg DVD: The July 2006 Special' and the only not downloaded text is just empty.

In [16]:
library.loc[(~library['id'].isin(books_list)) & (library['type'] == 'Collection')]

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
10150,PG19159,Project Gutenberg DVD: The July 2006 Special,,,,['en'],73,set(),Collection


In [17]:
library.loc[(~library['id'].isin(books_list)) & (library['type'] == 'Text')]

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
69464,PG90907,,,,,['en'],1,set(),Text


In [18]:
library.loc[library['author'].str.find('Lovecraft') > -1]

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
22880,PG30637,"Writings in the United Amateur, 1915-1922","Lovecraft, H. P. (Howard Phillips)",1890.0,1937.0,['en'],331,"{'Periodicals', 'Literature -- History and cri...",
23804,PG31469,The Shunned House,"Lovecraft, H. P. (Howard Phillips)",1890.0,1937.0,['en'],406,"{'Haunted houses -- Fiction', 'Horror tales, A...",
44538,PG50133,The Dunwich Horror,"Lovecraft, H. P. (Howard Phillips)",1890.0,1937.0,['en'],789,"{'American fiction -- 20th century', 'Fantasy ...",
64643,PG68236,The colour out of space,"Lovecraft, H. P. (Howard Phillips)",1890.0,1937.0,['en'],572,"{'Extraterrestrial beings -- Fiction', 'Horror...",
64695,PG68283,The call of Cthulhu,"Lovecraft, H. P. (Howard Phillips)",1890.0,1937.0,['en'],2045,"{'Cthulhu (Fictitious character) -- Fiction', ...",
64987,PG68547,He,"Lovecraft, H. P. (Howard Phillips)",1890.0,1937.0,['en'],187,"{'New York (N.Y.) -- Fiction', 'Horror tales',...",
64994,PG68553,The festival,"Lovecraft, H. P. (Howard Phillips)",1890.0,1937.0,['en'],247,"{'Horror tales', 'New England -- Fiction', 'Sh...",
67130,PG70478,The silver key,"Lovecraft, H.P.",,,['en'],0,set(),
67139,PG70486,The lurking fear,"Lovecraft, H. P. (Howard Phillips)",1890.0,1937.0,['en'],1169,"{'Horror tales', 'Catskill Mountains Region (N...",


## Finding out English books

In [20]:
library.loc[library['language'].str.find('en') > -1]

Unnamed: 0,id,title,author,authoryearofbirth,authoryearofdeath,language,downloads,subjects,type
0,PG10000,The Magna Carta,Anonymous,,,['en'],183,{'Constitutional history -- England -- Sources...,
1,PG10001,Apocolocyntosis,"Seneca, Lucius Annaeus",,65.0,['en'],400,"{'Claudius, Emperor of Rome, 10 B.C.-54 A.D. -...",
2,PG10002,The House on the Borderland,"Hodgson, William Hope",1877.0,1918.0,['en'],666,{'Science fiction'},
3,PG10003,"My First Years as a Frenchwoman, 1876-1879","Waddington, Mary King",1833.0,1923.0,['en'],43,"{'France -- Social life and customs', 'France ...",
4,PG10004,The Warriors,"Lindsay, Anna Robertson Brown",1864.0,1948.0,['en'],27,{'Christianity'},
...,...,...,...,...,...,...,...,...,...
70443,PG9997,"France and England in North America, Part III:...","Parkman, Francis",1823.0,1893.0,['en'],34,{'Canada -- History -- To 1763 (New France)'},
70444,PG9998,Poems,"Betham, Matilda",1776.0,1852.0,['en'],23,{'Poetry'},
70445,PG9999,"Harriet, the Moses of Her People","Bradford, Sarah H. (Sarah Hopkins)",1818.0,1912.0,['en'],103,"{'Slaves -- United States -- Biography', 'Afri...",
70447,PG99,Collected Articles of Frederick Douglass,"Douglass, Frederick",1818.0,1895.0,['en'],170,"{'Reconstruction (U.S. history, 1865-1877)', '...",
