# Merge non-US canon

We have created a list of 52 volumes not by American authors that are reprinted by W. W. Norton. To extend this up to 1980, we have in a few cases selected novels by authors whose short fiction is included in the *Norton Anthology of English Literature*.

However, not all of these volumes are already present in our sample. So we probably need to add them and run another model.

*This notebook is going to be the notebook that creates the third — and let's hope final — version of project metadata.*

In [72]:
import pandas as pd
import numpy as np

#### the non-US canon

In [27]:
nonus = pd.read_csv('canon/nonusnorton.tsv', sep = '\t', index_col = 'docid')
nonus.head()

Unnamed: 0_level_0,author,title,firstpub
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
uc2.ark+=13960=t24b3010m,"Conrad, Joseph",The secret Agent,1907
mdp.39015032425400,"Conrad, Joseph",Heart of Darkness,1899
mdp.39015053037415,"Dickens, Charles",Hard Times,1854
uc1.b4713109,"Brontë, Charlotte",Jane Eyre,1847
mdp.39015039284107,"Brontë, Charlotte",Jane Eyre,1847


#### metadata documenting hypotheses

In [78]:
hypmeta = pd.read_csv('../supplement2/second_supplement_deduped.tsv', sep = '\t', index_col = 'docid')
hypmeta.head()

Unnamed: 0_level_0,allcopiesofwork,author,bestseller,contrast4reviewed,copiesin25yrs,earlyedition,firstpub,gender,heath,imprint,...,lastname,latestcomp,mostdiscussed,nationality,norton,nortonshort,preregistered,recordid,reviewed,title
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
uc1.b4095385,2.0,"Keyes, Frances Parkinson",True,False,2.0,True,,,False,"New York|Julian Messner, Inc.|c1950.",...,Keyes,1950,False,,False,False,False,589151.0,False,Joy Street
uiuo.ark+=13960=t6vx10m4w,2.0,"Solly, Henry",False,True,2.0,True,1874.0,m,False,London;Chapman and Hall;1874.,...,Solly,1874,False,uk,False,False,False,8719961.0,False,Gerald and his friend the doctor
nyp.33433076045701,,"Michelson, Miriam,",True,False,,,1904.0,,False,Indianapolis;The Bobbs-Merrill Company;1904,...,Michelson,1904,False,,False,False,False,5974479.0,False,In the bishop's carriage
uc2.ark+=13960=t6736nr4k,,"Hawthorne, Julian",False,True,,,1880.0,m,False,New York;D. Appleton;1880.,...,Hawthorne,1880,False,us,False,False,False,7661915.0,False,Garth: a novel
mdp.39015054194546,4.0,"Ferber, Edna",True,False,3.0,True,,,False,"Garden City, N. Y.|Doubleday|c1926.",...,Ferber,1926,False,,False,False,False,389554.0,False,Show boat


In [36]:
allmeta = pd.read_csv('../supplement2/supp2nationalitymeta.tsv', sep = '\t', index_col = 'docid')
allmeta.head()

Unnamed: 0_level_0,author,inferreddate,recordid,latestcomp,allcopiesofwork,earlyedition,copiesin25yrs,imprint,lastname,title,nationality
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
uc2.ark+=13960=t5n877988,"Aytoun, William Edmondstoune",1861,7685423.0,1861,2.0,True,2.0,Edinburgh;London;William Blackwood and Sons;1861.,Aytoun,Norman Sinclair,uk
uc2.ark+=13960=t1pg1hz3p,"Winfield, Arthur M",1916,7677354.0,1916,1.0,True,1.0,New York;Grosset & Dunlap;c1916.,Winfield,The Rover boys on a tour,uk
uc1.b4369662,,1980,652471.0,1980,1.0,True,1.0,Boston|Houghton Mifflin|1980.,anonymous,Seeds of corruption / | $c: Sabri Moussa ; tra...,guess: us
mdp.39015037418947,"Doig, Ivan",1996,3059361.0,1996,1.0,True,1.0,New York|Simon & Schuster|c1996.,Doig,Bucking the sun : | a novel / | $c: Ivan Doig.,us
uc1.32106010927223,"Tan, Amy",1989,1541408.0,1989,2.0,True,2.0,New York|Putnam|c1989.,Tan,The Joy Luck Club,us


### Initial analysis of the situation

How many of the non-us vols are already in our hypothesis list? How many are in the larger list?

In [30]:
print('Total number of non-us vols: ', len(set(nonus.index)))
hypoverlap = set(nonus.index).intersection(set(hypmeta.index))
print("In the hypothesis list: ", len(hypoverlap))

Total number of non-us vols:  55
In the hypothesis list:  13


In [29]:
alloverlap = set(nonus.index).intersection(set(allmeta.index))
print("In the larger list: ", len(alloverlap))

In the larger list:  25


In [32]:
print(len(alloverlap.intersection(hypoverlap)))

13


#### what about problems in hypmeta itself

some vols in it aren't in allmeta

In [52]:
print(hypmeta.shape)
largeoverlap = set(allmeta.index).intersection(set(hypmeta.index))
print(len(largeoverlap))

(2016, 21)
2016


In [43]:
abouttolose = set(hypmeta.index) - set(allmeta.index)
hypmeta.loc[abouttolose, : ]

Unnamed: 0_level_0,allcopiesofwork,author,bestseller,contrast4reviewed,copiesin25yrs,earlyedition,firstpub,gender,heath,imprint,...,lastname,latestcomp,mostdiscussed,nationality,norton,nortonshort,preregistered,recordid,reviewed,title
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
uc1.b250586,,"Sinclair, Catherine",False,False,,,1852.0,f,False,London;R. Bentley;1852.,...,Sinclair,1852,False,uk,False,False,False,6502576.0,True,"Beatrice, or, The unknown relatives"
mdp.39015031600409,1.0,"Chandler, Raymond",False,False,1.0,True,1939.0,m,False,New York|A. A. Knopf|1945,...,Chandler,1939,False,us,False,False,True,624157.0,False,The big sleep
mdp.39015004063601,11.0,"Steinbeck, John",True,False,6.0,True,1939.0,m,True,New York|The Viking press|c1939,...,Steinbeck,1939,False,us,False,False,True,1029460.0,False,The grapes of wrath
uc1.b250585,,"Sinclair, Catherine",False,False,,,1852.0,f,False,London;R. Bentley;1852.,...,Sinclair,1852,False,uk,False,False,False,6502576.0,True,"Beatrice, or, The unknown relatives"
uc1.b242928,,"Glasgow, Ellen Anderson Gholson",False,False,,,1898.0,f,False,New York;London;Harper & Brothers;1898.,...,Glasgow,1898,False,us,False,False,False,432668.0,True,Phases of an inferior planet
nyp.33433081852711,,"James, Henry",False,False,,,1881.0,m,False,New York;Modern Library;c1909,...,James,1880,True,us,False,False,True,9038144.0,True,The portrait of a lady
uc1.b249782,,"Davis, Richard Harding,",True,False,,,1895.0,,False,New York;Harper & brothers;1895.,...,Davis,1895,False,,False,False,False,484766.0,False,The Princess Aline
uc1.b248359,,Cousin Carrie,False,True,,,1864.0,,False,New York;London;D. Appleton;1864.,...,Cousin,1864,False,us,False,False,False,6501301.0,False,Keep a good heart
uc1.b250587,,"Sinclair, Catherine",False,False,,,1852.0,f,False,London;R. Bentley;1852.,...,Sinclair,1852,False,uk,False,False,False,6502576.0,True,"Beatrice, or, The unknown relatives"


That requires some special thinking. Basically, I happen to know the ```uc1.b``` volumes are Hathi-level problems. But the other three are things I've been able to download. Don't know why not in allmeta. Let's try a second time.

In [77]:
keepthese = ['mdp.39015031600409', 'nyp.33433081852711', 'mdp.39015004063601']
print('Already in allmeta: ', len(set(keepthese).intersection(set(allmeta.index))))

Already in allmeta:  0


## Let's build an addition to allmeta

In [48]:
workmeta = pd.read_csv('../../noveltmmeta/workmeta.tsv', sep = '\t', index_col = 'docid', low_memory = False)

In [49]:
intersectcolumns = set(workmeta.columns).intersection(set(allmeta.columns))
intersectcolumns

{'allcopiesofwork',
 'author',
 'copiesin25yrs',
 'earlyedition',
 'imprint',
 'inferreddate',
 'latestcomp',
 'recordid',
 'title'}

In [50]:
set(allmeta.columns) - set(workmeta.columns)

{'lastname', 'nationality'}

In [60]:
missingvols = set(nonus.index) - set(allmeta.index)
missingvols = list(missingvols) + keepthese
print(len(missingvols))

33


In [61]:
newallmeta = workmeta.loc[missingvols, intersectcolumns]
newallmeta.head()

Unnamed: 0_level_0,copiesin25yrs,earlyedition,inferreddate,recordid,author,title,latestcomp,imprint,allcopiesofwork
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
mdp.39015039284107,11,False,1890,281476,"Brontë, Charlotte",Jane Eyre. | $c: By Charlotte Brontë ...,1855,New York;T. Y. Crowell and co.;c1890,29
pst.000029721290,1,True,1965,12269163,"Forster, E. M. (Edward Morgan)",Howards End / | $c: by E. M. Forster.,1965,Harmondsworth|Penguin Books|1965].,1
inu.39000003999138,1,False,1980,9399019,"Austen, Jane",Emma / | Jane Austen edited by Mary Lascelles ...,1817,London|Dent|New York|Dutton|1980,1
uiuo.ark+=13960=t8hd8j356,16,True,1871,8721591,"Eliot, George",Middlemarch: a study of provincial life,1871,Edinburgh;W. Blackwood;1871-72.,39
mdp.39015049725784,2,True,1959,1419817,"Achebe, Chinua",Things fall apart.,1959,"Greenwich, Conn.|Fawcett Pub.|c1959",5


In [64]:
newallmeta = newallmeta.assign(nationality = 'nonus')
for k in keepthese:
    newallmeta.loc[k, 'nationality'] = 'us'

In [65]:
def get_last_name(aname):
    if pd.isnull(aname) or len(aname) < 2:
        return('anonymous')
    else:
        names = aname.split()
        lastname = names[0].strip('()[],.')
        return lastname

newallmeta = newallmeta.assign(lastname = newallmeta.author.map(get_last_name))

In [66]:
newallmeta = pd.concat([allmeta, newallmeta])
newallmeta.shape

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  if __name__ == '__main__':


(39850, 11)

#### fixing a significant bug in the second supplement

A rather consequential detail here: I notice that the anonymization routine last time around failed to distinguish different anonymous authors!

In [67]:
print(len(set(newallmeta.lastname)))
ctr = 0
for idx in newallmeta.index:
    if newallmeta.loc[idx, 'lastname'].startswith('anonymou'):
        newallmeta.loc[idx, 'lastname'] = 'anonymous' + str(ctr)
        ctr += 1
print(len(set(newallmeta.lastname)))

10298
13228


#### add birthdate

In [70]:
print(newallmeta.shape)
birthmeta = newallmeta.join(workmeta.authordate, how = 'left')
birthmeta.shape

(39850, 11)


(39850, 12)

In [74]:
def get_birth(authdate):
    if pd.isnull(authdate):
        return np.nan
    elif len(authdate) < 5:
        return np.nan
    elif len(authdate) == 5 and authdate.endswith('-'):
        try:
            birth = int(authdate[0:4])
            return birth
        except:
            return np.nan
    elif len(authdate) > 8:
        try:
            birth = int(authdate[0:4])
            return birth
        except:
            return np.nan
    elif authdate.startswith('b.') and authdate[-4:].isnumeric():
        return int(authdate[-4 : ])
    else:
        return np.nan

birthmeta = birthmeta.assign(birth = birthmeta.authordate.map(get_birth))
birthmeta.shape

(39850, 13)

In [75]:
sum(~pd.isnull(birthmeta.birth))

23751

In [76]:
birthmeta.to_csv('../supplement3/third_all_meta.tsv', sep = '\t', index_label = 'docid')

## Let's build an addition to hypmeta

In [79]:
intersectcolumns = set(workmeta.columns).intersection(set(hypmeta.columns))
intersectcolumns

{'allcopiesofwork',
 'author',
 'copiesin25yrs',
 'earlyedition',
 'imprint',
 'inferreddate',
 'latestcomp',
 'recordid',
 'title'}

In [80]:
set(hypmeta.columns) - set(workmeta.columns)

{'bestseller',
 'contrast4reviewed',
 'firstpub',
 'gender',
 'heath',
 'lastname',
 'mostdiscussed',
 'nationality',
 'norton',
 'nortonshort',
 'preregistered',
 'reviewed'}

In [81]:
abouttolose = set(hypmeta.index) - set(allmeta.index)
print(abouttolose)
for k in keepthese:
    abouttolose.remove(k)
abouttolose

{'uc1.b250586', 'mdp.39015031600409', 'mdp.39015004063601', 'uc1.b250585', 'uc1.b242928', 'nyp.33433081852711', 'uc1.b249782', 'uc1.b248359', 'uc1.b250587'}


{'uc1.b242928',
 'uc1.b248359',
 'uc1.b249782',
 'uc1.b250585',
 'uc1.b250586',
 'uc1.b250587'}

In [82]:
keep = set(hypmeta.index) - abouttolose
hypmeta = hypmeta.loc[keep, : ]
hypmeta.shape

(2019, 21)

In [83]:
print('Total number of non-us vols: ', len(set(nonus.index)))
hypoverlap = set(nonus.index).intersection(set(hypmeta.index))
print("In the hypothesis list: ", len(hypoverlap))

Total number of non-us vols:  55
In the hypothesis list:  13


In [84]:
hypmeta = hypmeta.assign(nonusnorton = False)
for h in hypoverlap:
    hypmeta.loc[h, 'nonusnorton'] = True

In [85]:
needed2add = set(nonus.index) - set(hypmeta.index)
print(len(needed2add))

42


In [86]:
addition = workmeta.loc[needed2add, intersectcolumns]
addition.head()

Unnamed: 0_level_0,copiesin25yrs,earlyedition,recordid,inferreddate,author,title,latestcomp,imprint,allcopiesofwork
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
mdp.39015039284107,11,False,281476,1890,"Brontë, Charlotte",Jane Eyre. | $c: By Charlotte Brontë ...,1855,New York;T. Y. Crowell and co.;c1890,29
loc.ark+=13960=t75t4hz8p,2,True,11205329,1895,"Dickens, Charles",Great expectations and Hard times,1870,New York and London;Macmillan and co.;1895.,3
mdp.39015020732668,6,True,137557,1981,"Rushdie, Salman",Midnight's children : | a novel / | $c: by Sal...,1980,"New York|Knopf|1981, c1980.",6
pst.000029721290,1,True,12269163,1965,"Forster, E. M. (Edward Morgan)",Howards End / | $c: by E. M. Forster.,1965,Harmondsworth|Penguin Books|1965].,1
mdp.39076006766344,4,True,9916985,1954,"Woolf, Virginia",Jacob's room.,1941,London|Hogarth Press|1954.,5


In [88]:
booleans = {'bestseller', 'contrast4reviewed', 'heath', 'mostdiscussed', 'norton', 'nortonshort', 
            'preregistered', 'reviewed'}
for b in booleans:
    addition[b] = False
addition['nonusnorton'] = True
addition.head()

Unnamed: 0_level_0,copiesin25yrs,earlyedition,recordid,inferreddate,author,title,latestcomp,imprint,allcopiesofwork,heath,norton,contrast4reviewed,nonusnorton,bestseller,nortonshort,mostdiscussed,reviewed,preregistered
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
mdp.39015039284107,11,False,281476,1890,"Brontë, Charlotte",Jane Eyre. | $c: By Charlotte Brontë ...,1855,New York;T. Y. Crowell and co.;c1890,29,False,False,False,True,False,False,False,False,False
loc.ark+=13960=t75t4hz8p,2,True,11205329,1895,"Dickens, Charles",Great expectations and Hard times,1870,New York and London;Macmillan and co.;1895.,3,False,False,False,True,False,False,False,False,False
mdp.39015020732668,6,True,137557,1981,"Rushdie, Salman",Midnight's children : | a novel / | $c: by Sal...,1980,"New York|Knopf|1981, c1980.",6,False,False,False,True,False,False,False,False,False
pst.000029721290,1,True,12269163,1965,"Forster, E. M. (Edward Morgan)",Howards End / | $c: by E. M. Forster.,1965,Harmondsworth|Penguin Books|1965].,1,False,False,False,True,False,False,False,False,False
mdp.39076006766344,4,True,9916985,1954,"Woolf, Virginia",Jacob's room.,1941,London|Hogarth Press|1954.,5,False,False,False,True,False,False,False,False,False


In [89]:
set(hypmeta.columns) - set(addition.columns)

{'firstpub', 'gender', 'lastname', 'nationality'}

In [91]:
addition['firstpub'] = np.nan
for idx in addition.index:
    addition.loc[idx, 'firstpub'] = nonus.loc[idx, 'firstpub']
addition.head()

Unnamed: 0_level_0,copiesin25yrs,earlyedition,recordid,inferreddate,author,title,latestcomp,imprint,allcopiesofwork,heath,norton,contrast4reviewed,nonusnorton,bestseller,nortonshort,mostdiscussed,reviewed,preregistered,firstpub
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
mdp.39015039284107,11,False,281476,1890,"Brontë, Charlotte",Jane Eyre. | $c: By Charlotte Brontë ...,1855,New York;T. Y. Crowell and co.;c1890,29,False,False,False,True,False,False,False,False,False,1847.0
loc.ark+=13960=t75t4hz8p,2,True,11205329,1895,"Dickens, Charles",Great expectations and Hard times,1870,New York and London;Macmillan and co.;1895.,3,False,False,False,True,False,False,False,False,False,1861.0
mdp.39015020732668,6,True,137557,1981,"Rushdie, Salman",Midnight's children : | a novel / | $c: by Sal...,1980,"New York|Knopf|1981, c1980.",6,False,False,False,True,False,False,False,False,False,1981.0
pst.000029721290,1,True,12269163,1965,"Forster, E. M. (Edward Morgan)",Howards End / | $c: by E. M. Forster.,1965,Harmondsworth|Penguin Books|1965].,1,False,False,False,True,False,False,False,False,False,1910.0
mdp.39076006766344,4,True,9916985,1954,"Woolf, Virginia",Jacob's room.,1941,London|Hogarth Press|1954.,5,False,False,False,True,False,False,False,False,False,1922.0


In [92]:
addition['gender'] = np.nan
addition['nationality'] = 'non-us'

In [93]:
addition = addition.assign(lastname = addition.author.map(get_last_name))

In [94]:
addition.shape

(42, 22)

In [95]:
hypmeta.shape

(2019, 22)

In [96]:
newhypmeta = pd.concat([hypmeta, addition])
newhypmeta.shape

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  if __name__ == '__main__':


(2061, 22)

In [97]:
newhypmeta.to_csv('../supplement3/third_hypothesis_meta.tsv', sep = '\t', index_label = 'docid')

In [99]:
toget = set(birthmeta.index) - set(allmeta.index)
print(len(toget))

33


In [100]:
with open('../gettexts/thirdsupplementIDs.txt', mode = 'w', encoding = 'utf-8') as f:
    for tg in toget:
        f.write(tg + '\n')