# Assemble metadata for supplement 3

This is going to be an annoyingly miscellaneous notebook, because this is a stage of the project where we're drawing on lots of different sources to slightly improve our metadata before the final interpretive push.

To start with, a separate project has been manually correcting metadata for the NovelTM fiction corpus. This work was done by Jessica Witte, Patrick Kimutis, and Ted Underwood.

### Use NovelTM metadata

Import the tables generated by separate editors, and concatenate them.

In [1]:
import pandas as pd
import random
from collections import Counter

In [2]:
jmc = pd.read_csv('../../meta2018/jessica.tsv', index_col = 'docid')

In [3]:
pmc = pd.read_csv('../../meta2018/patrick.tsv', index_col = 'docid')

In [4]:
tmc = pd.read_csv('../../meta2018/ted.tsv', sep = '\t', index_col = 'docid')

In [5]:
allmeta = pd.concat([tmc, jmc, pmc], sort = False)
allmeta.to_csv('../../meta2018/merged/mergeddirectsample.tsv', sep = '\t', index_label = 'docid')
allmeta.shape

(2730, 17)

### Manually edit the data and read it back in

There's a separate step here where Ted Underwood made some manual changes to the concatenated table, mostly involving the boundary separating "reprint" from "novel."

Then we read it back in.

In [6]:
edited = pd.read_csv('../../meta2018/merged/manuallyedited.tsv', sep = '\t', index_col = 'docid')

### Now read in the list of works we used for topic modeling

Ideally, we might have corrected metadata before generating the topic model. But in reality I think that makes a tiny difference. We're not actually going to filter *out* many works at all; we're mostly changing the dates.

In [7]:
third = pd.read_csv('third_all_meta.tsv', sep = '\t', index_col = 'docid')

In [8]:
third = third.assign(toremove = False)

I'm going to create a column to mark works I might remove, for generic reasons, if I do a fourth iteration of topic modeling. But these will  not actually be removed from the third model, because I don't want to get the corpus too far away from the model.

In [9]:
removenow = set()
dateschanged = 0

A smaller number of works will be removed now (because their date makes them impossible to compare on our timeline).

In [10]:
def make_corrections(existing, comparison):
    ''' This function uses a comparison table to identify changes to the
    "existing" table.
    '''
    
    global removenow, dateschanged
    
    for idx, row in existing.iterrows():
        if idx not in comparison.index:
            continue
        else:
            if pd.isnull(comparison.loc[idx, 'firstpub']):
                firstpub = 3000
            else:
                firstpub = int(comparison.loc[idx, 'firstpub'])
            
        if 'notfiction' in comparison.columns:
            nonfic = comparison.loc[idx, 'notfiction']
            if not pd.isnull(nonfic) and (nonfic == True or nonfic == 'TRUE'):
                toremove = True
            else:
                toremove = False
        elif 'category' in comparison.columns:
            cat = comparison.loc[idx, 'category']
            if not pd.isnull(cat) and (cat == 'nonfic' or cat == 'reprint' or cat == 'juvenile'):
                toremove = True
            else:
                toremove = False
        else:
            toremove = False
        
        if toremove:
            existing.loc[idx, 'toremove'] = True
        
        latestcomp = int(existing.loc[idx, 'latestcomp'])
        
        if firstpub < 1798:
            removenow.add(idx)
        elif firstpub < 1800:
            existing.loc[idx, 'latestcomp'] = 1800
            # we cheat a little around the margins to
            # avoid removing more than needed
            print(firstpub)
            dateschanged += 1
        elif firstpub < latestcomp:
            existing.loc[idx, 'latestcomp'] = firstpub
            dateschanged += 1
            

In [11]:
make_corrections(third, edited)

1799
1799
1799


In [12]:
dateschanged

157

In [13]:
print("Remove now:", len(removenow))
print("Remove later:", sum(third.toremove))

Remove now: 25
Remove later: 181


### Correcting suspiciously old authors

I also created a file of ```dates_to_correct.tsv``` where authors seemed suspiciously old. We can use this also to correct our third sample.

In [14]:
corrected = pd.read_csv('../meta/dates_to_correct.tsv', sep = '\t', index_col = 'docid')

In [15]:
make_corrections(third, corrected)
print('Total dates changed:', dateschanged)
print("Remove now:", len(removenow))
print("Remove later:", sum(third.toremove))

1799
Total dates changed: 280
Remove now: 66
Remove later: 212


### Selecting contrast sets

For each of the hypotheses we're checking, we need a contrast set that is comparable in terms of genre / accuracy of dating / nationality. Let's construct them.

First, let's construct a set of volumes that are accurately dated and exclude juvenile fiction and nonfiction.

In [16]:
contrast_source = edited[(edited.category == 'novel') | (edited.category == 'shortstories')]

In [17]:
contrast_source.shape

(2205, 17)

In [18]:
shared = contrast_source.loc[contrast_source.index.isin(third.index) , : ]
shared.shape

(1037, 17)

### Now load the hypotheses

We need the list of volumes in the positive set for each hypothesis.

In [19]:
hypo = pd.read_csv('third_hypothesis_meta.tsv', sep = '\t', index_col = 'docid')
hypo.shape

(2061, 22)

### Create isusa columns in all our data

This will make things simpler to check, since nationality is recorded in several different ways.

In [20]:
hypo = hypo.assign(isusa = 0)

changed = 0
isus = 0

for idx in hypo.index:
    if pd.isnull(hypo.loc[idx, 'nationality']):
        hypo.loc[idx, 'nationality'] = third.loc[idx, 'nationality']
        changed += 1
    
    nation = hypo.loc[idx, 'nationality']
    if nation.strip() =='us' or nation.strip() == "guess: us":
        hypo.loc[idx, 'isusa'] = 1
        isus += 1
        
print('Changed:', changed)
print('US:', isus)

Changed: 831
US: 1118


In [21]:
shared = shared.assign(isusa = 0)
isus = 0
changed = 0

for idx in shared.index:
    nation = shared.loc[idx, 'nationality']
    if pd.isnull(nation):
        shared.loc[idx, 'nationality'] = third.loc[idx, 'nationality']
        nation = shared.loc[idx, 'nationality']
        
    if nation.strip() =='us' or nation.strip() == "guess: us":
        shared.loc[idx, 'isusa'] = 1
        isus += 1
    
    firstpub = shared.loc[idx, 'firstpub']
    if not pd.isnull(firstpub):
        firstpub = int(firstpub)
        if firstpub < shared.loc[idx, 'latestcomp']:
            shared.loc[idx, 'latestcomp'] = firstpub
            changed += 1
        
print('Changed:', changed)        
print('US:', isus)

Changed: 156
US: 279


In [22]:
def whether_us(nation):
    if nation.strip() =='us' or nation.strip() == "guess: us":
        return 1
    else:
        return 0

third = third.assign(isusa = third.nationality.map(whether_us))
               
print('US:', sum(third.isusa))

US: 18148


In [52]:
print(third.shape)
third.drop(removenow, inplace = True)
print(third.shape)
third.to_csv('third_corrected_meta.tsv', sep = '\t', index_label = 'docid')

(39850, 15)
(39784, 15)


In [41]:
manualcheck = pd.read_csv('manualcheckus.tsv', sep = '\t')
manualcheck.drop_duplicates('docid', inplace = True)
manualcheck.shape

(194, 16)

In [43]:
manualcheck = manualcheck.set_index('docid')

In [44]:
for idx, row in manualcheck.iterrows():
    wehave = third.loc[idx, 'latestcomp']
    groundtruth = manualcheck.loc[idx, 'latestcomp']
    if groundtruth < wehave:
        print('correction')
        third.loc[idx, 'latestcomp'] = groundtruth

correction
correction
correction


In [26]:
manualcheck = manualcheck.assign(isusa = manualcheck.nationality.map(whether_us))
               
print('US:', sum(manualcheck.isusa))

US: 87


In [45]:
manualcheck.to_csv('manualcheckus.tsv', sep = '\t', index_label = 'docid')

In [68]:
hyposelected = dict()
contrast = dict()

In [69]:
print(hypo.shape)
print(shared.shape)
print(third.shape)

(2061, 23)
(1037, 18)
(39784, 36)


In [138]:
def find_matches(hypo, manually_checked, third, category, startdate, enddate):
    ''' Finds volumes in manually_checked that match the nationality and
    date distribution of volumes in category.
    '''
    global contrast, hyposelected
    
    subset = hypo.loc[(hypo[category] == True) &
                      (hypo.latestcomp >= startdate) &
                      (hypo.latestcomp < enddate), : ]
    
    contrast[(category, startdate)] = []
    hyposelected[(category, startdate)] = []
    
    print("Needed: ", subset.shape)
    
    # The contrast set is defined to exclude books
    # by authors in the positive set.
    hypo_authors = set(subset.author)
    
    available = manually_checked.loc[~manually_checked['author'].isin(hypo_authors), : ]
    
    print("Available: ", available.shape)
    already_chosen = set()
    
    # shuffle the subset
    subset = subset.sample(frac = 1)
    
    secondary = 0
    
    for idx, row in subset.iterrows():
        if len(already_chosen) > 140:
            break
            
        usflag = row['isusa']
        date = row['latestcomp']
        
        candidates = available.loc[(available.isusa == usflag) & 
                                   (available.latestcomp > (date - 4)) &
                                   (available.latestcomp < (date + 4)), : ]
        candidates = candidates.index.tolist()
        candidates = set(candidates) - already_chosen
        
        if len(candidates) < 1:
            failures[category] += 1
            thirdoptions = third.loc[(third.isusa == usflag) &
                                    (third.latestcomp > (date - 4)) &
                                    (third.latestcomp < (date + 4)), : ]
            thirdoptions = thirdoptions.loc[~thirdoptions['author'].isin(hypo_authors), : ]
            thirdoptions = thirdoptions.index.tolist()
            thirdoptions = set(thirdoptions) - already_chosen
            
            if len(thirdoptions) < 1:
                print("Unrecoverable error.")
            else:
                selection = random.sample(thirdoptions, 1)[0]
                contrast[(category, startdate)].append(selection)
                already_chosen.add(selection)
                hyposelected[(category, startdate)].append(idx)
                secondary += 1
                  
        else:
            selection = random.sample(candidates, 1)[0]
            contrast[(category, startdate)].append(selection)
            already_chosen.add(selection)
            hyposelected[(category, startdate)].append(idx)
            
    print("Found: ", len(contrast[(category, startdate)]))
    print("Secondary: ", secondary)
    
    return contrast[(category, startdate)]
    

In [71]:
selections = find_matches(hypo, shared, manualcheck, 'bestseller', 1821, 1900)

Needed:  (141, 23)
Available:  (1034, 18)
Unrecoverable error.
Unrecoverable error.
Found:  139
Secondary:  15


In [72]:
selections = find_matches(hypo, shared, manualcheck, 'bestseller', 1900, 1950)

Needed:  (414, 23)
Available:  (1024, 18)
Unrecoverable error.
Unrecoverable error.
Found:  141
Secondary:  33


In [73]:
selections = find_matches(hypo, shared, manualcheck, 'bestseller', 1950, 1990)

Needed:  (181, 23)
Available:  (1031, 18)
Found:  141
Secondary:  66


In [74]:
for k, v in hyposelected.items():
    print(k, len(v))

('bestseller', 1950) 141
('bestseller', 1900) 141
('bestseller', 1821) 139


In [75]:
selections = find_matches(hypo, shared, manualcheck, 'heath', 1800, 2011)

Needed:  (46, 23)
Available:  (1025, 18)
Found:  46
Secondary:  0


In [76]:
selections = find_matches(hypo, shared, manualcheck, 'norton', 1800, 2011)

Needed:  (54, 23)
Available:  (1028, 18)
Found:  54
Secondary:  1


In [81]:
selections = find_matches(hypo, shared, manualcheck, 'nortonshort', 1800, 2011)

Needed:  (5, 23)
Available:  (1037, 18)
Found:  5
Secondary:  0


In [82]:
sum(hypo.nortonshort)

5

In [78]:
selections = find_matches(hypo, shared, manualcheck, 'nonusnorton', 1800, 2011)

Needed:  (55, 23)
Available:  (1025, 18)
Found:  55
Secondary:  0


In [79]:
selections = find_matches(hypo, shared, manualcheck, 'preregistered', 1800, 2011)

Needed:  (20, 23)
Available:  (1028, 18)
Found:  20
Secondary:  0


In [80]:
selections = find_matches(hypo, shared, manualcheck, 'mostdiscussed', 1800, 2011)

Needed:  (29, 23)
Available:  (1032, 18)
Found:  29
Secondary:  0


In [63]:
hypo.columns

Index(['allcopiesofwork', 'author', 'bestseller', 'contrast4reviewed',
       'copiesin25yrs', 'earlyedition', 'firstpub', 'gender', 'heath',
       'imprint', 'inferreddate', 'lastname', 'latestcomp', 'mostdiscussed',
       'nationality', 'nonusnorton', 'norton', 'nortonshort', 'preregistered',
       'recordid', 'reviewed', 'title', 'isusa'],
      dtype='object')

In [54]:
third = third.assign(best1821_1900 = 0)
third = third.assign(best1900_1950 = 0)
third = third.assign(best1950_1990 = 0)
third = third.assign(anybest = 0)
third = third.assign(best1821_1900contrast = 0)
third = third.assign(best1900_1950contrast = 0)
third = third.assign(best1950_1990contrast = 0)
third = third.assign(best1900_1950 = 0)
third = third.assign(reviewed1850_1950 = 0)
third = third.assign(reviewed1850_1950contrast = 0)
third = third.assign(heath = 0)
third = third.assign(heathcontrast = 0)
third = third.assign(mostdiscussed = 0)
third = third.assign(mostdiscussedcontrast = 0)
third = third.assign(usnorton = 0)
third = third.assign(usnortoncontrast = 0)
third = third.assign(nonusnorton = 0)
third = third.assign(nonusnortoncontrast = 0)
third = third.assign(preregistered = 0)
third = third.assign(preregisteredcontrast = 0)
third = third.assign(reviewed1965_1990 = 0)
third = third.assign(reviewed1965_1990contrast = 0)
print(third.shape)
third.columns

(39784, 36)


Index(['allcopiesofwork', 'author', 'copiesin25yrs', 'earlyedition', 'imprint',
       'inferreddate', 'lastname', 'latestcomp', 'nationality', 'recordid',
       'title', 'authordate', 'birth', 'toremove', 'isusa', 'best1821_1900',
       'best1900_1950', 'best1950_1990', 'anybest', 'best1821_1900contrast',
       'best1900_1950contrast', 'best1950_1990contrast', 'reviewed1850_1950',
       'reviewed1850_1950contrast', 'heath', 'heathcontrast', 'mostdiscussed',
       'mostdiscussedcontrast', 'usnorton', 'usnortoncontrast', 'nonusnorton',
       'nonusnortoncontrast', 'preregistered', 'preregisteredcontrast',
       'reviewed1965_1990', 'reviewed1965_1990contrast'],
      dtype='object')

In [55]:
mostreviewedpost65 = pd.read_csv('../meta/canon/mostreviewedpost65.tsv', sep = '\t', index_col = 'docid')

In [85]:
category_dict = {'best1821_1900': ('bestseller', 1821),
 'best1900_1950': ('bestseller', 1900),
 'best1950_1990': ('bestseller', 1950),
 'best1900_1950contrast': ('bestseller', 1950),
 'best1950_1990contrast': ('bestseller', 1900),
 'best1821_1900contrast': ('bestseller', 1821),
 'heath': ('heath', 1800),
 'heathcontrast': ('heath', 1800),
 'mostdiscussed': ('mostdiscussed', 1800),
 'mostdiscussedcontrast': ('mostdiscussed', 1800),
 'usnorton': ('norton', 1800),
 'usnortoncontrast': ('norton', 1800),
 'nonusnorton': ('nonusnorton', 1800),
 'nonusnortoncontrast': ('nonusnorton', 1800),
 'preregistered': ('preregistered', 1800),
 'preregisteredcontrast': ('preregistered', 1800)}
 

In [87]:
for key, value in category_dict.items():
    ctr = 0
    
    if 'contrast' in key:
        volstoadd = contrast[value]
    else:
        volstoadd = hyposelected[value]
    
    for vol in volstoadd:
        third.loc[vol, key] = 1
        ctr +=1
    
    print(key, ctr)

mostdiscussed 29
best1950_1990contrast 141
best1950_1990 141
usnortoncontrast 54
mostdiscussedcontrast 29
best1821_1900contrast 139
nonusnorton 55
preregistered 20
heathcontrast 46
best1900_1950contrast 141
usnorton 54
best1821_1900 139
best1900_1950 141
preregisteredcontrast 20
heath 46
nonusnortoncontrast 55


In [148]:
for idx in mostreviewedpost65.index:
    third.loc[idx, 'reviewed1965_1990'] = 1

In [89]:
ctr = 0
for idx in hyposelected[('nortonshort', 1800)]:
    third.loc[idx, 'usnorton'] = 1
    ctr +=1
print(ctr)

5


In [90]:
ctr = 0
for idx in contrast[('nortonshort', 1800)]:
    third.loc[idx, 'usnortoncontrast'] = 1
    ctr +=1
print(ctr)

5


In [146]:
third = third.assign(reviewed1965_1990 = 0)
third = third.assign(reviewed1965_1990contrast = 0)
ctr = 0
for idx in hypo.loc[hypo.reviewed == True, : ].index:
    third.loc[idx, 'reviewed1850_1950'] = 1
    ctr += 1
print(ctr)

541


In [147]:
ctr = 0
for idx in hypo.loc[hypo.contrast4reviewed == True, : ].index:
    third.loc[idx, 'reviewed1850_1950contrast'] = 1
    ctr += 1
print(ctr)

603


In [93]:
ctr = 0
for idx in hypo.loc[hypo.bestseller == True, : ].index:
    third.loc[idx, 'anybest'] = 1
    ctr += 1
print(ctr)

814


In [104]:
for idx, row in corrected.iterrows():
    if idx not in third.index:
        continue
    else:
        groundtruth = row['authordate']
        existing = third.loc[idx, 'authordate']
        if pd.isnull(groundtruth):
            continue
        else:
            groundlen = len(str(groundtruth))
            
        if pd.isnull(existing):
            existlen = 0
        else:
            existlen = len(str(existing))
        
        if groundlen > existlen:
            print(groundtruth, existing)
            third.loc[idx, 'authordate'] = groundtruth

1770-1844. nan
1802-1887. nan
1770-1844. nan
1807-1876. -1824
1770-1844. nan
1930-. 1897-


In [106]:
for idx, row in edited.iterrows():
    if idx not in third.index:
        continue
    else:
        groundtruth = row['authordate']
        existing = third.loc[idx, 'authordate']
        if pd.isnull(groundtruth):
            continue
        else:
            groundlen = len(str(groundtruth))
            
        if pd.isnull(existing):
            existlen = 0
        else:
            existlen = len(str(existing))
        
        if groundlen > existlen:
            print(groundtruth, existing)
            third.loc[idx, 'authordate'] = groundtruth

1770-1844. nan
1899-1982. nan
1912- nan
1824-1906 nan
1822-1911 nan
1708-1778 nan
1798-1866 nan
1798-1857. nan
1892-1990 1892-
1922-1998. nan
1924-2008. 1924-
1748-1828. nan
1695-1758 nan
1758-1852 nan
1775-1842 nan
1757-1807 b. 1756.
1759-1805. nan
1759-1805. nan
1774-1815 nan
1770-1847 nan
1770-1847. nan
1770-1847. nan
f. 1827 nan
1763-1822. nan
1758-1852 nan
d. 1827 nan
1759-1857. nan
1775-1842. nan
1759-1857. nan
1759-1857. nan
d. 1849 nan
1779-1859 nan
1779-1846 nan
1779-1859. nan
1779-1859. nan
1758-1852. nan
1788-1869. d. 1869.
1790-1860. nan
1791-1854. nan
1779-1859 nan
1746-1830. nan
1787-1867 nan
1776-1865. nan
1775-1842. nan
1786-1820 nan
1785-1828. nan
1761-1826. d. 1818.
1772-1825. nan
1768-1842. nan
1784?-1854. d. 1854.
1800-1842 nan
1800-1842. nan
1798-1861. nan
1805-1844 nan
1806-1844 nan
1806-1844. nan
1800-1851. nan
1798-1861 nan
1804-1867 nan
1792-1838. nan
d. 1875 nan
1778-1845. d. 1845.
1806-1862. nan
1806-1862. nan
1801-1843. d. 1843.
1801-1843 d. 1843.
1798-1869 

In [107]:
third = third.assign(age = float('nan'))


In [112]:
ctr = 0

def calculate_age(row):
    global ctr
    if pd.isnull(row['authordate']):
        return float('nan')
    else:
        dates = str(row['authordate'].strip('.'))
        
    if dates.startswith('b.') or dates.endswith('-') or len(dates) > 7:
        try:
            birth = int(dates[0:4])
        except:
            birth = -1
    else:
        birth = -1
    
    compdate = int(row['latestcomp'])
    
    if birth > 0 and birth < compdate:
        age = compdate - birth
        if age > 100:
            return(float('nan'))
        else:
            ctr += 1
            return age
    else:
        return float('nan')

third = third.assign(age = third.apply(calculate_age, axis = 1))
print(ctr)
        

23826


In [113]:
third = third.assign(actualgender = float('nan'))

In [114]:
ctr = 0
for idx, row in edited.iterrows():
    if idx not in third.index:
        continue
    else:
        gender = row['gender']
        if not pd.isnull(gender):
            third.loc[idx, 'actualgender'] = row['gender']
            ctr += 1
print(ctr)

1130


In [115]:
ctr = 0
for idx, row in hypo.iterrows():
    if idx not in third.index:
        continue
    else:
        gender = row['gender']
        if not pd.isnull(gender):
            third.loc[idx, 'actualgender'] = row['gender']
            ctr += 1
print(ctr)

1202


In [116]:
third.shape

(39784, 38)

In [118]:
thirdtogender = third[['author', 'latestcomp', 'actualgender']]
thirdtogender.to_csv('third2gender.tsv', sep = '\t', index_label = 'docid')

In [125]:
gendered = pd.read_csv('thirdgendered.tsv', sep = '\t', index_col = 'docid')

In [120]:
third = third.assign(likelygender = float('nan'))

In [121]:
import numpy as np

In [133]:
right = 0
wrong = 0
ctr = 0

for idx, row in gendered.iterrows():
    ctr += 1
    if ctr % 1000 == 1:
        print(ctr)
        
    actual = third.loc[idx, 'actualgender']
    predicted = row.Gender.lower()
    if predicted == 'u':
        predicted = np.nan
        
    if pd.isnull(actual):
        third.loc[idx, 'likelygender'] = predicted
    else:
        if actual == predicted:
            right += 1
            third.loc[idx, 'likelygender'] = actual
        elif pd.isnull(predicted):
            third.loc[idx, 'likelygender'] = actual
            continue
        else:
            wrong += 1
            third.loc[idx, 'likelygender'] = actual

print("Right: ", right)
print("Wrong: ", wrong)       

1
1001
2001
3001
4001
5001
6001
7001
8001
9001
10001
11001
12001
13001
14001
15001
16001
17001
18001
19001
20001
21001
22001
23001
24001
25001
26001
27001
28001
29001
30001
31001
32001
33001
34001
35001
36001
37001
38001
39001
Right:  1406
Wrong:  79


That's not terrible accuracy, so likelygender column is probably okay to use.

In [128]:
ctr = 0

def calculate_birth(row):
    global ctr
    existingbirth = row['birth']
    
    if pd.isnull(row['authordate']):
        return existingbirth
    else:
        dates = str(row['authordate'].strip('.'))
        
    if dates.startswith('b.') or dates.endswith('-') or len(dates) > 7:
        try:
            birth = int(dates[0:4])
        except:
            birth = -1
    else:
        birth = -1
    
    if birth > 0 and birth < 2010:
        if birth != existingbirth:
            ctr += 1
            print(existingbirth, birth)
        
        return birth

    else:
        return existingbirth

third = third.assign(birth = third.apply(calculate_birth, axis = 1))
print(ctr)
        

nan 1778
nan 1748
nan 1925
nan 1956
nan 1786
nan 1844
nan 1833
nan 1791
nan 1968
1865.0 1866
nan 1915
nan 1759
nan 1935
nan 1770
nan 1770
nan 1828
nan 1847
nan 1925
nan 1822
nan 1925
1880.0 1914
nan 1775
nan 1925
nan 1779
nan 1779
nan 1708
nan 1865
nan 1770
nan 1852
nan 1925
nan 1770
nan 1839
nan 1806
nan 1925
nan 1912
nan 1828
nan 1758
nan 1787
nan 1925
nan 1819
nan 1763
nan 1922
nan 1798
1897.0 1930
nan 1779
nan 1758
nan 1899
nan 1798
nan 1807
nan 1891
nan 1824
nan 1949
nan 1785
nan 1812
nan 1798
nan 1805
nan 1841
nan 1907
nan 1935
nan 1931
nan 1770
nan 1921
nan 1836
nan 1814
nan 1777
nan 1910
nan 1792
nan 1808
nan 1941
nan 1841
nan 1851
nan 1779
nan 1928
nan 1915
nan 1802
nan 1759
nan 1915
nan 1900
nan 1933
nan 1827
nan 1791
nan 1880
nan 1801
nan 1784
nan 1758
nan 1937
nan 1806
nan 1772
nan 1847
nan 1911
nan 1845
nan 1779
nan 1790
nan 1798
nan 1862
nan 1932
nan 1955
nan 1816
nan 1788
nan 1949
nan 1768
nan 1829
nan 1919
nan 1759
nan 1925
nan 1807
nan 1798
nan 1695
nan 1807
nan 1889
n

In [129]:
third.columns

Index(['allcopiesofwork', 'author', 'copiesin25yrs', 'earlyedition', 'imprint',
       'inferreddate', 'lastname', 'latestcomp', 'nationality', 'recordid',
       'title', 'authordate', 'birth', 'toremove', 'isusa', 'best1821_1900',
       'best1900_1950', 'best1950_1990', 'anybest', 'best1821_1900contrast',
       'best1900_1950contrast', 'best1950_1990contrast', 'reviewed1850_1950',
       'reviewed1850_1950contrast', 'heath', 'heathcontrast', 'mostdiscussed',
       'mostdiscussedcontrast', 'usnorton', 'usnortoncontrast', 'nonusnorton',
       'nonusnortoncontrast', 'preregistered', 'preregisteredcontrast',
       'reviewed1965_1990', 'reviewed1965_1990contrast', 'age', 'actualgender',
       'likelygender'],
      dtype='object')

In [130]:
rightorder = ['allcopiesofwork', 'author', 'copiesin25yrs', 'earlyedition', 'imprint',
       'inferreddate', 'lastname', 'latestcomp', 'nationality', 'isusa', 'actualgender',
       'likelygender', 'title', 'authordate', 'birth', 'age', 'recordid', 'best1821_1900',
       'best1900_1950', 'best1950_1990', 'anybest', 'best1821_1900contrast',
       'best1900_1950contrast', 'best1950_1990contrast', 'reviewed1850_1950',
       'reviewed1850_1950contrast', 'heath', 'heathcontrast', 'mostdiscussed',
       'mostdiscussedcontrast', 'usnorton', 'usnortoncontrast', 'nonusnorton',
       'nonusnortoncontrast', 'preregistered', 'preregisteredcontrast',
       'reviewed1965_1990', 'reviewed1965_1990contrast', 'toremove']

In [131]:
third = third[rightorder]

In [139]:
selections = find_matches(third, shared, manualcheck, 'reviewed1965_1990', 1950, 2010)

Needed:  (100, 39)
Available:  (1030, 18)
Unrecoverable error.
Unrecoverable error.
Unrecoverable error.
Unrecoverable error.
Unrecoverable error.
Found:  95
Secondary:  53


In [149]:
for sel in selections:
    third.loc[sel, 'reviewed1965_1990contrast'] = 1

In [151]:
third.to_csv('thirdmastermeta.tsv', sep = '\t', index_label = 'docid')

In [144]:
cols2check = ['best1821_1900',
       'best1900_1950', 'best1950_1990', 'anybest', 'best1821_1900contrast',
       'best1900_1950contrast', 'best1950_1990contrast', 'reviewed1850_1950',
       'reviewed1850_1950contrast', 'heath', 'heathcontrast', 'mostdiscussed',
       'mostdiscussedcontrast', 'usnorton', 'usnortoncontrast', 'nonusnorton',
       'nonusnortoncontrast', 'preregistered', 'preregisteredcontrast',
       'reviewed1965_1990', 'reviewed1965_1990contrast']

In [150]:
for col in cols2check:
    print()
    print(col)
    df = third.loc[third[col] == 1, : ]
    print('All vols: ', len(df.isusa))
    print('US vols: ', sum(df.isusa))
    print('Mean date: ', np.mean(df.latestcomp))


best1821_1900
All vols:  139
US vols:  38
Mean date:  1870.5971223

best1900_1950
All vols:  141
US vols:  96
Mean date:  1922.4964539

best1950_1990
All vols:  141
US vols:  119
Mean date:  1968.4964539

anybest
All vols:  814
US vols:  544
Mean date:  1924.97420147

best1821_1900contrast
All vols:  139
US vols:  38
Mean date:  1868.61151079

best1900_1950contrast
All vols:  141
US vols:  120
Mean date:  1968.73049645

best1950_1990contrast
All vols:  141
US vols:  97
Mean date:  1922.23404255

reviewed1850_1950
All vols:  541
US vols:  210
Mean date:  1898.17560074

reviewed1850_1950contrast
All vols:  603
US vols:  303
Mean date:  1894.48922056

heath
All vols:  46
US vols:  40
Mean date:  1902.34782609

heathcontrast
All vols:  46
US vols:  40
Mean date:  1903.45652174

mostdiscussed
All vols:  29
US vols:  28
Mean date:  1903.13793103

mostdiscussedcontrast
All vols:  29
US vols:  27
Mean date:  1903.72413793

usnorton
All vols:  59
US vols:  51
Mean date:  1911.08474576

usnorto