# Third deduplication

A final standardizing pass on **workmeta**, followed by a temporal filter to produce **contemporaryworkmeta**.


In [1]:
import pandas as pd

In [2]:
work = pd.read_csv('newworkmeta.tsv', sep = '\t', low_memory = False)

### Standardize names

In our second deduplication notebook, we wrote to file pairs of names that we discovered in the process of deduplicating works. The texts and titles of certain works were similar enough to convince us that these author names were also equivalent.

However, we don't know which name is better. To decide, we need some rules, which I will proceed to provide.

In [3]:
authsets = pd.read_csv('authorsets.tsv', sep = '\t', header = None, names = ['name1', 'name2'])
authsets.head()

Unnamed: 0,name1,name2
0,"Robinson, Mary Stephens","Robinson, Mary S"
1,"Howes, Edith Annie","Howes, Edith"
2,"Williams, Elizabeth Whitney, Mrs","Williams, Elizabeth Whitney"
3,"Somerville, E. Œ. (Edith Œnone)","Somerville, E. &#xbf;. (Edith &#xbf;none)"
4,"Pearce, Donn","Pearce, Donald"


In [4]:
def name2prefer(row):
    ''' A bunch of rules we can use to make up our mind.
    Basically, diacritical marks are preferred to garbage
    equivalents; otherwise longer forms are usually preferred.
    '''
    option1 = row.name1
    option2 = row.name2
    if '??' in option1 or '#xbf;' in option1:
        return option2
    elif '??' in option2 or '#xbf;' in option2:
        return option1
    elif 'from old catalog' in option1:
        return option2
    elif 'from old catalog' in option2:
        return option1
    elif '?_' in option1:
        return option2
    elif '?_' in option2:
        return option1
    elif ' ̌' in option1:
        return option2
    elif ' ̌' in option2:
        return option1
    elif '1' in option1:
        return option2
    elif '1' in option2:
        return option1
    elif '  ' in option1:
        return option2
    elif '  ' in option2:
        return option1
    elif 'è' in option1 or 'é' in option1:
        return option1
    elif 'è' in option2 or 'é' in option2:
        return option2
    elif len(option1) > len(option2):
        return option1
    else:
        return option2

authsets = authsets.assign(normative = authsets.apply(name2prefer, axis = 1))
authsets.head()

Unnamed: 0,name1,name2,normative
0,"Robinson, Mary Stephens","Robinson, Mary S","Robinson, Mary Stephens"
1,"Howes, Edith Annie","Howes, Edith","Howes, Edith Annie"
2,"Williams, Elizabeth Whitney, Mrs","Williams, Elizabeth Whitney","Williams, Elizabeth Whitney, Mrs"
3,"Somerville, E. Œ. (Edith Œnone)","Somerville, E. &#xbf;. (Edith &#xbf;none)","Somerville, E. Œ. (Edith Œnone)"
4,"Pearce, Donn","Pearce, Donald","Pearce, Donald"


#### now actually make the change


In [5]:
firstset = set(authsets.name1)
secondset = set(authsets.name2)
ctr = 0

def applynorm(name):
    global firstset, secondset, authsets, ctr
    
    if name in firstset:
        newname = str(authsets.loc[authsets.name1 == name, 'normative'].values[0])
        ctr += 1
        return newname
    elif name in secondset:
        newname = str(authsets.loc[authsets.name2 == name, 'normative'].values[0])
        ctr += 1
        return newname
    else:
        return name
    
work = work.assign(author = work.author.map(applynorm))  
print(ctr)

2899


That change was made in 2899 rows.

### Correct authordates

An error in an earlier notebook has caused the author "blank" or "no author name provided" to inherit authordates. These are not real.

In [7]:
ctr = 0

def be_agnostic(row):
    global ctr
    
    if pd.isnull(row.author) or len(row.author) < 2:
        ctr += 1
        return ''
    else:
        return row.authordate

work = work.assign(authordate = work.apply(be_agnostic, axis = 1))
print('Change made in ' + str (ctr) + ' rows.')

Change made in 8553 rows.


### Update last date of composition

We have a column **lastcomp** that's supposed to contain an inference about the last possible date of composition, based on certain values of **datetype** and **enddate,** plus authors' death dates.

But the dates of death have been updated since **lastcomp** was created. We may be able to improve that column a little before using it in the next stage of dedup.

In [8]:
ctr = 0

def update(row):
    global ctr 
    
    lastcomp = row.latestcomp
    if pd.isnull(lastcomp):
        lastcomp = float('nan')
    else:
        lastcomp = int(lastcomp)
        
    authordates = str(row.authordate).strip('[.]() ')
    
    if pd.isnull(authordates):
        if pd.isnull(lastcomp) or lastcomp > 2010:
            return float('nan')
        else:
            return lastcomp
        
    elif authordates.startswith('d.') or (len(authordates) > 7 and '-' in authordates):
        try:
            death = int(authordates[-4 : ])
        except:
            death = 0
    else:
        death = 0
    
    if (death > 1500 and death < 2010) and death < lastcomp:
        ctr += 1
        return death
    else:
        return lastcomp

work = work.assign(updated = work.apply(update, axis = 1))
print('Change made in ' + str (ctr) + ' rows.')

Change made in 1927 rows.


In [9]:
work['latestcomp'] = work['updated']
work.drop(labels = ['updated'], axis = 1, inplace = True)

In [10]:
def whethertokeep(row):
    infdate = int(row.inferreddate)
    lastcomp = row.latestcomp
    if pd.isnull(lastcomp):
        return True
    else:
        lastcomp = int(lastcomp)
    
    if infdate > (lastcomp + 25):
        return False
    else:
        return True

work = work.assign(earlyedition = work.apply(whethertokeep, axis = 1))

In [11]:
work.loc[work.author.str.startswith('Gaskell, E', na = False), : ]

Unnamed: 0,docid,oldauthor,author,authordate,inferreddate,latestcomp,datetype,startdate,enddate,imprint,...,recordid,instances,allcopiesofwork,copiesin25yrs,enumcron,volnum,title,parttitle,shorttitle,earlyedition
249,nyp.33433074857115,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,0,1865,n,uuuu,uuuu,New York|Dodd|n.d.,...,8665158,1,1,1,,,"Cranford, | with a memoir of the author; | $c:...",,"Cranford, with a memoir of the author;",True
8947,uiuo.ark+=13960=t9z03hc07,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1848,1848,s,1848,,London;Chapman and Hall;1848.,...,1419882,1,8,1,v.1,1.0,Mary Barton,,Mary Barton,True
10281,njp.32101019691409,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1853,1853,s,1853,,New York;Harper & brothers;1853.,...,1419868,1,35,2,,,"Cranford / | $c: by the author of ""Mary Barton...",,Cranford,True
10426,nyp.33433074857149,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1853,1853,s,1853,,Leipzig;B. Tauchnitz;1853.,...,8665214,1,6,6,v. 1,1.0,"Ruth : | a novel / | $c: by the author of ""Mar...",,Ruth : a novel,True
10427,nyp.33433074857131,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1853,1853,s,1853,,Leipzig;B. Tauchnitz;1853.,...,8665214,1,6,6,v. 2,2.0,"Ruth : | a novel / | $c: by the author of ""Mar...",,Ruth : a novel,True
11082,nyp.33433074857362,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1855,1855,s,1855,,New York;Harper & Bros.;1855.,...,8668329,1,13,4,,,"North and south, | $c: by the author of Mary B...",,North and south,True
12045,nyp.33433082340302,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1858,1858,s,1858,,New York;Appleton;1858.,...,8637302,1,1,1,v. 1-2,1.0,The life of Charlotte Brontë / | $c: by E. C....,,The life of Charlotte Brontë,True
12064,nyp.33433074857156,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1858,1858,s,1858,,New York;Harper;1858.,...,8665192,1,2,2,,,My Lady Ludlow : | a novel / | $c: by Mrs. Gas...,,My Lady Ludlow : a novel,True
12236,inu.30000007412517,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1859,1859,s,1859,,"London;S. Low, son & co.;1859.",...,6059946,1,3,1,v.1,1.0,"Round the sofa. | $c: By the author of ""Mary B...",,Round the sofa,True
12364,uc1.b3322505,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1859,1859,s,1859,,London;S. Low;1859.,...,7915307,1,1,1,v.1,1.0,"Round the sofa / | $c: by the author of ""Mary ...",Round the sofa. My Lady Ludlow,Round the sofa. My Lady Ludlow,True


In [12]:
sum(work.earlyedition)

129023

In [13]:
work.to_csv('../workmeta.tsv', sep = '\t', index = False)

### Also correct authordate error in manifestationmeta

Same error we corrected above.

In [15]:
manifest = pd.read_csv('../manifestationmeta.tsv', sep = '\t', low_memory = False)

In [16]:
ctr = 0
manifest = manifest.assign(authordate = manifest.apply(be_agnostic, axis = 1))
print('Change made in ' + str (ctr) + ' rows.')

Change made in 8553 rows.


In [17]:
manifest.to_csv('../manifestationmeta.tsv', sep = '\t', index = False)