# Standardize and deduplicate fiction metadata

This notebook begins after masterficmeta has been created.

The next step is to produce a dataset where there's a single copy of each physical volume. In terms of [FRBR,](https://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records) this is roughly the "manifestation" level.

In other words, after this round of dedup, there will still be several copies of *Middlemarch* in the dataset: maybe a three-volume edition from the 1870s, as well as one-volume editions in 1892 and 1950, adding up to five volumes. But there should be only one copy of each volume, in each edition.

I'll achieve that very simply by relying on HTRC record IDs + volume numbers.

This notebook also prepares for the next stage of deduplication by loosely standardizing author names.

The output is written to disk as manifestationmeta.tsv.


In [1]:
# Some generic imports.

import csv
import pandas as pd
import unicodedata

### General strategy for standardizing authors

We're not going to use fuzzy matching on this pass; we're going to make very conservative assumptions and link authors only if

    1) They have the same last name
    2) Also the same initials
    3) And also *either* the same birthdate, or
    4) Are listed as authoring the same title.

Very short author names will also be ignored.

To do this we need dictionaries of authors' birthdates and titles-authored. While we're at it we can group authors into blocks that share the same first two letters, to speed up the comparison process later.

In [2]:
# read in the existing metadata, and normalize unicode

def normalize_unicode(astring):
    if pd.isnull(astring):
        return ''
    else:
        astring = astring.replace('  ', ' ')
        astring = unicodedata.normalize('NFC', astring)
        return astring

meta = pd.read_csv('../masterficmetadata.tsv', sep = '\t', low_memory = False)
meta = meta.assign(newauthor = meta.author.map(normalize_unicode))

In [3]:
grouped = meta.groupby('newauthor')
titlesets = dict()
birthdates = dict()
authorblocks = dict()

for author, group in grouped:
    birthset = set()
    titleset = set()
    for ix in group.index:
        authordate = group.loc[ix, 'authordate']
        if not pd.isnull(authordate) and len(authordate) >= 4:
            try:
                birth = int(authordate[0:4])
                birthset.add(birth)
            except:
                pass
        title = group.loc[ix, 'shorttitle']
        if not pd.isnull(title):
            if len(title) > 20:
                title = title[0:20]
            titleset.add(title)
    if len(birthset) == 1:
        birthdates[author] = birthset.pop()
    titlesets[author] = titleset
    
    if len(author) > 2:
        initial = author[0:2].lower()
    else:
        initial = 'xx'
    if initial not in authorblocks:
        authorblocks[initial] = set()
    authorblocks[initial].add(author)
        

#### actually generate standardized names

We go through each block, looking for pairs of authors who meet the above criteria.

When a matched pair are found, we add them as a new group, unless one member of the pair is in an existing group, in which case, they are both added to that group.

In [4]:
groups = []

def zapnsplit(a):
    a = a.replace('from old catalog', ' ')
    a = a.replace('.', ' ')
    a = a.replace('(', ' ')
    a = a.replace(')', ' ')
    a = a.replace('[', ' ')
    a = a.replace(']', ' ')
    a = a.replace(',', ' ')
    a = a.replace('Mrs', ' ')
    a = a.replace('Sir', ' ')
    # we don't want honorifics to interfere with 
    # identity
    
    parts = a.lower().split()
    lastname = parts[0]
    parts = [x for x in parts if len(x) > 0]
    return set(parts), lastname

def matchnames(namea, nameb):
    a, lasta = zapnsplit(namea)
    b, lastb = zapnsplit(nameb)
    
    if lasta != lastb:
        return False
    
    missing = a - b
    for m in missing:
        if m[0] not in b:
            return False
    # for each word that's missing in b,
    # we check to see if the initial is present in B
    
    missing = b - a
    for m in missing:
        if m[0] not in a:
            return False
    # likewise for a
    
    # If all tests are passed,  
    return True

# iterate through author blocks and actually do the work

for initial, block in authorblocks.items():
        
    for a1 in block:
        for a2 in block:
            if a1 == a2:
                continue
            if len(a1) < 9 or len(a2) < 9:
                continue
            if a1[0:5] != a2[0:5]:
                continue
            
            if len(a1) > 21:
                trunca1 = a1[0:21]
            else:
                trunca1 = a1
                
            if len(a2) > 21:
                trunca2 = a2[0:21]
            else:
                trunca2 = a2
                
            if trunca1 != trunca2 and not matchnames(a1, a2):
                continue
            
            titlematch = 0
            
            birthmatch = False
            if (a1 in birthdates and a2 in birthdates) and (birthdates[a1] == birthdates[a2]):
                birthmatch = True
                titlematch += 1
                # not requiring titlematch if birthdates match
                
            elif a1 not in birthdates:
                birthmatch = True
            elif a2 not in birthdates:
                birthmatch = True
            
            if not birthmatch:
                continue
            
            for t1 in titlesets[a1]:
                for t2 in titlesets[a2]:
                    if t1 == t2:
                        titlematch += 1
            
            if titlematch == 0:
                continue
            else:
                found = False
                for g in groups:
                    if a1 in g or a2 in g:
                        g.add(a1)
                        g.add(a2)
                        found = True
                        break
                        
                if not found:
                    groups.append({a1, a2})
                    
                

In [5]:
len(groups)

1148

In [6]:
# Write the author groups to file for future reference.

with open('newauthorgroups.tsv', mode = 'w', encoding = 'utf-8') as f:
    for g in groups:
        of_record = ''
        for n in g:
            official = n.replace('from old catalog', '')
            official = official.strip('[], .')
            if official[-1].isupper():
                official = official + '.'
            if len(of_record) < 1 or len(official) > len(of_record):
                of_record = official
        outline = of_record + '\t' + '\t'.join(g) + '\n'
        f.write(outline)
        

Now let's read that back in, to create a translation dictionary.

In [7]:
author_trans = dict()

with open('newauthorgroups.tsv', encoding = 'utf-8') as f:
    for line in f:
        fields = line.strip('\n').split('\t')
        official = fields [0]
        for name in fields:
            author_trans[name] = official

Actually standardize the author names.

In [8]:
def map_names(aname):
    global author_trans
    if aname in author_trans:
        return author_trans[aname]
    else:
        return aname

meta = meta.assign(newauthor = meta.newauthor.map(map_names))

#### Let's also standardize authordates

Some rows have birth and/or death dates for an author; others for the same author may lack that info. We're going to want to take the richest available info, and spread it across all rows for the author. "Longest" is not the world's best metric of "richest," but in practice, for this case, it will do.

In [9]:
authordates = dict()
grouped = meta.groupby('newauthor')
for author, group in grouped:
    longest = ''
    for d in group.authordate:
        d = str(d)
        if pd.isnull(d):
            continue
        if len(d) > len(longest) and d.lower() != 'nan':
            longest = d
    authordates[author] = longest


### The actual deduplication

We create a column that is str(recordid) + str(volnum), so we can group on the combination. (Passing both to groupby gives you a multi-index, which is more complexity than I need today.)

We also record the number of instances of a recordid-volnum combination that are being collapsed.

In [10]:
def concatenate_fields(row):
    ''' Concatenate two fields, with special provision
    for the likelihood that the second is null.
    '''
    fielda = row['recordid']
    fieldb = row['volnum']
    
    if pd.isnull(fieldb):
        fieldb = 'nan'
    else:
        fieldb = str(fieldb)
    
    return str(fielda) + '+' + fieldb

meta = meta.assign(groupingcolumn = meta.apply(concatenate_fields, axis = 1))

In [11]:
kept = []
ctr = 0
instances = dict()
grouped = meta.groupby(['groupingcolumn'])
for key, group in grouped:
    ctr += 1
    if ctr % 10000 == 1:
        print(ctr)
    keeper = ''
    lowest = 2100
    for idx in group.index:
        date = int(group.loc[idx, 'inferreddate'])
        if (date < lowest and date > 1699) or lowest == 2100:
            lowest = date
            
            if lowest < 1700:
                lowest = 2100
            # dubious "dates" should not outcompete real dates
            
            keeper = group.loc[idx, 'docid']
            if type(keeper) != str:
                keeper = keeper[0]
    numcopies = len(group.index)
    instances[keeper] = numcopies
    kept.append(keeper)

1
10001
20001
30001
40001
50001
60001
70001
80001
90001
100001
110001
120001
130001
140001
150001
160001
170001


That created a list of docids to keep. Now we just have to keep them.

In [12]:
# actually do the deduplication
print(meta.shape)
meta.set_index('docid', inplace = True)
deduped = meta.loc[kept, : ]
print(deduped.shape)
deduped = deduped[~deduped.index.duplicated(keep='first')]
print(deduped.shape)

(210278, 26)
(176623, 25)
(176623, 25)


In winnowing duplicates, we don't want to lose authors' birth and death dates. Let's insure we have the richest information.

Also, while we're at it, let's add a column recording the number of instances for each record-vol.

In [13]:
def map_instances(docid):
    global instances
    if docid in instances:
        return instances[docid]
    else:
        return 0

deduped['instances'] = deduped.index.map(map_instances)

def enrich_authordates(row):
    ''' Enriches the authordate with a longer form where appropriate
    '''
    global authordates
    
    authdate = row['authordate']
    if pd.isnull(authdate):
        authdate = ''
    author  = row['newauthor']
    if author in authordates and len(authordates[author]) > len(authdate) and authordates[author] != 'nan':
        return authordates[author]
    else:
        return authdate
    
deduped = deduped.assign(authordate = deduped.apply(enrich_authordates, axis = 1))

In [14]:
deduped.head()

Unnamed: 0_level_0,oldauthor,author,authordate,inferreddate,latestcomp,datetype,startdate,enddate,imprint,imprintdate,...,place,recordid,enumcron,volnum,title,parttitle,shorttitle,newauthor,groupingcolumn,instances
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
uc2.ark:/13960/t4mk67z1j,"Jones, J. D. (John Daniel)","Jones, J. D. (John Daniel)",1865-1942.,1900,1900,,1900,1989,New York|George H. Doran|19--,"<estimate=""[19--]"">",...,nyu,100000247,,,The Lord of life and death / | $c: by J.D. Jones.,,The Lord of life and death,"Jones, J. D. (John Daniel)",100000247+nan,1
uc2.ark:/13960/t39z92s0m,"Ovcharenko, Ivan Vasilʹevich","Ovcharenko, Ivan Vasilʹevich",,1900,1900,,1900,1985,London|Modern Books|19--?,"<estimate=""[19--?]"">",...,enk,100000271,,,In a ring of fire; | memories of a partisan.,,In a ring of fire; memories of a partisan,"Ovcharenko, Ivan Vasilʹevich",100000271+nan,1
uc2.ark:/13960/t0ht2j50b,"Phillpotts, Eden","Phillpotts, Eden",1862-1960.,1900,1900,,1900,1987,Paris|Thomas Nelson and Sons|19--?,"<estimate=""[19--?]"">",...,fr,100000276,,,Old Delabole / | $c: by Eden Phllpotts.,,Old Delabole,"Phillpotts, Eden",100000276+nan,1
uc2.ark:/13960/t1sf2p90r,"Rosborough, Alexander J","Rosborough, Alexander J",,1900,1900,,1900,1986,"Yreka, CA|News-Journal Print Shop|19]̲̲","<estimate=""19]̲̲"">",...,cau,100000283,,,The mountie and the sourdough / | $c: by Alexa...,,The mountie and the sourdough,"Rosborough, Alexander J",100000283+nan,1
uc2.ark:/13960/t9c53hj4v,"Verne, Jules","Verne, Jules",1828-1905.,1900,1900,,1900,1986,New York|Phoenix Publishing Co.|19--?,"<estimate=""[19--?]"">",...,nyu,100000299,,,The tour of the world in eighty days / | $c: b...,,The tour of the world in eighty days,"Verne, Jules",100000299+nan,1


In [15]:
manual = dict()
with open('manual_author_matches.tsv', encoding = 'utf-8') as f:
    reader = csv.DictReader(f, delimiter = '\t')
    for row in reader:
        manual[row['alias']] = unicodedata.normalize('NFC', row['realname'])

def manual_correction(author):
    global manual
    if author in manual:
        return manual[author]
    else:
        return author

Now we just write to file, after reassigning a column, dropping an extra column, and sorting.

In [16]:
deduped = deduped.assign(author = deduped.newauthor)
deduped = deduped.assign(author = deduped.author.map(manual_correction))
deduped.drop(labels = ['newauthor', 'groupingcolumn'], axis = 1, inplace = True)
deduped.sort_values(by = ['inferreddate', 'recordid', 'volnum'], inplace = True)
deduped.to_csv('newmanifestationmeta.tsv', sep = '\t')
