# Cleaning the master metadata

#### standardizing authors / titles

The authors and titles I got from HathiTrust have a few rough edges. Titles sometimes include a statement about authorship preceded by ```$c```. I don't usually want to treat that as part of the title.

Authors' names may be preceded by "Sir" or "Mrs"; generally I want to move that sort of honorific to the end of the name, so that last name always comes first. (Important for deduplication.)

#### volume-part inference

Commonly, a multi-volume set of *Works* will have a "contents" statement that enumerates the sub-title of each volume. With a bit of careful parsing, we can assign titles to individual volumes, so we have *Ivanhoe* instead of the less informative *Works of Scott,* vol 7.

#### date correction

The routine I used to infer ```inferreddate``` gave up a little too easily in some cases, and there are zeroes where we could make a better guess. Also, I'd like to add a column for "last possible date of composition." Using information about an author's date of death (!!), or in some cases copyright date, we can infer that some volumes are reprints of much earlier publications.


In [1]:
# a few useful imports

import pandas as pd
import re

In [2]:
# read the raw data

meta = pd.read_csv('mergedficmetadata.tsv', sep = '\t', index_col = 'docid', low_memory = False)
meta.head()

Unnamed: 0_level_0,author,authordate,contents,datetype,enddate,enumcron,genres,geographics,imprint,imprintdate,inferreddate,locnum,oclc,place,recordid,startdate,subjects,title
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
njp.32101071963472,"Rousseau, Jean-Jacques",1712-1778.,,s,,vol. 2,NotFiction,,"London;Printed for G.G.J. and J. Robinson, and...",1790,1790,,16894767.0,enk,8980647,1790,"Rousseau, Jean-Jacques|1712-1778|Correspondence","The confessions of J.J. Rousseau, citizen of G..."
nnc1.0037106139,"Savage, Richard",1846-1903.,Copyright ed. ...,s,,v.1,Fiction,,Leipzig;Tauchnitz;1899.,1899,1899,,35179607.0,gw,8398383,1899,,The white lady of Khaminavatka; | a story of t...
dul1.ark+=13960=t3nw07208,"Riddell, J. H",,,s,,v.1,,,London;Tinsley Brothers;1866.,1866,1866,PR5227.R36R33 1866,2753964.0,enk,10945362,1866,,The race for wealth
nyp.33433074869615,"Irving, Washington",1783-1859.,Surrey ed.,s,,v. 2,Fiction,,New York;G. P. Putnam;1896.,1896,1896,,8182806.0,nyu,8665326,1896,,"Bracebridge hall; or, The humourists."
nyp.33433068271737,,,New ed.,s,,,NotFiction,,London;F.C. & J. Rivington;1810.,1810,1810,,38289890.0,enk,8627815,1810,Religious aspects|Anecdotes|Tracts,"Cheap repository tracts: entertaining, moral, ..."


In [3]:
# Let's create some new columns. Two of them will be blank.
# One will contain just volume numbers. To that end, let's
# define a function that translates enumcrons to vol
# numbers.

def justvolnumbers(enum):
    
    ''' Returns strictly the numeric part of an enumcron,
    getting rid of the nonstandard 'v. ' or 'V.' It doesn't
    return anything for enums that are like 'c. 2' or 
    'copy 2'--that's not a volume number.
    '''
    
    if pd.isnull(enum) or len(enum) < 1:
        return ''
    elif enum.startswith('c') or enum.startswith('(c'):
        return ''
    else:
        matches = re.findall('\d+', enum)
        if len(matches) < 1:
            return ''
        else:
            volnum = int(matches[0])
            if volnum < 200 and volnum > 0:
                return volnum
            else:
                return ''

meta['volnum'] = meta['enumcron'].map(justvolnumbers)
meta['shorttitle'] = ''
meta['parttitle'] = ''

In [4]:
# Just to test what we produced:

meta.head()

Unnamed: 0_level_0,author,authordate,contents,datetype,enddate,enumcron,genres,geographics,imprint,imprintdate,...,locnum,oclc,place,recordid,startdate,subjects,title,volnum,shorttitle,parttitle
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
njp.32101071963472,"Rousseau, Jean-Jacques",1712-1778.,,s,,vol. 2,NotFiction,,"London;Printed for G.G.J. and J. Robinson, and...",1790,...,,16894767.0,enk,8980647,1790,"Rousseau, Jean-Jacques|1712-1778|Correspondence","The confessions of J.J. Rousseau, citizen of G...",2.0,,
nnc1.0037106139,"Savage, Richard",1846-1903.,Copyright ed. ...,s,,v.1,Fiction,,Leipzig;Tauchnitz;1899.,1899,...,,35179607.0,gw,8398383,1899,,The white lady of Khaminavatka; | a story of t...,1.0,,
dul1.ark+=13960=t3nw07208,"Riddell, J. H",,,s,,v.1,,,London;Tinsley Brothers;1866.,1866,...,PR5227.R36R33 1866,2753964.0,enk,10945362,1866,,The race for wealth,1.0,,
nyp.33433074869615,"Irving, Washington",1783-1859.,Surrey ed.,s,,v. 2,Fiction,,New York;G. P. Putnam;1896.,1896,...,,8182806.0,nyu,8665326,1896,,"Bracebridge hall; or, The humourists.",2.0,,
nyp.33433068271737,,,New ed.,s,,,NotFiction,,London;F.C. & J. Rivington;1810.,1810,...,,38289890.0,enk,8627815,1810,Religious aspects|Anecdotes|Tracts,"Cheap repository tracts: entertaining, moral, ...",,,


### volume-part inference

Basically, we want to be able to translate a contents statement, and convert it into a dictionary where volume numbers map to titles of individual volumes, like so:

![caption](files/parsed.png)

That's not terribly hard, with a regex:

In [5]:
def volmap(contents):
    
    ''' A function that turns a "contents" statement into a dictionary
    of titles.
    '''
    themap = dict()
    if pd.isnull(contents):
        return themap
    if len(contents) < 4:
        return themap
    
    contents = contents.replace('XVI.', '16')
    contents = contents.replace('XV.', '15')
    contents = contents.replace('XIV.', '14')
    contents = contents.replace('XIII.', '13')
    contents = contents.replace('XII.', '12')
    contents = contents.replace('XI.', '11')
    contents = contents.replace('IX.', '9')
    contents = contents.replace('X.', '10')
    contents = contents.replace('VIII.', '8')
    contents = contents.replace('VII.', '7')
    contents = contents.replace('VI.', '6')
    contents = contents.replace('IV.', '4')
    contents = contents.replace('V.', '5')
    contents = contents.replace('III.', '3')
    contents = contents.replace('II.', '2')
    contents = contents.replace('I.', '1')
    
    sequence = re.findall(r'\D+|\d+', contents)
    
    # The regex above does most of the work in this function, translating the
    # contents statement into a sequence of alternating alphabetic and numeric
    # sections.
    
    if len(sequence) < 3:
        return themap
    
    started = False
    hyphen = False
    
    for s in sequence:
        if s.isdigit() and not started:
            started = True
            nextvols = [int(s)]
            expectation = int(s) + 1
        elif not started:
            pass
        elif s == '-':
            hyphen = True
        elif s.isdigit() and hyphen:
            if int(s) < expectation:
                hyphen = False
                pass
            elif len(nextvols) == 1:
                for i in range(nextvols[0], int(s) + 1):
                    nextvols.append(i)
                expectation = int(s) + 1
                hyphen = False
            else:
                hyphen = False
                pass
        elif s.isdigit():
            if int(s) == expectation:
                nextvols = [int(s)]
                expectation = int(s) + 1
            else:
                pass
        else:
            for n in nextvols:
                themap[n] = s.strip('., -v[]()')
    
    return themap
                                

We also need to clean up titles, by getting rid of the part after "$c," along with various extra punctuation characters.

In [6]:
def short_title(longtitle):
    if "$c" in longtitle:
        parts = longtitle.split("$c")
        justtitle = parts[0]
    else:
        justtitle = longtitle
    
    shorttitle = justtitle.strip('| /.,').replace(' | ', ' ')
    return shorttitle

Now let's actually do the work.

In [None]:
grouped = meta.groupby('recordid')
ctr = 0
for record, group in grouped:
    ctr += 1
    if ctr % 100 == 1:
        print(ctr)
    maxlen = 0
    longest = ''
    for cont in group.contents:
        if pd.isnull(cont):
            continue
        elif len(cont) > maxlen:
            maxlen = len(cont)
            longest = cont
    themap = volmap(longest)

    for idx in group.index:
        volnum = group.loc[idx, 'volnum']
        if type(volnum) == int and volnum in themap:
            meta.loc[idx, 'parttitle'] = themap[volnum]
            meta.loc[idx, 'shorttitle'] = themap[volnum]
        else:
            meta.loc[idx, 'shorttitle'] = short_title(meta.loc[idx, 'title'])
            
        

### Author standardization

Move those honorifics to the end of the name.

Also, while we're at it, let's redress a couple of historical injustices that affect prominent authors in ways that would complicate deduplication.

In [8]:
def flip_honorific(auth):
    if pd.isnull(auth):
        return ''
    elif auth == 'Ward, Humphry, Mrs' or auth == "Mrs. Humphry Ward" or auth == "Mrs., Ward, Humphry" or auth == 'Ward, Humphry':
        return "Ward, Mary Augusta"
    elif auth == 'Wood, Henry, Mrs' or auth == "Mrs. Henry Wood" or auth == "Mrs., Wood, Henry" or auth == 'Wood, Henry':
        return "Wood, Ellen"
    
    # yes, in principle that's unfair to the real Humphry Ward and Henry Wood
    # however, in practice ...
    
    elif auth.startswith('Sir') or auth.startswith('Mrs'):
        return auth[3: ].strip('. ,') + ', ' + auth[0:3]
    elif auth.startswith('Lady'):
        return auth[4: ].strip('. ,') + ', ' + auth[0:4]
    elif auth.startswith('(') and ')' in auth:
        parts = auth.split(')')
        firstpart = parts[1].strip('., ')
        name = firstpart + " " + parts[0] + ")"
        return name
    elif auth == 'Baron, Dunsany, Edward John Moreton Drax Plunkett':
        return 'Dunsany, Edward John Moreton Drax Plunkett'
    elif auth == 'Baron, Lytton, Edward Bulwer Lytton':
        return 'Lytton, Edward Bulwer Lytton'
    elif auth == 'Baroness, Orczy, Emmuska Orczy':
        return 'Orczy, Emmuska Orczy'
    else:
        return auth

meta['cleanauth'] = meta['author'].map(flip_honorific)

### Date correction

Fixing a few inferred dates, adding a column for last possible date of composition.

In [9]:
meta['latestcomp'] = ''

for idx in meta.index:
    infer = meta.loc[idx, 'inferreddate']
    if int(infer) == 0:
        try:
            newdate = int(meta.loc[idx, 'startdate'])
            if newdate > 1699 and newdate < 2100:
                meta.loc[idx, 'inferreddate'] = newdate
            else:
                newdate = newdate = int(meta.loc[idx, 'enddate'])
                if newdate > 1699 and newdate < 2100:
                    meta.loc[idx, 'inferreddate'] = newdate
        except:
            pass
        
    authdate = meta.loc[idx, 'authordate']
    
    death = 3000
    if not pd.isnull(authdate):
        authdate = authdate.strip(',.')
        if '-' in authdate and len(authdate) > 6:
            try:
                death = int(authdate[-4: ])
            except:
                death = 3000
        else:
            death = 3000
    
    datetype = meta.loc[idx, 'datetype']
    if datetype == 'c' or datetype == 't' or datetype == 'r':
        try:
            firstpub = int(meta.loc[idx, 'enddate'])
        except:
            firstpub = 3000
    else:
        firstpub = 3000
    
    infer = int(meta.loc[idx, 'inferreddate'])
    if infer < 1700:
        infer = 2100
    
    if death < 1700:
        death = 2100
    
    if firstpub < 1700:
        firstpub = 2100
    
    meta.loc[idx, 'latestcomp'] = min(death, infer, firstpub)
    
    
            

#### now write to file

In [10]:
cols_in_order = ['author', 'cleanauth', 'authordate',  'inferreddate', 'latestcomp', 'datetype', 'startdate', 'enddate', 'imprint',
 'imprintdate', 'contents', 'genres',  'subjects', 'geographics', 'locnum', 'oclc', 'place', 'recordid',
 'enumcron', 'volnum', 'title', 'parttitle', 'shorttitle']
outmeta = meta[cols_in_order]
outmeta.sort_values(by = ['inferreddate', 'recordid', 'volnum'], inplace = True)
outmeta.to_csv('masterficmetadata.tsv', sep = '\t')