## Second deduplication

This notebook begins with **manifestationmeta.tsv,** and moves toward a smaller dataset that aspires to contain only one copy of each "work," in [FRBR terminology.](https://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records) 

However, this reference to FRBR should not be taken very literally. In reality, we're just identifying (relatively) unique title-author pairs, which may or may not line up with "works" and "expressions." I say "relatively" because fuzzy matching is used to allow for minor variations in spelling and punctuation.


In [58]:
import pandas as pd
from difflib import SequenceMatcher
from collections import Counter

### create blocks

We start by grouping volumes into "blocks." This is purely a time-reduction step, to avoid useless comparisons of very different volumes. Each block is identified by the first three characters of the author's name plus the first five characters of the title. Those aren't all necessarily matches, but they are probable matches.

This strategy does unfortunately mean that the first few characters of names become very important, which is why I made some effort to standardize naming in the first deduplication notebook -- moving e.g. "sir" and "mrs" to the end of the name. More could probably be done here: names like "Du Maurier" and "Van Dyck" are potentially tricky.

In [2]:
meta = pd.read_csv('manifestationmeta.tsv', sep = '\t', low_memory = False)

blocks = dict()
for idx in meta.index:
    name = meta.loc[idx, 'author']
    if pd.isnull(name) or len(name) < 5:
        name = 'nan'
        
    title = meta.loc[idx, 'shorttitle']
    
    # note that we use short titles, which means that we'll be using
    # the titles for individual volume parts when available
    
    if pd.isnull(title) or len(title) < 6:
        title = 'default'
    
    blockcode = name[0:5].lower() + title[0:6].lower()
    if blockcode not in blocks:
        blocks[blockcode] = set()
    
    blocks[blockcode].add(idx)

In [3]:
len(blocks)


89178

### Group records into sets that have the same author / title

Generally, the strategy here is to loop through each block, comparing each record to all the other records in the block. If we find suffient similarity, we make sure they end up in the same "group."

Records that don't match anything else in the block get their own group.

But of course the devil is in the details. For instance, we ignore semicolons and colons, which often substitute for each other in titles. We also cap names at 25 chars and titles at 35 chars, because very long names/titles are often encumbered by extra phrases that do not actually disambiguate anything.


In [43]:
def probablymatch(str1, str2):
    
    m = SequenceMatcher(None, str1, str2)
    match = m.real_quick_ratio()
    if match > 0.75:
        match = m.ratio()
    
    return match

def cleanstring(astring, cap):
    astring = astring.replace(';', '')
    astring = astring.replace(':', '')
    astring = astring.lower()
    if len(astring) > cap:
        astring = astring[0 : cap]
    return astring
    
groups = []
dubiouscalls = []

ctr = 0
for code, block in blocks.items():
    ctr += 1
    if ctr % 10000 == 1:
        print(ctr)
    
    already_checked = dict()
    titledict = dict()
    authdict = dict()
    
    # we clean all the titles and authors in the block before 
    # attempting to match; otherwise you end up doing
    # n x n cleaning operations.
    
    for b in block:
        auth = meta.loc[b, 'author']
        if pd.isnull(auth) or len(auth) < 4:
            auth = 'cannot-match'
        else:
            auth = cleanstring(auth, 25)
        
        title = meta.loc[b, 'shorttitle']
        if pd.isnull(title) or len(title) < 5:
            title = 'cannot-match'
        else:
            title = cleanstring(title, 35)
        
        titledict[b] = title
        authdict[b] = auth
           
    for b1 in block:
        matched = False
        for b2 in block:
            if b1 == b2:
                continue
            if (str(b1) + ' ' + str(b2)) in already_checked:
                if not matched:
                    matched = already_checked[str(b1) + ' ' + str(b2)]
                continue
            
            auth1 = authdict[b1]
            auth2 = authdict[b2]
            title1 = titledict[b1]
            title2 = titledict[b2]
            
            if auth1 == 'cannot-match' or auth2 == 'cannot-match':
                already_checked[str(b2) + ' ' + str(b1)] = False
                continue
            if title1 == 'cannot-match' or title2 == 'cannot-match':
                already_checked[str(b2) + ' ' + str(b1)] = False
                continue
            
            if auth1 == auth2:
                authormatch = 1.0
            else:
                authormatch = probablymatch(auth1, auth2)
                if authormatch < 0.9:
                    already_checked[str(b2) + ' ' + str(b1)] = False
                    continue
            
            if title1 == title2:
                titlematch = 1.0
            else:
                titlematch = probablymatch(title1, title2)
                if titlematch < 0.88:
                    already_checked[str(b2) + ' ' + str(b1)] = False
                    continue
            
            if authormatch + titlematch < 1.85:
                already_checked[str(b2) + ' ' + str(b1)] = False
                continue
            elif authormatch + titlematch < 1.91:
                outline = auth1 + " | " + title1 + '\n' + auth2 + ' | ' + title2 + '\n' + str(authormatch + titlematch) + '\n'
                dubiouscalls.append(outline)           
            
            # we have a match!
            matched = True
            found = False
            for g in groups:
                if b1 in g or b2 in g:
                    g.add(b1)
                    g.add(b2)
                    found = True
                    break

            if not found:
                groups.append({b1, b2})
            
            already_checked[str(b2) + ' ' + str(b1)] = True
            
        if not matched:
            groups.append({b1})
                
                

1
10001
20001
30001
40001
50001
60001
70001
80001


### Write dubious calls to a file where they can be inspected

The cutoffs above are adjusted manually, and arbitrarily. Matches on the low end get grouped into a list named "dubiouscalls." The cells below write that out.

By and large, I'm comfortable with these, though there are a few obvious errors; a couple of Bobbsey Twin books get grouped that shouldn't be grouped, etc.

In [44]:
len(dubiouscalls)

16946

In [45]:
with open('dubiouscalls.txt', mode = 'w', encoding = 'utf-8') as f:
    for d in dubiouscalls:
        f.write(d)

### A little exploratory description

E.g., how many groups do we have? How big is the biggest?

In [46]:
len(groups)

128296

In [47]:
maxsize = 0
for g in groups:
    if len(g) > maxsize:
        maxsize = len(g)
print(maxsize)

317


### Now the actual deduplication

In principle, generally, we want to take one volume from each group of volumes that have matching titles and authors. And in general we want to take the earliest volume, so our resulting dataset will be dated as close as possible to dates of first publication.

However, there are complicating cases. What if, for instance, the earliest instance of a novel is a Victorian three-decker edition? That's going to happen pretty often. In that case, we don't want to take *just one volume* from the group; we want all three volumes of the earliest edition. So we need a new rule: take all volumes sharing the *recordid* of the earliest volume. That will get all three volumes of a three-volume edition.

But we confront yet another complication! Volumes grouped by a recordid are sometimes three volumes of a single work. But often they are, say, 28 volumes in the *Collected Works of Scott.* All sharing a single record id, but not all the same fictional work. Maybe some of the longer novels are spread across 2 or three volumes, but many of the volumes represent a single novel. This gets bloody complicated.

So our *new* rule is: find the earliest volume. Get its record id. Find all volumes sharing that record id (all volumes in the same set). Then take all the volumes that share the same *short title*. If we have been able to identify vols 11 and 12 as *Ivanhoe,* this will get just 11 and 12. However, if we haven't been able to identify titles beyond *Collected Works of Scott,* we'll get all 28 vols! So the final rule is, ignore cases where we recover more than five vols sharing the same recordid. We suspect these are collected works.

As we do this, we are going to want to keep track of the number of copies of a volume that have been collapsed into a single deduplicated record. We'll use a column of "instances" created in the earlier stage of deduplication; this counts vols that had the same recordid+volnum. We'll further aggregate that into "copies": vols that had the same author/title. Moreover, since we may want to distinguish *contemporary* popularity from later canonicity, we're going to keep track of this in two different ways: a general column of copies and a column of copies-published-within-25-yrs of our first example.

In [61]:
selected = []
ignored = []
errors = 0
authtitlecopies = dict()
copiesin25yrs = dict()

ctr = 0
for g in groups:
    ctr += 1
    if ctr % 10000 == 1:
        print(ctr)
    
    # Some groups contain only a single volume.
    if len(g) == 1:
        for e in g:
            break
        selected.append(e)
        authtitlecopies[e] = int(meta.loc[e, 'instances'])
        copiesin25yrs[e] = authtitlecopies[e]
        # For a single volume, all these quantities will be the same.
        continue
        
    if len(g) < 1:
        errors += 1
        continue
    
    earliest = ''
    earliestdate = 2100
    instancectr = Counter()
    
    for element in g:
        date = meta.loc[element, 'inferreddate']
        copies = int(meta.loc[element, 'instances'])
        
        if pd.isnull(date):
            date = 2100
        else:
            date = int(date)
        
        instancectr[date] += copies
        
        if earliestdate == 2100 or date < earliestdate:
            earliestdate = date
            earliest = element
            if earliestdate < 1700:
                earliestdate = 2100
                # don't reward dubious dates
    
    # now let's add up those copies
    allcopies = 0
    copiesin25yrsofearliest = 0
    
    for date, count in instancectr.items():
        allcopies += count
        if date < (earliestdate + 25):
            copiesin25yrsofearliest += count
            
    record = meta.loc[earliest, 'recordid']
    title2match = str(meta.loc[earliest, 'shorttitle'])

    matching = []

    thisrec = meta.loc[meta.recordid == record, : ]
    for idx in thisrec.index:
        thistitle = str(thisrec.loc[idx, 'shorttitle'])
        match = probablymatch(title2match, thistitle)
        if match > 0.9:
            matching.append(idx)
    
    if len(matching) < 6:
        selected.extend(matching)
        for m in matching:
            authtitlecopies[m] = allcopies
            copiesin25yrs[m] = copiesin25yrsofearliest
    else:
        ignored.append((title2match, record))
        
print(errors)          

1
10001
20001
30001
40001
50001
60001
70001
80001
90001
100001
110001
120001
0


### Some exploratory description

For instance, how many records did we select. How many groups of vols were ignored?

In [62]:
print(len(selected))


135299


In [63]:
print(len(ignored))

255


In [64]:
ignored[0:20]

[('[Complete works', 1417287),
 ('The life of a lover. In a series of letters', 321626),
 ('Works, in an English translation', 1364287),
 ('The novels of Captain Marryat', 9245242),
 ('Scenes of Parisian life;', 1203519),
 ('Scenes of private life;', 1203519),
 ('Scenes of private life;', 1203519),
 ('Scenes of provincial life;', 1203519),
 ('Scenes of provincial life;', 1203519),
 ('Scenes of Parisian life', 7678129),
 ("The world's one hundred best short stories", 6511333),
 ("[Scott's novels]", 8665211),
 ("Journeys through Bookland; a new and original plan for reading, applied to the world's best literature for children",
  5543768),
 ('Complete writings of O. Henry [i.e. W.S. Porter]', 1376739),
 ('Works', 8881896),
 ('The works of Louise M?_hlbach in eighteen volumes', 7707100),
 ('The Riverside readers', 7910637),
 ('The real America in romance', 9909118),
 ('Philosophic and analytic studies;', 1203519),
 ('Philosophic and analytic studies;', 1203519)]

In [70]:
# Let's write the ignored records to file

with open('ignoredgroups.tsv', mode = 'w', encoding = 'utf-8') as f:
    for title, record in ignored:
        f.write(title + '\t' + str(record) + '\n')

### Now actually produce and write the dataframe

All of our effort so far has gone into selecting a list of indices that will be retained. Now we have to use those indices to actually produce a new dataframe.

In [65]:
# like so

deduped = meta.loc[selected, : ]

In [66]:
deduped.head()

Unnamed: 0,docid,oldauthor,author,authordate,inferreddate,latestcomp,datetype,startdate,enddate,imprint,...,locnum,oclc,place,recordid,enumcron,volnum,title,parttitle,shorttitle,instances
81024,uva.x000677513,"Hope, Laura Lee","Hope, Laura Lee",,1920,1920,s,1920,,New York;Grosset & Dunlap;c1920.,...,,4153330,nyu,9795572,,,The Bobbsey twins in the great West / | $c: by...,,The Bobbsey twins in the great West,1
82817,nyp.33433082332069,"Hope, Laura Lee","Hope, Laura Lee",,1922,1922,s,1922,,New York;Grosset and Dunlap;1922,...,,1410101,nyu,5805688,,,The Bobbsey Twins at the county fair / | $c: b...,,The Bobbsey Twins at the county fair,1
71530,nyp.33433082332028,"Hope, Laura Lee","Hope, Laura Lee",,1913,1913,s,1913,,New York;Grosset & Dunlap;c1913,...,PZ7.H772Bocs,2568839,nyu,8689225,,,"The Bobbsey twins at school, | $c: by Laura Le...",,The Bobbsey twins at school,1
60840,nyp.33433082332010,"Hope, Laura Lee","Hope, Laura Lee",,1907,1907,s,1907,,New York;Grosset & Dunlap;c1907.,...,PS3515.O585B603 1907,9613473,nyu,5104160,,,The Bobbsey twins at the seashore / | $c: by L...,,The Bobbsey twins at the seashore,1
78570,nyp.33433082344874,"Hope, Laura Lee","Hope, Laura Lee",,1919,1919,s,1919,,New York;Grosset & Dunlap;1919.,...,,2109184,nyu,5346603,,,The Bobbsey twins in Washington. / | $c: By La...,,The Bobbsey twins in Washington,1


#### add copy counts

Before we write out the dataframe, add columns reflecting the number of copies collapsed into each record.

In [67]:
def get_copy_count(idx, dictionary):
    return dictionary[idx]

deduped = deduped.assign(allcopiesofwork = deduped.apply(lambda row: get_copy_count(row.name, authtitlecopies), axis = 1))
deduped = deduped.assign(copiesin25yrs = deduped.apply(lambda row: get_copy_count(row.name, copiesin25yrs), axis = 1))



In [72]:
print(deduped.columns)

Index(['docid', 'oldauthor', 'author', 'authordate', 'inferreddate',
       'latestcomp', 'datetype', 'startdate', 'enddate', 'imprint',
       'imprintdate', 'contents', 'genres', 'subjects', 'geographics',
       'locnum', 'oclc', 'place', 'recordid', 'enumcron', 'volnum', 'title',
       'parttitle', 'shorttitle', 'instances', 'allcopiesofwork',
       'copiesin25yrs'],
      dtype='object')


In [74]:
# sort rows
deduped.sort_values(by = ['inferreddate', 'recordid', 'volnum'], inplace = True)

# put columns in desired order (title last)
deduped = deduped[['docid', 'oldauthor', 'author', 'authordate', 'inferreddate',
       'latestcomp', 'datetype', 'startdate', 'enddate', 'imprint',
       'imprintdate', 'contents', 'genres', 'subjects', 'geographics',
       'locnum', 'oclc', 'place', 'recordid', 'instances', 'allcopiesofwork',
       'copiesin25yrs', 'enumcron', 'volnum', 'title',
       'parttitle', 'shorttitle']]

# write to file
deduped.to_csv('workmeta.tsv', sep = '\t', index = False)