# Fuzzy-matching canonical volumes

Richard So provided a list of American novels canonized by Norton or Heath. This notebook does some fuzzy matching to identify corresponding Hathi volumes. We use deduplicated metadata in order to get the earliest available copy.

In [3]:
import pandas as pd
from difflib import SequenceMatcher
from collections import Counter

In [4]:
norton = pd.read_csv('canon/britnorton.csv')
norton.head()

Unnamed: 0,author,title,firstpub
0,"Conrad, Joseph",The secret Agent,1907
1,"Conrad, Joseph",Heart of Darkness,1899
2,"Dickens, Charles",Hard Times,1854
3,"Brontë, Charlotte",Jane Eyre,1847
4,"Austen, Jane",Pride and Prejudice,1813


In [10]:
meta = pd.read_csv('../supplement2/supp2nationalitymeta.tsv', sep = '\t')
meta.shape

(39817, 12)

### Create "blocks" of records for speedier processing

Fuzzy matching can be pretty slow if you have to check each author/title combination against the whole list of 138,000 records. So it's conventional to group records into "blocks." Here we'll use the first two letters of the author's last name.

In reality, the number of books we're dealing with here isn't overwhelming, but I'm in the habit of doing this, and it sets a good example.

In [11]:
def blockcode(aname):
    if pd.isnull(aname):
        block = 'xx'
    elif len(aname) < 3:
        block = 'nn'
    else:
        block = aname.lower()[0:2]
    
    return block

meta = meta.assign(block = meta.author.map(blockcode))

byblock = meta.groupby('block')
block_dictionary = dict()

for code, df in byblock:
    block_dictionary[code] = df

### Now the actual matching

We check author first, then title. If both similarities are above 0.8 we add the record to a list. After sorting, we take the top match.

In [12]:
def fuzzymatch(str1, str2):
    
    m = SequenceMatcher(None, str1, str2)
    match = m.real_quick_ratio()
    if match > 0.7:
        match = m.ratio()
    
    return match

allmatches = []

for idx, row in norton.iterrows():
    normname = row['author'].lower()
    
    title = row['title'].lower()
    if len(title) > 28:
        title = title[0:28]
    code = normname[0:2]
    
    possiblematches = []
    
    if code in block_dictionary:
        block = block_dictionary[code]
        for idx2, row2 in block.iterrows():
            auth2match = row2['author'].lower()
            if '(' in auth2match:
                auth2match = auth2match.split('(')[0]
            authsimilarity = fuzzymatch(normname, auth2match)
            if authsimilarity < 0.74: 
                continue
            
            title2match = row2['title']
            if pd.isnull(title2match):
                title2match = 'x x x'
            else:
                title2match = title2match.lower()
                if len(title2match) > 28:
                    title2match = title2match[0:28]
                
            titlesimilarity = fuzzymatch(title, title2match)
            if titlesimilarity > 0.7:
                possiblematches.append((authsimilarity * titlesimilarity, idx2))

    if len(possiblematches) > 0:
        possiblematches.sort()
        probability, matchidx = possiblematches[-1]
        
        print(normname, ' | ', title, block.loc[matchidx, 'author'], ' == ', block.loc[matchidx, 'title'])
        allmatches.append((block.loc[matchidx, 'docid'], idx))
    else:
        print(normname, ' || ', title, "NO MATCH")
        
                
                
            

conrad, joseph  |  the secret agent Conrad, Joseph  ==  The secret agent
conrad, joseph  ||  heart of darkness NO MATCH
dickens, charles  ||  hard times NO MATCH
brontë, charlotte  |  jane eyre Brontë, Charlotte  ==  Jane Eyre
austen, jane  |  pride and prejudice Austen, Jane  ==  Pride and Prejudice
hardy, thomas  |  jude the obscure Hardy, Thomas  ==  Jude the obscure, | $c: by Thomas Hardy.
austen, jane  ||  persuasion NO MATCH
ford, ford madox  |  the good soldier Ford, Ford Madox  ==  The good soldier; | a tale of passion, | $c: by Ford Madox Hueffer.
shelley, mary  ||  frankenstein NO MATCH
austen, jane  ||  emma NO MATCH
gaskell, elizabeth  |  mary barton Gaskell, Elizabeth Cleghorn  ==  Mary Barton
wells, h g  |  the time machine Wells, H. G. (Herbert George)  ==  The Time Machine
woolf, virginia  |  jacob's room Woolf, Virginia  ==  Jacob's room.
joyce, james  |  portrait of the artist as a  Joyce, James  ==  A portrait of the artist as a young man / | $c: by James Joyce.
wild

In [13]:
print("Number of matches: ", len(allmatches))
print(len(set(allmatches)))

Number of matches:  23
23


In [14]:
norton['docid'] = float('nan')

In [15]:
for docid, idx in allmatches:
    norton.loc[idx, 'docid'] = docid

In [16]:
norton.to_csv('canon/britnorton.tsv', sep = '\t', index = False)