# Final transformation of title metadata

This is a final stage of processing on the file formerly called ```workmeta.tsv``` -- now ```titlemeta.tsv.``` It addresses some errors that slipped into the file, especially re: the "numbers of copies" associated with particular records. 

At the same time, we absorb the probabilistic information in recordmeta.tsv and migrate it across to the smaller titlemeta file.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import pearsonr
from collections import Counter

In [2]:
title = pd.read_csv('../titlemeta.tsv', sep = '\t', index_col = 'docid',
                   low_memory = False)

In [6]:
record = pd.read_csv('../recordmeta.tsv', sep = '\t', 
                     index_col = 'docid', low_memory = False)

### Fixing the "copies" calculation

So, in the second deduplication notebook, I

    a) group volumes that share the same author and title and
    b) count the number of instances of those volumes

I then used those counts to fill ```allcopiesofwork``` and ```copiesin25yrs.``` The only problem is that multi-volume works (e.g. triple-decker novels) got credit for *all three volumes*; thus if you construct a list of the most-reprinted books, they're almost all three and four-volume novels!

"Ah," you'll say, "your error was that you should have grouped volumes that shared the same author, title and *enumcron/volume number*." That way vol 2 of *Ivanhoe* would only get credit for other volume twos of *Ivanhoe.*

Initially plausible, except that deduplication is often merging records with different numbers of volumes--e.g. two-, three-, and one-volume editions of Ivanhoe!

Fortunately, we saved the groupings created in the second deduplication notebook. This makes a more principled approach possible: go back and and count distinct record ids in each grouping, and create a sum total of copies by adding up instances for each *record*, not each *volume.*

#### first get the record IDs

In [7]:
copiesperdoc = dict()
recordfordoc = dict()

for idx in record.index:
    instances = int(record.loc[idx, 'instances'])
    r = record.loc[idx, 'recordid']
    copiesperdoc[idx] = instances
    recordfordoc[idx] = r

#### then read in the groupings created by the second dedup notebook

In [8]:
groupdict = dict()
with open('../dedup/allgroups.tsv', encoding = 'utf-8') as f:
    for line in f:
        docs = set(line.strip().split('\t'))
        for d in docs:
            groupdict[d] = docs

#### finally, add up the fractional instance values associated with titles

For each title, find the associated group. Add up all the fractional instance values, or (in the case of ```copiesin25yrs,``` only those found in the 25 years after a title's first appearance in Hathi.)

This will end up producing a value that is *roughly* the number of copies of a book's complete text found in Hathi. Note that since vols of a set are not always present in the same number of instances, it's quite easy to get fractional values here.

In [11]:
allcop = []
cop25 = []

for doc, row in title.iterrows():
    
    group = groupdict[doc]
    allcopies = Counter()
    copies25 = Counter()
    thisdate = row.inferreddate
    
    for d in group:
        if ':/' in d:
            d = d.replace(':', '+')
            d = d.replace('/', '=')
        ddate = record.loc[d, 'inferreddate']
        r = recordfordoc[d]
        instances4d = copiesperdoc[d]
        if r not in allcopies:
            allcopies[r] = [instances4d]
        else:
            allcopies[r].append(instances4d)
            
        if thisdate + 25 >= ddate:
            if r not in copies25:
                copies25[r] = [instances4d]
            else:
                copies25[r].append(instances4d)
        
    ac = 0
    c25 = 0
    for k, v in allcopies.items():
        ac = ac + np.mean(v)
    for k, v in copies25.items():
        c25 = c25 + np.mean(v)
        
    allcop.append(ac)
    cop25.append(c25)

#### some EDA to confirm I did that roughly right

In [12]:
pearsonr(allcop, title.allcopiesofwork)

(0.96771458781090858, 0.0)

In [13]:
pearsonr(cop25, title.copiesin25yrs)

(0.93577806952108133, 0.0)

In [14]:
pearsonr(allcop, title.copiesin25yrs)

(0.68784454622681268, 0.0)

In [15]:
title = title.assign(allcopiesofwork = allcop)
title = title.assign(copiesin25yrs = cop25)

In [16]:
title.columns

Index(['oldauthor', 'author', 'authordate', 'inferreddate', 'latestcomp',
       'datetype', 'startdate', 'enddate', 'imprint', 'imprintdate',
       'contents', 'genres', 'subjects', 'geographics', 'locnum', 'oclc',
       'place', 'recordid', 'instances', 'allcopiesofwork', 'copiesin25yrs',
       'enumcron', 'volnum', 'title', 'parttitle', 'earlyedition',
       'shorttitle', 'nonficprob', 'juvenileprob'],
      dtype='object')

In [17]:
title.shape

(138137, 29)

In [18]:
title.to_csv('../titlemeta.tsv', sep = '\t', index_label = 'docid')