# Final transformation of title metadata

This is a final stage of processing on the file formerly called ```workmeta.tsv``` -- now ```titlemeta.tsv.``` It addresses some errors that slipped into the file, especially re: the "numbers of copies" associated with particular records. 

At the same time, we absorb the probabilistic information in recordmeta.tsv and migrate it across to the smaller titlemeta file.

In [29]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import pearsonr

In [3]:
title = pd.read_csv('../titlemeta.tsv', sep = '\t', index_col = 'docid',
                   low_memory = False)

In [4]:
record = pd.read_csv('../recordmeta.tsv', sep = '\t', 
                     index_col = 'docid', low_memory = False)

In [6]:
print(title.columns.tolist())

['oldauthor', 'author', 'authordate', 'inferreddate', 'latestcomp', 'datetype', 'startdate', 'enddate', 'imprint', 'imprintdate', 'contents', 'genres', 'subjects', 'geographics', 'locnum', 'oclc', 'place', 'recordid', 'instances', 'allcopiesofwork', 'copiesin25yrs', 'enumcron', 'volnum', 'title', 'parttitle', 'earlyedition', 'shorttitle']


### Fixing the "copies" calculation

So, in the second deduplication notebook, I

    a) group volumes that share the same author and title and
    b) count the number of instances of those volumes

I then used those counts to fill ```allcopiesofwork``` and ```copiesin25yrs.``` The only problem is that multi-volume works (e.g. triple-decker novels) got credit for *all three volumes*; thus if you construct a list of the most-reprinted books, they're almost all three and four-volume novels!

"Ah," you'll say, "your error was that you should have grouped volumes that shared the same author, title and *enumcron/volume number*." That way vol 2 of *Ivanhoe* would only get credit for other volume twos of *Ivanhoe.*

Initially plausible, except that deduplication is often merging records with different numbers of volumes--e.g. two-, three-, and one-volume editions of Ivanhoe!

For an ideal version of this calculation, we might go back to the second dedup notebook and divide the count of copies by number of vols-per-record at that stage of the process. So we would be in effect counting recordids.

In [13]:
copiesperdoc = dict()

for r, df in record.groupby('recordid'):
    numvols = len(df)
    for idx in df.index:
        instances = int(df.loc[idx, 'instances'])
        copiesperdoc[idx] = instances / numvols

In [14]:
groupdict = dict()
with open('../dedup/allgroups.tsv', encoding = 'utf-8') as f:
    for line in f:
        docs = set(line.strip().split('\t'))
        for d in docs:
            groupdict[d] = docs

In [28]:
allcop = []
cop25 = []

for doc, row in title.iterrows():
    
    group = groupdict[doc]
    allcopies = 0
    copies25 = 0
    thisdate = row.inferreddate
    
    for d in group:
        if ':/' in d:
            d = d.replace(':', '+')
            d = d.replace('/', '=')
        ddate = record.loc[d, 'inferreddate']
        instances4d = copiesperdoc[d]
        allcopies += instances4d
        if thisdate + 25 >= ddate:
            copies25 += instances4d
            
    allcop.append(allcopies)
    cop25.append(copies25)

In [30]:
pearsonr(allcop, title.allcopiesofwork)

(0.8697198875602189, 0.0)

In [31]:
pearsonr(cop25, title.copiesin25yrs)

(0.76864050880328427, 0.0)

In [32]:
pearsonr(allcop, title.copiesin25yrs)

(0.62185135782640855, 0.0)

In [33]:
title = title.assign(allcopiesofwork = allcop)
title = title.assign(copiesin25yrs = cop25)

In [34]:
title = title.join(record.nonficprob, how = 'left')
title = title.join(record.juvenileprob, how = 'left')

In [35]:
title.shape

(138137, 29)

In [37]:
title.to_csv('../titlemeta.tsv', sep = '\t', index_label = 'docid')