# Final polishing stage for recordmeta

This is the final polishing stage for ```recordmeta.tsv.``` It reconciles the probabilistic columns ```juvenileprob``` and ```nonficprob``` with ground truth. (There's no point relying on "predictions" about volumes where we actually know the truth.)

Then it corrects docids, to deal with errors and update some ids that have changed.

In [17]:
import pandas as pd
import numpy as np

#### read in existing recordmeta, enriched by make_predictions.py

In [18]:
record = pd.read_csv('../enrichedrecordmeta.tsv', sep = '\t', index_col = 'docid', low_memory = False)
print(record.columns.tolist())

['oldauthor', 'author', 'authordate', 'inferreddate', 'latestcomp', 'datetype', 'startdate', 'enddate', 'imprint', 'imprintdate', 'contents', 'genres', 'subjects', 'geographics', 'locnum', 'oclc', 'place', 'recordid', 'enumcron', 'volnum', 'title', 'parttitle', 'shorttitle', 'instances', 'juvenileprob', 'nonficprob']


#### read in ground truth

In [19]:
ground = pd.read_csv('../manuallists/union_of_subsets.csv', index_col = 'docid', low_memory = False)

#### make corrections

Specifically, we check each index in recordmeta against ground. If it matches, we check the category listed in ground, and set the ```juvenileprob``` and ```nonficprob``` to zero or one as appropriate.

If the predicted probabilities are not null, we add them to a list that we're going to use to estimate averages.

In [21]:
yesnon = []
nonon = []
unknon = []
yesjuv = []
nojuv = []
unkjuv = []

for idx in record.index:
    juvpredict = record.loc[idx, 'juvenileprob']
    nonpredict = record.loc[idx, 'nonficprob']
    
    if idx not in ground.index and not pd.isnull(juvpredict):
        unkjuv.append(juvpredict)
    elif idx in ground.index:
        truecat = ground.loc[idx, 'category']
        if truecat == 'juvenile':
            if not pd.isnull(juvpredict):
                yesjuv.append(juvpredict)
            record.loc[idx, 'juvenileprob'] = 1.0
        else:
            if not pd.isnull(juvpredict):
                nojuv.append(juvpredict)
            record.loc[idx, 'juvenileprob'] = 0.0
    
    if idx not in ground.index and not pd.isnull(nonpredict):
        unknon.append(nonpredict)
    elif idx in ground.index:
        truecat = ground.loc[idx, 'category']
        if truecat == 'notfiction':
            if not pd.isnull(nonpredict):
                yesnon.append(nonpredict)
            record.loc[idx, 'nonficprob'] = 1.0
        else:
            if not pd.isnull(nonpredict):
                nonon.append(nonpredict)
            record.loc[idx, 'nonficprob'] = 0.0


#### some EDA

I was just curious what predicted probabilities actually are, on average, for volumes that we *know* to be nonfiction (or know not to be) compared with those we don't know.

In [22]:
print(np.mean(yesnon))
print(np.mean(unknon))
print(np.mean(nonon))

0.791014044048
0.259495901629
0.207118329161


The upshot is that our model is pretty confident about the examples known to be nonfiction (```yesnon```), but doesn't place the average volume known to be fiction (```nonon```) much lower than the average unknown volume. This makes sense, as most unknown volumes are indeed fiction.

The same pattern holds for juvenile fiction. The last two numbers in this sequence are much closer to each other than the middle number is to the first:

In [23]:
print(np.mean(yesjuv))
print(np.mean(unkjuv))
print(np.mean(nojuv))

0.956455906831
0.158700985652
0.122038997642


#### some unimportant error checking

In [28]:
errorcount = 0
for idx in record.index:
    if ':' in idx:
        errorcount += 1
errorcount

0

In [29]:
errorcount = 0
for idx in ground.index:
    if idx not in record.index:
        errorcount += 1
errorcount

0

In [31]:
trans = pd.read_csv('../data/filename_translator.tsv', sep = '\t', index_col = 'badname')

In [32]:
trans.head()

Unnamed: 0_level_0,goodname
badname,Unnamed: 1_level_1
uc1.b250374,uc1.$b250374
uc1.b250174,uc1.$b250174
uc1.b249745,uc1.$b249745
uc1.b70887,uc1.$b70887
uc1.b182851,uc1.$b182851


In [35]:
errorcount = 0
for idx in record.index:
    if idx in trans.index:
        errorcount += 1
errorcount

70

**Note that** we have not yet corrected bad docids.

Write the corrected file to disk.

In [36]:
record.to_csv('../recordmeta.tsv', sep = '\t', index_label = 'docid')