# Merging data produced by three different people to create the manual_title_subset

This is the notebook that created List # 4 in the report, the "manually-checked title subset." It can be re-run to recreate the table.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from collections import Counter
from difflib import SequenceMatcher
%matplotlib inline

## Load the data, confirm basic format

In [2]:
j = pd.read_csv('titledata/jessica.csv')
j.shape

(1200, 18)

In [3]:
p = pd.read_csv('titledata/patrick.csv')
p.shape

(1200, 18)

In [4]:
t = pd.read_csv('titledata/ted.tsv', sep = '\t')
t.shape

(330, 18)

In [5]:
set(p.columns) == set(j.columns)

True

In [6]:
set(p.columns) == set(t.columns)

True

### Concatenate three datasets

In [7]:
all = pd.concat([j, p, t], sort = False)

## Confirm that values match our data dictionary

In [8]:
set(all.gender)

{nan, 'o', 'f', 'm', 'u'}

In practice, there is no difference between 'u' and nan; they both mean we don't know. The only difference between 'o' and 'u,' in this data, is that one coder has used 'o' in five cases of multiple authorship; the other readers have not done the same thing, so we can't consistently maintain that distinction as part of the dictionary.

In [9]:
all.loc[all.gender == 'o', 'gender'] = 'u'

In [10]:
all['gender'] = all['gender'].fillna(value = 'u')


There's one row where a coder has double-listed nationality to address dual authorship. This is not something we've done consistently, so let's flatten it out to a single value.

In [11]:
all.loc[all.nationality == 'us| us', 'nationality'] = 'us'

In [12]:
t['gender'] = t['gender'].fillna(value = 'u')

### confirm the ```category``` field

This is a basic element of manual correction, indicating basically what genre a book should be filed under.

According to our data dictionary in process.md https://github.com/tedunderwood/meta2018/blob/master/process.md the allowable codes here are

    nonfic
    reprint
    novel
    poetry
    shortstories
    juvenile

Some of these are designations of a genre, form, or (in the case of "juvenile") audience. We don't postulate actually-crisp boundaries between these categories; for one thing, we're characterizing books, and books often include works from *multiple* genres. But I asked my collaborators to produce best-available generalizations for the purpose of a rough survey. Our goal here is not to study anything in detail, but to start by locating works that most people would consider fiction.

The "reprint" category is anomalous: it's not a genre or form. These are genuine works of fiction, but their first appearance in Hathi is more than 25 years after their original date of publication. We will correct the "firstpub" date, but researchers still may wish to exclude these from a random sample based on "firstpub," since some of these works didn't actually circulate widely near first date of publication.

In subsequent conversation we added

    drama, and
    shortstories|juvenile

But I am on reflection going to simplify that last category by letting "juvenile" be the dominant tag. This project has been targeted from the beginning at adult fiction, and formal divisions within the juvenile category are not something we can really pretend to have addressed.

In reality, we recorded a more complex set of categories, because coders didn't always know which category to treat as dominant:

In [13]:
set(all.category)

{'drama',
 'juvenile',
 'juvenile | shortstories',
 'juvenile|novel',
 'juvenile|shortstories',
 'nonfic',
 'nonfic | reprint',
 'nonfic|juvenile',
 'nonfic|poetry',
 'novel',
 'novel|juvenile',
 'poetry',
 'reprint',
 'shortstories',
 'shortstories | poetry',
 'shortstories|juvenile',
 'shortstories|poetry'}

**solution**

In this instance, I'm going to impose a strict order of dominance on the categories so we don't have to use pipes and can have a one-to-one mapping here.

In [14]:
# We start by cleaning up spaces. Then, we simplify the data
# by allowing nonfic and poetry to dominate all categories.
# and juvenile to dominate everything that remains.

def dominant_category(astring):
    ''' Accepts a category string that may contain multiple
    
    '''
    astring = astring.replace(' ', '')
    cats = astring.split('|')
    if 'nonfic' in cats:
        return 'nonfic'
    if 'poetry' in cats:
        return 'poetry'
    if 'drama' in cats:
        return 'drama'
    if 'reprint' in cats:
        return 'reprint'
    if 'juvenile' in cats:
        return 'juvenile'
    if 'shortstories' in cats:
        return 'shortstories'
    if 'novel' in cats:
        return 'novel'
    else:
        return 'error'

all = all.assign(category = all.category.map(dominant_category))

**what do we actually have?**

In [15]:
allcats = set(all.category)
for c in allcats:
    print(c, sum(all.category == c))

poetry 11
nonfic 199
juvenile 144
drama 3
reprint 129
shortstories 298
novel 1946


So in a random sample of 2730 books, 2517 are fiction. But some researchers may want to exclude 144 vols of juvenile fiction. 

A sample focused on literary *production* might also want to exclude the 129 "reprints"; they are first appearing in Hathi significantly (>25 yrs) after their original publication. We have now provided correct first publication dates for these (where we can). But if you're trying to reflect relatively *prominent* works from e.g. the 1820s, it might be misleading to include these "reprints" as if they had been randomly sampled *from the 1820s* when their period of popularity may actually be later.

### Filling out ```firstpub``` column

We have manually entered a first publication date where it's earlier than the automatically-calculated "latest possible date of composition" (which intersects attested publication date with e.g. author's date of death).

But the column is often left blank in our manual process. This turns it into a column that always holds the earlier of the two dates, or just latestcomp if that's all we have.

In [16]:
def lowestof(row):
    if pd.isnull(row['firstpub']):
        return int(row['latestcomp'])
    else:
        latest = int(row['latestcomp'])
        first = int(row['firstpub'])
        lowest = min(latest, first)
        if lowest > 1790:
            return lowest
        else:
            return latest

all = all.assign(firstpub = all.apply(lowestof, axis = 1))

### Breaking the ```reprint``` category out as a distinct column

On reflection, it's problematic to include "reprint" as a category on the same level as, say "shortstories." Doing that would mean that people who want to include reprints lose any guidance about genre. Let's fix that with some further manual coding.

In [17]:
reprints = pd.read_csv('titledata/reprints.tsv', index_col = 'docid', sep = '\t')

In [18]:
def new_category_and_repval(row):
    global reprints
    
    if row['category'] == 'reprint':
        docid = row.docid
        newcat = reprints.loc[docid, 'category']
        firstpub = int(reprints.loc[docid, 'firstpub'])
    else:
        newcat = row['category']
        firstpub = int(row['firstpub'])
    
    foundat = int(row['inferreddate'])
    if firstpub + 25 < foundat:
        repval = 'reprint'
    else:
        repval = 'contemporary'
    
    return newcat, repval, firstpub
    
categories, repvals, firstpubs = zip(*all.apply(new_category_and_repval, axis = 1))
all = all.assign(category = categories)
all = all.assign(hathiadvent = repvals)
all = all.assign(firstpub = firstpubs)

### Resetting certain text columns to original values

In creating the data we worked on, I truncated certain columns to a character limit, in order to make the spreadsheet more manageable. Also, although we tried to work in utf-8, there were slip-ups that caused certain portions of the data to be saved in a different encoding. Once that happens, special characters are lost.

We can reset those columns using the index, ```docid.``` However, we need to be cautious in certain cases, since there are also manual edits to titles we want to preserve.

Doing both things at once becomes a fun algorithmic challenge.

In [19]:
titlemeta = pd.read_csv('../titlemeta.tsv', 
                        index_col = 'docid',
                       sep = '\t', low_memory = False)

In [20]:
muchshorter = 0
rejected_old_strings = []
oldbetter = 0

def match_values(row, column_name):
    global muchshorter, titlemeta, weirdchars, oldbetter, maybe
    
    tocorrect = {'The manuscripts of Erdély': 'The Manuscripts of Erdély',
                'NhÃ¡Ì‚t Háº¡nh, ThÃ\xadch': 'Nhá̂t Hạnh, Thích',
                 'RÄ\uf181javaá¹ƒÅ›Ä«, Lakshmaá¹‡a': 'Rājavaṃśī, Lakshmaṇa'}
    
    docid = row['docid']
    if pd.isnull(row[column_name]):
        newval = ""
    else:
        newval = row[column_name]
        
    if pd.isnull(titlemeta.loc[docid, column_name]):
        oldval = ""
    else:
        oldval = titlemeta.loc[docid, column_name]
        
    if newval == oldval:
        return newval
    elif newval in tocorrect:
        return tocorrect[newval]
    elif len(oldval) < 1:
        return newval
    elif len(newval) == 25 and len(oldval) > 25 and newval == oldval[0:25]:
        muchshorter += 1
        return oldval
    else:
        matcher = SequenceMatcher(None, oldval, newval)
        matchprob = matcher.ratio()
        
        for char in newval:
            if char == 'Ä' or char == '�' or char == 'Ã' or char == 'Å':
                matchprob += 0.07
            elif char == '?':
                matchprob += 0.03
                
        if matchprob > 0.85:
            oldbetter += 1
            return oldval
        else:
            rejected_old_strings.append((oldval, newval))
            return newval
        
besttitles = all.apply(match_values, axis = 1, args = ['shorttitle'])
print("Old preferred because much longer: ", muchshorter)
print("Old preferred because it was close & weird chars in new: ", oldbetter) 
print('New preferred: ', len(rejected_old_strings))

Old preferred because much longer:  0
Old preferred because it was close & weird chars in new:  28
New preferred:  45


In [21]:
muchshorter = 0
rejected_old_strings = []
oldbetter = 0
bestauthors = all.apply(match_values, axis = 1, args = ['author'])
print("Old preferred because much longer: ", muchshorter)
print("Old preferred because it was close & weird chars in new: ", oldbetter) 
print('New preferred: ', len(rejected_old_strings))

Old preferred because much longer:  0
Old preferred because it was close & weird chars in new:  74
New preferred:  14


In [22]:
muchshorter = 0
rejected_old_strings = []
oldbetter = 0
bestimprints = all.apply(match_values, axis = 1, args = ['imprint'])
print("Old preferred because much longer: ", muchshorter)
print("Old preferred because it was close & weird chars in new: ", oldbetter) 
print('New preferred: ', len(rejected_old_strings))

Old preferred because much longer:  2109
Old preferred because it was close & weird chars in new:  1
New preferred:  4


In [23]:
muchshorter = 0
rejected_old_strings = []
oldbetter = 0
bestgenres = all.apply(match_values, axis = 1, args = ['genres'])
print("Old preferred because much longer: ", muchshorter)
print("Old preferred because it was close & weird chars in new: ", oldbetter) 
print('New preferred: ', len(rejected_old_strings))

Old preferred because much longer:  142
Old preferred because it was close & weird chars in new:  0
New preferred:  2


In [24]:
muchshorter = 0
rejected_old_strings = []
oldbetter = 0
bestsubjects = all.apply(match_values, axis = 1, args = ['subjects'])
print("Old preferred because much longer: ", muchshorter)
print("Old preferred because it was close & weird chars in new: ", oldbetter) 
print('New preferred: ', len(rejected_old_strings))

Old preferred because much longer:  630
Old preferred because it was close & weird chars in new:  3
New preferred:  7


In [25]:
all = all.assign(shorttitle = besttitles,
                author = bestauthors,
                imprint = bestimprints,
                genres = bestgenres,
                subjects = bestsubjects)

### Redefining categories to avoid misunderstanding

In manual coding we used the terms "novel" and "shortstories." But these phrases are in reality often misleading. Folktales or anecdotes are not really short stories, and some older or experimental fiction might not be quite "a novel."

Let's use looser terms.

In [26]:
def remap_categories(cat):
    if cat == 'novel':
        return 'longfiction'
    elif cat == 'shortstories':
        return 'shortfiction'
    elif cat == 'nonfic':
        return 'notfiction'
    elif cat == 'poetry':
        return cat
    elif cat == 'drama':
        return cat
    elif cat == 'juvenile':
        return cat
    else:
        print(cat)
        return cat

all = all.assign(category = all.category.map(remap_categories))

### A few minor name corrections

Still fixing errors caused by some data having been saved outside utf-8.

In [27]:
corrector = pd.read_csv('titledata/name_corrections.tsv', sep = '\t', index_col = 'error')
corrector.head()

Unnamed: 0_level_0,realname
error,Unnamed: 1_level_1
Aminoff. Constance L�onie Caroline,Aminoff. Constance Léonie Caroline
"Wibberley, Leonard Patrick O�Connor","Wibberley, Leonard Patrick O'Connor"
"B�lte, Amely","Bolte, Amely"
"De la Motte, Friedrich Heinrich Karl, Baron Fouqu�","De la Motte, Friedrich Heinrich Karl, Baron Fo..."
"Bouv�, Pauline Carrington Rust","Bouvé, Pauline Carrington Rust"


In [28]:
def correct_names(name):
    global corrector
    if name in corrector.index:
        return corrector.loc[name, 'realname']
    else:
        return name

all = all.assign(realname = all.realname.map(correct_names))

In [29]:
all.to_csv('manual_title_subset.tsv', index = False, sep = '\t')