# Testing and merging the manual title subset

This is the notebook that created List # 4 in the report, the "manually-checked title subset."

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

## Load the data, confirm basic format

In [9]:
j = pd.read_csv('titledata/jessica.csv')
j.shape

(1200, 18)

In [38]:
p = pd.read_csv('titledata/patrick.csv')
p.shape

(1200, 18)

In [11]:
t = pd.read_csv('titledata/ted.tsv', sep = '\t')
t.shape

(330, 18)

In [12]:
set(p.columns) == set(j.columns)

True

In [13]:
set(p.columns) == set(t.columns)

True

## Confirm that values match our data dictionary

In [15]:
set(j.gender)

{'f', 'm', 'u', nan, 'o'}

In [16]:
set(p.gender)

{'f', 'm', 'u', nan}

In [23]:
set(t.gender)

{nan, 'm', 'f'}

In practice, there is no difference between 'u' and nan; they both mean we don't know. The only difference between 'o' and 'u,' in this data, is that Jessica has used 'o' in five cases of multiple authorship; the other readers have not done the same thing, so we can't consistently maintain that distinction as part of the dictionary.

In [25]:
j.loc[j.gender == 'o', 'gender'] = 'u'

In [26]:
set(j.gender)

{'f', 'm', 'u', nan}

In [32]:
j['gender'] = j['gender'].fillna(value = 'u')


In [39]:
p['gender'] = p['gender'].fillna(value = 'u')

In [29]:
t['gender'] = t['gender'].fillna(value = 'u')

There's one row where Jessica has double-listed nationality to address dual authorship. This is not something we've done consistently, so let's flatten it out to a single value.

In [34]:
j.loc[j.nationality == 'us| us', 'nationality'] = 'us'

**category**

According to our data dictionary in process.md https://github.com/tedunderwood/meta2018/blob/master/process.md the allowable codes here are

    nonfic
    reprint
    novel
    poetry
    shortstories
    juvenile

In subsequent conversation we added

    drama, and
    shortstories|juvenile
    juvenile|shortstories (order doesn't matter here)

In [40]:
set(p.category)

{'juvenile',
 'juvenile|novel',
 'juvenile|shortstories',
 'nonfic',
 'nonfic|juvenile',
 'nonfic|poetry',
 'novel',
 'novel|juvenile',
 'poetry',
 'reprint',
 'shortstories',
 'shortstories|juvenile',
 'shortstories|poetry'}

In [42]:
sum(pd.isnull(j.category))

1

In [45]:
t.loc[pd.isnull(t.category), 'author']

207                                     Leslie, Madeline
208                                     Emerson, Alice B
209                          Longfellow, Henry Wadsworth
210                                   Scott, Walter, Sir
211                                          Wood, Ellen
212             James, G. P. R. (George Payne Rainsford)
213                            Bacheller, Irving Addison
214                                        Francis, M. E
215                                          Owen, Frank
216                                      Morris, Charles
217                              Johnson, Helen Kendrick
218                                          Peel, Doris
219                                  Goldsmith, Martin M
220          Champney, Elizabeth W. (Elizabeth Williams)
221                          Pounds, Jessie Hunter Brown
222                         Pepys, Charlotte Maria, Lady
223                               Russell, William Clark
224             Landon, Melvill

In [44]:
set(t.category)

{nan,
 'novel',
 'juvenile',
 'nonfic',
 'reprint',
 'shortstories',
 'drama',
 'poetry'}