# A template for experiments

I'm trying to develop a model we can use for experiments on the NEH data. But this is by no means set in stone yet; it's a first draft we should discuss and adjust.

For a first test, let's consider the problem of author gender. We know our model of gender is imperfect, and we don't imagine a predictive model trained on this boundary will tell us very much about gender directly; it's almost certainly, to some degree, a proxy for genre. But it's a tricky boundary to model and thus a good place to start. We're in no danger of getting 100% accuracy!

### Imports

In [12]:
import pandas as pd
import numpy as np
import os

### Title metadata

Eventually we will have detailed metadata for each "chunk" of a title, reporting things like error levels, the ratio between the lengths of clean-text and ocr-text chunks, etc.

Right now I haven't created that yet. So our strategy for getting metadata will be to import title-level metadata and then map it out to the chunks we observe, using the heuristic that each chunk is named according to the formula ```gbindex_chunknumber.```

In [2]:
metadata = pd.read_csv('../metadata/cleanrowswithhathimatches.tsv', sep = '\t')

In [14]:
cleanfiles = [x for x in os.listdir('/Users/tunder/Box Sync/NEHproject/cleannarratives/')
              if x.endswith('.txt')]
dirtyfiles = [x for x in os.listdir('/Users/tunder/Box Sync/NEHproject/dirtynarratives/')
               if x.endswith('.txt')]

In [15]:
assert len(cleanfiles) == len(dirtyfiles)
print("We have " + str(len(cleanfiles)) + " separate chunks.")

We have 696 separate chunks.


In [16]:
cleanfiles[0]

'36965_3.txt'

In [24]:
def get_gbindex(filename):
    return filename.split('_')[0]

gbdict = dict()

for filename in cleanfiles:
    gbindex = get_gbindex(filename)
    if gbindex not in gbdict:
        gbdict[gbindex] = []
    gbdict[gbindex].append(filename)

gbset = set(gbdict.keys())

In [25]:
print("But only " + str(len(gbset)) + " separate titles.")

But only 121 separate titles.


In [26]:
ourtitles = metadata.loc[metadata['gbindex'].isin(gbset), : ]
ourtitles.shape

(125, 24)

Lol, contrary to my assertion that ```cleanrowswithhathimatches``` has one row for each gbindex, there appear to still be duplicated gbindexes. I will need to fix that in the original. For right now, a kludge.

In [27]:
ourtitles = ourtitles.drop_duplicates(subset = 'gbindex')
ourtitles.shape

(121, 24)

### Chunk metadata

Eventually we will read this in directly. For  right now, a kludge:

In [30]:
chunkmeta = []

for idx, row in ourtitles.iterrows():
    
    gbindex = row['gbindex']
    files4thisindex = gbdict[gbindex]
    
    for filename in files4thisindex:
        chunkrow = pd.Series(row)
        chunkrow['filename'] = filename
        chunkmeta.append(chunkrow)

chunkmeta = pd.DataFrame(chunkmeta)
chunkmeta.head()

Unnamed: 0,docid,author,authordate,title,latestcomp,hathidate,imprint,gutenstring,enumcron,gbindex,...,instances,genre,audience,authgender,multiplehtids,comments,coder,Folder,Trimmed,filename
0,loc.ark+=13960=t5p851b8s,"Reid, Stuart J.",,Lord John Russell,1895,,New York;Harper & brothers;1,"Reid, Stuart J. | Lord John Russell",<blank>,27553,...,,bio,,u,,,morgan,gutenbiotrimmed,Trimmed,27553_5.txt
0,loc.ark+=13960=t5p851b8s,"Reid, Stuart J.",,Lord John Russell,1895,,New York;Harper & brothers;1,"Reid, Stuart J. | Lord John Russell",<blank>,27553,...,,bio,,u,,,morgan,gutenbiotrimmed,Trimmed,27553_4.txt
0,loc.ark+=13960=t5p851b8s,"Reid, Stuart J.",,Lord John Russell,1895,,New York;Harper & brothers;1,"Reid, Stuart J. | Lord John Russell",<blank>,27553,...,,bio,,u,,,morgan,gutenbiotrimmed,Trimmed,27553_6.txt
0,loc.ark+=13960=t5p851b8s,"Reid, Stuart J.",,Lord John Russell,1895,,New York;Harper & brothers;1,"Reid, Stuart J. | Lord John Russell",<blank>,27553,...,,bio,,u,,,morgan,gutenbiotrimmed,Trimmed,27553_7.txt
0,loc.ark+=13960=t5p851b8s,"Reid, Stuart J.",,Lord John Russell,1895,,New York;Harper & brothers;1,"Reid, Stuart J. | Lord John Russell",<blank>,27553,...,,bio,,u,,,morgan,gutenbiotrimmed,Trimmed,27553_3.txt


### Analysis plan

We need to optimize the number of features and the regularization constant. But we don't want to overfit a particular sample. (There is in reality not a huge danger of overfitting with two parameters, but since we're trying to determine the risk of distortion as precisely as possible, it's best to be scrupulous here.)

So I would propose that we separate our test set (1/4 of the data) from 3/4 of the data that we use for training-and-validation. That's to say, we cross-validate and optimize on 3/4 of the data (the same 3/4 for both clean and dirty corpora). And then finally test the model produced by that 3/4 on a held-out 1/4 of the data.

The one additional complication is that we need to be dividing the data *by author.*

In [31]:
len(set(chunkmeta['author']))

114

In [32]:
authors = list(set(chunkmeta['author']))
authors.sort()
for a in authors:
    print(a)

Abbott, Henry
Abbott, Jacob,
Alcott, Louisa May,
Allen, George Hoyt
Allen, Grant,
Barine, Arvède,
Barrie, J. M.
Bates, Arlo
Beerbohm, Julius
Bell, Henry Glassford,
Bettany, George Thomas,
Blind, Mathilde,
Blunden, Edmund
Broun, Heywood
Brown, E. E.
Burney, Fanny
Campan, Mme
Carter, Harry
Casson, Herbert Newton,
Castlemon, Harry
Cibber, Colley
Colmache, Édouard
Coolidge, Susan
Cooper, James Fenimore
Crockett, David
Cruttwell, Maud.
Daniels, Mabel W.
Duffy, Bella
Duncan, Norman
Farrar, Geraldine.
Gallishaw, John
Gaskell, Elizabeth Cleghorn
Gay, Sydney Howard,
Giberne, Agnes
Gilchrist, Anne Burrows,
Goodwin, William
Graham, Isabella
Grimaldi, Joseph,
Gronniosaw, James Albert
Habberton, John
Haldane, Elizabeth Sanderson,
Hall, Edward B.
Harland, Marion,
Harrison, James
Haynes, Henrietta.
Higginson, Thomas Wentworth,
Horne, C. Silvester
Howe, Julia Ward,
Hughes, Thomas
Ingemann, Bernhard Severin
Ingoldsby, Thomas
Jenkins, Hester Donaldson,
Johnson, Willis Fletcher,
Keller, Elizabeth Leavitt

Our next problem to resolve is, check out Stowe above.