# A template for experiments

I'm trying to develop a model we can use for experiments on the NEH data. But this is by no means set in stone yet; it's a first draft we should discuss and adjust.

For a first test, let's consider the problem of author gender. We know our model of gender is imperfect, and we don't imagine a predictive model trained on this boundary will tell us very much about gender directly; it's almost certainly, to some degree, a proxy for genre. But it's a tricky boundary to model and thus a good place to start. We're in no danger of getting 100% accuracy!

### Imports

In [66]:
import pandas as pd
import numpy as np
import os, random
from matplotlib import pyplot as plt
from sklearn.model_selection import cross_validate
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import cross_val_predict

### Title metadata

Eventually we will have detailed metadata for each "chunk" of a title, reporting things like error levels, the ratio between the lengths of clean-text and ocr-text chunks, etc.

Right now I haven't created that yet. So our strategy for getting metadata will be to import title-level metadata and then map it out to the chunks we observe, using the heuristic that each chunk is named according to the formula ```gbindex_chunknumber.```

In [21]:
titlemeta = pd.read_csv('../metadata/cleanrowswithhathimatches.tsv', sep = '\t')

In [22]:
titlemeta.head()

Unnamed: 0,docid,author,authordate,title,latestcomp,hathidate,imprint,gutenstring,enumcron,gbindex,...,contents,instances,genre,audience,authgender,multiplehtids,comments,coder,Folder,Trimmed
0,loc.ark+=13960=t5p851b8s,"Reid, Stuart J.",,Lord John Russell,1895,,New York;Harper & brothers;1,"Reid, Stuart J. | Lord John Russell",<blank>,27553,...,,,bio,,u,,,morgan,gutenbiotrimmed,Trimmed
1,hvd.32044070870779,"Smiles, Samuel,",,Lives of the engineers,1879,,London;J. Murray;1874-1877.,"Smiles, Samuel | Lives of the Engineers",v. 5,27710,...,,,bio | short,,m,,"2 people, mixed together (not one per chapter)",morgan,gutenbiotrimmed,Trimmed
2,mdp.39015005892362,"Cruttwell, Maud.",,Luca Signorelli,1899,,London;G. Bell & sons;1899.,"Cruttwell, Maud | Luca Signorelli",<blank>,27759,...,,,bio,,f,,,morgan,gutenbiotrimmed,Trimmed
3,mdp.39015051108531,"Bettany, George Thomas,",,Life of Charles Darwin,1887,,London;W. Scott;1887.,"Bettany, George Thomas | Life of Charles Darwin",<blank>,28380,...,,,bio,,m,,,morgan,gutenbiotrimmed,Trimmed
4,loc.ark+=13960=t6b27z54n,"Gay, Sydney Howard,",,James Madison,1889,,"Boston;New York;Houghton, Mi","Gay, Sydney Howard | James Madison",<blank>,28992,...,,,bio,,u,,,morgan,gutenbiotrimmed,Trimmed


In [84]:
titlemeta.shape

(740, 24)

### Works we have actually trimmed and chunked so far

Let's limit the titles to those present in the Box folders that store processed texts.

In [4]:
cleanfiles = [x for x in os.listdir('/Users/tunder/Box Sync/NEHproject/cleannarratives/')
              if x.endswith('.txt')]
dirtyfiles = [x for x in os.listdir('/Users/tunder/Box Sync/NEHproject/dirtynarratives/')
               if x.endswith('.txt')]

In [5]:
assert len(cleanfiles) == len(dirtyfiles)
print("We have " + str(len(cleanfiles)) + " separate chunks.")

We have 1237 separate chunks.


What is actually in these data objects?

In [6]:
cleanfiles[0]

'36965_3.txt'

How many titles have we trimmed and chunked so far?

In [7]:
def get_gbindex(filename):
    return filename.split('_')[0]

gbdict = dict()

for filename in cleanfiles:
    gbindex = get_gbindex(filename)
    if gbindex not in gbdict:
        gbdict[gbindex] = []
    gbdict[gbindex].append(filename)

gbset = set(gbdict.keys())

In [8]:
print("But only " + str(len(gbset)) + " separate titles.")

But only 223 separate titles.


In [23]:
ourtitles = titlemeta.loc[titlemeta['gbindex'].isin(gbset), : ]
ourtitles.shape

(227, 24)

Lol, contrary to my assertion that ```cleanrowswithhathimatches``` has one row for each gbindex, there appear to still be duplicated gbindexes. I will need to fix that in the original. **(In fact, there's still a lot of work to be done on the original metadata.)** For right now, a kludge.

In [24]:
ourtitles = ourtitles.drop_duplicates(subset = 'gbindex')
ourtitles.shape

(223, 24)

In [11]:
ourtitles.sample(10)

Unnamed: 0,docid,author,authordate,title,latestcomp,hathidate,imprint,gutenstring,enumcron,gbindex,...,contents,instances,genre,audience,authgender,multiplehtids,comments,coder,Folder,Trimmed
213,uc1.$b288364,"Church, Alfred John",1829-1912.,Stories of the Old world,1884,1884.0,"Boston;Ginn, Heath & co.;188","Church, Alfred John | Stories of the Old world",,43982,...,,1.0,fic | folklore | short,juv,m,,,ted,gutenfictiontrimmed,Trimmed
43,uc2.ark+=13960=t22b8z55p,"Carter, Harry",,The autobiography of a Cornish smuggler,1809,1900.0,London;Gibbings & Co.;Truro;,"Carter, Harry | The Autobiography of a Cornish...",,40008,...,,1.0,bio,,m,,,ted,gutenbiotrimmed,Trimmed
454,uc2.ark+=13960=t7wm14b10,"Reed, Myrtle",1874-1911.,Master of the vineyard,1910,1910.0,New York;G.P. Putnam;1910.,"Reed, Myrtle | Master of the Vineyard",,27661,...,,2.0,fic | romance,,f,,,wenyi,gutenfictiontrimmed,Trimmed
455,uc2.ark+=13960=t5cc0wf99,"Tracy, Louis",1863-1928.,Cynthia's chauffeur,1910,1910.0,New York;Grosset & Dunlap;c1,"Tracy, Louis | Cynthia's Chauffeur",,31472,...,,2.0,fic,,m,,,peizhen,gutenfictiontrimmed,Trimmed
83,umn.319510019644103,"Stratton, Royal B",d. 1875.,Captivity of the Oatman girls:,1858,1858.0,New-York;Pub. for the author,"Stratton, Royal B. | Captivity of the Oatman G...",,55071,...,,1.0,bio,,m,,,peizhen,gutenbiotrimmed,Trimmed
66,njp.32101066164219,"Grimaldi, Joseph,",,Memoirs of Joseph Grimaldi,1846,,London;R. Bentley;1846.,"Grimaldi, Joseph | Memoirs of Joseph Grimaldi",<blank>,46709,...,,,bio,,m,,,morgan,gutenbiotrimmed,Trimmed
129,osu.32435002999571,"Kennedy, John Pendleton",1795-1870.,Horse-shoe Robinson,1835,1835.0,London;Bentley;1835.,"Kennedy, John Pendleton | Horse-Shoe Robinson",vol.1,33478,...,,1.0,fic | historical,,m,,,peizhen,gutenfictiontrimmed,Trimmed
21,mdp.39015061869783,"Tappan, Eva March",1854-1930.,In the days of Queen Victoria,1903,1903.0,"Boston;Lothrop, Lee & Shepar","Tappan, Eva March | In the Days of Queen Victoria",,35576,...,,1.0,bio,,f,,,wenyi,gutenbiotrimmed,Trimmed
369,nyp.33433074932322,"Barr, Robert",1850-1912.,The speculations of John Steele,1905,1905.0,New York;F. A. Stokes compan,"Barr, Robert | The Speculations of John Steele",,55328,...,,2.0,fic,,m,,,ted,gutenfictiontrimmed,Trimmed
77,uc2.ark+=13960=t27942z7w,"Jenkins, Hester Donaldson,",,Ibrahim Pasha,1911,,New York;Columbia University,"Jenkins, Hester Donaldson | Ibrahim Pasha",<blank>,51299,...,,,bio,,f,,,morgan,gutenbiotrimmed,Trimmed


#### size of the smaller class

In [13]:
sum(ourtitles.authgender == 'f')

81

### Chunk metadata

I've also produced chunk-level metadata, including a word error rate for each chunk.

In [14]:
chunkmeta = pd.read_csv('../metadata/chunks_w_error_rates.tsv', sep = '\t')
chunkmeta.shape

(1254, 5)

In [87]:
chunkmeta.sample(10)

Unnamed: 0,chunkid,gutenlen,hathilen,rejectedchars,wer,gbindex
954,42147_0,89274,83894,0,0.033099,42147
1094,54218_5,78340,76892,0,0.030434,54218
617,28271_1,72778,70995,0,0.012405,28271
853,37059_1,82548,89177,0,0.083024,37059
100,31210_2,76392,76864,0,0.050161,31210
1083,52072_10,79877,77545,0,0.043375,52072
948,41324_1,93766,97781,0,0.025723,41324
482,28366_3,81063,87463,0,0.020563,28366
442,51468_0,70936,74117,0,0.035709,51468
163,41286_11,2469,179,0,0.993197,41286


There are still some very interesting things happening in the pipeline. *Mostly* we have very low word error rates, but there are a few extremely weird cases. 

In [18]:
chunkmeta = chunkmeta.drop_duplicates(subset = 'chunkid')

In [19]:
chunkmeta['gbindex'] = chunkmeta.chunkid.apply(get_gbindex)

In [25]:
fullmeta = chunkmeta.merge(ourtitles, how = 'inner', on = 'gbindex')

In [26]:
fullmeta.shape

(1237, 29)

In [28]:
fullmeta.columns

Index(['chunkid', 'gutenlen', 'hathilen', 'rejectedchars', 'wer', 'gbindex',
       'docid', 'author', 'authordate', 'title', 'latestcomp', 'hathidate',
       'imprint', 'gutenstring', 'enumcron', 'nonficprob', 'juvenileprob',
       'LOCgenres', 'LOCsubjects', 'contents', 'instances', 'genre',
       'audience', 'authgender', 'multiplehtids', 'comments', 'coder',
       'Folder', 'Trimmed'],
      dtype='object')

### Balancing the distribution of classes across time

Language change is very easy to model, so if you try to model the boundary between two categories that happen to be distributed differently across time (in your collection), you're very likely to get a model of language change. That's a problem if you want to study the categorical difference in itself, separated from confounding issues of chronology that might just be selection bias.

Here we're distinguishing books written by men from those written by women. And although we don't really care about the model in its own right (since we're interested in the consequences of OCR distortion), it's still important to know what we're modeling, because OCR distortion could have *different* effects on different kinds of boundaries (e.g. chronological or demographic). 

So we need to be careful to balance the classes across time. At a minimum, we should require the median date for both categories to be roughly the same. A more ambitious approach would match the full distribution. But for right now let's keep it simple.

In [60]:
def match_medians(smaller_indexes, larger_indexes, metadata):
    '''
    smaller_indexes = indexes of metadata for the smaller class
    larger_indexes = indexes of metadata for the larger class
    '''
    selected_from_large = []
    smaller_median = np.median(metadata.loc[smaller_indexes, 'latestcomp'])
    
    largerdf = metadata.loc[larger_indexes, : ]
    above_median = largerdf.loc[largerdf['latestcomp'] >= smaller_median, : ].index.tolist()
    below_median = largerdf.loc[largerdf['latestcomp'] <= smaller_median, : ].index.tolist()
    
    for i in range(len(smaller_indexes)):
        if len(selected_from_large) > 0:
            larger_median = np.median(metadata.loc[selected_from_large, 'latestcomp'])
        else:
            larger_median = smaller_median
        
        if larger_median >= smaller_median and len(below_median) > 0:
            selected = random.sample(below_median, 1)[0]
            below_median.pop(below_median.index(selected))
            selected_from_large.append(selected)
        elif larger_median <= smaller_median and len(above_median) > 0:
            selected = random.sample(above_median, 1)[0]
            above_median.pop(above_median.index(selected))
            selected_from_large.append(selected)
        else:
            # we have no more items that won't distort the median
            break

    return selected_from_large, smaller_median, larger_median

indexes_f = ourtitles.loc[ourtitles.authgender == 'f', : ].index.tolist()
indexes_m = ourtitles.loc[ourtitles.authgender == 'm', : ].index.tolist()

selected_m, median_f, median_m = match_medians(indexes_f, indexes_m, ourtitles)  

In [61]:
print(median_f, median_m, len(selected))

1902.0 1900.5 81


In [63]:
all_selected_vols = indexes_f + selected
len(all_selected_vols)

162

In [64]:
selected_gbindexes = ourtitles.loc[all_selected_vols, 'gbindex']
print('We have ', len(selected_gbindexes))
selected_gbindexes[0:10]

We have  162


2     27759
8     32511
10    32835
12    33345
14    33537
20    35418
21    35576
24    36754
25    36847
27    36965
Name: gbindex, dtype: object

In [65]:
modelmeta = fullmeta.loc[fullmeta['gbindex'].isin(selected_gbindexes), : ]
modelmeta.shape

(897, 29)

In [68]:
modelmeta.head()

Unnamed: 0,chunkid,gutenlen,hathilen,rejectedchars,wer,gbindex,docid,author,authordate,title,...,contents,instances,genre,audience,authgender,multiplehtids,comments,coder,Folder,Trimmed
0,37908_0,78086,78835,0,0.064934,37908,wu.89006385306,"Opie, Amelia Alderson",1769-1853.,Adeline Mowbray,...,,1.0,fic,,f,,,wenyi,gutenfictiontrimmed,Trimmed
1,37908_1,78052,79125,0,0.083327,37908,wu.89006385306,"Opie, Amelia Alderson",1769-1853.,Adeline Mowbray,...,,1.0,fic,,f,,,wenyi,gutenfictiontrimmed,Trimmed
2,37908_2,78084,78744,0,0.051356,37908,wu.89006385306,"Opie, Amelia Alderson",1769-1853.,Adeline Mowbray,...,,1.0,fic,,f,,,wenyi,gutenfictiontrimmed,Trimmed
3,37908_3,78069,76923,0,0.058616,37908,wu.89006385306,"Opie, Amelia Alderson",1769-1853.,Adeline Mowbray,...,,1.0,fic,,f,,,wenyi,gutenfictiontrimmed,Trimmed
4,37908_4,78042,79252,0,0.051276,37908,wu.89006385306,"Opie, Amelia Alderson",1769-1853.,Adeline Mowbray,...,,1.0,fic,,f,,,wenyi,gutenfictiontrimmed,Trimmed


### Term-document matrices for clean and dirty narratives 


In [69]:
clean_rootdir = '/Users/tunder/Box Sync/NEHproject/cleannarratives/'
dirty_rootdir = '/Users/tunder/Box Sync/NEHproject/dirtynarratives/'

clean_paths = []
dirty_paths = []

for chunk_id in modelmeta['chunkid']:
    clean_paths.append(clean_rootdir + chunk_id + '.txt')
    dirty_paths.append(dirty_rootdir + chunk_id + '.txt')
    

    

In [73]:
if 'chunkid' in modelmeta.columns:         
    model_chunk_ids = modelmeta['chunkid']
    modelmeta = modelmeta.set_index('chunkid')   # If we haven't made this the index yet, let's do it.
else:
    model_chunk_ids = modelmeta.index.tolist()


vectorizer = CountVectorizer(input = 'filename', min_df = .04)
sparse_clean_counts = vectorizer.fit_transform(clean_paths) # the vectorizer produces something
                                                               # called a 'sparse matrix'; we need to
                                                               # unpack it
clean_wordcounts = pd.DataFrame(sparse_clean_counts.toarray(), index = model_chunk_ids, 
                            columns = vectorizer.get_feature_names())
clean_wordcounts.head()

Unnamed: 0,000,10,100,10th,11,11th,12,12th,13,13th,...,yours,yourself,yourselves,youth,youthful,youths,zeal,zealous,zest,zone
37908_0,0,0,0,0,0,0,0,0,0,0,...,1,1,0,1,2,0,0,0,0,0
37908_1,0,0,0,0,0,0,0,0,0,0,...,3,5,0,2,0,0,0,0,0,0
37908_2,0,0,0,0,0,0,0,0,0,0,...,6,2,1,2,0,0,0,0,0,0
37908_3,0,0,0,0,0,0,0,0,0,0,...,3,5,0,1,0,0,0,0,0,0
37908_4,0,0,0,0,0,0,0,0,0,0,...,2,11,0,2,0,0,0,0,0,0


In [74]:
vectorizer = CountVectorizer(input = 'filename', min_df = .04)
sparse_dirty_counts = vectorizer.fit_transform(dirty_paths) # the vectorizer produces something
                                                               # called a 'sparse matrix'; we need to
                                                               # unpack it
dirty_wordcounts = pd.DataFrame(sparse_dirty_counts.toarray(), index = model_chunk_ids, 
                            columns = vectorizer.get_feature_names())
dirty_wordcounts.head()

Unnamed: 0,000,10,100,101,102,103,104,105,106,107,...,yours,yourself,yourselves,youth,youthful,youths,zeal,zealous,zest,zone
37908_0,0,0,0,0,0,0,0,0,0,0,...,1,1,0,1,2,0,0,0,0,0
37908_1,0,0,1,0,0,0,1,1,1,0,...,3,3,0,2,0,0,0,0,0,0
37908_2,0,0,0,0,0,0,0,0,0,0,...,6,2,1,2,0,0,0,0,0,0
37908_3,0,0,0,1,1,0,0,1,1,1,...,3,5,0,1,0,0,0,0,0,0
37908_4,0,0,0,0,0,0,0,0,0,0,...,2,10,0,2,0,0,0,0,0,0


In [76]:
clean_rowsums = clean_wordcounts.sum(axis = 'columns')
clean_freqs = clean_wordcounts.divide(clean_rowsums, axis = 'rows')
scaler = StandardScaler()
clean_freqs = scaler.fit_transform(X)

In [77]:
dirty_rowsums = dirty_wordcounts.sum(axis = 'columns')
dirty_freqs = dirty_wordcounts.divide(clean_rowsums, axis = 'rows')

In [80]:
dirty_freqs.head()

Unnamed: 0,000,10,100,101,102,103,104,105,106,107,...,yours,yourself,yourselves,youth,youthful,youths,zeal,zealous,zest,zone
37908_0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,8.1e-05,8.1e-05,0.0,8.1e-05,0.000163,0.0,0.0,0.0,0.0,0.0
37908_1,0.0,0.0,8.1e-05,0.0,0.0,0.0,8.1e-05,8.1e-05,8.1e-05,0.0,...,0.000244,0.000244,0.0,0.000163,0.0,0.0,0.0,0.0,0.0,0.0
37908_2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000489,0.000163,8.1e-05,0.000163,0.0,0.0,0.0,0.0,0.0,0.0
37908_3,0.0,0.0,0.0,8.1e-05,8.1e-05,0.0,0.0,8.1e-05,8.1e-05,8.1e-05,...,0.000242,0.000403,0.0,8.1e-05,0.0,0.0,0.0,0.0,0.0,0.0
37908_4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000163,0.000817,0.0,0.000163,0.0,0.0,0.0,0.0,0.0,0.0


### Let's produce a model for the clean counts

We're going to do a grid search for the best model. The outer loop will select the number of features. The inner loop will select the regularization constant.

In selecting the top *n* features, we will always select the *n* with top *document* frequency.

Eventually we're going to want to separate the train-and-validation loop from a real, separate test set. But we don't have all the data yet; still processing; so it doesn't make a lot of sense to discuss that yet.

In [82]:
for featurecount in [2000, 3000, 4000, 5000, 6000, 7000]:
    docfreqs = []
    for col in clean_freqs.columns:
        docfreqs.append((sum(clean_freqs[col] > 0), col))
    docfreqs.sort()
    features = [x[1] for x in docfreqs[0: featurecount]]
    
    model_features = clean_freqs.loc[ : , features]
    
    scaler = StandardScaler()
    model_features = scaler.fit_transform(model_features)
    
    for c_param in [1, .1, .01, .001, .0001, .00001, .000001]:
        logreg = LogisticRegression(C = c_param, max_iter = 1000)
        
        grouper = GroupKFold(n_splits = 10)
        cv_results = cross_validate(logreg, model_features, modelmeta['authgender'], groups = modelmeta['author'], cv = grouper)
        cv_results = cross_validate(logreg, model_features, modelmeta['authgender'], cv = 10)
        mean_score = np.mean(cv_results['test_score'])
        print(featurecount, c_param, mean_score)
    
    
        

2000 1 0.6165792759051185
2000 0.1 0.6232833957553059
2000 0.01 0.6333333333333333
2000 0.001 0.6500873907615481
2000 0.0001 0.6712983770287141
2000 1e-05 0.5116853932584269
2000 1e-06 0.5094631710362046
3000 1 0.6456803995006243
3000 0.1 0.6445692883895131
3000 0.01 0.6479151061173534
3000 0.001 0.6546192259675405
3000 0.0001 0.672434456928839
3000 1e-05 0.5428464419475656
3000 1e-06 0.5094631710362046
4000 1 0.6779525593008738
4000 0.1 0.6734956304619226
4000 0.01 0.6757303370786517
4000 0.001 0.6857677902621724
4000 0.0001 0.6980524344569288
4000 1e-05 0.5952434456928838
4000 1e-06 0.5094631710362046
5000 1 0.7024594257178527
5000 0.1 0.7024594257178527
5000 0.01 0.7013233458177279
5000 0.001 0.7203245942571785
5000 0.0001 0.7326217228464419
5000 1e-05 0.6287141073657928
5000 1e-06 0.5094631710362046
6000 1 0.7213732833957552
6000 0.1 0.7224843945068663
6000 0.01 0.7258426966292134
6000 0.001 0.723645443196005
6000 0.0001 0.732621722846442
6000 1e-05 0.676629213483146
6000 1e-06 0.5

In [83]:
for featurecount in [6000, 7000]:
    docfreqs = []
    for col in dirty_freqs.columns:
        docfreqs.append((sum(dirty_freqs[col] > 0), col))
    docfreqs.sort()
    features = [x[1] for x in docfreqs[0: featurecount]]
    
    model_features = dirty_freqs.loc[ : , features]
    
    scaler = StandardScaler()
    model_features = scaler.fit_transform(model_features)
    
    for c_param in [1, .1, .01, .001, .0001, .00001, .000001]:
        logreg = LogisticRegression(C = c_param, max_iter = 1000)
        
        grouper = GroupKFold(n_splits = 10)
        cv_results = cross_validate(logreg, model_features, modelmeta['authgender'], groups = modelmeta['author'], cv = grouper)
        cv_results = cross_validate(logreg, model_features, modelmeta['authgender'], cv = 10)
        mean_score = np.mean(cv_results['test_score'])
        print(featurecount, c_param, mean_score)
    

6000 1 0.7024843945068664
6000 0.1 0.7192134831460675
6000 0.01 0.7258551810237204
6000 0.001 0.7258676654182272
6000 0.0001 0.720287141073658
6000 1e-05 0.6621847690387016
6000 1e-06 0.5094631710362046
7000 1 0.7035705368289638
7000 0.1 0.7180524344569287
7000 0.01 0.7280774032459426
7000 0.001 0.7380898876404495
7000 0.0001 0.7280898876404494
7000 1e-05 0.6844694132334581
7000 1e-06 0.5094631710362046


In [30]:
chunkmeta = []

for idx, row in ourtitles.iterrows():
    
    gbindex = row['gbindex']
    files4thisindex = gbdict[gbindex]
    
    for filename in files4thisindex:
        chunkrow = pd.Series(row)
        chunkrow['filename'] = filename
        chunkmeta.append(chunkrow)

chunkmeta = pd.DataFrame(chunkmeta)
chunkmeta.head()

Unnamed: 0,docid,author,authordate,title,latestcomp,hathidate,imprint,gutenstring,enumcron,gbindex,...,instances,genre,audience,authgender,multiplehtids,comments,coder,Folder,Trimmed,filename
0,loc.ark+=13960=t5p851b8s,"Reid, Stuart J.",,Lord John Russell,1895,,New York;Harper & brothers;1,"Reid, Stuart J. | Lord John Russell",<blank>,27553,...,,bio,,u,,,morgan,gutenbiotrimmed,Trimmed,27553_5.txt
0,loc.ark+=13960=t5p851b8s,"Reid, Stuart J.",,Lord John Russell,1895,,New York;Harper & brothers;1,"Reid, Stuart J. | Lord John Russell",<blank>,27553,...,,bio,,u,,,morgan,gutenbiotrimmed,Trimmed,27553_4.txt
0,loc.ark+=13960=t5p851b8s,"Reid, Stuart J.",,Lord John Russell,1895,,New York;Harper & brothers;1,"Reid, Stuart J. | Lord John Russell",<blank>,27553,...,,bio,,u,,,morgan,gutenbiotrimmed,Trimmed,27553_6.txt
0,loc.ark+=13960=t5p851b8s,"Reid, Stuart J.",,Lord John Russell,1895,,New York;Harper & brothers;1,"Reid, Stuart J. | Lord John Russell",<blank>,27553,...,,bio,,u,,,morgan,gutenbiotrimmed,Trimmed,27553_7.txt
0,loc.ark+=13960=t5p851b8s,"Reid, Stuart J.",,Lord John Russell,1895,,New York;Harper & brothers;1,"Reid, Stuart J. | Lord John Russell",<blank>,27553,...,,bio,,u,,,morgan,gutenbiotrimmed,Trimmed,27553_3.txt


### Analysis plan

We need to optimize the number of features and the regularization constant. But we don't want to overfit a particular sample. (There is in reality not a huge danger of overfitting with two parameters, but since we're trying to determine the risk of distortion as precisely as possible, it's best to be scrupulous here.)

So I would propose that we separate our test set (1/4 of the data) from 3/4 of the data that we use for training-and-validation. That's to say, we cross-validate and optimize on 3/4 of the data (the same 3/4 for both clean and dirty corpora). And then finally test the model produced by that 3/4 on a held-out 1/4 of the data.

The one additional complication is that we need to be dividing the data *by author.*

In [31]:
len(set(chunkmeta['author']))

114

In [32]:
authors = list(set(chunkmeta['author']))
authors.sort()
for a in authors:
    print(a)

Abbott, Henry
Abbott, Jacob,
Alcott, Louisa May,
Allen, George Hoyt
Allen, Grant,
Barine, Arvède,
Barrie, J. M.
Bates, Arlo
Beerbohm, Julius
Bell, Henry Glassford,
Bettany, George Thomas,
Blind, Mathilde,
Blunden, Edmund
Broun, Heywood
Brown, E. E.
Burney, Fanny
Campan, Mme
Carter, Harry
Casson, Herbert Newton,
Castlemon, Harry
Cibber, Colley
Colmache, Édouard
Coolidge, Susan
Cooper, James Fenimore
Crockett, David
Cruttwell, Maud.
Daniels, Mabel W.
Duffy, Bella
Duncan, Norman
Farrar, Geraldine.
Gallishaw, John
Gaskell, Elizabeth Cleghorn
Gay, Sydney Howard,
Giberne, Agnes
Gilchrist, Anne Burrows,
Goodwin, William
Graham, Isabella
Grimaldi, Joseph,
Gronniosaw, James Albert
Habberton, John
Haldane, Elizabeth Sanderson,
Hall, Edward B.
Harland, Marion,
Harrison, James
Haynes, Henrietta.
Higginson, Thomas Wentworth,
Horne, C. Silvester
Howe, Julia Ward,
Hughes, Thomas
Ingemann, Bernhard Severin
Ingoldsby, Thomas
Jenkins, Hester Donaldson,
Johnson, Willis Fletcher,
Keller, Elizabeth Leavitt

Our next problem to resolve is, check out Stowe above.