# Short story model

The code provided here extracts a short story corpus and a novel corpus from our master metadata, selecting volumes by using words in the title. ("Other" and "stories" select for short stories. "A novel," without "stories" or "tales" anywhere in the title, selects for novels.) 

Then we collect topic proportions for those volumes, and train a model to distinguish short stories from novels.

We do this two ways. First we select a test set at random by choosing individual volumes. This provides 80.5% accuracy. But since many authors tend to be short story writers or novelists for much of their career, there's some possibility that this model will "cheat" by learning to recognize specific writers, instead of really learning a generalizable model of the forms.

To avoid fooling ourselves in this way, we re-run the experiment in the second half of the notebook while rigorously separating the authors in the training set from authors in the test set. This avoids "leaking" information about authors from test to train.

There is still a kind of information leakage happening here, because *the topic model itself* is trained on all volumes at once. This means that the topics are shaped, in some small part, by the ~150 volumes in our test set, which may have helped to create topics that are specifically reflective of those volumes. It's unsupervised modeling, of course; it's not like the topic model knows these are/aren't short stories. So the information leakage is very indirect. But still, it's possible that we will get slightly higher accuracy than we would have if we trained a model without those 150 test volumes and used it to predict their topic proportions. 

However, the effect of 150 volumes in a topic model of 29,000 + is likely to be extremely small. And we're sampling authors 20 times to average our random variation. It would really be prohibitively difficult to train 20 topic models with 20 different sets of held-out authors. I think for the purpose of this project, it's sufficient to note that the actual accuracy we would get in the wild might be a couple tenths of a percentage point lower. A difference of that scale won't matter for the point we're illustrating.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

In [2]:
meta = pd.read_csv('../metadata/finalcorpus.tsv', sep = '\t')
meta.head()

Unnamed: 0,docid,hathi_author,hathi_title,authordate,birthyear,deathyear,chi_date,ukw_date,copyright_date,firstpub,...,distances,copyright_corpus,manual_corpus,us_national,pubdate_known,authof3ormore,tokens,pagesinchunk,tokensperpage,omnibus
0,loc.ark+=13960=t49p3qv8g,"Thwing, Edward P[ayson]","Outdoor life in Europe, sketches of men and ma...",,,,,,,1880,...,,False,False,False,False,False,45865,54,849.352,False
1,loc.ark+=13960=t7wm1fd5j,"Jones, Joseph",Major Jones's travels,1812-1882.,1812.0,,,,,1880,...,,False,False,False,False,False,42135,204,206.544,False
2,nyp.33433081882650,"Allan-Olney, Mary",The new Virginians,,,,,,,1880,...,,False,False,False,False,False,32052,244,131.361,False
3,uva.x030742146,,"The Growing world; or, Progress of civilizatio...",,,,,,,1880,...,,False,False,False,False,False,374115,420,890.75,False
4,nyp.33433074386792,of Samosata. Lucian,A traveller's true tale,,,,,,,1880,...,,False,False,False,False,False,11848,128,92.562,False


## Create the story and novel subcorpora

In [3]:
print(meta.shape)
metaclean = meta[~pd.isnull(meta.hathi_title)]
print(metaclean.shape)
stories = metaclean[(metaclean.hathi_title.str.contains('stories')) & 
                    (metaclean.hathi_title.str.contains('other'))]
stories.shape

(29351, 23)
(29311, 23)


(722, 23)

In [4]:
stories.head()

Unnamed: 0,docid,hathi_author,hathi_title,authordate,birthyear,deathyear,chi_date,ukw_date,copyright_date,firstpub,...,distances,copyright_corpus,manual_corpus,us_national,pubdate_known,authof3ormore,tokens,pagesinchunk,tokensperpage,omnibus
93,uc2.ark+=13960=t3707xn80,"Perry, Nora","The tragedy of the unexpected, and other stories",1831-1896.,1831.0,,,,,1880,...,,False,False,False,False,False,37848,251,150.789,False
149,uc2.ark+=13960=t7gq6s478,"Hawthorne, Julian","Yellow-cap, and other fairy-stories for children",1846-1934.,1846.0,1934.0,,1880.0,,1880,...,53.797678,False,True,True,True,True,29201,242,120.665,False
223,nyp.33433082525563,"Arthur, T. S. (Timothy Shay)",Who is the greatest? And other stories,1809-1885.,1809.0,,,,,1881,...,,False,False,False,False,False,25127,258,97.391,False
233,uc2.ark+=13960=t5bc41d87,"Besant, Walter, Sir","The ten years' tenant, and other stories",1836-1901.,1836.0,,,,,1881,...,,False,False,False,False,False,70571,297,237.613,False
246,nyp.33433082525589,"Arthur, T. S. (Timothy Shay)","Pierre, the organ-boy, and other stories",1809-1885.,1809.0,,,,,1881,...,,False,False,False,False,False,25520,254,100.472,False


In [222]:
novels = metaclean[(~metaclean.hathi_title.str.contains('stories')) & 
                       (~metaclean.hathi_title.str.contains('tales')) &
                       (~metaclean.hathi_title.str.contains('Stories')) &
                       (~metaclean.hathi_title.str.contains('Tales')) &
                       ((metaclean.hathi_title.str.contains('a novel')) |
                        (metaclean.hathi_title.str.contains('A Novel')))
                      ]
novels.shape

(2362, 23)

In [146]:
selectedstories = stories.sample(710)
selectedstories.shape

(710, 23)

#### Get a subset of novels chronologically matching the story corpus

In [223]:
selectednovels = []
badidxes = []

for idx, row in selectedstories.iterrows():
    thisdate = row.firstpub
    options = novels[(novels.firstpub >= thisdate-2) & (novels.firstpub <= thisdate + 2)]
    found = False
    tried = 0
    while not found:
        chosen = options.sample(1)
        thisidx = chosen.index[0]
        tried += 1
        if thisidx not in selectednovels:
            selectednovels.append(thisidx)
            found = True
        if tried > len(options):
            badidxes.append(idx)
            break

print(len(selectednovels))   

710


In [150]:
selectedstories.shape

(710, 23)

In [27]:
selectedstories.sample(8)

Unnamed: 0,docid,hathi_author,hathi_title,authordate,birthyear,deathyear,chi_date,ukw_date,copyright_date,firstpub,...,distances,copyright_corpus,manual_corpus,us_national,pubdate_known,authof3ormore,tokens,pagesinchunk,tokensperpage,omnibus
13664,mdp.39015059406119,"Phillpotts, Eden",They could do no other; a volume of stories by...,1862-1960.,1862.0,1960.0,,,1931.0,1931,...,43.036775,True,False,False,True,True,46845,238,196.828,False
11126,uva.x000157545,"Mansfield, Katherine","Bliss, and other stories",1888-1923.,1888.0,1923.0,,,1923.0,1921,...,35.137842,True,False,False,True,True,39050,238,164.076,False
10675,wu.89094611258,"Levinger, Elma Ehrlich",Playmates in Egypt : and other stories,1887-,1887.0,,,,,1920,...,,False,False,False,False,False,12224,112,109.143,False
11626,mdp.39015031304671,"Tarkington, Booth",The fascinating stranger : an other stories,1869-1946.,1869.0,1946.0,,,1923.0,1923,...,30.205821,True,False,True,True,True,57555,416,138.353,False
621,uc2.ark+=13960=t8kd26w5q,"Turgenev, Ivan Sergi︠e︡evích","A reckless character, and other stories",1818-1883.,1818.0,,,,,1883,...,,False,False,False,False,False,47395,332,142.756,False
2157,wu.89098017650,"Crowley, Mary Catherine","Happy-go-lucky, and other stories",,,,,,,1889,...,,False,False,False,False,False,18926,174,108.77,False
23773,mdp.39015030748860,"Guerrero, Amadis María",The mainstream and other stories,,,,,,,1975,...,,False,False,False,False,False,15900,64,248.438,False
24244,mdp.39015005062339,"Aldiss, Brian Wilson","Last orders, and other stories",1925-,1925.0,,,,,1977,...,,False,False,False,True,True,42054,187,224.888,False


In [151]:
np.mean(selectedstories.firstpub)

1941.5394366197183

## Concatenate and write to tsv

In [225]:
selectednovels = novels.loc[selectednovels]
np.mean(selectednovels.firstpub)

1941.5746478873239

In [226]:
selectedstories['genre'] = 'short'
selectednot['genre'] = 'long'
selectednovels['genre'] = 'long'

In [227]:
shortvsnovelmeta =  pd.concat([selectedstories, selectednovels])

In [156]:
shortmodelmeta.to_csv('shortmodelmeta.tsv', sep = '\t')

In [228]:
shortvsnovelmeta.to_csv('shortvsnovelmeta.tsv', sep = '\t')

## Here something important happens in another notebook

Sorry for the slightly confused sequence here. At this point execution moves to ```./topicdata/MakeShortStoryData.ipynb```, which combines ```shortvsnoveldata.tsv``` with raw topic model results to create aggregated book-level topic proportions.

I load the results of that process below.

In [5]:
data = pd.read_csv('../topicdata/shortvsnoveldata.tsv', sep = '\t')
data.head()

Unnamed: 0,t0,t1,t2,t3,t4,t5,t6,t7,t8,t9,...,t199,docid,genre,birthyear,firstpub,hathi_author,hathi_title,us_national,authof3ormore,age
0,4.6e-05,5e-06,0.009303,2.4e-05,0.000293,4.6e-05,2.3e-05,0.000636,0.000169,0.012015,...,2e-06,osu.32435001753706,long,1881.0,1945,"Abbott, Jane",Mary Patten's daughters : a novel,True,True,64.0
1,0.000538,0.001999,9e-05,4e-06,0.001392,5e-06,8.2e-05,0.006044,7e-06,0.046004,...,0.001252,uc1.$b71369,long,1871.0,1929,"Adams, Samuel Hopkins",The flagrant years : a novel of the beauty market,True,True,58.0
2,6e-06,0.01879,0.023701,4e-06,0.000635,0.0231,3e-06,0.003576,0.000692,8e-06,...,0.000775,mdp.39015059384274,short,1889.0,1925,"Aiken, Conrad",Bring! bring! and other stories,True,True,36.0
3,6e-06,5e-06,0.00185,4e-06,1e-05,0.005259,2e-06,0.011151,7e-06,0.000112,...,0.000317,uc1.b4449482,long,,1975,"Albert, Mimi",The second story man : a novel,True,False,
4,0.000186,0.003024,0.007866,4.9e-05,0.0001,0.026803,2e-06,0.052632,0.001825,7e-06,...,2e-06,mdp.39015063779485,long,1892.0,1946,"Aldington, Richard","The romance of Casanova, a novel",False,True,54.0


#### create a stratified training set

In [6]:
shortstorydata = data[data.genre == 'short'].sample(frac = 1)
longstorydata = data[data.genre == 'long'].sample(frac = 1)

In [7]:
training = pd.concat([shortstorydata.iloc[0: 638, 0:200], longstorydata.iloc[0: 638, 0:200]])

In [8]:
y = [1] * 638 + [0] * 638
len(y)

1276

#### scale it and try logistic regression

In [9]:
scaler = StandardScaler()
training_scaled = scaler.fit_transform(training)

In [10]:
for constant in [0.00001, 0.0001, .001, .01, .1, 1, 10, 100, 1000]:
    logreg = LogisticRegression(C = constant, max_iter = 2000, penalty = 'l2')
    scores = cross_validate(logreg, training_scaled, y, scoring = 'accuracy', cv = 10)
    print(constant, round(np.mean(scores['test_score']), 4))

1e-05 0.75
0.0001 0.761
0.001 0.772
0.01 0.7751
0.1 0.7728
1 0.7657
10 0.7633
100 0.7618
1000 0.7618


#### random forests work better though

In [249]:
deepdict = dict()
treedict = dict()

for numtrees in [275, 300, 325, 350, 375, 400, 425, 450]:
    treedict[numtrees] = []
    for deepness in [6, 8, 9, 10, 11, 12, 14, 16, 18]:
        if deepness not in deepdict:
            deepdict[deepness] = []
        randfor = RandomForestClassifier(n_estimators = numtrees, max_depth = deepness, n_jobs = 12)
        scores = cross_validate(randfor, training_scaled, y, scoring = 'accuracy', cv = 10)
        acc = np.mean(scores['test_score'])
        print(numtrees, deepness, round(acc, 4))
        deepdict[deepness].append(acc)
        treedict[numtrees].append(acc)
        

275 6 0.7869
275 8 0.7924
275 9 0.7963
275 10 0.7987
275 11 0.8135
275 12 0.7853
275 14 0.7947
275 16 0.801
275 18 0.8033
300 6 0.7908
300 8 0.7893
300 9 0.7853
300 10 0.79
300 11 0.7908
300 12 0.7963
300 14 0.7947
300 16 0.7861
300 18 0.79
325 6 0.7893
325 8 0.7893
325 9 0.79
325 10 0.7971
325 11 0.7854
325 12 0.7948
325 14 0.7986
325 16 0.8041
325 18 0.7877
350 6 0.7908
350 8 0.8026
350 9 0.7963
350 10 0.7908
350 11 0.7924
350 12 0.7948
350 14 0.794
350 16 0.801
350 18 0.7987
375 6 0.7893
375 8 0.7901
375 9 0.7971
375 10 0.8034
375 11 0.7955
375 12 0.7877
375 14 0.7939
375 16 0.7995
375 18 0.805
400 6 0.7932
400 8 0.7994
400 9 0.7971
400 10 0.7979
400 11 0.8034
400 12 0.794
400 14 0.7995
400 16 0.7948
400 18 0.7987
425 6 0.7963
425 8 0.7908
425 9 0.7924
425 10 0.7947
425 11 0.7979
425 12 0.7987
425 14 0.7939
425 16 0.79
425 18 0.7995
450 6 0.7924
450 8 0.7916
450 9 0.8018
450 10 0.7971
450 11 0.805
450 12 0.7987
450 14 0.7994
450 16 0.794
450 18 0.7901


In [251]:
for numtrees in [275, 300, 325, 350, 375, 400, 425, 450]:
    print(numtrees, round(np.mean(treedict[numtrees]), 4))
for deepness in [6, 8, 9, 10, 11, 12, 14, 16, 18]:
    print(deepness, round(np.mean(deepdict[deepness]), 4))
    

275 0.7969
300 0.7904
325 0.7929
350 0.7957
375 0.7957
400 0.7975
425 0.7949
450 0.7967
6 0.7911
8 0.7932
9 0.7946
10 0.7962
11 0.798
12 0.7938
14 0.7961
16 0.7963
18 0.7966


In [235]:
logreg = LogisticRegression(C = .01, max_iter = 1000)
logreg.fit(training_scaled, y)
coefs = logreg.coef_

#### create and scale the test set

In [8]:
test = pd.concat([shortstorydata.iloc[638: 710, 0:200], longstorydata.iloc[638: 710, 0:200]])

In [9]:
test_scaled = scaler.transform(test)

In [10]:
test_scaled.shape

(144, 200)

In [11]:
test_y = [1] * 72 + [0] * 72

In [240]:
predictions = logreg.predict(test_scaled)
sum(predictions == test_y) / len(test_y)

0.7916666666666666

#### Here's an incautious volume-ramdomized accuracy figure, averaging 10 runs

In [12]:
accuracies = []
for iter in range(10):
    randfor = RandomForestClassifier(n_estimators = 400, max_depth = 18, n_jobs = 12)
    randfor.fit(training_scaled, y)
    predictions = randfor.predict(test_scaled)
    acc = sum(predictions == test_y) / len(test_y)
    accuracies.append(acc)
print(round(np.mean(accuracies), 4))

0.8049


#### to find out which topics are predictive

We use the logistic regression. It can be done with random forests, but it's clunkier.

In [245]:
coefficients = [(coef, idx) for idx, coef in enumerate(coefs[0])]
coefficients.sort()
coefficients

[(-0.24313424170915315, 157),
 (-0.18221881645920668, 57),
 (-0.16482749559310558, 88),
 (-0.16307998478121052, 72),
 (-0.15354510037825858, 64),
 (-0.1516223297461646, 143),
 (-0.14853967389492304, 154),
 (-0.14597596888958125, 132),
 (-0.1320005506090026, 165),
 (-0.11787582289491635, 142),
 (-0.11360792304335095, 115),
 (-0.1135036554297054, 51),
 (-0.10954962665155109, 78),
 (-0.09736486798777723, 35),
 (-0.09623186936286825, 197),
 (-0.09236675241901808, 153),
 (-0.09200542352828629, 38),
 (-0.09130777223366604, 162),
 (-0.09014211989820133, 190),
 (-0.08796970176601236, 118),
 (-0.08700708254459807, 69),
 (-0.08579977130318572, 122),
 (-0.08515277027618856, 59),
 (-0.08502947516973054, 102),
 (-0.07993376264788019, 84),
 (-0.07875447216225773, 63),
 (-0.07849987031144769, 151),
 (-0.07848985303334048, 90),
 (-0.07800521482846585, 198),
 (-0.07692080840218442, 23),
 (-0.07635007658512093, 34),
 (-0.07556046741466166, 80),
 (-0.07487629768155978, 0),
 (-0.07453698156585857, 186),
 

## Let's do a more rigorous test with held-out authors

It's possible that the model above learns to recognize specific short-story writers and novelists. Let's replicate the result while dividing by authors.

In [82]:
data.head()

Unnamed: 0,t0,t1,t2,t3,t4,t5,t6,t7,t8,t9,...,t199,docid,genre,birthyear,firstpub,hathi_author,hathi_title,us_national,authof3ormore,age
0,4.6e-05,5e-06,0.009303,2.4e-05,0.000293,4.6e-05,2.3e-05,0.000636,0.000169,0.012015,...,2e-06,osu.32435001753706,long,1881.0,1945,"Abbott, Jane",Mary Patten's daughters : a novel,True,True,64.0
1,0.000538,0.001999,9e-05,4e-06,0.001392,5e-06,8.2e-05,0.006044,7e-06,0.046004,...,0.001252,uc1.$b71369,long,1871.0,1929,"Adams, Samuel Hopkins",The flagrant years : a novel of the beauty market,True,True,58.0
2,6e-06,0.01879,0.023701,4e-06,0.000635,0.0231,3e-06,0.003576,0.000692,8e-06,...,0.000775,mdp.39015059384274,short,1889.0,1925,"Aiken, Conrad",Bring! bring! and other stories,True,True,36.0
3,6e-06,5e-06,0.00185,4e-06,1e-05,0.005259,2e-06,0.011151,7e-06,0.000112,...,0.000317,uc1.b4449482,long,,1975,"Albert, Mimi",The second story man : a novel,True,False,
4,0.000186,0.003024,0.007866,4.9e-05,0.0001,0.026803,2e-06,0.052632,0.001825,7e-06,...,2e-06,mdp.39015063779485,long,1892.0,1946,"Aldington, Richard","The romance of Casanova, a novel",False,True,54.0


In [83]:
allauthors = set(data.hathi_author)

In [84]:
import random

#### selecting authors for the test set while ensuring that the genre balance stays even

Randomly selecting authors won't do, because we should ensure that the balance of novels to short stories is nearly the same in test and training sets. We write a function to ensure that with a loop.

In [96]:
def get_test_authors(data):
    allauthors = set(data.hathi_author)
    # stratified sample
    selectedvols = 0
    selectedlong = 0
    selectedshort = 0
    testauthors = []
    while selectedvols < 150:
        author = random.sample(allauthors, 1)[0]
        df = data.loc[data.hathi_author == author, : ]
        shortcount = sum(df.genre == 'short')
        longcount = len(df) - shortcount
        allowed = False
        if selectedlong > selectedshort and longcount <= shortcount:
            allowed = True
        elif selectedlong < selectedshort and longcount >= shortcount:
            allowed = True
        elif selectedlong == selectedshort:
            allowed = True

        if allowed:
            testauthors.append(author)
            allauthors.remove(author)
            selectedlong += longcount
            selectedshort += shortcount
            selectedvols += (longcount + shortcount)

    print(selectedlong, selectedshort)
    return testauthors

Define another function that uses the list of testauthors to create test and training sets, scale them, and get the true class labels for both.

In [97]:
def make_test_and_train(data, testauthors):
    authtest = data.loc[data.hathi_author.isin(testauthors)]
    authtrain = data.loc[~data.hathi_author.isin(testauthors)]
    scaler2 = StandardScaler()
    authtrain_scaled = scaler2.fit_transform(authtrain.iloc[ : , 0: 200])
    authtest_scaled = scaler2.transform(authtest.iloc[ : , 0: 200])
    authtrain_y = (authtrain.genre == 'short')
    authtest_y = (authtest.genre == 'short')
    print(np.mean(authtrain_y), np.mean(authtest_y))
    
    return authtrain_scaled, authtest_scaled, authtrain_y, authtest_y

Now use the functions defined above to make 20 different test/train splits and average the accuracies.

In [98]:
accuracies = []
for iter in range(20):
    testauthors = get_test_authors(data)
    authtrain_scaled, authtest_scaled, authtrain_y, authtest_y = make_test_and_train(data, testauthors)
    randfor = RandomForestClassifier(n_estimators = 400, max_depth = 18, n_jobs = 12)
    randfor.fit(authtrain_scaled, authtrain_y)
    predictions = randfor.predict(authtest_scaled)
    acc = sum(predictions == authtest_y) / len(authtest_y)
    print(acc)
    accuracies.append(acc)
print(round(np.mean(accuracies), 4))

75 76
0.4996059889676911 0.5033112582781457
0.8211920529801324
76 75
0.5003940110323088 0.4966887417218543
0.8211920529801324
75 75
0.5 0.5
0.7933333333333333
74 78
0.49842271293375395 0.5131578947368421
0.8223684210526315
75 75
0.5 0.5
0.8333333333333334
76 75
0.5003940110323088 0.4966887417218543
0.7549668874172185
76 74
0.5007874015748032 0.49333333333333335
0.8
75 75
0.5 0.5
0.78
74 76
0.49921259842519683 0.5066666666666667
0.7666666666666667
75 75
0.5 0.5
0.8
77 75
0.500788643533123 0.4934210526315789
0.8026315789473685
75 75
0.5 0.5
0.7933333333333333
75 75
0.4907630522088353 0.5657142857142857
0.8057142857142857
75 75
0.5 0.5
0.7866666666666666
72 78
0.49763779527559054 0.52
0.82
75 75
0.5 0.5
0.84
75 75
0.5 0.5
0.7866666666666666
75 75
0.5 0.5
0.82
75 75
0.4907630522088353 0.5657142857142857
0.8114285714285714
76 75
0.5003940110323088 0.4966887417218543
0.7549668874172185
0.8007


That's a more cautious number, and the number I'm going to report in the article.

In [99]:
print(round(np.mean(accuracies), 3))

0.801
