# Second supplement, data analysis

Earlier topic models didn't include all the volumes we would need to test our hypotheses. The model produced with a "second supplement" of volumes is the first one that covers all the bases. This is an initial analysis of information-theoretic asymmetries in that model.

For the code used to measure the asymmetries, see ```../entropycalc.``` This analysis begins after that code has run, producing a series of "summary files" that describe novelty, transience, and (what Barron et al call) "resonance" for each volume. We are probably going to rename resonance something like "anticipation."

First, we import a few models that will prove useful later.

In [33]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import glob, math, random
from scipy.stats import pearsonr, zscore, ttest_ind
from statistics import mean, stdev
from math import sqrt

### get metadata

In order to analyze the data we will need information about date of publication for each volume (technically "latest possible date of composition," which is defined as the *earlier* of date of publication or author's date of death if known).

We're also going to need information about / predictions of nationality, which are collated in [```detect_americans```](https://github.com/tedunderwood/asymmetry/blob/master/analysis/detect_americans.ipynb).

In [2]:
meta = pd.read_csv('../supplement2/supp2nationalitymeta.tsv', sep = '\t', index_col = 'docid')

### get data

Now the data itself. This is broken into a number of "summary files" because the processing that produced it had to be distributed across a cluster. (The entropy calculation is done by comparing individual volumes to each other, and when you've got 40k vols, the number of cross-comparisons becomes fairly large.)

So we first make a list of all the files we need ...

In [3]:
paths = glob.glob('../supp2results/*summary.tsv')
paths

['../supp2results/supp232000summary.tsv',
 '../supp2results/supp218000summary.tsv',
 '../supp2results/supp236000summary.tsv',
 '../supp2results/supp226000summary.tsv',
 '../supp2results/supp28000summary.tsv',
 '../supp2results/supp222000summary.tsv',
 '../supp2results/supp22000summary.tsv',
 '../supp2results/supp228000summary.tsv',
 '../supp2results/supp26000summary.tsv',
 '../supp2results/supp212000summary.tsv',
 '../supp2results/supp216000summary.tsv',
 '../supp2results/supp238000summary.tsv',
 '../supp2results/supp24000summary.tsv',
 '../supp2results/supp214000summary.tsv',
 '../supp2results/supp210000summary.tsv',
 '../supp2results/supp234000summary.tsv',
 '../supp2results/supp230000summary.tsv',
 '../supp2results/supp20summary.tsv',
 '../supp2results/supp220000summary.tsv',
 '../supp2results/supp224000summary.tsv']

... and then loop across the list, reading them in ... and finally concatenate the data frames.

In [4]:
dfs = []
for p in paths:
    df = pd.read_csv(p, sep = '\t', index_col = 'docid')
    dfs.append(df)
    print(df.shape)

data = pd.concat(dfs, verify_integrity = True)
print(data.shape)

(2000, 36)
(2000, 36)
(2000, 36)
(2000, 36)
(2000, 36)
(2000, 36)
(2000, 36)
(2000, 36)
(2000, 36)
(2000, 36)
(2000, 36)
(1817, 36)
(2000, 36)
(2000, 36)
(2000, 36)
(2000, 36)
(2000, 36)
(2000, 36)
(2000, 36)
(2000, 36)
(39817, 36)


### Map a couple of columns from the metadata into our data

We're going to use **latest-possible-date-of-composition** for our analysis.

In [10]:
data = data.join(meta.latestcomp, how = 'inner')

We're also going to divide **US authors** from others, because

    1. A lot of our hypotheses are US-specific, and
    2. resonance/anticipation tends to be slightly higher for US authors than for others, simply because the US fraction of the library **increases** across time.

Given those two facts, it gives us slightly more confidence to calculate resonance/anticipation separately for US and for non-US authors. We could in principle also separate authors who are British, Canadian, Indian, and so forth, but since we have few hypotheses specific to those nationalities, the gains are likely to be outweighed by increases in fragility.

The nationality column of the metadata has a lot of possibilities; to simplify the next step, we're going to reduce it to a binary.

In [7]:
def whether_us(anation):
    if anation.startswith('us') or anation == 'guess: us':
        return True
    else:
        return False

meta = meta.assign(isus = meta.nationality.map(whether_us))

In [8]:
sum(meta.isus)

18146

In short, we've got about 18,000 volumes by US authors, in about 40,000 vols total. Note that a lot of these nationalities are inferred / estimated. Overall accuracy (combining manual ground truth and estimation) comes to about 88-89%; consult [```detect_americans```](https://github.com/tedunderwood/asymmetry/blob/master/analysis/detect_americans.ipynb) for details.

We are also fully aware that people like Henry James move across the Atlantic halfway through their lives, that some volumes are collections combining authors of different nationalities, and so on. If we were to fully describe the nuances of each volume, a single "nationality" code would be grossly inadequate. To provide that full description across 40,000 volumes would also take several million pages, and you would quickly stop reading!

Nationality is not a central subject of inquiry in this project. We're generating a simplified, imperfect *model* of nationality merely because it helps us (imperfectly) factor out a confounding variable that has a modest effect on our results. We think that provides a slight interpretive advantage, but the conclusions of our study would not be profoundly altered if we ignored nationality; to estimate the size of the effect, you can consult [earlier notebooks and models where nationality is not factored out.](https://github.com/tedunderwood/asymmetry/blob/master/analysis/first_eda.ipynb)

In [9]:
data = data.join(meta.isus, how = 'inner')
data.shape

(39817, 38)

### Normalize resonance for date and nationality

"Normalize" here means that we take a 7-year window of volumes that are by US authors (or non-US authors), and calculate z scores for volumes within that window. (I.e., subtract the mean and divide by standard deviation.) We replace the raw resonance/anticipation score for each volume with the z score calculated when it's at the center of the window.

I've already explained why we normalize for nationality. The reason for normalizing by date is that distance calculations on topic vectors cannot be trusted to remain uniform across a timeline. There are edge-sampling effects which make distances lower toward the ends of the timeline. This is true for both cosine distance and KLD.

Here's the loop where we actually do the normalizing. This does take some time to run (~20 min). There might be a simpler/quicker way to do this with an ```apply``` method, but it's a bit tricky, since we use a 7-year span to generate the z scores, but only copy over the scores for the central year.

The result is a new dataframe called ```zdata.``` This is the data that will be used in subsequent analyses.

In doing this, we also loop across columns and calculate a separate z score for each column. The reason is that resonance/anticipation can be calculated in a range of different ways--using different fractions of the dataset, and different temporal windows. There's a separate column for each of these possibilities. (Note, however, that we preregistered some guidelines about the modes of calculation we would use in checking hypotheses, to avoid a garden of infinitely forking paths.)

In [11]:
zdata = data.copy(deep = True)
columns = [x for x in zdata.columns.tolist() if x.startswith('resonance')]

for col in columns:
    zdata[col] = np.nan
    # set default as empty

for yankeeness in [True, False]:
    for yr in range(1800, 2009):
        if yr % 50 == 1:
            print(yr)
        df = data.loc[(data.latestcomp >= yr - 3) & (data.latestcomp <= yr + 3) & (data.isus == yankeeness), : ]
        for col in columns:
            nas = np.isnan(df[col])
            seriestonormalize = df.loc[~nas, col]
            indices = seriestonormalize.index.values
            zscores = zscore(seriestonormalize)
            for idx, z in zip(indices, zscores):
                date = df.loc[idx, 'latestcomp']
                if date == yr:
                    zdata.loc[idx, col] = z

1801
1851
1901
1951
2001
1801
1851
1901
1951
2001




I would guess that warning means I need to make the window a little larger in the early going.

# Analysis

Our hypotheses were [preregistered with the Open Science Framework.](https://osf.io/zuq9a/register/5771ca429ad5a1020de2872e) (See "how many and which conditions will participants be assigned to?")

We'll test them more or less in order.

### first hypothesis: reprinting

To start, the hypothesis about reprinting. For this we'll need information about the number of subsequent copies associated with each title. (To see where this is calculated, you'll need to consult the [```noveltmmeta```](https://github.com/tedunderwood/noveltmmeta) repo.)

In [12]:
zdata = zdata.join(meta.allcopiesofwork, how = 'inner')

In [15]:
zdata.head()

Unnamed: 0_level_0,novelty_1.0_10,novelty_1.0_25,novelty_1.0_40,novelty_0.2_10,novelty_0.2_25,novelty_0.2_40,novelty_0.05_10,novelty_0.05_25,novelty_0.05_40,novelty_0.025_10,...,resonance_0.05_10,resonance_0.05_25,resonance_0.05_40,resonance_0.025_10,resonance_0.025_25,resonance_0.025_40,inferreddate,isus,latestcomp,allcopiesofwork
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
uc2.ark+=13960=t5n877988,5.699619,5.66853,5.672994,4.381012,4.346295,4.342459,3.747675,3.7441,3.73563,3.478798,...,0.314764,0.020179,-0.393374,0.040855,-0.073399,-0.527166,1861,False,1861,2.0
uc2.ark+=13960=t1pg1hz3p,5.755898,5.936071,6.090055,4.115668,4.345863,4.584277,3.439015,3.663256,3.929144,3.215241,...,-1.386936,-0.435228,0.042845,-1.395966,-0.577248,-0.062578,1916,False,1916,1.0
uc1.b4369662,6.635509,6.726831,6.85405,5.142163,5.240199,5.376638,4.449461,4.5224,4.647743,4.162975,...,-0.122204,0.032498,,-0.273588,-0.164787,,1980,True,1980,1.0
mdp.39015037418947,5.421231,5.45746,5.546295,3.829046,3.873979,3.960809,3.216911,3.284133,3.366082,2.985479,...,0.209497,,,0.351961,,,1996,True,1996,1.0
uc1.32106010927223,4.894209,5.078146,5.249962,3.391462,3.559401,3.770094,2.812734,2.94512,3.151993,2.606917,...,2.993573,,,3.592816,,,1989,True,1989,2.0


Now we loop across resonance columns. For each column we remove NaNs, and then calculate correlation with the number of copies associated with volumes.

Since reprinting is sort of a feast-or-famine thing, we also try correlation with log(num_vols), which indeed is slightly stronger.

In [19]:
for col in columns:
    nas = np.isnan(zdata[col]) | np.isnan(zdata['allcopiesofwork'])
    r, p = pearsonr(zdata.loc[~nas, col], zdata.loc[~nas, 'allcopiesofwork'])
    print(col, r, p)
    r, p = pearsonr(zdata.loc[~nas, col], np.log(zdata.loc[~nas, 'allcopiesofwork'] + 0.64))
    print("log-transformed: ", col, r, p)
    print()

resonance_1.0_10 0.0953931234566 7.77876227474e-75
log-transformed:  resonance_1.0_10 0.107388427978 1.90844556743e-94

resonance_1.0_25 0.0935771913294 1.33042731513e-63
log-transformed:  resonance_1.0_25 0.11447631186 1.86808398768e-94

resonance_1.0_40 0.0844974216679 5.10923172766e-43
log-transformed:  resonance_1.0_40 0.108920089117 1.88275759494e-70

resonance_0.2_10 0.115465304155 5.55746452309e-109
log-transformed:  resonance_0.2_10 0.120951970388 1.80197609569e-119

resonance_0.2_25 0.110813779942 1.36188617721e-88
log-transformed:  resonance_0.2_25 0.12616547101 1.79545832372e-114

resonance_0.2_40 0.1004553444 3.77330412289e-60
log-transformed:  resonance_0.2_40 0.120413581365 8.46942351179e-86

resonance_0.05_10 0.112353549022 2.967808422e-103
log-transformed:  resonance_0.05_10 0.12017833202 5.82065401123e-118

resonance_0.05_25 0.122453978833 6.69996349584e-108
log-transformed:  resonance_0.05_25 0.135938700606 9.58291528736e-133

resonance_0.05_40 0.115321908547 8.325597

We calculated resonance for 100% of the corpus, as well as the 20%, 5% and 2.5% of volumes closest to the volume being evaluated.

We also measured across 10-, 25-, and 40-year windows.

**Notice, however, that the differences between different measurement strategies are on the whole rather modest!** This is not necessarily true for individual vols, but when you're looking at broad correlations or differences of means between groups, the measurement strategy is rarely determinative.

### second and other hypotheses

Will require more metadata about the volumes contained in particular groups (bestsellers, mostdiscussed, etc). We load this below.

In [20]:
hypothesis_meta = pd.read_csv('../meta/second_supplement_maxoverlap.tsv', sep = '\t')

In [21]:
hypothesis_meta.columns.tolist()

['docid',
 'author',
 'title',
 'inferreddate',
 'latestcomp',
 'firstpub',
 'allcopiesofwork',
 'copiesin25yrs',
 'gender',
 'nationality',
 'earlyedition',
 'lastname',
 'imprint',
 'recordid',
 'norton',
 'heath',
 'nortonshort',
 'mostdiscussed',
 'preregistered',
 'reviewed',
 'contrast4reviewed',
 'bestseller']

The last eight columns above describe particular groups to be used in testing hypotheses. Rows (volumes) will have True or False in each of these columns.

### second hypothesis

The hypothesis about reviewing is a little different than the others, because our test set in this case comes with a contrast set that was constructed manually to match its distribution across dates. So we can use one column to create the test set, another to create the contrast set, and then simply compare the groups.

In [28]:
reviewed_docs = hypothesis_meta.loc[hypothesis_meta.reviewed == True, 'docid']
unreviewed_docs = hypothesis_meta.loc[hypothesis_meta.contrast4reviewed == True, 'docid']

for col in columns:
    reviewed_data = zdata.loc[reviewed_docs, col]
    reviewed_data = reviewed_data[~np.isnan(reviewed_data)]
    unreviewed_data = zdata.loc[unreviewed_docs, col]
    unreviewed_data = unreviewed_data[~np.isnan(unreviewed_data)]
    t, p = ttest_ind(reviewed_data, unreviewed_data)
    print(col, "t-test", t, p)
    a = reviewed_data
    b = unreviewed_data
    cohens_d = (mean(a) - mean(b)) / (sqrt((stdev(a) ** 2 + stdev(b) ** 2) / 2))
    print(col, "Cohen's d", cohens_d)
    print() 

resonance_1.0_10 t-test 8.18489644079 7.17697465673e-16
resonance_1.0_10 Cohen's d 0.484656100018

resonance_1.0_25 t-test 8.71311330151 1.01469822134e-17
resonance_1.0_25 Cohen's d 0.51603188055

resonance_1.0_40 t-test 8.37203403202 1.62931510126e-16
resonance_1.0_40 Cohen's d 0.495506207256

resonance_0.2_10 t-test 7.46905418059 1.59358708582e-13
resonance_0.2_10 Cohen's d 0.442330617997

resonance_0.2_25 t-test 8.5101111178 5.35650755746e-17
resonance_0.2_25 Cohen's d 0.50411223647

resonance_0.2_40 t-test 8.1844552086 7.20186320816e-16
resonance_0.2_40 Cohen's d 0.484619944984

resonance_0.05_10 t-test 7.4205011709 2.26330319651e-13
resonance_0.05_10 Cohen's d 0.438918048083

resonance_0.05_25 t-test 9.07193344903 4.94154737988e-19
resonance_0.05_25 Cohen's d 0.53724385254

resonance_0.05_40 t-test 8.51814366021 5.01843271806e-17
resonance_0.05_40 Cohen's d 0.504591225364

resonance_0.025_10 t-test 6.96516472852 5.50970318492e-12
resonance_0.025_10 Cohen's d 0.411950304381

resona

### functions used to test subsequent hypotheses

The other hypotheses we preregistered don't come with a predefined contrast set. Instead we specify a control set that matches the test set's distribution across the timeline, and excludes authors included in the test set. Ensuring the latter condition is tricky, but also not likely to make a huge difference given the low odds of random collision in this space. For the moment, let's just say, our contrast set will be selected by matching dates.

We can write a function to do this selection.

In [43]:
def main_and_contrast_set(categorylist, hypmeta, data):
    ''' Accepts a list of categories and finds volumes matching those categories in 
    *hypmeta*. Then constructs a set of matching volumes.
    '''
    
    vols_in_cat = []
    dates_of_cat = []
    for cat in categorylist:
        vols_in_cat.extend(hypmeta.loc[hypmeta[cat] == True, 'docid'])
        dates_of_cat.extend(hypmeta.loc[hypmeta[cat] == True, 'latestcomp'])
    
    # TODO: Right now, since categories overlap, some volumes can be represented more
    # than once. Need to fix this.
        
    contrast_vols = []
    for d in dates_of_cat:
        population = data.index[data['latestcomp'] == d].tolist()
        population = set(population) - set(vols_in_cat)
        if len(population) < 1:
            print('*')
            continue
        else:
            chosen = random.sample(population, 1)[0]
            contrast_vols.append(chosen)
    
    return vols_in_cat, contrast_vols
    
    

In [36]:
def general_test(acat, bcat, zdata): 
    ''' Given two sets of volumes, calculates t tests and
    Cohen's d for difference of means.
    '''
    
    global columns
    for col in columns:
        a = zdata.loc[acat, col]
        a = a[~np.isnan(a)]
        b = zdata.loc[bcat, col]
        b = b[~np.isnan(b)]
        t, p = ttest_ind(a, b)
        print(col, "t-test", t, p)
        cohens_d = (mean(a) - mean(b)) / (sqrt((stdev(a) ** 2 + stdev(b) ** 2) / 2))
        print(col, "Cohen's d", cohens_d)
        print()

### Best sellers

This is not technically one of our preregistered hypotheses, because we weren't confident we knew what to expect here. We mention that we're going to test it, and say that we don't know whether there will be an effect.

In [44]:
bestsellers, notbestsellers = main_and_contrast_set(['bestseller'], hypothesis_meta, zdata)
general_test(bestsellers, notbestsellers, zdata)

*
*
*
resonance_1.0_10 t-test 6.43007915411 1.67190978433e-10
resonance_1.0_10 Cohen's d 0.318634771171

resonance_1.0_25 t-test 7.10852024775 1.83227488482e-12
resonance_1.0_25 Cohen's d 0.372587699362

resonance_1.0_40 t-test 6.30153084753 4.03142788474e-10
resonance_1.0_40 Cohen's d 0.34993414797

resonance_0.2_10 t-test 5.91441105179 4.05132712186e-09
resonance_0.2_10 Cohen's d 0.293079008322

resonance_0.2_25 t-test 6.08722988243 1.46713177541e-09
resonance_0.2_25 Cohen's d 0.319057539169

resonance_0.2_40 t-test 5.55720429346 3.3244507281e-08
resonance_0.2_40 Cohen's d 0.308601285324

resonance_0.05_10 t-test 5.49343231996 4.56754238828e-08
resonance_0.05_10 Cohen's d 0.272212597703

resonance_0.05_25 t-test 6.33150578793 3.22773215249e-10
resonance_0.05_25 Cohen's d 0.331861075555

resonance_0.05_40 t-test 5.951514683 3.41543041064e-09
resonance_0.05_40 Cohen's d 0.33050608699

resonance_0.025_10 t-test 4.8823501877 1.1511762453e-06
resonance_0.025_10 Cohen's d 0.241934725263

r

### Third hypothesis: volumes canonized by Norton

Here we're adding Heath and Norton Short Fiction; we should separate those in further analysis.

In [46]:
canon, notcanon = main_and_contrast_set(['norton', 'heath', 'nortonshort'], hypothesis_meta, zdata)
general_test(canon, notcanon, zdata)

resonance_1.0_10 t-test 1.18215752533 0.238499377839
resonance_1.0_10 Cohen's d 0.163574074763

resonance_1.0_25 t-test 0.447653755485 0.654905142568
resonance_1.0_25 Cohen's d 0.0641355159363

resonance_1.0_40 t-test 0.746726289438 0.456219521472
resonance_1.0_40 Cohen's d 0.111661697526

resonance_0.2_10 t-test 1.03650354794 0.301176582912
resonance_0.2_10 Cohen's d 0.143342079545

resonance_0.2_25 t-test 0.240921984906 0.809871325157
resonance_0.2_25 Cohen's d 0.0344997697115

resonance_0.2_40 t-test 0.409553487155 0.682628727123
resonance_0.2_40 Cohen's d 0.0612078568973

resonance_0.05_10 t-test 2.64578422924 0.00877601427284
resonance_0.05_10 Cohen's d 0.366070994434

resonance_0.05_25 t-test 1.27400378963 0.204194072518
resonance_0.05_25 Cohen's d 0.182461084209

resonance_0.05_40 t-test 1.33192283316 0.184598079124
resonance_0.05_40 Cohen's d 0.199104019712

resonance_0.025_10 t-test 2.75672610718 0.00635984173369
resonance_0.025_10 Cohen's d 0.381395292855

resonance_0.025_25 

### okay, whoa

That's a huge blinking neon light saying "publish me."

Bestsellers are *at least* as influential than the canon. There's not even a lot of statistically significant evidence that the canon is influential at all, except when you look at "close resonance" in the 5% and 2.5% fraction.

This result needs to be examined carefully. The power of bestsellerdom varies by time (see below), and the power of canonicity will be stronger for Norton proper than for Heath and Norton Short Fiction. Also, the significance gap is partly caused by a big difference in *N*: we have a lot more bestsellers than canonized titles in our metadata. (Roughly 800 vs 100.)

But I think even after slicing and dicing the result we're going to find that we can say something like "If you want to find nineteenth-century novels that were ahead of their time, you're better off looking at bestsellers, or volumes well-reviewed in the nineteenth century, than at a list of titles currently assigned in college courses on the period."

"Better off" might turn out to be an exaggeration, but at a minimum I suspect we can say "you'll do at least as well," and that's still a huge thing to be able to say.

### what about the specific titles we preregistered?

This was the fifth hypotheses we preregistered with OSF. We couldn't find all the titles in Hathi, but we found twenty.

In [48]:
prereg, notprereg = main_and_contrast_set(['preregistered'], hypothesis_meta, zdata)
general_test(prereg, notprereg, zdata)

resonance_1.0_10 t-test 3.98757765943 0.0003236210011
resonance_1.0_10 Cohen's d 1.3399071216

resonance_1.0_25 t-test 4.49654696207 8.06373095525e-05
resonance_1.0_25 Cohen's d 1.53893747919

resonance_1.0_40 t-test 5.42824605414 9.65907123498e-06
resonance_1.0_40 Cohen's d 2.04902702368

resonance_0.2_10 t-test 3.46369210984 0.0014249806473
resonance_0.2_10 Cohen's d 1.15907770706

resonance_0.2_25 t-test 4.40171217501 0.000106200848094
resonance_0.2_25 Cohen's d 1.49960896448

resonance_0.2_40 t-test 5.48505331687 8.29876166328e-06
resonance_0.2_40 Cohen's d 2.05527043677

resonance_0.05_10 t-test 2.80016001782 0.00825897888746
resonance_0.05_10 Cohen's d 0.937886502276

resonance_0.05_25 t-test 4.29729362594 0.000143637697755
resonance_0.05_25 Cohen's d 1.46136459099

resonance_0.05_40 t-test 5.49364860714 8.11045440775e-06
resonance_0.05_40 Cohen's d 2.06263158789

resonance_0.025_10 t-test 2.57331737562 0.0144687211737
resonance_0.025_10 Cohen's d 0.865807757845

resonance_0.025_

### We can certainly pick em!!

When we consciously aimed at things we expected to be "like the future," our aim was true. Statistically significant even at n = 20, and effect sizes are huge.

That's fucking interesting.

In [53]:
special = zdata.loc[prereg, ['resonance_0.025_25', 'resonance_1.0_40', 'resonance_0.05_10', 'latestcomp']]
special = special.join(meta.author, how = 'inner')
special

Unnamed: 0_level_0,resonance_0.025_25,resonance_1.0_40,resonance_0.05_10,latestcomp,author
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
nyp.33433074943634,,,0.555,1813.0,"Austen, Jane"
mdp.39015062084390,1.497536,1.392499,1.085073,1955.0,"Nabokov, Vladimir"
uc2.ark+=13960=t3cz3334b,2.28415,2.383205,0.656516,1884.0,"Twain, Mark"
uva.x000380956,-0.313829,-0.290482,0.144973,1865.0,"Verne, Jules"
nyp.33433076030760,0.274981,0.671516,0.199807,1844.0,"Poe, Edgar Allan"
inu.39000003707283,1.746966,1.574013,1.338988,1892.0,"Doyle, Arthur Conan"
uva.x000464259,0.506068,1.30901,0.345109,1850.0,"Hawthorne, Nathaniel"
mdp.39015000695182,2.900934,,2.209323,1972.0,"Atwood, Margaret"
uc2.ark+=13960=t1hh6d619,1.343092,0.882502,1.430802,1900.0,"Dreiser, Theodore"
uc1.32106002103940,1.834456,1.439508,1.342405,1952.0,"Ellison, Ralph"


Margaret Atwood, HG Wells, Conan Doyle, Mark Twain, Charlotte Brontë, Richard Wright, and Ralph Ellison are the champions of this round.

### Why is the "canon" so lame?

In [55]:
norton, notnorton = main_and_contrast_set(['norton'], hypothesis_meta, zdata)
nortonvols = zdata.loc[norton, ['resonance_0.025_25', 'resonance_1.0_40', 'resonance_0.05_10', 'latestcomp']]
nortonvols = nortonvols.join(meta.author, how = 'inner')
nortonvols

Unnamed: 0_level_0,resonance_0.025_25,resonance_1.0_40,resonance_0.05_10,latestcomp,author
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
uc1.32106010927223,,,2.993573,1989,"Tan, Amy"
"miun.abr7583,0001,001",-0.309324,0.372108,-0.142517,1924,"Fauset, Jessie Redmon"
mdp.39015035340093,0.415774,0.13839,0.223069,1904,"Chopin, Kate"
yale.39002014432901,1.083358,0.856012,0.282043,1841,"Dana, Richard Henry"
uc1.32106002107412,-0.646895,-0.502028,-0.557187,1940,"Fitzgerald, F. Scott (Francis Scott)"
inu.30000048909653,0.407513,0.315474,0.643952,1958,"Malamud, Bernard"
njp.32101068582491,0.146781,0.101713,0.163735,1898,"Cahan, Abraham"
uc2.ark+=13960=t3cz3334b,2.28415,2.383205,0.656516,1884,"Twain, Mark"
uc1.32106014299538,,,1.195299,1989,"Kingston, Maxine Hong"
mdp.39015054061430,-0.232882,0.901708,-0.732913,1928,"Larsen, Nella"


#### what's the mean for Norton?

In [56]:
np.mean(nortonvols.loc[~np.isnan(nortonvols['resonance_0.05_10']), 'resonance_0.05_10'])

0.42234910215688248

In [73]:
short, notshort = main_and_contrast_set(['nortonshort'], hypothesis_meta, zdata)
shortvols = zdata.loc[short, ['resonance_0.025_25', 'resonance_1.0_40', 'resonance_0.05_10', 'latestcomp']]
np.mean(shortvols.loc[~np.isnan(shortvols['resonance_0.05_10']), 'resonance_0.05_10'])

-0.059760015049459138

#### aha — Norton short fiction is the canon-killing villain!

In [57]:
heath, notheath = main_and_contrast_set(['heath'], hypothesis_meta, zdata)
heathvols = zdata.loc[heath, ['resonance_0.025_25', 'resonance_1.0_40', 'resonance_0.05_10', 'latestcomp']]
heathvols = heathvols.join(meta.author, how = 'inner')
heathvols

Unnamed: 0_level_0,resonance_0.025_25,resonance_1.0_40,resonance_0.05_10,latestcomp,author
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
uc1.32106002182373,0.910966,,0.605134,1974.0,"Oates, Joyce Carol"
"miun.aat2524,0001,001",0.425045,0.826687,0.238125,1929.0,"Larsen, Nella"
mdp.39015002379611,0.007465,0.108486,-0.191399,1942.0,"Redding, J. Saunders (Jay Saunders)"
mdp.39015035340093,0.415774,0.13839,0.223069,1904.0,"Chopin, Kate"
mdp.39015008221700,0.493657,-0.107157,0.867667,1912.0,"Antin, Mary"
uc2.ark+=13960=t20c4ss2f,2.711637,1.462955,1.199227,1887.0,"Harris, Joel Chandler"
inu.30000048909653,0.407513,0.315474,0.643952,1958.0,"Malamud, Bernard"
mdp.39015046832278,-0.488577,-0.321979,-0.49384,1854.0,"Hentz, Caroline Lee"
nyp.33433082294244,,,-1.093592,1824.0,"Rowson, Mrs"
nyp.33433076084221,0.292146,0.403312,0.163445,1907.0,"Howells, William Dean"


In [58]:
np.mean(heathvols.loc[~np.isnan(heathvols['resonance_0.05_10']), 'resonance_0.05_10'])

0.2687628268346951

#### Heath also brings down the average

but compare the preregistered volumes:

In [61]:
np.mean(special.loc[~np.isnan(special['resonance_0.05_10']), 'resonance_0.05_10'])

0.81528545651340767

### The raw value in bestsellerdom is not huge (though a bit larger than Heath) but the n is large enough to make statistical significance very strong

In [63]:
best, notbest = main_and_contrast_set(['bestseller'], hypothesis_meta, zdata)
bestvols = zdata.loc[best, ['resonance_0.025_25', 'resonance_1.0_40', 'resonance_0.05_10', 'latestcomp']]
bestvols = bestvols.join(meta.author, how = 'inner')
print('n = ', len(best))
np.mean(bestvols.loc[~np.isnan(bestvols['resonance_0.05_10']), 'resonance_0.05_10'])

*
*
*
n =  832


0.27247834510137858

### How does the power of bestsellerdom vary by time?

In [66]:
def main_and_contrast_by_date(startdate, enddate, categorylist, hypmeta, data):
    vols_in_cat = []
    dates_of_cat = []
    for cat in categorylist:
        vols_in_cat.extend(hypmeta.loc[(hypmeta[cat] == True) & (hypmeta.latestcomp >= startdate) & (hypmeta.latestcomp < enddate), 'docid'])
        dates_of_cat.extend(hypmeta.loc[(hypmeta[cat] == True) & (hypmeta.latestcomp >= startdate) & (hypmeta.latestcomp < enddate), 'latestcomp'])
        
    contrast_vols = []
    for d in dates_of_cat:
        population = data.index[data['latestcomp'] == d].tolist()
        population = set(population) - set(vols_in_cat)
        if len(population) < 1:
            print('*')
            continue
        else:
            chosen = random.sample(population, 1)[0]
            contrast_vols.append(chosen)
    
    return vols_in_cat, contrast_vols

In [67]:
# 19c

best, notbest = main_and_contrast_by_date(1800, 1900, ['bestseller'], hypothesis_meta, zdata)
bestvols = zdata.loc[best, ['resonance_0.025_25', 'resonance_1.0_40', 'resonance_0.05_10', 'latestcomp']]
bestvols = bestvols.join(meta.author, how = 'inner')
print('n = ', len(best))
np.mean(bestvols.loc[~np.isnan(bestvols['resonance_0.05_10']), 'resonance_0.05_10'])

n =  162


0.7723400941338544

#### as strong as our preregistered set!

In [68]:
# early 20c

best, notbest = main_and_contrast_by_date(1900, 1950, ['bestseller'], hypothesis_meta, zdata)
bestvols = zdata.loc[best, ['resonance_0.025_25', 'resonance_1.0_40', 'resonance_0.05_10', 'latestcomp']]
bestvols = bestvols.join(meta.author, how = 'inner')
print('n = ', len(best))
np.mean(bestvols.loc[~np.isnan(bestvols['resonance_0.05_10']), 'resonance_0.05_10'])

n =  429


0.21579183232781218

#### weaker ....

In [69]:
# late 20c

best, notbest = main_and_contrast_by_date(1950, 2010, ['bestseller'], hypothesis_meta, zdata)
bestvols = zdata.loc[best, ['resonance_0.025_25', 'resonance_1.0_40', 'resonance_0.05_10', 'latestcomp']]
bestvols = bestvols.join(meta.author, how = 'inner')
print('n = ', len(best))
np.mean(bestvols.loc[~np.isnan(bestvols['resonance_0.05_10']), 'resonance_0.05_10'])

n =  238


0.042102831538836046

#### nothing left.

### Oh, what about mostdiscussed?


In [70]:
buzzy, notbuzzy = main_and_contrast_set(['mostdiscussed'], hypothesis_meta, zdata)
general_test(buzzy, notbuzzy, zdata)

resonance_1.0_10 t-test 3.98835538734 0.000205191788003
resonance_1.0_10 Cohen's d 1.08040628027

resonance_1.0_25 t-test 3.68459303103 0.000554764445372
resonance_1.0_25 Cohen's d 1.01505949932

resonance_1.0_40 t-test 3.76719664865 0.000459476510361
resonance_1.0_40 Cohen's d 1.07855795927

resonance_0.2_10 t-test 4.13983632437 0.000125167613307
resonance_0.2_10 Cohen's d 1.11880997868

resonance_0.2_25 t-test 3.47040609132 0.00106642315272
resonance_0.2_25 Cohen's d 0.954737225577

resonance_0.2_40 t-test 2.97495225992 0.00461549713077
resonance_0.2_40 Cohen's d 0.849834017732

resonance_0.05_10 t-test 3.59162594806 0.000719084487122
resonance_0.05_10 Cohen's d 0.96904032016

resonance_0.05_25 t-test 3.86561171888 0.000314821393508
resonance_0.05_25 Cohen's d 1.06241407582

resonance_0.05_40 t-test 3.05054155329 0.00374793988585
resonance_0.05_40 Cohen's d 0.871704206549

resonance_0.025_10 t-test 3.35859763186 0.00145687330653
resonance_0.025_10 Cohen's d 0.907129081963

resonance_

In [71]:
buzzyvols = zdata.loc[buzzy, ['resonance_0.025_25', 'resonance_1.0_40', 'resonance_0.05_10', 'latestcomp']]
buzzyvols = buzzyvols.join(meta.title, how = 'inner')
buzzyvols

Unnamed: 0_level_0,resonance_0.025_25,resonance_1.0_40,resonance_0.05_10,latestcomp,title
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
mdp.39015046349141,2.053983,1.489412,2.233331,1931.0,The sound and the fury
mdp.39015059959802,,,,2003.0,Love
uva.x000319019,0.287353,-0.005038,2.39431,1904.0,"The golden bowl,"
mdp.39015005186146,0.254466,-0.012963,0.066023,1856.0,Dred; | a tale of the great Dismal Swamp. | $c...
coo.31924052984741,-0.009456,-0.179583,2.022006,1903.0,"The ambassadors, | $c: a novel by Henry James."
nyp.33433074940861,1.352436,1.65684,-0.357777,1875.0,The adventures of Tom Sawyer
mdp.39015004956234,1.542085,1.376702,1.230469,1852.0,"Uncle Tom's cabin, or, Life among the lowly"
uc2.ark+=13960=t3cz3334b,2.28415,2.383205,0.656516,1884.0,Huckleberry Finn
nyp.33433076030760,0.274981,0.671516,0.199807,1844.0,Tales
uc2.ark+=13960=t5j96177f,0.640434,0.08001,0.152889,1886.0,The Bostonians;


In [72]:
np.mean(buzzyvols.loc[~np.isnan(buzzyvols['resonance_0.05_10']), 'resonance_0.05_10'])

0.61369824296594311

### write out the results for easy analysis

In [75]:
zdata.columns

Index(['novelty_1.0_10', 'novelty_1.0_25', 'novelty_1.0_40', 'novelty_0.2_10',
       'novelty_0.2_25', 'novelty_0.2_40', 'novelty_0.05_10',
       'novelty_0.05_25', 'novelty_0.05_40', 'novelty_0.025_10',
       'novelty_0.025_25', 'novelty_0.025_40', 'transience_1.0_10',
       'transience_1.0_25', 'transience_1.0_40', 'transience_0.2_10',
       'transience_0.2_25', 'transience_0.2_40', 'transience_0.05_10',
       'transience_0.05_25', 'transience_0.05_40', 'transience_0.025_10',
       'transience_0.025_25', 'transience_0.025_40', 'resonance_1.0_10',
       'resonance_1.0_25', 'resonance_1.0_40', 'resonance_0.2_10',
       'resonance_0.2_25', 'resonance_0.2_40', 'resonance_0.05_10',
       'resonance_0.05_25', 'resonance_0.05_40', 'resonance_0.025_10',
       'resonance_0.025_25', 'resonance_0.025_40', 'inferreddate', 'isus',
       'latestcomp', 'allcopiesofwork'],
      dtype='object')

In [76]:
zdata.to_csv('../supp2results/zdata.tsv', sep = '\t', index_label = 'docid')