# Make popular subset

Creating a subset of ```titlemeta``` that is selected to maximize the ```copiesin25yrs``` column.

This may not be "popularity" in an absolute sense; please don't take the name of the file in a literal-minded fashion.


In [7]:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr

In [2]:
title = pd.read_csv('../titlemeta.tsv', sep = '\t', index_col = 'docid', low_memory = False)

In [3]:
frames = []

for floor in range(1800, 2010, 10):
    decade = title.loc[(title.latestcomp >= floor) & (title.latestcomp < (floor + 10)), : ]
    sample = decade.nlargest(100, 'copiesin25yrs')  
    frames.append(sample)

subset = pd.concat(frames)
subset.shape

(2100, 29)

In [4]:
subset.head()

Unnamed: 0_level_0,oldauthor,author,authordate,inferreddate,latestcomp,datetype,startdate,enddate,imprint,imprintdate,...,allcopiesofwork,copiesin25yrs,enumcron,volnum,title,parttitle,earlyedition,shorttitle,nonficprob,juvenileprob
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
hvd.rsmd2m,"More, Hannah","More, Hannah",1745-1833.,1809,1809,s,1809,,New-York;David Carlisle;1809.,1809,...,7.642857,7.5,,,Cœlebs in search of a wife : | comprehending o...,,True,Cœlebs in search of a wife : comprehending obs...,0.0,0.0
nyp.33433075814735,"Madame Cottin, (Sophie)","Madame Cottin, (Sophie)",1770-1807.,1810,1807,s,1810,,Poughkeepsie [N.Y.;Printed by Paraclete Potter...,1810,...,8.0,6.0,,,"Elizabeth, | or, The exiles of Siberia. A tale...",,True,"Elizabeth, or, The exiles of Siberia. A tale, ...",0.231656,0.008057
mdp.39015014117546,"Porter, Anna Maria","Porter, Anna Maria",1780-1832.,1807,1807,s,1807,,"London;Longman, Hurst, Rees, and Orme;1807.",1807,...,5.0,5.0,v.1,1.0,The Hungarian brothers ... | $c: By Miss Anna ...,,True,The Hungarian brothers,0.076883,0.616757
uiuo.ark+=13960=t2t449b4t,"Porter, Anna Maria","Porter, Anna Maria",1780-1832.,1807,1807,s,1807,,"London;Longman, Hurst, Rees, and Orme;1807.",1807,...,5.0,5.0,v.2,2.0,The Hungarian brothers,,True,The Hungarian brothers,0.208313,0.158856
uiuo.ark+=13960=t3vt28143,"Porter, Anna Maria","Porter, Anna Maria",1780-1832.,1807,1807,s,1807,,"London;Longman, Hurst, Rees, and Orme;1807.",1807,...,5.0,5.0,v.3,3.0,The Hungarian brothers,,True,The Hungarian brothers,0.224745,0.221614


In [5]:
subset.to_csv('most_popular_subset.tsv', sep = '\t', index_label = 'docid')

### validate the logic of ```copiesin25yrs```

This version of ```most_popular_subset``` was created after experimentally recalculating our "number of copies" statistics. The earlier version of that calculation privileged multi-volume works. I want to confirm that our new calculation works better.

In [6]:
oldsub = pd.read_csv('old_popular_subset.tsv', sep = '\t', index_col = 'docid')

In [9]:
lengths = []
copies = []

for recordid, df in oldsub.groupby('recordid'):
    lengths.append(len(df))
    copies.append(np.mean(df.copiesin25yrs))

print(np.mean(lengths))
pearsonr(lengths, copies)
    

1.43246930423


(0.32977201247571758, 1.5787079700343248e-38)

In [10]:
lengths = []
copies = []

for recordid, df in subset.groupby('recordid'):
    lengths.append(len(df))
    copies.append(np.mean(df.copiesin25yrs))

print(np.mean(lengths))
pearsonr(lengths, copies)

1.1751538892


(0.02460100982758159, 0.29862515699656245)

**Success!** The average number of volumes is down, and the strength of correlation between num-vols and num-copies is down. This suggests that our recalculation has succeeded in addressing the problem that over-privileged multi-volume works.