# Interpret LIWC results

I ran ```liwc_gridsearch_oos.Rmd``` five times, producing five different versions of ```liwc_deltas_oos.csv.``` (All uploaded to github.)

The code below averages the five runs, and then pairs the averaged results with metadata on the LIWC categories.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

### Load the results of the R scripts.

In [67]:
root = 'liwc_delta_oos'

deltas = dict()

for i in range(1, 6):
    
    suffix = str(i) + '.csv'
    deltas[i] = pd.read_csv(root + suffix)
    
print(len(deltas), ' files loaded.')

5  files loaded.


In [68]:
deltas[1].head()

Unnamed: 0,depvar,cmse,pmse,totalr2,delta,adjdelta,bywidth,fpwidth,bydf,fpdf,pmse_oos,cmse_oos,delta_oos,r2_oos
0,Analytic,16.393,19.707,0.030445,0.4541,0.624579,by_24,fp_12,4,8,0.012416,0.002757112,0.181711,0.011727
1,Clout,31.131,36.723,0.021478,0.458794,0.404116,by_20,fp_20,5,4,0.017138,0.008107973,0.3211573,0.011231
2,Authentic,3.384,51.667,0.027007,0.06147,0.046822,by_24,fp_24,4,3,0.04455,1e-08,2.244654e-07,0.015761
3,Tone,80.008,24.689,0.151734,0.764186,0.876092,by_12,fp_4,11,24,0.019586,0.3158953,0.9416193,0.125711
4,WPS,11.926,18.425,0.007263,0.392936,0.49262,by_24,fp_16,4,6,0.011326,0.00127008,0.1008337,4.7e-05


There is some continuity but also sigificant divergence between different runs.

In [69]:
deltas[4].head()

Unnamed: 0,depvar,cmse,pmse,totalr2,delta,adjdelta,bywidth,fpwidth,bydf,fpdf,pmse_oos,cmse_oos,delta_oos,r2_oos
0,Analytic,35.497,10.925,0.032849,0.764659,0.795876,by_20,fp_16,5,6,0.000144,0.016745,0.991445,0.016127
1,Clout,31.131,36.723,0.021478,0.458794,0.404116,by_20,fp_20,5,4,0.023745,0.004315,0.153763,0.009093
2,Authentic,2.043,56.28,0.027835,0.035029,0.051639,by_24,fp_16,4,6,0.062243,0.002057,0.031996,0.013096
3,Tone,100.385,13.451,0.149717,0.881839,0.80279,by_12,fp_16,11,6,0.00619,0.288784,0.979014,0.122433
4,WPS,26.831,15.107,0.008909,0.639778,0.586922,by_20,fp_20,5,4,0.020945,0.017002,0.448041,0.002516


### Construct a data frame that has average values

In [86]:
smoothed = dict()

cols = ['delta', 'adjdelta', 'delta_oos', 'cmse', 'pmse', 'cmse_oos', 'pmse_oos', 
        'totalr2', 'r2_oos', 'bydf', 'fpdf']

for c in cols:
    if c not in smoothed:
        smoothed[c] = []
    for rownum in range(len(deltas[1])):
        values = []
        for i in range(1, 6):
            if c in deltas[i].columns:      # the only exception is 'agemse' which got added late
                                            # and won't be in all five runs
                values.append(deltas[i].loc[rownum, c])
        smoothed[c].append(np.mean(values))

        
avgdf = pd.DataFrame(smoothed)
avgdf['depvar'] = deltas[1].depvar
dfcols = avgdf.columns.tolist()
dfcols = dfcols[0: -1]
dfcols.insert(0, 'depvar')
avgdf = avgdf.loc[ : , dfcols]

In [87]:
avgdf.head()

Unnamed: 0,depvar,delta,adjdelta,delta_oos,cmse,pmse,cmse_oos,pmse_oos,totalr2,r2_oos,bydf,fpdf
0,Analytic,0.677514,0.721063,0.537381,33.082,14.943,0.008605,0.007592,0.032774,0.014831,4.8,6.2
1,Clout,0.463125,0.425778,0.349306,29.558,35.574,0.012745,0.024844,0.020328,0.007741,4.8,4.2
2,Authentic,0.08525,0.075123,0.042442,4.2658,50.8382,0.0022,0.053805,0.027806,0.016091,4.2,4.8
3,Tone,0.68686,0.789703,0.669212,60.558,25.3484,0.154334,0.045105,0.144778,0.123361,7.4,13.6
4,WPS,0.616289,0.542211,0.435241,30.3132,16.5064,0.015054,0.018329,0.009946,0.000479,7.6,4.8


## Calculate average delta as per our pre-registered plan

The columns that matter most are the "weighted" ones; we've already decided to care more about variables where the model is strong than about ones where r2 is low and no chronological variables are very predictive--also more about large topics than small ones.

In [75]:
def weighted_avg(aframe):
    avg = sum(aframe.cmse) / (sum(aframe.cmse) + sum(aframe.pmse))
    return avg

def weighted_avg_oos(aframe):
    avg = sum(aframe.cmse_oos) / (sum(aframe.cmse_oos) + sum(aframe.pmse_oos))
    return avg

In [76]:
print('raw in-sample\traw oos\t\tweighted in-sample\tweighted oos')
for i in range(1, 6):
    print(round(np.mean(deltas[i].delta), 4), '\t\t', round(np.mean(deltas[i].delta_oos), 4), '\t\t',
          round(weighted_avg(deltas[i]), 4), '\t\t', round(weighted_avg_oos(deltas[i]), 4))

raw in-sample	raw oos		weighted in-sample	weighted oos
0.5279 		 0.4459 		 0.5317 		 0.5199
0.546 		 0.4821 		 0.5504 		 0.5233
0.5262 		 0.4716 		 0.5308 		 0.5143
0.5176 		 0.4367 		 0.5309 		 0.5128
0.5443 		 0.4716 		 0.5479 		 0.5274


In [77]:
print("Overall, weighted in-sample is", round(weighted_avg(avgdf), 4))
print("And out-of-sample: ", round(weighted_avg_oos(avgdf), 4))

Overall, weighted in-sample is 0.5385
And out-of-sample:  0.5196


### Writing results to file

We'll save the averaged results of all five runs as ```meanLIWCresults.```

In [88]:
avgdf.to_csv('mean_LIWC_result.tsv', sep = '\t')