This notebook assesses the ability to predict demographic outcomes from survey data.  

In [21]:
import os,glob
import numpy,pandas
from sklearn.svm import LinearSVC,SVC
from sklearn.linear_model import LinearRegression,LogisticRegressionCV,RandomizedLogisticRegression,ElasticNet,ElasticNetCV,Ridge,RidgeCV
from sklearn.preprocessing import scale
from sklearn.cross_validation import StratifiedKFold,KFold
from sklearn.metrics import accuracy_score,f1_score,roc_auc_score
%matplotlib inline

%load_ext rpy2.ipython
%R require(mirt)

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


array([1], dtype=int32)

In [2]:
if not os.path.exists('factor_scores'):
    os.mkdir('factor_scores')


In [3]:
binary_vars=["ArrestedChargedLifeCount","DivorceCount","GamblingProblem","ChildrenNumber",
            "CreditCardDebt","RentOwn","RetirementAccount","TrafficTicketsLastYearCount","Obese",
             "TrafficAccidentsLifeCount","CaffienatedSodaCansPerDay"]

demogdata=pandas.read_csv('surveydata/demographics.tsv',index_col=0,delimiter='\t')
# remove a couple of outliers
demogdata=demogdata.query('WeightPounds>50')
demogdata=demogdata.query('CaffienatedSodaCansPerDay>-1')

demogdata['BMI']=demogdata['WeightPounds']*0.45 / (demogdata['HeightInches']*0.025)**2
demogdata['Obese']=(demogdata['BMI']>30).astype('int')
demogdata=demogdata[binary_vars]
demogdata=demogdata[demogdata.isnull().sum(1)==0]
workers=list(demogdata.index)

subscale_data=pandas.read_csv('survey_subscales.csv',index_col=0)

subscale_data=subscale_data.ix[workers]
subscale_data=subscale_data[subscale_data.isnull().sum(1)==0]
subscale_vars=list(subscale_data.columns)
demogdata=demogdata.ix[subscale_data.index]
assert list(demogdata.index)==list(subscale_data.index)
demogdata_scaled=scale(demogdata.values)
subscale_data=scale(subscale_data.values)

First get binary variables and test classification based on survey data.  Only include variables that have at least 10% of the infrequent category. Some of these were not collected as binary variables, but we binarize by calling anything above the minimum value a positive outcome.

In [4]:

bvardata=numpy.zeros((len(demogdata),len(binary_vars)))
for i in range(len(binary_vars)):
    v=binary_vars[i]
    d=demogdata[v].copy()
    if not d.min()==0:
        d[d==d.min()]=0
    d[d>d.min()]=1
    assert d.isnull().sum()==0
    bvardata[:,i]=d.values



In [24]:
nfeatures=5 # number of features to show

for i in range(len(binary_vars)):
    print('')
    y=bvardata[:,i]
    kf=StratifiedKFold(y,n_folds=8) # use stratified K-fold CV to get roughly equal folds
    # we use an inner CV loop on training data to estimate the best penalty value
    clf=SVC(probability=True) #LogisticRegressionCV(solver='liblinear',penalty='l1')  #LinearSVC()
    
    pred=numpy.zeros(len(y))

    for train,test in kf:
        clf.fit(subscale_data[train,:],y[train])
        pred[test]=clf.predict_proba(subscale_data[test,:])
    rocauc=roc_auc_score(y,pred)

    print('%s)
          : predictive accuracy (AUC - chance = 0.5) = %0.3f'%(,rocauc))
    print("Features sorted by their absolute correlation with outcome (top %d):"%nfeatures)
    featcorr=numpy.array([numpy.corrcoef(subscale_data[:,x],y)[0,1] for x in range(subscale_data.shape[1])])
    idx=numpy.argsort(numpy.abs(featcorr))[::-1]
    for i in range(nfeatures):
        print('%f: %s'%(featcorr[idx[i]],subscale_vars[idx[i]]))


ArrestedChargedLifeCount: predictive accuracy (AUC - chance = 0.5) = 0.552
Features sorted by their absolute correlation with outcome (top 5):
-0.158825: future_time_perspective_survey.future_time_perspective
0.131977: dospert_eb_survey.recreational
0.129978: eating_survey.cognitive_restraint
0.128566: impulsive_venture_survey.venturesomeness
0.121823: bis_bas_survey.BAS_fun_seeking

DivorceCount: predictive accuracy (AUC - chance = 0.5) = 0.523
Features sorted by their absolute correlation with outcome (top 5):
0.229985: eating_survey.emotional_eating
0.182185: leisure_time_activity_survey.activity_level
0.180984: dospert_rt_survey.financial
-0.178070: time_perspective_survey.past_positive
-0.166550: five_facet_mindfulness_survey.describe





GamblingProblem: predictive accuracy (AUC - chance = 0.5) = 0.594
Features sorted by their absolute correlation with outcome (top 5):
0.224106: dickman_survey.functional
0.170774: dospert_eb_survey.social
0.155250: bis_bas_survey.BAS_drive
0.151556: dospert_eb_survey.recreational
0.140211: time_perspective_survey.past_positive

ChildrenNumber: predictive accuracy (AUC - chance = 0.5) = 0.506
Features sorted by their absolute correlation with outcome (top 5):
-0.306942: dospert_rp_survey.financial
-0.210904: bis_bas_survey.BAS_reward_responsiveness
0.201684: dospert_eb_survey.financial
0.179619: dospert_rt_survey.financial
-0.171689: dospert_rp_survey.health/safety

CreditCardDebt: predictive accuracy (AUC - chance = 0.5) = 0.459
Features sorted by their absolute correlation with outcome (top 5):
-0.163655: leisure_time_activity_survey.activity_level
0.147613: time_perspective_survey.present_hedonistic
-0.142226: erq_survey.suppression
-0.136683: bis11_survey.first_order_attention
-0.13

In [None]:
%%R -i workers
compnums=c(3:10)
for (i in 1:length(compnums)) {
  ncomps=compnums[i]
  load(sprintf('rdata_files_wrangler/mirt_%ddims.Rdata',ncomps))
  scores=fscores(m,full.scores = TRUE,method='MAP')
  scores=data.frame(scores)
  row.names(scores)=workers
  write.table(scores,file=sprintf('factor_scores/factor_scores_%ddims.tsv',ncomps),sep='\t',quote=FALSE,col.names=FALSE)
}