CONTENT WARNING: explicit and multiple references to suicide attemps and suicidal ideation

The following is an analysis regarding attempted suicides and suicidal thoughts. Questions around these topics will be directly referenced and mentioned explicitly. While their potential correlations with other parameters will be analysed and discussed at length, suicide attempts and suicidal ideation as concepts will not be discussed. Deaths by suicide will not be referenced or discussed at all.

Please only continue if it comfortable for you to do so.

In [118]:
import pandas as pd
import statsmodels.api as sm
import numpy as np
from sklearn.cross_validation import train_test_split
import seaborn as sb
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

National Survey on Drug Use and Health 2017 results, from SAMHDA, Substance Abuse and Mental Health Data Archive (https://www.datafiles.samhsa.gov/study-series/national-survey-drug-use-and-health-nsduh-nid13517), after agreeing to their terms of use.

In [4]:
df = pd.DataFrame.from_csv('NSDUH_2017_Tab.tsv',sep='\t')

  """Entry point for launching an IPython kernel.
  if self.run_code(code, result):


In [6]:
df

Unnamed: 0_level_0,FILEDATE,CIGEVER,CIGOFRSM,CIGWILYR,CIGTRY,CIGYFU,CIGMFU,CIGREC,CIG30USE,CG30EST,...,POVERTY3,TOOLONG,TROUBUND,PDEN10,COUTYP4,MAIIN102,AIIND102,ANALWT_C,VESTR,VEREP
QUESTID2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
55235143,10/09/2018,1,99,99,13,9999,99,4,93,93,...,3.0,2,2,1,1,2,2,11203.888954,40043,1
13435143,10/09/2018,1,99,99,15,9999,99,1,18,99,...,3.0,1,2,1,1,2,2,9496.462244,40006,2
81345143,10/09/2018,1,99,99,14,9999,99,1,10,99,...,3.0,2,2,1,1,2,2,2943.702802,40030,2
53955143,10/09/2018,1,99,99,16,9999,99,4,93,93,...,3.0,2,2,2,2,2,2,1783.702549,40026,2
51775143,10/09/2018,2,99,99,991,9991,91,91,91,91,...,3.0,1,1,1,1,2,2,31528.749357,40029,1
47796143,10/09/2018,1,99,99,15,9999,99,4,93,93,...,1.0,2,2,3,3,2,2,13593.927387,40035,1
13196143,10/09/2018,1,99,99,15,9999,99,3,93,93,...,3.0,2,2,3,3,2,2,3486.457416,40011,2
81726143,10/09/2018,1,99,99,14,9999,99,3,93,93,...,2.0,2,2,2,2,2,2,782.266930,40032,2
61536143,10/09/2018,1,99,99,985,9998,98,14,93,93,...,1.0,2,2,2,3,2,2,836.875263,40024,2
10636143,10/09/2018,2,4,4,991,9991,91,91,91,91,...,1.0,2,2,1,1,2,2,782.663302,40001,2


2667 columns, that is a lot of parameters, and a huge majority of them are categorical

Decriptions of parameters can be found in codebook

'SUICTHNK' -  At any time in the past 12 months, that is from [DATEFILL] up to and including today, did you seriously think about trying to kill yourself?, yes = 1, no = 2, bunch of other values available

'SUICPLAN' - During the past 12 months, did you make any plans to kill yourself?, yes = 1, no = 2

'SUICTRY' - During the past 12 months, did you try to kill yourself?

'MHSUITHK' - same as SUICTHNK, no = 0, yes = 1, all others = nan

'MHSUIPLN' - same as SUICPLAN, no = 0, yes = 1, all others = nan

'MHSUITRY' - same as SUICTRY, no = 0, yes = 1, all others = nan

'MHSUTK_U' - same as above, but no/unknown are grouped together

'ADWRSTHK' - Did you think about committing suicide? (think about the period of time/most recent period of time when your [FEELNOUN] and other problems were the worst.), yes = 1, no = 2

'ADWRSPLN' - same as above, but Did you make a suicide plan?

'ADWRSATP' - same as above, but Did you make a suicide attempt?

'AD_MDEA9' - ANY THOUGHTS OR PLANS OF SUICIDE, 1 = has symptoms, 2 = does not

'SIMHSUI2' - whether a respondent received their most recent mental health services from at least one inpatient/residential specialty mental health source in the past year because they thought about or tried to kill themselves, 0 = no, 1 = yes

'SOMHSUI' -  whether a respondent received their most recent mental health services from at least one outpatient specialty mental health source in the past year because they thought about or tried to kill themselves, 0 = no, 1 = yes

'SMHSUI2' -  whether a respondent received their most recent mental health services from at least one specialty mental health source in the past year because they thought about or tried to kill themselves, 0 = no, 1 = yes

'YOWRSTHK' - same as 'ADWRSTHK' but for youth (12-17)

'YOWRSPLN' - same as above, but with plan

'YOWRSATP' - same as above, but suicide attempt

'YO_MDEA9' - same as 'ADWRSATP', but for youth (12-17 yrs)

In [7]:
use_col = ['SUICTHNK','SUICPLAN','SUICTRY','MHSUITHK','MHSUIPLN','MHSUITRY','MHSUTK_U','ADWRSTHK','ADWRSPLN',
           'ADWRSATP','AD_MDEA9','SIMHSUI2','SOMHSUI','SMHSUI2','YOWRSTHK','YOWRSPLN','YOWRSATP','YO_MDEA9']
df[use_col].describe()

Unnamed: 0,SUICTHNK,SUICPLAN,SUICTRY,MHSUITHK,MHSUIPLN,MHSUITRY,MHSUTK_U,ADWRSTHK,ADWRSPLN,ADWRSATP,AD_MDEA9,SIMHSUI2,SOMHSUI,SMHSUI2,YOWRSTHK,YOWRSPLN,YOWRSATP,YO_MDEA9
count,56276.0,56276.0,56276.0,42240.0,42237.0,42237.0,42554.0,56276.0,56276.0,56276.0,56276.0,320.0,1792.0,1896.0,56276.0,56276.0,56276.0,56276.0
mean,26.128367,94.519369,94.527063,0.061269,0.020006,0.009612,0.060817,85.948681,93.597146,93.604698,85.303469,0.553125,0.297991,0.313819,94.265939,96.37835,96.382206,94.021235
std,41.965473,20.373581,20.337973,0.239826,0.140023,0.097572,0.238997,33.170977,22.276343,22.245322,33.012253,0.497948,0.457503,0.464166,20.944202,15.750741,15.727177,21.049307
min,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
25%,2.0,99.0,99.0,0.0,0.0,0.0,0.0,99.0,99.0,99.0,98.0,0.0,0.0,0.0,99.0,99.0,99.0,99.0
50%,2.0,99.0,99.0,0.0,0.0,0.0,0.0,99.0,99.0,99.0,98.0,1.0,0.0,0.0,99.0,99.0,99.0,99.0
75%,2.0,99.0,99.0,0.0,0.0,0.0,0.0,99.0,99.0,99.0,98.0,1.0,1.0,1.0,99.0,99.0,99.0,99.0
max,99.0,99.0,99.0,1.0,1.0,1.0,1.0,99.0,99.0,99.0,99.0,1.0,1.0,1.0,99.0,99.0,99.0,99.0


Note: this is discussed in the codebook, there are several parameters which are derived from a regression model in the 2012 iteration of this study which are recommended to not be used when studying suicidal attempts and ideation (among other things), as they were found to systematically overpredict the occurance of these phenomena. I will not be using them any modeling.

In [8]:
print('Num adults thought about killing themselves in last year: '+str(len(df[df.MHSUITHK==1])))
print('Num adults made plans to kill themselves in last year: '+str(len(df[df.MHSUIPLN==1])))
print('Num adults attemped to kill themselves in last year: '+str(len(df[df.MHSUITRY==1])))
print('Num adults thought at difficult time about killing themslves: '+str(len(df[df.ADWRSTHK==1])))
print('Num adults made plans at difficult time to kill themslves: '+str(len(df[df.ADWRSPLN==1])))
print('Num adults attemped at difficult time to kill themslves: '+str(len(df[df.ADWRSATP==1])))
print('Num adults any thoughts/attempts to kill themslves: '+str(len(df[df.AD_MDEA9==1])))
len(df[(df.ADWRSTHK==1) | (df.ADWRSPLN==1) | (df.ADWRSATP==1) | (df.MHSUITHK==1) | (df.MHSUIPLN==1) | (df.MHSUITRY==1)])

Num adults thought about killing themselves in last year: 2588
Num adults made plans to kill themselves in last year: 845
Num adults attemped to kill themselves in last year: 406
Num adults thought at difficult time about killing themslves: 3121
Num adults made plans at difficult time to kill themslves: 1183
Num adults attemped at difficult time to kill themslves: 761
Num adults any thoughts/attempts to kill themslves: 4988


4153

Why does the sum of all those not equal 4988, all adults who had any suicidal thoughts/attempts?

In [9]:
use_col = ['SUICTHNK','SUICPLAN','SUICTRY','MHSUITHK','MHSUIPLN','MHSUITRY','MHSUTK_U','ADWRSTHK','ADWRSPLN',
           'ADWRSATP','AD_MDEA9','SIMHSUI2','SOMHSUI','SMHSUI2','YOWRSTHK','YOWRSPLN','YOWRSATP','YO_MDEA9',
           'YUHOSUIC','YURSSUIC','ADWRDBTR','ADWRDLOT']

df[df.AD_MDEA9==1][use_col]

Unnamed: 0_level_0,SUICTHNK,SUICPLAN,SUICTRY,MHSUITHK,MHSUIPLN,MHSUITRY,MHSUTK_U,ADWRSTHK,ADWRSPLN,ADWRSATP,...,SOMHSUI,SMHSUI2,YOWRSTHK,YOWRSPLN,YOWRSATP,YO_MDEA9,YUHOSUIC,YURSSUIC,ADWRDBTR,ADWRDLOT
QUESTID2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
37837143,2,99,99,0.0,0.0,0.0,0.0,2,99,99,...,,,99,99,99,99,99,99,2,1
34608143,2,99,99,0.0,0.0,0.0,0.0,1,2,2,...,,,99,99,99,99,99,99,1,1
24160143,2,99,99,0.0,0.0,0.0,0.0,2,99,99,...,,,99,99,99,99,99,99,1,1
41932143,2,99,99,0.0,0.0,0.0,0.0,2,99,99,...,,,99,99,99,99,99,99,1,1
63244143,2,99,99,0.0,0.0,0.0,0.0,1,2,2,...,,,99,99,99,99,99,99,1,1
19246243,2,99,99,0.0,0.0,0.0,0.0,2,99,99,...,,,99,99,99,99,99,99,1,2
49297243,2,99,99,0.0,0.0,0.0,0.0,1,2,2,...,,,99,99,99,99,99,99,1,1
34879243,2,99,99,0.0,0.0,0.0,0.0,1,2,2,...,,,99,99,99,99,99,99,1,1
47040243,2,99,99,0.0,0.0,0.0,0.0,1,2,2,...,,,99,99,99,99,99,99,1,1
24770243,1,1,2,1.0,1.0,0.0,1.0,1,1,2,...,,,99,99,99,99,99,99,1,1


In [10]:
len(df[(df.ADWRSTHK==1) | (df.ADWRSPLN==1) | (df.ADWRSATP==1) | (df.ADWRDBTR==1) | (df.ADWRDLOT==1)])

4988

AD_MDEA9 is only in reference to during a difficult time in person's life

In [11]:
print('Num adults w/ suicidal thoughts last year: '+str(len(df[df.MHSUITHK==1])))
print('Num adults w/ suicidal plan last year: '+str(len(df[df.MHSUIPLN==1])))
print('Num adults w/ suicidal attempt last year: '+str(len(df[df.MHSUITRY==1])))
print('Num adults w/ suicidal thoughts in difficult time: '+str(len(df[df.ADWRSTHK==1])))
print('Num adults w/ suicidal plan in difficult time: '+str(len(df[df.ADWRSPLN==1])))
print('Num adults w/ suicidal attempt in difficult time: '+str(len(df[df.ADWRSATP==1])))
print('Num adults w/ thoughts about death (theirs, others, general) in difficult time: '+str(len(df[df.ADWRDLOT==1])))
print('Num adults w/ thought better if they were dead in difficult time: '+str(len(df[df.ADWRDBTR==1])))
print('Num adults answered yes for past 5 questions: '+str(len(df[df.AD_MDEA9==1])))

Num adults w/ suicidal thoughts last year: 2588
Num adults w/ suicidal plan last year: 845
Num adults w/ suicidal attempt last year: 406
Num adults w/ suicidal thoughts in difficult time: 3121
Num adults w/ suicidal plan in difficult time: 1183
Num adults w/ suicidal attempt in difficult time: 761
Num adults w/ thoughts about death (theirs, others, general) in difficult time: 4084
Num adults w/ thought better if they were dead in difficult time: 3930
Num adults answered yes for past 5 questions: 4988


In [12]:
print('Num adults w/ suicidal thoughts: '+str(len(df[(df.MHSUITHK==1) | (df.ADWRSTHK==1)])))
print('Num adults w/ suicidal plans: '+str(len(df[(df.MHSUIPLN==1) | (df.ADWRSPLN==1)])))
print('Num adults w/ suicidal attempt: '+str(len(df[(df.MHSUITRY==1) | (df.ADWRSATP==1)])))
print('Num adults w/ thoughts of death: '+str(len(df[(df.ADWRDLOT==1) | (df.ADWRDBTR==1)])))
print('Num adults w/ thoughts of death or suicide: '+str(len(df[(df.ADWRDLOT==1) | (df.ADWRDBTR==1) | (df.MHSUITHK==1) | (df.ADWRSTHK==1)])))

Num adults w/ suicidal thoughts: 4153
Num adults w/ suicidal plans: 1522
Num adults w/ suicidal attempt: 946
Num adults w/ thoughts of death: 4909
Num adults w/ thoughts of death or suicide: 5914


In [13]:
df[df.MHSUITRY==1][['MHSUIPLN','MHSUITHK']].describe()

Unnamed: 0,MHSUIPLN,MHSUITHK
count,406.0,406.0
mean,0.864532,1.0
std,0.342645,0.0
min,0.0,1.0
25%,1.0,1.0
50%,1.0,1.0
75%,1.0,1.0
max,1.0,1.0


In [14]:
df[df.MHSUITRY==0][['MHSUIPLN','MHSUITHK']].describe()

Unnamed: 0,MHSUIPLN,MHSUITHK
count,41829.0,41831.0
mean,0.011762,0.052091
std,0.107815,0.222212
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,0.0,0.0
max,1.0,1.0


In [15]:
df[df.ADWRSATP==1][['ADWRSPLN','ADWRSTHK','ADWRDLOT','ADWRDBTR']].describe()

Unnamed: 0,ADWRSPLN,ADWRSTHK,ADWRDLOT,ADWRDBTR
count,761.0,761.0,761.0,761.0
mean,1.254928,1.0,1.093298,1.148489
std,0.436107,0.0,0.291041,3.482328
min,1.0,1.0,1.0,1.0
25%,1.0,1.0,1.0,1.0
50%,1.0,1.0,1.0,1.0
75%,2.0,1.0,1.0,1.0
max,2.0,1.0,2.0,97.0


In [16]:
df[df.ADWRSATP==2][['ADWRSPLN','ADWRSTHK','ADWRDLOT','ADWRDBTR']].describe()

Unnamed: 0,ADWRSPLN,ADWRSTHK,ADWRDLOT,ADWRDBTR
count,2357.0,2357.0,2357.0,2357.0
mean,1.817989,1.0,1.240136,1.13322
std,2.766133,0.0,2.7722,1.935743
min,1.0,1.0,1.0,1.0
25%,1.0,1.0,1.0,1.0
50%,2.0,1.0,1.0,1.0
75%,2.0,1.0,1.0,1.0
max,97.0,1.0,97.0,94.0


In [17]:
df[df.YOWRSATP==1][['YOWRSPLN','YOWRSTHK','YOWRDLOT','YOWRDBTR']].describe()

Unnamed: 0,YOWRSPLN,YOWRSTHK,YOWRDLOT,YOWRDBTR
count,545.0,545.0,545.0,545.0
mean,1.322936,1.0,1.592661,1.033028
std,4.121135,0.0,7.035568,0.178873
min,1.0,1.0,1.0,1.0
25%,1.0,1.0,1.0,1.0
50%,1.0,1.0,1.0,1.0
75%,1.0,1.0,1.0,1.0
max,97.0,1.0,97.0,2.0


In [18]:
df[df.YOWRSATP==2][['YOWRSPLN','YOWRSTHK','YOWRDLOT','YOWRDBTR']].describe()

Unnamed: 0,YOWRSPLN,YOWRSTHK,YOWRDLOT,YOWRDBTR
count,964.0,964.0,964.0,964.0
mean,2.479253,1.0,1.797718,1.582988
std,8.629478,0.0,7.933495,6.727014
min,1.0,1.0,1.0,1.0
25%,1.0,1.0,1.0,1.0
50%,2.0,1.0,1.0,1.0
75%,2.0,1.0,1.0,1.0
max,97.0,1.0,97.0,97.0


From the numbers in the code book, only those who answer they had suicidal ideation can then answer questions about plans and attempts.

As one would suspect, a large percentage of those who have a suicide attempt also had thoughts of a suicide plan.

In [21]:
def mkTrainTest(df):
    rem_val = [81,83,85,89,91,93,94,97,98,99,981,983,985,989,991,993,994,997,998,999,9981,9983,9985,9989,9991,9993,
               9994,9997,9998,9999,186,187,188,180]

    dfc = df.copy()
    
    dfc.HTINCHE2 = np.floor((dfc.HTINCHE2-55)/5)
    dfc.WTPOUND2 = np.floor((dfc.WTPOUND2-75)/55)
    dfc.BMI2 = np.floor((dfc.BMI2-9.3)/14.83)
    dfc.EDUSCHLGO.replace(11,1,inplace=True)
    dfc.BOOKED.replace(3,1,inplace=True)
    dfc.SNYSELL.replace(1,0,inplace=True)
    dfc.SNYSTOLE.replace(1,0,inplace=True)
    dfc.SNYATTAK.replace(1,0,inplace=True)
    dfc.SNYSELL.replace([2,3,4,5],1,inplace=True)
    dfc.SNYSTOLE.replace([2,3,4,5],1,inplace=True)
    dfc.SNYATTAK.replace([2,3,4,5],1,inplace=True)
    
    dfc = dfc.replace(rem_val,np.nan)
    df1 = dfc[dfc.MHSUITHK==1]
    df0 = dfc[dfc.MHSUITHK==0]

    #use_df1 = use_df1.replace(rem_val,np.nan)
    #use_df1.dropna(inplace=True)
    #use_df0 = use_df0.replace(rem_val,np.nan)
    #use_df0.dropna(inplace=True)

    one_size = len(df1)
    df0p = df0.sample(one_size)

    y0 = df0p.MHSUITHK
    X0 = df0p.drop(columns=['MHSUITHK','SUICTHNK','FILEDATE'])
    #use_X0 = pd.get_dummies(use_X0,columns=use_cols)
    y1 = df1.MHSUITHK
    X1 = df1.drop(columns=['MHSUITHK','SUICTHNK','FILEDATE'])
    #use_X1 = pd.get_dummies(use_X1,columns=use_cols)
    
    X_all = pd.concat([X0,X1])
    y_all = pd.concat([y0,y1])

    #X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.3, random_state=0)

    #X_train = pd.concat([X0_train,X1_train])
    #X_test = pd.concat([X0_test,X1_test])
    #y_train = pd.concat([y0_train,y1_train])
    #y_test = pd.concat([y0_test,y1_test])
    return train_test_split(X_all, y_all, test_size=0.3, random_state=0)

In [11]:
df[['HTINCHE2','WTPOUND2']].describe()

Unnamed: 0,HTINCHE2,WTPOUND2
count,56276.0,56276.0
mean,96.318608,508.800714
std,163.520676,1795.827742
min,55.0,75.0
25%,64.0,135.0
50%,67.0,165.0
75%,70.0,200.0
max,998.0,9998.0


In [6]:
def featFindBeta(X,y):
    list_p = list(X)
    find_p = []
    logreg = LogisticRegression()
    rfe = RFE(logreg)
    
    for i in range(3):#len(list_p)/10+1):
        if i == len(list_p)/1000+1:
            use_p = list_p[i*1000:]
        else:
            use_p = list_p[i*1000:(i+1)*1000]
            
        use_X = X[use_p]
        use_X = pd.get_dummies(use_X,columns=use_p)
        rfe = rfe.fit(use_X,y.values.ravel())
        tmp_p = np.array(list(use_X))
        find_p.extend(tmp_p[rfe.support_])
        
    return find_p

In [73]:
def featFind(X,y,use_col,X_test):
    logreg = LogisticRegression()
    rfe = RFE(logreg)
    use_X = X[use_col]
    X_teu = X_test[use_col]
    dum_p = []
    
    for param in use_col:
        if use_X[param].max() == 2:
            use_X[param].replace(2,0,inplace=True)
            X_teu[param].replace(2,0,inplace=True)
        elif use_X[param].max() > 2:
            dum_p.append(param)
    
    use_X = pd.get_dummies(use_X,columns=dum_p) 
    X_teu = pd.get_dummies(X_teu,columns=dum_p) 
    tmp_p = np.array(list(use_X))
    rfe = rfe.fit(use_X,y.values.ravel())
    return tmp_p[rfe.support_],use_X[tmp_p[rfe.support_]],X_teu[tmp_p[rfe.support_]]

In [71]:
X_train,X_test,y_train,y_test = mkTrainTest(df)

In [8]:
use_col = ['SERVICE','HEALTH']

featFind(X_train,y_train,use_col)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


array(['HEALTH_1.0', 'HEALTH_2.0', 'HEALTH_5.0'],
      dtype='|S10')

In [None]:
use_col = ['CATAG6','NOMARR2','SERVICE','HEALTH','MOVSINPYR2','SEXATRACT','DIFFHEAR','DIFFSEE','DIFFTHINK','DIFFWALK',
           'DIFFDRESS','DIFFERAND','IRSEX','IRMARIT','IREDUHIGHST2','NEWRACE2','EDUHIGHCAT','EDUSCHLGO','WRKSTATWK2',
           'EDFAM18','PRVHLTIN','HLCNOTYR','IRPINC3','IFAMIN3','INCOME','POVERTY','MAIIN102','BOOKED','TXEVRRCVD',
           'PREGNANT','HTINCHE2','WTPOUND2','INHOSPYR','NMVSOEST','AUINPYR','AUOPTYR','AURXYR','AUUNMTYR','SNYSELL',
           'SNYSTOLE','SNYATTAK','SNRLGIMP','DSTWORST','ADDPREV','AMDELT']

In [93]:
use_col = ['CATAG6','HEALTH','MOVSINPYR2','SEXATRACT','IRMARIT','IREDUHIGHST2','NEWRACE2','WRKSTATWK2','IRPINC3',
           'SNRLGIMP','WTPOUND2','BMI2']

use_param, use_X, X_ntest = featFind(X_train,y_train,use_col,X_test)
use_param

array(['CATAG6_2', 'CATAG6_3', 'CATAG6_6', 'HEALTH_1.0', 'HEALTH_2.0',
       'HEALTH_4.0', 'HEALTH_5.0', 'MOVSINPYR2_2.0', 'MOVSINPYR2_3.0',
       'SEXATRACT_1.0', 'SEXATRACT_2.0', 'SEXATRACT_3.0', 'SEXATRACT_4.0',
       'SEXATRACT_5.0', 'IRMARIT_1.0', 'IRMARIT_2.0', 'IRMARIT_3.0',
       'IREDUHIGHST2_1', 'IREDUHIGHST2_2', 'IREDUHIGHST2_3',
       'IREDUHIGHST2_5', 'NEWRACE2_2', 'NEWRACE2_4', 'NEWRACE2_5',
       'NEWRACE2_7', 'WRKSTATWK2_4.0', 'WRKSTATWK2_6.0', 'IRPINC3_5',
       'IRPINC3_7', 'SNRLGIMP_1.0', 'SNRLGIMP_3.0', 'BMI2_0.0', 'BMI2_1.0',
       'BMI2_2.0', 'BMI2_3.0'],
      dtype='|S15')

In [94]:
logit_model=sm.Logit(y_train,use_X)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.569151
         Iterations 29
                                          Results: Logit
Model:                          Logit                      Pseudo R-squared:           0.179      
Dependent Variable:             MHSUITHK                   AIC:                        4194.0661  
Date:                           2019-01-13 18:49           BIC:                        4410.8931  
No. Observations:               3623                       Log-Likelihood:             -2062.0    
Df Model:                       34                         LL-Null:                    -2511.2    
Df Residuals:                   3588                       LLR p-value:                1.1025e-166
Converged:                      1.0000                     Scale:                      1.0000     
No. Iterations:                 29.0000                                                           
------------------------------------------------

In [95]:
pval = result.pvalues
for tmp_p in use_param[pval < 0.05]:
    print(tmp_p)
    print('num w/ attribute: '+str(len(use_X[use_X[tmp_p]==1])))

CATAG6_2
num w/ attribute: 1527
CATAG6_3
num w/ attribute: 718
CATAG6_6
num w/ attribute: 239
HEALTH_1.0
num w/ attribute: 654
HEALTH_2.0
num w/ attribute: 1256
HEALTH_4.0
num w/ attribute: 475
HEALTH_5.0
num w/ attribute: 115
MOVSINPYR2_2.0
num w/ attribute: 305
MOVSINPYR2_3.0
num w/ attribute: 262
SEXATRACT_2.0
num w/ attribute: 365
SEXATRACT_3.0
num w/ attribute: 260
SEXATRACT_4.0
num w/ attribute: 70
SEXATRACT_5.0
num w/ attribute: 85
IRMARIT_1.0
num w/ attribute: 1176
IRMARIT_2.0
num w/ attribute: 82
IRMARIT_3.0
num w/ attribute: 385
IREDUHIGHST2_5
num w/ attribute: 76
NEWRACE2_2
num w/ attribute: 404
NEWRACE2_5
num w/ attribute: 140
NEWRACE2_7
num w/ attribute: 563
WRKSTATWK2_4.0
num w/ attribute: 221
IRPINC3_5
num w/ attribute: 223
IRPINC3_7
num w/ attribute: 253
SNRLGIMP_1.0
num w/ attribute: 868
SNRLGIMP_3.0
num w/ attribute: 1208


In [102]:
def redParam(use_col,X_train,y_train,X_test,pLim=0.04,nLim=50):
    use_param, use_X, X_ntest = featFind(X_train,y_train,use_col,X_test)
    logit_model=sm.Logit(y_train,use_X)
    result=logit_model.fit()
    
    final_param = []
    for tmp_p in use_param[result.pvalues <= pLim]:
        if len(use_X[use_X[tmp_p]==1]) > nLim:
            final_param.append(tmp_p)
            
    return final_param, use_X[final_param], X_ntest[final_param]

In [103]:
final_param, X_trf, X_tef = redParam(use_col,X_train,y_train,X_test)
logit_model=sm.Logit(y_train,X_trf)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.569151
         Iterations 29
Optimization terminated successfully.
         Current function value: 0.572385
         Iterations 6
                          Results: Logit
Model:              Logit            Pseudo R-squared: 0.174      
Dependent Variable: MHSUITHK         AIC:              4195.5030  
Date:               2019-01-13 18:52 BIC:              4344.1844  
No. Observations:   3623             Log-Likelihood:   -2073.8    
Df Model:           23               LL-Null:          -2511.2    
Df Residuals:       3599             LLR p-value:      4.5977e-170
Converged:          1.0000           Scale:            1.0000     
No. Iterations:     6.0000                                        
------------------------------------------------------------------
                    Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
------------------------------------------------------------------
CATAG6_2         

In [117]:
y_pred = np.round(result.predict(X_tef))
print('fraction wrong: '+str(np.sum(np.abs(y_pred-y_test))/len(y_test)))
confusion_matrix(y_test, y_pred)

fraction wrong: 0.322601416613


array([[551, 219],
       [282, 501]])

Type 1 error: 219 (False positive (suicidal ideation in this case))

Type 2 error: 282 (False negative (did not predict SI))

In [120]:
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

        0.0       0.66      0.72      0.69       770
        1.0       0.70      0.64      0.67       783

avg / total       0.68      0.68      0.68      1553



In [128]:
result.params.sort_values(ascending=False)

SEXATRACT_3.0     1.375488
SEXATRACT_4.0     1.253534
SEXATRACT_2.0     1.104603
SEXATRACT_5.0     1.081130
HEALTH_5.0        0.997226
MOVSINPYR2_3.0    0.910212
CATAG6_2          0.744024
MOVSINPYR2_2.0    0.399951
WRKSTATWK2_4.0    0.377482
HEALTH_4.0        0.332115
SNRLGIMP_1.0      0.296529
CATAG6_3          0.271338
SNRLGIMP_3.0     -0.317221
NEWRACE2_2       -0.421091
HEALTH_2.0       -0.433873
IRPINC3_5        -0.472115
NEWRACE2_7       -0.472203
IRMARIT_1.0      -0.508152
CATAG6_6         -0.555914
IRPINC3_7        -0.596877
IREDUHIGHST2_5   -0.618605
NEWRACE2_5       -0.749869
HEALTH_1.0       -0.892824
IRMARIT_2.0      -0.911975
dtype: float64

While there is still work needed to better predict those who will experience suicidal ideation, this analysis provides some insights:

The biggest predictor of suicidal ideation is sexual orientation, specifically those who experience some level of attraction to the same sex. Following that, being in poor/fair health, moving 3 times or more in the last year, being 18 to 34 years old, being unemployed and looking for work, and not placing much value in one's religious beliefs, are also positively correlated with suicidal thoughts.

Some factors that anti-correlate with suicidal thoughts need further investigation (such as being widowed and only completing through the 9th grade).

Factors including having excellent/great health, being Asian, Hispanic or Black, having a household income of $75k+, being married, placing importance in one's religious beliefs, and being 65 or older, are predictors of the lack of suicidal thoughts.

While not new to this analysis, it is worth noting that suicidal thoughts are a common experience and to properly address this phenomena, any analysis around suicide needs to adequately address this. Also, a note this analysis focused on suicidal ideation, which is not the same has having a suicidal plan and/or attempts.