# Heart Disease Risk Predictions

## Data Used: UCI Heart Disease Dataset
This directory contains 4 databases concerning heart disease diagnosis.
   All attributes are numeric-valued.  The data was collected from the
   four following locations:

     1. Cleveland Clinic Foundation
     2. Hungarian Institute of Cardiology, Budapest
     3. V.A. Medical Center, Long Beach, CA
     4. University Hospital, Zurich, Switzerland

## Number of Instances: 
####        Database:    # of instances:
          1. Cleveland: 303
          2. Hungarian: 294
          3. Switzerland: 123
          4. Long Beach VA: 200
      
      

# Attribute Information:
      1. age:age in years       
      2. sex:(1 = male; 0 = female)       
      3. cp:chest pain type
          -- Value 1: typical angina
          -- Value 2: atypical angina
          -- Value 3: non-anginal pain
          -- Value 4: asymptomatic
      4. trestbps: resting blood pressure  
      5. chol:cholestoral      
      6. fbs:(fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)    
      7. restecg:
          -- Value 0: normal
          -- Value 1: having ST-T wave abnormality 
          -- Value 2: showing probable or definite left ventricular hypertrophy
      8. thalach:maximum heart rate achieved
      9. exang:exercise induced angina (1 = yes; 0 = no)     
      10. oldpeak:ST depression induced by exercise relative to rest   
      11. slope:the slope of the peak exercise ST segment
        -- Value 1: upsloping
        -- Value 2: flat
        -- Value 3: downsloping     
      12. ca: number of major vessels (0-3) colored by flourosopy        
      13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect 
      14. category:diagnosis of heart disease[0-4]       (the predicted attribute)

# Relevant Information:
     This database contains 76 attributes, but all published experiments
     refer to using a subset of 14 of them.  In particular, the Cleveland
     database is the only one that has been used by ML researchers to 
     this date.  The "goal" field refers to the presence of heart disease
     in the patient.  It is integer valued from 0 (no presence) to 4.
     Experiments with the Cleveland database have concentrated on simply
     attempting to distinguish presence (values 1,2,3,4) from absence (value
     0).  


## Class Distribution:
          Database:    0   1   2   3   4 Total
          Cleveland: 164  55  36  35  13   303
          Hungarian: 188  37  26  28  15   294
        Switzerland:   8  48  32  30   5   123
      Long Beach VA:  51  56  41  42  10   200

In [283]:
import pandas
import numpy
import matplotlib.pyplot as plt

In [284]:
df=pandas.read_csv('Preprocessed/data_combined.csv')
print df[:5]

   AGE  SEX  CP THRESTBPS CHOL FBS RESTECG THALACH EXANG OLDPEAK SLOPE CA  \
0   63    1   1       145  233   1       2     150     0     2.3     3  0   
1   67    1   4       160  286   0       2     108     1     1.5     2  3   
2   67    1   4       120  229   0       2     129     1     2.6     2  2   
3   37    1   3       130  250   0       0     187     0     3.5     3  0   
4   41    0   2       130  204   0       2     172     0     1.4     1  0   

  THAL  CATEGORY  
0    6         0  
1    3         2  
2    7         1  
3    3         0  
4    3         0  


In [285]:
print df.dtypes

AGE           int64
SEX           int64
CP            int64
THRESTBPS    object
CHOL         object
FBS          object
RESTECG      object
THALACH      object
EXANG        object
OLDPEAK      object
SLOPE        object
CA           object
THAL         object
CATEGORY      int64
dtype: object


In [286]:
print df['CATEGORY'].value_counts()

0    411
1    265
2    109
3    107
4     28
Name: CATEGORY, dtype: int64


## Missing Attribute Values(WEKA TOOL)
1. THRESTBPS(6%)
2. RESTECG(2 values)
2. CHOL(3%)
3. FBS(10%)
4. THALAC(6%)
5. EXANG(6%)
5. OLDPEAK(7%)
6. SLOPE(34%)
7. CA(66%)
8. THAL(53%)

## Replacing missing values for THERESTBPS

In [287]:
print df['THRESTBPS'].value_counts().head()

120    131
130    115
140    102
110     59
?       59
Name: THRESTBPS, dtype: int64


In [288]:
#average rest blood pressure is  generally in range 120-140
df['THRESTBPS'] = df['THRESTBPS'].replace(['?'],'120')
df['THRESTBPS'] = df['THRESTBPS'].astype('int64')

## Replacing missing values for FBS

In [289]:
#print df.columns
print df['FBS'].value_counts()
print "male:\n",df[df['SEX']==1]['FBS'].value_counts()
print "Female:\n",df[df['SEX']==0]['FBS'].value_counts()#directly replace with 0

0    692
1    138
?     90
Name: FBS, dtype: int64
male:
0    528
1    119
?     79
Name: FBS, dtype: int64
Female:
0    164
1     19
?     11
Name: FBS, dtype: int64


In [290]:
#randomly filling values with 80% with 0 and 20% with 1s
v=df.FBS.values=='?'
df.loc[v, 'FBS'] = numpy.random.choice(('0','1'), v.sum(), p=(0.8,0.2))
print df['FBS'].value_counts()
df['FBS']=df['FBS'].astype('int64')

0    765
1    155
Name: FBS, dtype: int64


# Replacing missing values in CHOL

In [291]:
df['CHOL'].value_counts().head()
#evenly distributed...
#so will replace with mean of the class

0      172
?       30
254     10
220     10
216      9
Name: CHOL, dtype: int64

In [292]:
df['CHOL']=df['CHOL'].replace('?','-69')#temporarily replacing ? with -69
df['CHOL']=df['CHOL'].astype('int64')
k=int(df[df['CHOL']!=-69]['CHOL'].mean())
df['CHOL']=df['CHOL'].replace(-69,k)


print df['CHOL'].unique() #completed !--!

[233 286 229 250 204 236 268 354 254 203 192 294 256 263 199 168 239 275
 266 211 283 284 224 206 219 340 226 247 167 230 335 234 177 276 353 243
 225 302 212 330 175 417 197 198 290 253 172 273 213 305 216 304 188 282
 185 232 326 231 269 267 248 360 258 308 245 270 208 264 321 274 325 235
 257 164 141 252 255 201 222 260 182 303 265 309 307 249 186 341 183 407
 217 288 220 209 227 261 174 281 221 205 240 289 318 298 564 246 322 299
 300 293 277 214 207 223 160 394 184 315 409 244 195 196 126 313 259 200
 262 215 228 193 271 210 327 149 295 306 178 237 218 242 319 166 180 311
 278 342 169 187 157 176 241 131 132 161 173 194 297 292 339 147 291 358
 412 238 163 280 202 328 129 190 179 272 100 468 320 312 171 365 344  85
 347 251 287 156 117 466 338 529 392 329 355 603 404 518 285 279 388 336
 491 331 393   0 153 316 458 384 349 142 181 310 170 369 165 337 333 139
 385]


## Replacing missing values in RESTECG

In [293]:
print df['RESTECG'].value_counts()

#replacing with max occuring value for attribute
df['RESTECG']=df['RESTECG'].replace('?','0')
#print df['RESTECG'].unique()
#print df['RESTECG'].value_counts()
df['RESTECG'] = df['RESTECG'].astype('int64')



print "after replacing\n",df['RESTECG'].value_counts()

0    551
2    188
1    179
?      2
Name: RESTECG, dtype: int64
after replacing
0    553
2    188
1    179
Name: RESTECG, dtype: int64


## Replacing missing values in THALACH

In [294]:
df['THALACH'].value_counts().head()

?      55
150    43
140    41
120    35
130    30
Name: THALACH, dtype: int64

In [295]:
df['THALACH']=df['THALACH'].replace('?','-69')#temporarily replacing ? with -69
df['THALACH']=df['THALACH'].astype('int64')
k=int(df[df['THALACH']!=-69]['THALACH'].mean())
print k
df['THALACH']=df['THALACH'].replace(-69,k)

137


In [296]:
df['THALACH'].value_counts().head()

137    60
150    43
140    41
120    35
130    30
Name: THALACH, dtype: int64

## Replacing missing values in EXANG

In [297]:
#exang:exercise induced angina (1 = yes; 0 = no) 
print df['EXANG'].value_counts()

0    528
1    337
?     55
Name: EXANG, dtype: int64


In [298]:
k=528.0/(337.0+528.0)
print k

0.610404624277


In [299]:
v=df.EXANG.values=='?'
df.loc[v,'EXANG'] = numpy.random.choice(('0','1'), v.sum(), p=(0.61,0.39))
print df['EXANG'].value_counts()
df['EXANG']=df["EXANG"].astype('int64')

0    560
1    360
Name: EXANG, dtype: int64


## Replacing missing values in OLDPEAK

In [300]:
print df['OLDPEAK'].value_counts().head()

0      370
1       83
2       76
?       62
1.5     48
Name: OLDPEAK, dtype: int64


In [301]:
df['OLDPEAK']=df['OLDPEAK'].replace('?','-69')#temporarily replacing ? with -69
df['OLDPEAK']=df['OLDPEAK'].astype('float64')
k=df[df['OLDPEAK']!=-69]['OLDPEAK'].mean()
print k
df['OLDPEAK']=df['OLDPEAK'].replace(-69,numpy.round(k,1))

0.878787878788


In [302]:
print df['OLDPEAK'].value_counts().head()

0.0    370
1.0     83
2.0     76
0.9     66
1.5     48
Name: OLDPEAK, dtype: int64


## SLOPE

In [303]:
print df['SLOPE'].value_counts()

2    345
?    309
1    203
3     63
Name: SLOPE, dtype: int64


In [304]:
#k=203.0/(345.0+203.0+63.0)
#print k

In [305]:
v=df.SLOPE.values=='?'
df.loc[v,'SLOPE'] = numpy.random.choice(('2','1','3'), v.sum(), p=(0.6,0.30,0.10))
print df['SLOPE'].value_counts()
df['SLOPE']=df['SLOPE'].astype('int64')

2    520
1    300
3    100
Name: SLOPE, dtype: int64


## CA

In [306]:
print df["CA"].value_counts()
k=(41.0)/(181+67+41+20)
print k

?    611
0    181
1     67
2     41
3     20
Name: CA, dtype: int64
0.132686084142


In [307]:
v=df.CA.values=='?'
df.loc[v,'CA'] = numpy.random.choice(('0','1','2','3'), v.sum(), p=(0.60,0.20,0.13,0.07))
df['CA']=df['CA'].astype('int64')
print df['CA'].value_counts()

0    557
1    200
2    108
3     55
Name: CA, dtype: int64


## THAL

In [308]:
print df['THAL'].value_counts()
#can't use random walk directly here

?    486
3    196
7    192
6     46
Name: THAL, dtype: int64


In [309]:
print df[df['THAL']=='3']['SEX'].value_counts()
print df[df['THAL']=='7']['SEX'].value_counts()

1    110
0     86
Name: SEX, dtype: int64
1    171
0     21
Name: SEX, dtype: int64


In [310]:
print "THAL:3=====>\n",df[df['THAL']=='3']['CATEGORY'].value_counts()
print "THAL:7=====>\n",df[df['THAL']=='7']['CATEGORY'].value_counts()
print "THAL:6=====>\n",df[df['THAL']=='6']['CATEGORY'].value_counts()

THAL:3=====>
0    138
1     30
2     14
3     12
4      2
Name: CATEGORY, dtype: int64
THAL:7=====>
1    63
3    43
0    38
2    37
4    11
Name: CATEGORY, dtype: int64
THAL:6=====>
1    13
2    12
0    11
3     7
4     3
Name: CATEGORY, dtype: int64


In [311]:
df['THAL']=df['THAL'].replace('?',-1)
'''
df['THAL']=df['THAL'].replace('?',-1)
for row in df.iterrows():
    if row['THAL']==-1 and row['CATEGORY']>=1:
        df.loc[row.Index, 'ifor'] = 7
        
    elif row['THAL']==-1 and row['CATEGORY']==0:
        df.loc[row.Index, 'ifor'] = 3
'''
df.loc[(df['THAL']==-1)&(df['CATEGORY']!=0),'THAL']='7'
#print df['THAL'].value_counts()
df.loc[(df['THAL']==-1)&(df['CATEGORY']==0),'THAL']='3'
print df['THAL'].value_counts()
df['THAL']=df['THAL'].astype('int64')

7    454
3    420
6     46
Name: THAL, dtype: int64


In [312]:
print df.dtypes

AGE            int64
SEX            int64
CP             int64
THRESTBPS      int64
CHOL           int64
FBS            int64
RESTECG        int64
THALACH        int64
EXANG          int64
OLDPEAK      float64
SLOPE          int64
CA             int64
THAL           int64
CATEGORY       int64
dtype: object


In [313]:
dummies = pandas.get_dummies(df["CP"],prefix="CP")
df = df.join(dummies)

dummies = pandas.get_dummies(df["RESTECG"],prefix="RESTECG")
df      = df.join(dummies)

dummies = pandas.get_dummies(df["SLOPE"],prefix="SLOPE")
df      = df.join(dummies)

dummies = pandas.get_dummies(df["THAL"],prefix="THAL")
df      = df.join(dummies)

#dummies = pandas.get_dummies(df["EXANG"],prefix="EXANG")
#df = df.join(dummies)

#del df['SEX']
del df['CP']
del df['RESTECG']
del df['SLOPE']
del df['THAL']
#del df['EXANG']

In [314]:
print df.dtypes

AGE            int64
SEX            int64
THRESTBPS      int64
CHOL           int64
FBS            int64
THALACH        int64
EXANG          int64
OLDPEAK      float64
CA             int64
CATEGORY       int64
CP_1           uint8
CP_2           uint8
CP_3           uint8
CP_4           uint8
RESTECG_0      uint8
RESTECG_1      uint8
RESTECG_2      uint8
SLOPE_1        uint8
SLOPE_2        uint8
SLOPE_3        uint8
THAL_3         uint8
THAL_6         uint8
THAL_7         uint8
dtype: object


In [315]:
for g in df.columns:
    if df[g].dtype=='uint8':
        df[g]=df[g].astype('int64')

In [316]:
df.dtypes
df.loc[df['CATEGORY']>0,'CATEGORY']=1

In [317]:
stdcols = ["AGE","THRESTBPS","CHOL","THALACH","OLDPEAK"]
nrmcols = ["CA"]
stddf   = df.copy()
stddf[stdcols] = stddf[stdcols].apply(lambda x: (x-x.mean())/x.std())
stddf[nrmcols] = stddf[nrmcols].apply(lambda x: (x-x.mean())/(x.max()-x.min()))
#stddf[stdcols] = stddf[stdcols].apply(lambda x: (x-x.mean())/(x.max()-x.min()))


for g in stdcols:
    print g,max(stddf[g]),min(stddf[g])
    
for g in nrmcols:
    print g,max(stddf[g]),min(stddf[g])    

AGE 2.49229867231 -2.70681396757
THRESTBPS 3.67440591791 -7.03102349108
CHOL 3.70670590542 -1.82755513195
THALACH 2.56523324659 -3.08339929343
OLDPEAK 5.04825042025 -3.30258023693
CA 0.789492753623 -0.210507246377


In [318]:
print stddf.dtypes

AGE          float64
SEX            int64
THRESTBPS    float64
CHOL         float64
FBS            int64
THALACH      float64
EXANG          int64
OLDPEAK      float64
CA           float64
CATEGORY       int64
CP_1           int64
CP_2           int64
CP_3           int64
CP_4           int64
RESTECG_0      int64
RESTECG_1      int64
RESTECG_2      int64
SLOPE_1        int64
SLOPE_2        int64
SLOPE_3        int64
THAL_3         int64
THAL_6         int64
THAL_7         int64
dtype: object


In [319]:
from sklearn.model_selection import train_test_split


In [320]:
df_copy=stddf.copy()
df_copy=df_copy.drop(['CATEGORY'],axis=1)

dat=df_copy.values
#print dat.shape

print type(dat),dat.shape

<type 'numpy.ndarray'> (920, 22)


In [321]:
labels=df['CATEGORY'].values
print labels[:5],type(labels)

[0 1 1 0 0] <type 'numpy.ndarray'>


In [322]:
print df['CATEGORY'].value_counts()

1    509
0    411
Name: CATEGORY, dtype: int64


In [323]:
x_train,x_test,y_train,y_test=train_test_split(dat,labels, test_size=0.25, random_state=42)

In [324]:
print "x_train:",x_train.shape
print "y_train:",y_train.shape
print
print "x_test:",x_test.shape
print "y_test:",y_test.shape

x_train: (690, 22)
y_train: (690,)

x_test: (230, 22)
y_test: (230,)


In [325]:
#training and testing
#SVM
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=5)
clf.fit(x_train,y_train)
print "SVM:",clf.score(x_test,y_test)*100,"%"
svmpred=clf.predict(x_test)
#print svmpred




#from sklearn.model_selection import cross_val_score
#scores = cross_val_score(clf,dat,labels, cv=5)
#print scores


from sklearn import linear_model
lrcv=linear_model.LogisticRegressionCV(fit_intercept=True,penalty='l2',dual=False)
lrcv.fit(x_train,y_train)
print "Logistic Regression:",lrcv.score(x_test,y_test)*100,"%"


SVM: 89.1304347826 %
Logistic Regression: 88.2608695652 %


In [326]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

clf = ExtraTreesClassifier()
clf = clf.fit(dat,labels)
g=clf.feature_importances_
c=stddf.drop(['CATEGORY'],axis=1).columns

print "Importance of various features"
for k in range(len(c)):
    print c[k],g[k]
    
    
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(dat)
print X_new.shape


tx_train,tx_test,ty_train,ty_test=train_test_split(X_new,labels, test_size=0.25, random_state=42)


tclf = svm.SVC(gamma=0.001, C=5)
tclf.fit(tx_train,ty_train)
print "after feature sel SVM:",tclf.score(tx_test,ty_test)*100,"%"
tsvmpred=tclf.predict(tx_test)
#print tsvmpred


lrcv=linear_model.LogisticRegressionCV(fit_intercept=True,penalty='l2',dual=False)
lrcv.fit(tx_train,ty_train)
print "Logistic Regression:",lrcv.score(tx_test,ty_test)*100,"%"


Importance of various features
AGE 0.0430948093414
SEX 0.0335623949641
THRESTBPS 0.0399753743733
CHOL 0.065922258698
FBS 0.00864986982423
THALACH 0.0428641967858
EXANG 0.0511826428599
OLDPEAK 0.0347466261144
CA 0.0356597495659
CP_1 0.00700858796546
CP_2 0.0421112202428
CP_3 0.0105986298357
CP_4 0.0489146696255
RESTECG_0 0.011312980557
RESTECG_1 0.0140054280326
RESTECG_2 0.0131781616231
SLOPE_1 0.011029097367
SLOPE_2 0.0124039872418
SLOPE_3 0.00490804312616
THAL_3 0.317612451723
THAL_6 0.00548549291523
THAL_7 0.145773327217
(920, 5)
after feature sel SVM: 89.5652173913 %
Logistic Regression: 90.0 %


In [327]:
import keras 
import tensorflow
%matplotlib inline
from sklearn import metrics
import matplotlib.pyplot as plt # side-stepping mpl backend
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers.core import Dropout, Flatten, Activation, Dense
import math

In [328]:
def make_model(activ,opti,ip,layers,trainx,trainy,testx,testy):
    model = Sequential()
    model.add(Dense(layers[0], input_dim=ip, init='uniform', activation=activ))
    model.add(Dense(layers[1], init='uniform', activation=activ))
    model.add(Dense(1, init='uniform', activation=activ))
    model.compile(loss='mse', optimizer=opti, metrics=['accuracy'])
    model.fit(trainx,trainy,epochs=600,batch_size=512,verbose=2,validation_data=(testx,testy))
    
    trainScore = model.evaluate(trainx,trainy, verbose=0)
    print "Train Score: ",100-trainScore[0]*100
    testScore = model.evaluate(testx,testy, verbose=0)
    print "Test Score: ",100-testScore[0]*100
    
    return model
    
    
    
    

In [329]:
#without k best features,sigmoid and rmsprop
m1=make_model('sigmoid','rmsprop',x_train.shape[1],[x_train.shape[1],16],x_train,y_train,x_test,y_test)

  app.launch_new_instance()


Train on 690 samples, validate on 230 samples
Epoch 1/600
0s - loss: 0.2512 - acc: 0.4565 - val_loss: 0.2501 - val_acc: 0.4174
Epoch 2/600
0s - loss: 0.2500 - acc: 0.4681 - val_loss: 0.2492 - val_acc: 0.5826
Epoch 3/600
0s - loss: 0.2495 - acc: 0.5435 - val_loss: 0.2483 - val_acc: 0.5826
Epoch 4/600
0s - loss: 0.2491 - acc: 0.5435 - val_loss: 0.2476 - val_acc: 0.5826
Epoch 5/600
0s - loss: 0.2488 - acc: 0.5435 - val_loss: 0.2473 - val_acc: 0.5826
Epoch 6/600
0s - loss: 0.2487 - acc: 0.5435 - val_loss: 0.2469 - val_acc: 0.5826
Epoch 7/600
0s - loss: 0.2485 - acc: 0.5435 - val_loss: 0.2464 - val_acc: 0.5826
Epoch 8/600
0s - loss: 0.2483 - acc: 0.5435 - val_loss: 0.2462 - val_acc: 0.5826
Epoch 9/600
0s - loss: 0.2482 - acc: 0.5435 - val_loss: 0.2457 - val_acc: 0.5826
Epoch 10/600
0s - loss: 0.2481 - acc: 0.5435 - val_loss: 0.2455 - val_acc: 0.5826
Epoch 11/600
0s - loss: 0.2481 - acc: 0.5435 - val_loss: 0.2455 - val_acc: 0.5826
Epoch 12/600
0s - loss: 0.2480 - acc: 0.5435 - val_loss: 0.24

In [330]:
pr=m1.predict(x_test)
u=0
k=0
k=0
for u in range(len(x_test)):
    if round(pr[u][0],1)>=0.3 and round(pr[u][0],1)<=0.8:
        g=svmpred[u]
    else:
        g=round(pr[u][0],0)
    if g!=y_test[u]:    
        print "expected",y_test[u],"predicted:",pr[u][0]," ",svmpred[u]
        k=k+1
print "error",k*100/len(y_test)        

expected 1 predicted: 0.127955   0
expected 1 predicted: 0.22189   1
expected 0 predicted: 0.718349   1
expected 0 predicted: 0.966084   1
expected 0 predicted: 0.880699   1
expected 1 predicted: 0.126082   1
expected 1 predicted: 0.0665667   0
expected 1 predicted: 0.220216   0
expected 1 predicted: 0.162432   0
expected 0 predicted: 0.623086   1
expected 1 predicted: 0.150062   0
expected 1 predicted: 0.653981   0
expected 0 predicted: 0.845451   1
expected 0 predicted: 0.887515   1
expected 1 predicted: 0.11685   0
expected 1 predicted: 0.137596   0
expected 1 predicted: 0.525477   0
expected 0 predicted: 0.937237   1
expected 1 predicted: 0.490873   0
expected 1 predicted: 0.235449   0
expected 1 predicted: 0.562639   0
expected 0 predicted: 0.930634   1
expected 0 predicted: 0.687741   1
expected 0 predicted: 0.92892   1
expected 1 predicted: 0.0514194   0
error 10


In [331]:
print tx_train.shape

(690, 5)


In [332]:
m2=make_model('sigmoid','rmsprop',tx_train.shape[1],[tx_train.shape[1]*10,16],tx_train,ty_train,tx_test,ty_test)

  app.launch_new_instance()


Train on 690 samples, validate on 230 samples
Epoch 1/600
0s - loss: 0.2494 - acc: 0.5435 - val_loss: 0.2470 - val_acc: 0.5826
Epoch 2/600
0s - loss: 0.2486 - acc: 0.5435 - val_loss: 0.2459 - val_acc: 0.5826
Epoch 3/600
0s - loss: 0.2482 - acc: 0.5435 - val_loss: 0.2456 - val_acc: 0.5826
Epoch 4/600
0s - loss: 0.2482 - acc: 0.5435 - val_loss: 0.2456 - val_acc: 0.5826
Epoch 5/600
0s - loss: 0.2482 - acc: 0.5435 - val_loss: 0.2456 - val_acc: 0.5826
Epoch 6/600
0s - loss: 0.2481 - acc: 0.5435 - val_loss: 0.2453 - val_acc: 0.5826
Epoch 7/600
0s - loss: 0.2480 - acc: 0.5435 - val_loss: 0.2450 - val_acc: 0.5826
Epoch 8/600
0s - loss: 0.2480 - acc: 0.5435 - val_loss: 0.2447 - val_acc: 0.5826
Epoch 9/600
0s - loss: 0.2479 - acc: 0.5435 - val_loss: 0.2446 - val_acc: 0.5826
Epoch 10/600
0s - loss: 0.2479 - acc: 0.5435 - val_loss: 0.2443 - val_acc: 0.5826
Epoch 11/600
0s - loss: 0.2480 - acc: 0.5435 - val_loss: 0.2446 - val_acc: 0.5826
Epoch 12/600
0s - loss: 0.2478 - acc: 0.5435 - val_loss: 0.24

In [333]:
pr=m2.predict(tx_test)
u=0
k=0
k=0
for u in range(len(tx_test)):
    if round(pr[u][0],1)>=0.2 and round(pr[u][0],1)<=0.75:
        g=tsvmpred[u]
    else:
        g=round(pr[u][0],0)
    if g!=ty_test[u]:    
        print "expected",ty_test[u],"predicted:",pr[u][0]," ",tsvmpred[u]
        k=k+1
print "error",k*100/len(ty_test)        

expected 1 predicted: 0.450232   0
expected 1 predicted: 0.309441   0
expected 0 predicted: 0.644813   1
expected 0 predicted: 0.892456   1
expected 0 predicted: 0.967451   1
expected 1 predicted: 0.128892   0
expected 1 predicted: 0.367067   0
expected 1 predicted: 0.440748   0
expected 0 predicted: 0.409086   1
expected 1 predicted: 0.0799265   0
expected 1 predicted: 0.261614   0
expected 0 predicted: 0.758133   0
expected 0 predicted: 0.705016   1
expected 1 predicted: 0.450232   0
expected 1 predicted: 0.122519   0
expected 1 predicted: 0.137053   0
expected 0 predicted: 0.935495   1
expected 1 predicted: 0.323703   0
expected 0 predicted: 0.605108   1
expected 1 predicted: 0.243381   0
expected 0 predicted: 0.865637   1
expected 0 predicted: 0.941152   1
expected 0 predicted: 0.652864   1
expected 1 predicted: 0.0661002   0
error 10


In [334]:
#without k best features,sigmoid and rmsprop
m3=make_model('relu','rmsprop',x_train.shape[1],[x_train.shape[1],16],x_train,y_train,x_test,y_test)

  app.launch_new_instance()


Train on 690 samples, validate on 230 samples
Epoch 1/600
0s - loss: 0.5423 - acc: 0.4565 - val_loss: 0.5724 - val_acc: 0.4174
Epoch 2/600
0s - loss: 0.5328 - acc: 0.4565 - val_loss: 0.5631 - val_acc: 0.4174
Epoch 3/600
0s - loss: 0.5241 - acc: 0.4565 - val_loss: 0.5534 - val_acc: 0.4174
Epoch 4/600
0s - loss: 0.5150 - acc: 0.4565 - val_loss: 0.5425 - val_acc: 0.4174
Epoch 5/600
0s - loss: 0.5046 - acc: 0.4565 - val_loss: 0.5297 - val_acc: 0.4174
Epoch 6/600
0s - loss: 0.4925 - acc: 0.4565 - val_loss: 0.5152 - val_acc: 0.4174
Epoch 7/600
0s - loss: 0.4786 - acc: 0.4565 - val_loss: 0.4985 - val_acc: 0.4174
Epoch 8/600
0s - loss: 0.4629 - acc: 0.4565 - val_loss: 0.4804 - val_acc: 0.4174
Epoch 9/600
0s - loss: 0.4457 - acc: 0.4565 - val_loss: 0.4602 - val_acc: 0.4174
Epoch 10/600
0s - loss: 0.4267 - acc: 0.4565 - val_loss: 0.4385 - val_acc: 0.4174
Epoch 11/600
0s - loss: 0.4063 - acc: 0.4565 - val_loss: 0.4154 - val_acc: 0.4174
Epoch 12/600
0s - loss: 0.3850 - acc: 0.4565 - val_loss: 0.39

In [335]:
pr=m3.predict(x_test)
u=0
k=0
k=0
for u in range(len(x_test)):
    if round(pr[u][0],1)>=0.2 and round(pr[u][0],1)<=0.75:
        g=svmpred[u]
    else:
        g=round(pr[u][0],0)
    if g!=y_test[u]:    
        print "expected",y_test[u],"predicted:",pr[u][0]," ",svmpred[u]
        k=k+1
print "error",k*100/len(y_test)        

expected 1 predicted: 0.16935   0
expected 0 predicted: 0.756986   0
expected 1 predicted: 0.129864   1
expected 0 predicted: 0.766857   1
expected 0 predicted: 1.0738   1
expected 0 predicted: 0.90131   1
expected 1 predicted: 0.0   1
expected 1 predicted: 0.0   0
expected 1 predicted: 0.0   0
expected 1 predicted: 0.0   0
expected 0 predicted: 0.881396   1
expected 1 predicted: 0.301764   0
expected 0 predicted: 0.819366   0
expected 0 predicted: 0.823425   0
expected 0 predicted: 0.43728   1
expected 0 predicted: 0.653403   1
expected 1 predicted: 0.0   0
expected 1 predicted: 0.581686   0
expected 1 predicted: 0.117042   1
expected 1 predicted: 0.699658   0
expected 0 predicted: 0.920601   1
expected 1 predicted: 0.435114   0
expected 1 predicted: 0.537698   0
expected 1 predicted: 0.702074   0
expected 0 predicted: 0.703102   1
expected 0 predicted: 0.840468   0
expected 1 predicted: 0.0789016   1
expected 0 predicted: 0.951252   1
expected 0 predicted: 0.825774   1
expected 1 pre

# Try K folds on SVM and neural networks

In [336]:
#selecting the best  partition for testing and training....

from sklearn.model_selection import KFold
kf = KFold(n_splits=7)
kf.get_n_splits(dat)

clf1 = svm.SVC(gamma=0.001, C=100)
score=-9

count =1

sumy=0


for train_index, test_index in kf.split(dat):
    fX_train, fX_test = dat[train_index],dat[test_index]
    fy_train, fy_test = labels[train_index],labels[test_index]
    clf1.fit(fX_train,fy_train)
    g=clf1.score(fX_test,fy_test)*100
    if count>2:
        print g
        sumy=sumy+g
    if g>score:
        score=g
    count=count+1
count=count-3    

92.4242424242
94.6564885496
91.6030534351
90.8396946565
96.1832061069


In [337]:
print "cross validation score :",sumy/(count)

print "best score :",score

cross validation score : 93.1413370345
best score : 96.1832061069


In [338]:
#selecting the best  partition for testing and training....

from sklearn.model_selection import KFold
kf = KFold(n_splits=7)
kf.get_n_splits(X_new)

clf = svm.SVC(gamma=0.001, C=100)
score=-9

count =1

sumy=0

for train_index, test_index in kf.split(X_new):
    fX_train, fX_test = X_new[train_index],X_new[test_index]
    fy_train, fy_test = labels[train_index],labels[test_index]
    clf.fit(fX_train,fy_train)
    g=clf.score(fX_test,fy_test)*100
    if count>2:
        print g
        sumy=sumy+g
    if g>score:
        score=g
    count=count+1
count=count-3    

92.4242424242
94.6564885496
91.6030534351
90.8396946565
96.1832061069


In [339]:
print "cross validation score best features:",sumy/(count)

print "best score best features:",score

cross validation score best features: 93.1413370345
best score best features: 96.1832061069


In [340]:
 from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_test,clf1.predict(x_test)).ravel()
print "True Negative",tn
print "False Positive",fp
print "False Negative",fn
print "True Positive",tp


print "precision :",tp*100/(tp+fp)

True Negative 85
False Positive 11
False Negative 14
True Positive 120
precision : 91



m1=make_model('sigmoid','rmsprop',22,[220,22],btrain_x,btrain_y,btest_x,btest_y)


pr=m1.predict(x_test)
u=0
k=0
k=0
for u in range(len(x_test)):
    if round(pr[u][0],1)>=0.2 and round(pr[u][0],1)<=0.75:
        g=svmpred[u]
    else:
        g=round(pr[u][0],0)
    if g!=y_test[u]:    
        print "expected",y_test[u],"predicted:",pr[u][0]," ",svmpred[u]
        k=k+1
    #if round(pr[u][0],0)==y_test[u]:
    #    k=k+1
print "accuracy",k*100/len(y_test)

In [341]:
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=120)

rf.fit(x_train,y_train)
print "Random Forest:",rf.score(x_test,y_test)*100,"%"


Random Forest: 88.2608695652 %


# The accuracy of SVM has been increased,But no improvements with NN

1. SVM:94.02%
2. Logistic:89%
3. Neural networks:92.5%