# Heart Disease Risk Predictions

## Data Used: UCI Heart Disease Dataset
This directory contains 4 databases concerning heart disease diagnosis.
   All attributes are numeric-valued.  The data was collected from the
   four following locations:

     1. Cleveland Clinic Foundation
     2. Hungarian Institute of Cardiology, Budapest
     3. V.A. Medical Center, Long Beach, CA
     4. University Hospital, Zurich, Switzerland

## Number of Instances: 
####        Database:    # of instances:
          1. Cleveland: 303
          2. Hungarian: 294
          3. Switzerland: 123
          4. Long Beach VA: 200
      
      

# Attribute Information:
      1. age:age in years       
      2. sex:(1 = male; 0 = female)       
      3. cp:chest pain type
          -- Value 1: typical angina
          -- Value 2: atypical angina
          -- Value 3: non-anginal pain
          -- Value 4: asymptomatic
      4. trestbps: resting blood pressure  
      5. chol:cholestoral      
      6. fbs:(fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)    
      7. restecg:
          -- Value 0: normal
          -- Value 1: having ST-T wave abnormality 
          -- Value 2: showing probable or definite left ventricular hypertrophy
      8. thalach:maximum heart rate achieved
      9. exang:exercise induced angina (1 = yes; 0 = no)     
      10. oldpeak:ST depression induced by exercise relative to rest   
      11. slope:the slope of the peak exercise ST segment
        -- Value 1: upsloping
        -- Value 2: flat
        -- Value 3: downsloping     
      12. ca: number of major vessels (0-3) colored by flourosopy        
      13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect 
      14. category:diagnosis of heart disease[0-4]       (the predicted attribute)


## Class Distribution:
        Database:      0   1   2   3   4 Total
          Cleveland: 164  55  36  35  13   303
          Hungarian: 188  37  26  28  15   294
        Switzerland:   8  48  32  30   5   123
      Long Beach VA:  51  56  41  42  10   200

In [2]:
import pandas
import numpy
import matplotlib.pyplot as plt

In [3]:
df=pandas.read_csv('Preprocessed/data_combined.csv')
print df[:15]

    AGE  SEX  CP THRESTBPS CHOL FBS RESTECG THALACH EXANG OLDPEAK SLOPE CA  \
0    63    1   1       145  233   1       2     150     0     2.3     3  0   
1    67    1   4       160  286   0       2     108     1     1.5     2  3   
2    67    1   4       120  229   0       2     129     1     2.6     2  2   
3    37    1   3       130  250   0       0     187     0     3.5     3  0   
4    41    0   2       130  204   0       2     172     0     1.4     1  0   
5    56    1   2       120  236   0       0     178     0     0.8     1  0   
6    62    0   4       140  268   0       2     160     0     3.6     3  2   
7    57    0   4       120  354   0       0     163     1     0.6     1  0   
8    63    1   4       130  254   0       2     147     0     1.4     2  1   
9    53    1   4       140  203   1       2     155     1     3.1     3  0   
10   57    1   4       140  192   0       0     148     0     0.4     2  0   
11   56    0   2       140  294   0       2     153     0     1.

In [4]:
print df.dtypes

AGE           int64
SEX           int64
CP            int64
THRESTBPS    object
CHOL         object
FBS          object
RESTECG      object
THALACH      object
EXANG        object
OLDPEAK      object
SLOPE        object
CA           object
THAL         object
CATEGORY      int64
dtype: object


In [5]:
print df['CATEGORY'].value_counts()

0    411
1    265
2    109
3    107
4     28
Name: CATEGORY, dtype: int64


## Missing Attribute Values(WEKA TOOL)
1. THRESTBPS(6%)
2. RESTECG(2 values)
2. CHOL(3%)
3. FBS(10%)
4. THALAC(6%)
5. EXANG(6%)
5. OLDPEAK(7%)
6. SLOPE(34%)
7. CA(66%)
8. THAL(53%)

## Replacing missing values for THERESTBPS

In [6]:
print df['THRESTBPS'].value_counts().head()

120    131
130    115
140    102
110     59
?       59
Name: THRESTBPS, dtype: int64


In [7]:
#average rest blood pressure is  generally in range 120-140
df['THRESTBPS'] = df['THRESTBPS'].replace(['?'],'120')
df['THRESTBPS'] = df['THRESTBPS'].astype('int64')

## Replacing missing values for FBS

In [8]:
#print df.columns
print df['FBS'].value_counts()
print "male:\n",df[df['SEX']==1]['FBS'].value_counts()
print "Female:\n",df[df['SEX']==0]['FBS'].value_counts()#directly replace with 0

0    692
1    138
?     90
Name: FBS, dtype: int64
male:
0    528
1    119
?     79
Name: FBS, dtype: int64
Female:
0    164
1     19
?     11
Name: FBS, dtype: int64


In [9]:
#randomly filling values with 80% with 0 and 20% with 1s
v=df.FBS.values=='?'
df.loc[v, 'FBS'] = numpy.random.choice(('0','1'), v.sum(), p=(0.8,0.2))
print df['FBS'].value_counts()
df['FBS']=df['FBS'].astype('int64')

0    759
1    161
Name: FBS, dtype: int64


# Replacing missing values in CHOL

In [10]:
df['CHOL'].value_counts().head()
#evenly distributed...
#so will replace with mean of the class

0      172
?       30
254     10
220     10
216      9
Name: CHOL, dtype: int64

In [11]:
df['CHOL']=df['CHOL'].replace('?','-69')#temporarily replacing ? with -69
df['CHOL']=df['CHOL'].astype('int64')
k=int(df[df['CHOL']!=-69]['CHOL'].mean())
df['CHOL']=df['CHOL'].replace(-69,k)


print df['CHOL'].unique() #completed !--!

[233 286 229 250 204 236 268 354 254 203 192 294 256 263 199 168 239 275
 266 211 283 284 224 206 219 340 226 247 167 230 335 234 177 276 353 243
 225 302 212 330 175 417 197 198 290 253 172 273 213 305 216 304 188 282
 185 232 326 231 269 267 248 360 258 308 245 270 208 264 321 274 325 235
 257 164 141 252 255 201 222 260 182 303 265 309 307 249 186 341 183 407
 217 288 220 209 227 261 174 281 221 205 240 289 318 298 564 246 322 299
 300 293 277 214 207 223 160 394 184 315 409 244 195 196 126 313 259 200
 262 215 228 193 271 210 327 149 295 306 178 237 218 242 319 166 180 311
 278 342 169 187 157 176 241 131 132 161 173 194 297 292 339 147 291 358
 412 238 163 280 202 328 129 190 179 272 100 468 320 312 171 365 344  85
 347 251 287 156 117 466 338 529 392 329 355 603 404 518 285 279 388 336
 491 331 393   0 153 316 458 384 349 142 181 310 170 369 165 337 333 139
 385]


## Replacing missing values in RESTECG

In [12]:
print df['RESTECG'].value_counts()

#replacing with max occuring value for attribute
df['RESTECG']=df['RESTECG'].replace('?','0')
#print df['RESTECG'].unique()
#print df['RESTECG'].value_counts()
df['RESTECG'] = df['RESTECG'].astype('int64')

0    551
2    188
1    179
?      2
Name: RESTECG, dtype: int64


## Replacing missing values in THALACH

In [13]:
df['THALACH'].value_counts().head()

?      55
150    43
140    41
120    35
130    30
Name: THALACH, dtype: int64

In [14]:
df['THALACH']=df['THALACH'].replace('?','-69')#temporarily replacing ? with -69
df['THALACH']=df['THALACH'].astype('int64')
k=int(df[df['THALACH']!=-69]['THALACH'].mean())
print k
df['THALACH']=df['THALACH'].replace(-69,k)

137


In [15]:
df['THALACH'].value_counts().head()

137    60
150    43
140    41
120    35
130    30
Name: THALACH, dtype: int64

## Replacing missing values in EXANG

In [16]:
#exang:exercise induced angina (1 = yes; 0 = no) 
print df['EXANG'].value_counts()

0    528
1    337
?     55
Name: EXANG, dtype: int64


In [17]:
k=528.0/(337.0+528.0)
print k

0.610404624277


In [18]:
v=df.EXANG.values=='?'
df.loc[v,'EXANG'] = numpy.random.choice(('0','1'), v.sum(), p=(0.61,0.39))
print df['EXANG'].value_counts()
df['EXANG']=df["EXANG"].astype('int64')

0    560
1    360
Name: EXANG, dtype: int64


## Replacing missing values in OLDPEAK

In [19]:
print df['OLDPEAK'].value_counts().head()

0      370
1       83
2       76
?       62
1.5     48
Name: OLDPEAK, dtype: int64


In [20]:
df['OLDPEAK']=df['OLDPEAK'].replace('?','-69')#temporarily replacing ? with -69
df['OLDPEAK']=df['OLDPEAK'].astype('float64')
k=df[df['OLDPEAK']!=-69]['OLDPEAK'].mean()
print k
df['OLDPEAK']=df['OLDPEAK'].replace(-69,numpy.round(k,1))

0.878787878788


In [21]:
print df['OLDPEAK'].value_counts()

 0.0    370
 1.0     83
 2.0     76
 0.9     66
 1.5     48
 3.0     28
 0.5     19
 1.2     17
 2.5     16
 0.8     15
 1.4     15
 0.6     14
 1.6     14
 0.2     14
 1.8     12
 0.4     10
 0.1      9
 4.0      8
 2.6      7
 2.8      7
 2.2      5
 0.7      5
 1.9      5
 1.3      5
 0.3      5
 1.1      4
 2.4      4
 3.6      4
 3.4      3
 3.5      2
-1.0      2
-0.5      2
 3.2      2
 4.2      2
 2.1      2
 2.3      2
 1.7      2
 3.1      1
 2.9      1
 4.4      1
-0.1      1
-0.7      1
 3.8      1
 5.6      1
 5.0      1
 3.7      1
-1.5      1
-2.0      1
-0.9      1
-2.6      1
-1.1      1
-0.8      1
 6.2      1
Name: OLDPEAK, dtype: int64


## SLOPE

In [22]:
print df['SLOPE'].value_counts()

2    345
?    309
1    203
3     63
Name: SLOPE, dtype: int64


In [23]:
#k=203.0/(345.0+203.0+63.0)
#print k

In [24]:
v=df.SLOPE.values=='?'
df.loc[v,'SLOPE'] = numpy.random.choice(('2','1','3'), v.sum(), p=(0.6,0.30,0.10))
print df['SLOPE'].value_counts()
df['SLOPE']=df['SLOPE'].astype('int64')

2    544
1    289
3     87
Name: SLOPE, dtype: int64


## CA

In [25]:
print df["CA"].value_counts()
k=(41.0)/(181+67+41+20)
print k

?    611
0    181
1     67
2     41
3     20
Name: CA, dtype: int64
0.132686084142


In [26]:
v=df.CA.values=='?'
df.loc[v,'CA'] = numpy.random.choice(('0','1','2','3'), v.sum(), p=(0.60,0.20,0.13,0.07))
df['CA']=df['CA'].astype('int64')
print df['CA'].value_counts()

0    538
1    199
2    123
3     60
Name: CA, dtype: int64


## THAL

In [27]:
print df['THAL'].value_counts()
#can't use random walk directly here

?    486
3    196
7    192
6     46
Name: THAL, dtype: int64


In [28]:
print df[df['THAL']=='3']['SEX'].value_counts()
print df[df['THAL']=='7']['SEX'].value_counts()

1    110
0     86
Name: SEX, dtype: int64
1    171
0     21
Name: SEX, dtype: int64


In [29]:
print "THAL:3=====>\n",df[df['THAL']=='3']['CATEGORY'].value_counts()
print "THAL:7=====>\n",df[df['THAL']=='7']['CATEGORY'].value_counts()
print "THAL:6=====>\n",df[df['THAL']=='6']['CATEGORY'].value_counts()

THAL:3=====>
0    138
1     30
2     14
3     12
4      2
Name: CATEGORY, dtype: int64
THAL:7=====>
1    63
3    43
0    38
2    37
4    11
Name: CATEGORY, dtype: int64
THAL:6=====>
1    13
2    12
0    11
3     7
4     3
Name: CATEGORY, dtype: int64


In [30]:
df['THAL']=df['THAL'].replace('?',-1)
'''
df['THAL']=df['THAL'].replace('?',-1)
for row in df.iterrows():
    if row['THAL']==-1 and row['CATEGORY']>=1:
        df.loc[row.Index, 'ifor'] = 7
        
    elif row['THAL']==-1 and row['CATEGORY']==0:
        df.loc[row.Index, 'ifor'] = 3
'''
df.loc[(df['THAL']==-1)&(df['CATEGORY']!=0),'THAL']='7'
#print df['THAL'].value_counts()
df.loc[(df['THAL']==-1)&(df['CATEGORY']==0),'THAL']=numpy.random.choice(('7','3'), p=(0.20,0.80))
print df['THAL'].value_counts()
df['THAL']=df['THAL'].astype('int64')

7    454
3    420
6     46
Name: THAL, dtype: int64


In [31]:
print df.dtypes

AGE            int64
SEX            int64
CP             int64
THRESTBPS      int64
CHOL           int64
FBS            int64
RESTECG        int64
THALACH        int64
EXANG          int64
OLDPEAK      float64
SLOPE          int64
CA             int64
THAL           int64
CATEGORY       int64
dtype: object


In [32]:
dummies = pandas.get_dummies(df["CP"],prefix="CP")
df = df.join(dummies)

dummies = pandas.get_dummies(df["RESTECG"],prefix="RESTECG")
df      = df.join(dummies)

dummies = pandas.get_dummies(df["SLOPE"],prefix="SLOPE")
df      = df.join(dummies)

dummies = pandas.get_dummies(df["THAL"],prefix="THAL")
df      = df.join(dummies)


del df['CP']
del df['RESTECG']
del df['SLOPE']
del df['THAL']

In [33]:
print df.dtypes

AGE            int64
SEX            int64
THRESTBPS      int64
CHOL           int64
FBS            int64
THALACH        int64
EXANG          int64
OLDPEAK      float64
CA             int64
CATEGORY       int64
CP_1           uint8
CP_2           uint8
CP_3           uint8
CP_4           uint8
RESTECG_0      uint8
RESTECG_1      uint8
RESTECG_2      uint8
SLOPE_1        uint8
SLOPE_2        uint8
SLOPE_3        uint8
THAL_3         uint8
THAL_6         uint8
THAL_7         uint8
dtype: object


In [34]:
for g in df.columns:
    if df[g].dtype=='uint8':
        df[g]=df[g].astype('int64')

In [35]:
df.dtypes
df.loc[df['CATEGORY']>0,'CATEGORY']=1

In [36]:
stdcols = ["AGE","THRESTBPS","CHOL","THALACH","OLDPEAK"]
nrmcols = ["CA"]
stddf   = df.copy()
stddf[stdcols] = stddf[stdcols].apply(lambda x: (x-x.mean())/x.std())
stddf[nrmcols] = stddf[nrmcols].apply(lambda x: (x-x.mean())/(x.max()-x.min()))

In [37]:
print stddf.dtypes

AGE          float64
SEX            int64
THRESTBPS    float64
CHOL         float64
FBS            int64
THALACH      float64
EXANG          int64
OLDPEAK      float64
CA           float64
CATEGORY       int64
CP_1           int64
CP_2           int64
CP_3           int64
CP_4           int64
RESTECG_0      int64
RESTECG_1      int64
RESTECG_2      int64
SLOPE_1        int64
SLOPE_2        int64
SLOPE_3        int64
THAL_3         int64
THAL_6         int64
THAL_7         int64
dtype: object


In [38]:
from sklearn.model_selection import train_test_split


In [39]:
df_copy=stddf.copy()
df_copy=df_copy.drop(['CATEGORY'],axis=1)

dat=df_copy.values
#print dat.shape

print type(dat),dat

<type 'numpy.ndarray'> [[ 1.00683792  1.          0.73041283 ...,  0.          1.          0.        ]
 [ 1.43125528  1.          1.53332004 ...,  1.          0.          0.        ]
 [ 1.43125528  1.         -0.60776585 ...,  0.          0.          1.        ]
 ..., 
 [ 0.1580032   1.         -0.50071155 ...,  0.          1.          0.        ]
 [ 0.47631622  1.         -0.60776585 ...,  1.          0.          0.        ]
 [ 0.90073358  1.         -0.60776585 ...,  0.          0.          1.        ]]


In [40]:
labels=df['CATEGORY'].values
print labels,type(labels)

[0 1 1 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 0 0 0 1
 1 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1
 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 1 1
 1 0 1 1 0 0 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 0
 0 0 0 0 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 1 0 0 1 1 0 0 1
 0 0 1 1 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 1 0 1 1 0 0 0 0 0 0 0
 0 1 1 0 0 0 1 1 0 1 1 0 0 1 1 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0
 1 0 1 0 0 1 1 1 1 1 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0
 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 1 

In [41]:
x_train,x_test,y_train,y_test=train_test_split(dat,labels, test_size=0.25, random_state=42)

In [42]:
print "x_train:",x_train.shape
print "y_train:",y_train.shape
print
print "x_test:",x_test.shape
print "y_test:",y_test.shape

x_train: (690, 22)
y_train: (690,)

x_test: (230, 22)
y_test: (230,)


In [43]:
#training and testing
#SVM
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=5)
clf.fit(x_train,y_train)
print "SVM:",clf.score(x_test,y_test)*100,"%"


from sklearn import linear_model
lrcv=linear_model.LogisticRegressionCV(fit_intercept=True,penalty='l2',dual=False)
lrcv.fit(x_train,y_train)
print "Logistic Regression:",lrcv.score(x_test,y_test)*100,"%"


SVM: 89.5652173913 %
Logistic Regression: 88.6956521739 %


In [44]:
import keras 
import tensorflow
%matplotlib inline
from sklearn import metrics
import matplotlib.pyplot as plt # side-stepping mpl backend
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers.core import Dropout, Flatten, Activation, Dense
from keras.layers.convolutional import Convolution2D, Convolution1D,MaxPooling1D

Using TensorFlow backend.


In [45]:
model = Sequential()
model.add(Dense(512, input_dim=22, init='uniform', activation='sigmoid'))
#model.add(Dropout(0.2))
model.add(Dense(30, init='uniform', activation='sigmoid'))
model.add(Dense(1, init='uniform', activation='sigmoid'))

  from ipykernel import kernelapp as app


In [46]:
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])

In [47]:
model.fit(x_train, y_train, epochs=600, batch_size=512,  verbose=2 ,validation_data=(x_test, y_test))

Train on 690 samples, validate on 230 samples
Epoch 1/600
0s - loss: 0.2501 - acc: 0.4623 - val_loss: 0.2463 - val_acc: 0.5826
Epoch 2/600
0s - loss: 0.2481 - acc: 0.5435 - val_loss: 0.2442 - val_acc: 0.5826
Epoch 3/600
0s - loss: 0.2480 - acc: 0.5435 - val_loss: 0.2433 - val_acc: 0.5826
Epoch 4/600
0s - loss: 0.2477 - acc: 0.5435 - val_loss: 0.2431 - val_acc: 0.5826
Epoch 5/600
0s - loss: 0.2474 - acc: 0.5435 - val_loss: 0.2429 - val_acc: 0.5826
Epoch 6/600
0s - loss: 0.2471 - acc: 0.5435 - val_loss: 0.2428 - val_acc: 0.5826
Epoch 7/600
0s - loss: 0.2466 - acc: 0.5435 - val_loss: 0.2427 - val_acc: 0.5826
Epoch 8/600
0s - loss: 0.2462 - acc: 0.5435 - val_loss: 0.2428 - val_acc: 0.5826
Epoch 9/600
0s - loss: 0.2458 - acc: 0.5435 - val_loss: 0.2428 - val_acc: 0.5826
Epoch 10/600
0s - loss: 0.2453 - acc: 0.5435 - val_loss: 0.2424 - val_acc: 0.5826
Epoch 11/600
0s - loss: 0.2447 - acc: 0.5435 - val_loss: 0.2417 - val_acc: 0.5826
Epoch 12/600
0s - loss: 0.2440 - acc: 0.5435 - val_loss: 0.24

<keras.callbacks.History at 0x7f51cafe6950>

In [48]:
import math

trainScore = model.evaluate(x_train, y_train, verbose=0)
print('Train Score: %.2f MSE (%.2f RMSE)' % (trainScore[0], math.sqrt(trainScore[0])))
testScore = model.evaluate(x_test, y_test, verbose=0)
print('Test Score: %.2f MSE (%.2f RMSE)' % (testScore[0], math.sqrt(testScore[0])))

Train Score: 0.07 MSE (0.27 RMSE)
Test Score: 0.10 MSE (0.32 RMSE)


## As of now Neural networks are performing the best
1. svm:89.56%
2. logistic regression:88.69%
3. neural networks:91.88%

## TO DO:
### 1. PCA
### 2. Remove outliers< neural networks can perform better>