# Assignment #2, Heart Disease
### deadline Jan 11, 2021 13:00 Hrs

Recently, machine learning is being used to diagnose diseases at the earliest stage. This also helps the doctors to reduce the cost of health care. For this purpose, some data has been collected at [UCI machine learning data set](http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data) .

The features or attributes are

- age - age in years      
- sex - sex (1 = male; 0 = female)       
- cp: chest pain type
- trestbpss: resting blood pressure (in mm Hg on admission to the hospital)  
- chol: serum cholestoral in mg/dl      
- fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)       
- restecg: resting electrocardiographic results
- thalach : maximum heart rate achieved
- exang : exercise induced angina (1 = yes; 0 = no)
- oldpeak : ST depression induced by exercise relative to rest
- slope : the slope of the peak exercise ST segment
- ca : number of major vessels (0-3) colored by flourosopy
- thal : 3 = normal; 6 = fixed defect; 7 = reversable defect

The target is 
- num : diagnosis of heart disease (angiographic disease status) - values range from 0 - 4

All the above 13 attributes are used to predict the target. 

Read the given and then apply a suitable machine learning technique (Naive, Bayes, KNN, Decision Tree, Random Forest) to predict the heart disease stage.
Next tune the hyperparameters to achieve the best scores and respective parameters. At the end, You will submit the working notebook along with a summary of models with best parameters, that helped you find the best model.

# Reading Data

In [1]:
import numpy as np
import pandas as pd

filename = 'processed.cleveland.data'
data =  pd.read_table(filename, sep= ",", names = ["age", "sex", "cp", "trestbpss", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"])

data=data.replace('?', np.nan)
data=data.dropna()

data.head(10)

Unnamed: 0,age,sex,cp,trestbpss,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
5,56.0,1.0,2.0,120.0,236.0,0.0,0.0,178.0,0.0,0.8,1.0,0.0,3.0,0
6,62.0,0.0,4.0,140.0,268.0,0.0,2.0,160.0,0.0,3.6,3.0,2.0,3.0,3
7,57.0,0.0,4.0,120.0,354.0,0.0,0.0,163.0,1.0,0.6,1.0,0.0,3.0,0
8,63.0,1.0,4.0,130.0,254.0,0.0,2.0,147.0,0.0,1.4,2.0,1.0,7.0,2
9,53.0,1.0,4.0,140.0,203.0,1.0,2.0,155.0,1.0,3.1,3.0,0.0,7.0,1


In [2]:
print("(Rows, Columns): " + str(data.shape))
print('Column Names: ')
for co in data.columns:
    print(co,end='   ')


(Rows, Columns): (297, 14)
Column Names: 
age   sex   cp   trestbpss   chol   fbs   restecg   thalach   exang   oldpeak   slope   ca   thal   target   

In [3]:
data.describe()

Unnamed: 0,age,sex,cp,trestbpss,chol,fbs,restecg,thalach,exang,oldpeak,slope,target
count,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0
mean,54.542088,0.676768,3.158249,131.693603,247.350168,0.144781,0.996633,149.599327,0.326599,1.055556,1.602694,0.946128
std,9.049736,0.4685,0.964859,17.762806,51.997583,0.352474,0.994914,22.941562,0.469761,1.166123,0.618187,1.234551
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.0,0.0,0.0,1.0,0.0
50%,56.0,1.0,3.0,130.0,243.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0
75%,61.0,1.0,4.0,140.0,276.0,0.0,2.0,166.0,1.0,1.6,2.0,2.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,4.0


# Machine Learning

In [4]:
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

In [5]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,y,train_size = 0.7, random_state = 30)

In [6]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

# Decision Tree

In [7]:
from sklearn.metrics import classification_report 
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=20)
model.fit(x_train, y_train)
print('Training Accuracy:\t',model.score(x_train, y_train))
print('Testing Accuracy:\t',model.score(x_test, y_test))

Training Accuracy:	 1.0
Testing Accuracy:	 0.5555555555555556
