This is a beginner friendly for predicting Heart disease using superivised machine learning.
Here we are given certain values of like age, gender, blood pressure, and we have to predict whether the patient would suffer from heart disease or has heart disease. Along with the information about patient, we are also given the label. The label tells us whether the patient suffers from disease or not.

So Lets get to it. For any supervised machine learning tasks, the most popular set of steps followed are:
1. Exploratory Data Analysis.
2. Data preprocessing.
3. Model training.
4. Model evaluation and testing.

# Load the Data.

**Lets Start...**
First lets load the data, so that we can manipulate it using python. We do this using *pandas* library in python.

In [None]:
#importing libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt# for plotting
import seaborn as sns # for plotting again... for advanced plots.

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



In [None]:
#here we load the data in data frame of python.
df=pd.read_csv("/kaggle/input/heart-disease-uci/heart.csv") #loading the data
df.head(5)#lets see the first 5 rows of the dataset

In [None]:
df.tail(5)#lets see the last 5 rows

In [None]:
#lets see the shape of the dataset
print("Number of rows:",df.shape[0])
print("Number of columns:",df.shape[1])

# Exploratory Data Analysis (EDA).

**So what is this term EDA? Why is this said as the most crucial step in data science?**

Well here we actually try to understand the data, plot multiple graphs which might give us some insights or intresting patterns. Understanding the data before going for model building is very important. 

**Why is it important?**

It is important as only after understanding the data, you can apply the machine learning algorithms in an effective way.

So lets start with the types of attributes. There are 2 broad types of attributes *Numeric* attributes like age, or marks of a test. The second type of attribute is *Categorical* attributes which have categories like gender or grade obtained from a test. 

We would first find the meaning of each attribute.

**Numeric Attributes:**

* age: Age in years
* trestbps: Resting Blood Pressure, individual values displayed in mmHg unit.
* chol: Serum cholesterol value displayed in mg/dl.
* thalach: Maximum heart rate achieved.
* oldpeak: ST depression induced by exercise related to rest.

**Categorical attributes:**

* sex: Displays the gender (Male(1), Female(0)).
* cp: Chest pain type (Typical Angina(0), Atypical Angina(1), Non-angina pain(2), asymptomatic(3)).
* fbs: Fasting blood sugar, given as a comparison. If value > 120 mg/dl then true(1) else false(0).
* restecg:  Resting electrocardiographic result (Normal(0), ST-T wave abnormality(1), Left ventricular hypertrophy (2)).
* exang: Exercise include angina (No(0), Yes(1)).
* slope: Slope of the peak exercise ST segment (Upsloping(0), Flat(1), Downsloping(2)).
* thal: Displays the thalassemia (Unknown(0), Normal(1), Fixed defect(2), Reversible defect(3)).
* ca: Number of major vessels colored by fluoroscopy.(Numbers 0, 1, 2, 3, 4).


In [None]:
#ok lets start with some simple analysis:
#lets get the mean, mode and other information about the dataset
df.describe()

In [None]:
#Its just numbers...
#lets start plots...
#lets get the count of male and females in the dataset.
df['sex'].hist()
#looks like we have more male patients than females.

In [None]:
# Lets have such of an histogram for every attribute:
fig, axis = plt.subplots(7,2,figsize=(10, 15))
df.hist(ax=axis)
plt.show()
#after having a closer look at the graphs below you can see the difference in the histograms of  numeric and categorical attributes.
#The categorical ones have distinct bars while those of numeric attributes are connected.

In [None]:
# we have heard that male are more prone to heart disease than female. 
#Lets see if its true for our dataset:
#Visualizing data with liver disease along with Gender
plt.figure(figsize=(5,5))

#here we use count plot from seaborn which can plot the frequencies
ax = sns.countplot(x = df['target'].apply(lambda x:'Heart Disease' if x == 1 else 'Normal'), hue=df['sex'])
ax.set_xlabel('Patient Condition')
plt.show()
#turns out it is true. Men are more prone to heart disease than women.

Lets have a look at correlation plot.

**Whats correlation??**
* This value is calculated between two attributes.
* Range from -1 to 1.
* Positive correlation (value: 0 to 1):If value of first attribute increases, the value of second attribute also increases. for eg. Grade and marks: the grade of the person is much dependent on the test marks. So the value of correlation will be high.
* Negative correlation (value: 0 to -1):If value of first attribute increases, the value of second attribute decreases. for eg. ID and Grade: The grade of the person is not at all dependent on the email id. So the value of correlation will be less.
* The higher the magnitude of the value, more the attributes are related to each other.

In [None]:
#Lets plot this
plt.figure(figsize=(15,10))
sns.heatmap(df.corr(),cmap='Greens',annot=True)

From above graph we can get following colclusions regarding the Target column:
1. Thalach and target: if the maximum heartrate increases, the possiblity of heart disease increases.
2.  Exang and Target: if patient doesn't have exercise include angina, the possiblity of heart disease increases (negative correlation).
3. Ca and target: if the patient has less number of major vessels colored by fluoroscopy,  the possiblity of heart disease increases. 
In similar way we can extract other correlations as well.

In [None]:
#Lets plot pair plots.
# This is mostly done for numerical columns where you can see all the points for the two columns.
# Using this aswell we can get some intresting patterns.
columns=['age','trestbps','chol','thalach','oldpeak','target']#selecting columns
sns.pairplot(df[columns],hue='target',corner=True,diag_kind='hist')

# Data Preprocessing

Data preprocessing has 3 sub steps namely Data cleaning, Data Transformation, Data Reduction.
![t*](https://media.geeksforgeeks.org/wp-content/uploads/20190312184006/Data-Preprocessing.png)

Source: [Geeks for geeks](https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/)

## Data cleaning

In [None]:
#Checking for missing values
df.isna().sum()

Great there are no values missing in the dataset...
lets go and have a look for Noisy data.

**What is noisy data?**

Noisy data is meaningless data that computers cannot perceive. It can be generated because of errors in data collection, input errors, etc. Such noisy data rows are also called as outliers. Mostly outliers occur for numeric data. For eg. consider in our data, the age of a person is 29 but by mistake it got registered as 129. That point is beyond the normal range of age. Hence it is an outlier. For categorical data, consider the target column where 1 signifies patitent has heart disease and 0 signifies patient doesnot have heart disease, but we got value as 2 which is undefined. However in our dataset there arent 

**How to detect Outliers?**

Well there are several ways to do that like using Box plots, using z-score techniques.

**How to deal with Outliers?**

We can either apply binning or delete the outliers. In this notebook, we would delete them. However, in most of the cases, it depends on the nature of the outliers.

Lets go for outliers

In [None]:
#Checking for unique values of categorical values.
#If number of unique values dont change from the given number of unique values,then we do not have categorical outliers in the data.
numeric_cols=['age','trestbps','chol','thalach','oldpeak']#numeric attributes
cat_cols=['sex','cp','fbs','restecg','exang','slope','ca','thal']#categorical attributes

df[cat_cols].nunique()

In [None]:
# The results show that there are no outliers. 
# There are two values for the sex attribute 1-0 and the code above also indicates the same value. 
#This applies to other categorical attributes as well.

#lets go for box plots of numeric attributes.
plt.figure(figsize=(10,19))
sns.boxplot(x="variable", y="value", data=pd.melt(df[numeric_cols]))
plt.show()

In [None]:
#It turns out there are some outliers in trestbps, chol, and thalach.
#lets apply zscore technique aswell to detect outliers.
#if value of zscore is greater than 3 and less than -3 are treated as outliers 
#However, this threshold of 3  and -3 is set by us and we can change it.
from scipy import stats
z = np.abs(stats.zscore(df))
print(z)

In [None]:
#ok here we cant see any outliers :( 
# we cant make out which ones are outliers from such a big table....
#let the computer do it for us...
df_outliers= df[(z >= 3).any(axis=1)] # it says that any of the column with value of z above 3 
df_outliers

In [None]:
print(df_outliers.shape)
#16 columns!! thats too many as we have a small dataset of only 303 rows.
#we would increase the threshold silghtly
df_outliers= df[(z >= 3.5).any(axis=1)] 
df_outliers
print()

In [None]:
#We would remove these 6 rows with outlier values now
df_clean=df[(z <3.5).all(axis=1)] # pay attention to the condition.
df_clean.shape

In [None]:
#lets check for duplicate samples.
# When present in large quantities, duplicate samples may reduce the effectiveness of the models.
duplicate = df_clean[df_clean.duplicated()]
duplicate
#There is only one duplicate value so there is no need to delete it.

## Data Transformation.

Here we would transform the data in suitable forms to avoid biases.
* Categorical attribute: apply one hot encoding.
* Numeric attribute: scale them in range of -1 to 1.

In [None]:
#applying One Hot encoding to categorical columns 
data = pd.get_dummies(df_clean,columns =['cp','restecg','slope','ca','thal'])
#we only apply one hot encoding to categorical columns with more than 2 unique value.
#hence the above set doesnot contain the columns sex, fbs and exang which have only 2 unique values 0 and 1.
data.head()


In [None]:
#lets scale the numeric columns
# This scaling is done so as the model wont get stuck in local opitmal value and converge fast.
numeric_cols=['age','trestbps','chol','thalach','oldpeak']#numeric attributes
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
# standard scaler scales the columns in range of -1 to 1 based on the value of their mean, and standard deviation.
data[numeric_cols] = standardScaler.fit_transform(data[numeric_cols])
data.head()

## Data Reduction

Here,we select important features. This is done to reduce dataset size.
Also by selecting important features we improve the performance of the model.
For this, we would apply *Selection of features from the model* technique by sklearn.

In this technique, we apply the Machine learning model and the model defines an importance value for each feature.
Features with importance value above certain threshold are decided as important features.

In [None]:
# for this to work we should first divide the dataset in features and labels.
# features are normally denoted by "x" and labels by "y".
y=data['target']
y=np.array(y)
x=data.drop(columns=['target']) # removed the label column from dataframe and passed it to x

In [None]:
# Lets work on feature selection
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import ExtraTreesClassifier

clf = ExtraTreesClassifier(n_estimators=500).fit(x, y)
selector = SelectFromModel(clf, prefit=True)
x_columns=x.columns #all columns of x
columns=selector.get_support() #selected list of columns with true and false values
selected_columns=list([x_columns[i] for i in range(len(columns)) if columns[i]]) #creating list of selected columns
print(selected_columns)

In [None]:
#selecting these important columns
x_reduced=selector.transform(x)

Thats great, now our data is ready to be given to a machine learning model. 

# Model Training and Evaluation

We wont be training one mocel but 4 models and compare there performance.
Those 4 models are Logistic regression, Random Forest, SVM, and KNN.

Also we will be evaluating our models performance on various common used metric:
* Accuracy: The percentage of correct predictions for the test data.


We use *5 fold cross validation* to get reliable results

In [None]:
#splitting the dataset for Training and testing and using 5-fold Cross validation.
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
kf.get_n_splits(x_reduced)

#metrics for SVM
SVM_accuracy=[]

#metrics for Random Forest
RF_accuracy=[]

#metrics for KNN
KNN_accuracy=[]

#metrics for Logistic Regression
LR_accuracy=[]


In [None]:
#initializing the models
#importing libraries of the selected algorithms
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
#importing libraries of performance Metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

#Making the classifier Objects
clf_svm=SVC() #SVM object
clf_rf=RandomForestClassifier(max_depth=100, random_state=0)#Random Forest Object
clf_knn = KNeighborsClassifier(n_neighbors=30)#KNN object
clf_lr=LogisticRegression(C= 1, class_weight= None, penalty= 'l2', solver= 'newton-cg')#Logistic regression model

In [None]:
i=1# count the number of folds
#starting the 5 fold cross valivation
for train_index, test_index in kf.split(x_reduced):
    print("\nNumber of fold: %d"%i)
    i+=1
    #Splitting the data
    X_train, X_test = x_reduced[train_index], x_reduced[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    #Training and Evaluating SVM
    model=clf_svm.fit(X_train,y_train)
    y_pred=model.predict(X_test)
    SVM_accuracy.append(accuracy_score(y_test,y_pred))
    print("Working on SVM")
    
    #Training and Evaluating Random Forest
    model=clf_rf.fit(X_train,y_train)
    y_pred=model.predict(X_test)
    RF_accuracy.append(accuracy_score(y_test,y_pred))
    print("Working on Random Forest")
    
    #Training and Evaluating KNN
    model=clf_knn.fit(X_train,y_train)
    y_pred=model.predict(X_test)
    KNN_accuracy.append(accuracy_score(y_test,y_pred))
    print("Working on KNN")
    
    #Training and Evaluating LR
    model=clf_lr.fit(X_train,y_train)
    y_pred=model.predict(X_test)
    LR_accuracy.append(accuracy_score(y_test,y_pred))
    print("Working on Logistic Regression")

In [None]:
#visualizing average results:
SVM=["SVM ", (sum(SVM_accuracy)/len(SVM_accuracy))]

RF=["RF ", (sum(RF_accuracy)/len(RF_accuracy)) ]

KNN=["KNN ", (sum(KNN_accuracy)/len(KNN_accuracy))]

LR=["LR ", (sum(LR_accuracy)/len(LR_accuracy))]
data=[]
data.append(SVM)
data.append(RF)
data.append(KNN)
data.append(LR)
#converting results to dataframe
results=pd.DataFrame(data,columns=["Algorithms","Accuracy"])
results


In [None]:
#Lets plot this accuracy
results_new=results.set_index('Algorithms')
results_new['Accuracy'].plot(kind='bar')
plt.ylabel("Accuracy")

From the above comparison we get that Logistic Regression gives the best accuracy of 78%. There are still many things which can be done like developing new features, tuning hyper parameters and also improving feature selection techniques.  