# Introduction

This dataset gives a number of variables along with a target condition of having or not having heart disease. It contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "goal" field refers to the presence of heart disease in the patient.

In addition, we will analyze for this dataset. We will use a wide range of tools for this part. If there's value in there, we'il do it there. Finally, machine learning algorithms are estimated.

## Loading appropriate libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # this is used for the plot the graph 
import seaborn as sns # used for plot interactive graph.
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_recall_curve
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import roc_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from sklearn.svm import SVC
%matplotlib inline

## Cause of Heart Disease

1.  Excess weight, especially around the stomach area, increases a woman's risk of developing cardiovascular disease and lack of physical activity makes it worse.
2. Diabetes causes damage to blood vessels so diabetes is a major factor in developing cardiovascular disease.
3. Unhealthy foods, lack of exercise, lead to heart disease. So can high blood pressure, infections, and birth defects.
4. Smoking is one of the biggest causes of cardiovascular disease.
5. Just a few cigarettes a day can damage the blood vessels and reduce the amount of oxygen available in our blood.
*  But other things might surprise you.

#### Loading the Data

In [None]:
df=pd.read_csv('../input/heart-disease-uci/heart.csv')
df.head(5)

It's a clean, easy to understand set of data. However, the meaning of some of the column headers are not obvious. Here's what they mean,

1.age: The person's age in years

2.sex: The person's sex (1 = male, 0 = female)

3.cp: The chest pain experienced (Value 0: typical angina, Value 1: atypical angina, Value 2: non-anginal pain, Value 3: asymptomatic)

4.trestbps: The person's resting blood pressure (mm Hg on admission to the hospital)

5.chol: The person's cholesterol measurement in mg/dl

6.fbs: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)

7.restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)

8.thalach: The person's maximum heart rate achieved

9.exang: Exercise induced angina (1 = yes; 0 = no)

10.oldpeak: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot. See more here)

11.slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)

12.ca: The number of major vessels (0-3)

13.thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)

14.target: Heart disease (0 = no, 1 = yes)


In [None]:
df.describe()

Describe function is a function that allows analysis between the numerical values contained in the data set. Using this function count, mean, std, min, max, 25%, 50%, 75%.
As seen in this section, most values are generally categorized. This means that we need to integrate other values into this situation. These; age, trestbps, chol, thalach

Looking at information of heart disease risk factors led me to the following: high cholesterol, high blood pressure, diabetes, weight, family history and smoking 3. According to another source 4, the major factors that can't be changed are: increasing age, male gender and heredity. Note that thalassemia, one of the variables in this dataset, is heredity. Major factors that can be modified are: Smoking, high cholesterol, high blood pressure, physical inactivity, and being overweight and having diabetes. Other factors include stress, alcohol and poor diet/nutrition.

Now,I will check null on all data and If data has null, I will sum of null data's. In this way, how many missing data is in the data.


In [None]:
df.isnull().sum().sum()

there are no missing data in this dataset

In [None]:
import seaborn as sns
plt.figure(figsize=(10,8), dpi= 80)
sns.heatmap(df.corr(), cmap='RdYlGn', center=0)

# Decorations
plt.title('Correlogram of mtcars', fontsize=22)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

from theabove corelation plot we see that cp(chest pain),thalch and slope are highly corelated with the target.

## Data Visulization

### TARGET

In [None]:
sns.distplot(df['target'],rug=True)
plt.show()

In [None]:
import plotly
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
col = "target"
colors = ['gold', 'blue']
grouped = df[col].value_counts().reset_index()
grouped = grouped.rename(columns = {col : "count", "index" : col})

## plot
#trace = go.Pie(labels=grouped[col], values=grouped['count'], pull=[0.05, 0])
trace = go.Pie(labels=grouped[col], values=grouped['count'], pull=[0.05, 0],
               marker=dict(colors=colors, line=dict(color='#000000', width=2)))
layout = {'title': 'Target(0 = No, 1 = Yes)'}
fig = go.Figure(data = [trace], layout = layout)
iplot(fig)


from the above plot we see that 45.5% people does not have disease and 54.5% have heart disease.

### SEX

In [None]:
sns.distplot(df['sex'],rug=True)
plt.show()

In [None]:
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
col = "sex"
grouped = df[col].value_counts().reset_index()
grouped = grouped.rename(columns = {col : "count", "index" : col})

## plot
trace = go.Pie(labels=grouped[col], values=grouped['count'], pull=[0.05, 0])
layout = {'title': 'Male(1), Female(0)'}
fig = go.Figure(data = [trace], layout = layout)
iplot(fig)

In the heart disease UCI dataset only 31.7% are female and rest are male. we have find that either male or female which have likely to heart disease.

In [None]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))
sns.countplot(x=df.target,hue=df.sex)
plt.legend(labels=['Female', 'Male'])

##### from the above plot we see that the rate of heart disease in females have more in comprission of male. 

### Women are 4 times more likely to die from heart disease than breast cancer

* Cardiovascular disease is the leading cause of death in women in Australia with 90% of women having one risk factor. 
* The causes including high blood pressure, high cholesterol, smoking, diabetes, weight and family history are discussed.
* A woman's risk also goes up if she's had a miscarriage or had her ovaries or uterus removed.
*  Women's hearts are affected by stress and depression more than men's. Depression makes it difficult to maintain a healthy lifestyle.

### CHAIST PAIN

In [None]:
sns.distplot(df['cp'],rug=True)
plt.show()

In [None]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))
sns.countplot(x=df.target,hue=df.cp)
plt.legend(labels=['0: typical angina', '1: atypical angina','2: non-anginal pain','3: asymptomatic'])

In the above plot we see that 27.2% persons having chaist pain type 0, 82% having chaist pain type 1, 79.3% having chaist pain type 2 and 69.5% having chaist pain type 3. These person have heart disease, from this data we observe that those who have chaist pain type 1 and chaist pain type 2 is more likely to affected by heart disease.

### THALACH

In [None]:
sns.distplot(df['thalach'],rug=True)
plt.show()

The person having the heart rate over 140 is more likely to have heart disease, therefor we conclude that we have to check our heart rate monthly if its over the thalach 140 then we have to concult the doctor and much concious to the health.

### FASTING BLOOD SUGAR

In [None]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(8,5))
sns.countplot(x=df.target,hue=df.fbs)
#fbs(> 120 mg/dl, 1 = true; 0 = false
plt.title('fbs(> 120 mg/dl)')
plt.legend(labels=['0: False', '1: True'])

The 51.2% person having the fasting blood sugar rate over 120 mg/dl and 55% person having the fasting blood sugar rate below 120 mg/dl is affected by heart disease.

#### AGE

In [None]:
col='age'
d1=df[df['target']==0]
d2=df[df['target']==1]
v1=d1[col].value_counts().reset_index()
v1=v1.rename(columns={col:'count','index':col})
v1['percent']=v1['count'].apply(lambda x : 100*x/sum(v1['count']))
v1=v1.sort_values(col)
v2=d2[col].value_counts().reset_index()
v2=v2.rename(columns={col:'count','index':col})
v2['percent']=v2['count'].apply(lambda x : 100*x/sum(v2['count']))
v2=v2.sort_values(col)
trace1 = go.Scatter(x=v1[col], y=v1["count"], name=0, marker=dict(color="blue"),mode='lines+markers')
trace2 = go.Scatter(x=v2[col], y=v2["count"], name=1, marker=dict(color="red"),mode='lines+markers')
data = [trace1, trace2]
layout={'title':"Target With Respect to age",'xaxis':{'title':"Age"}}
fig = go.Figure(data, layout=layout)
iplot(fig)

From this plot we identify that person having age between 40 to 65 year is more likely to affected by heart disease, therefor person having the age in this range have to more concisous about there health. 

## Symptoms

* Chest pain, chest tightness, chest pressure and chest discomfort (angina)
* Shortness of breath
* Pain, numbness, weakness or coldness in your legs or arms if the blood vessels in those parts of your body are narrowed
* Pain in the neck, jaw, throat, upper abdomen or back.
* Heart failure is also an outcome of heart disease, and breathlessness can occur when the heart becomes too weak to circulate blood.
* Some heart conditions occur with no symptoms at all, especially in older adults and individuals with diabetes.

In [None]:
chest_pain=pd.get_dummies(df['cp'],prefix='cp',drop_first=True)
df=pd.concat([df,chest_pain],axis=1)
df.drop(['cp'],axis=1,inplace=True)
sp=pd.get_dummies(df['slope'],prefix='slope')
th=pd.get_dummies(df['thal'],prefix='thal')
frames=[df,sp,th]
df=pd.concat(frames,axis=1)
df.drop(['slope','thal'],axis=1,inplace=True)

In [None]:
X = df.drop(['target'], axis = 1)
y = df.target.values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)
from sklearn.preprocessing import StandardScaler
sc_X=StandardScaler()
X_train=sc_X.fit_transform(X_train)
X_test=sc_X.transform(X_test)

StandardScaler will transform your data such that its distribution will have a mean value 0 and standard deviation of 1. Given the distribution of the data, each value in the dataset will have the sample mean value subtracted, and then divided by the standard deviation of the whole dataset.

# Training & Testing of Model

### 1.XGBoost

> XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.
> > > XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) ... A wide range of applications: Can be used to solve regression, classification, ranking, and user-defined prediction problems.

In [None]:
import xgboost as xgb
from xgboost import XGBClassifier
alg = XGBClassifier(learning_rate=0.01, n_estimators=2000, max_depth=8,
                        min_child_weight=0, gamma=0, subsample=0.52, colsample_bytree=0.6,
                        objective='binary:logistic', nthread=4, scale_pos_weight=1, 
                    seed=27, reg_alpha=5, reg_lambda=2, booster='gbtree',
            n_jobs=-1, max_delta_step=0, colsample_bylevel=0.6, colsample_bynode=0.6)
alg.fit(X_train, y_train)
print('train accuracy',alg.score(X_train, y_train))
print('test accuracy',alg.score(X_test,y_test))

In [None]:
import scikitplot as skplt
xgb_prob = alg.predict_proba(X_test)

In [None]:
skplt.metrics.plot_roc(y_test, xgb_prob)
skplt.metrics.plot_ks_statistic(y_test, xgb_prob)
skplt.metrics.plot_cumulative_gain(y_test, xgb_prob)
skplt.metrics.plot_lift_curve(y_test, xgb_prob)

In [None]:
probas_list1 = [alg.predict_proba(X_test)]
xy=['xgb']
skplt.metrics.plot_calibration_curve(y_test,
                                      probas_list1,
                                    xy)

In [None]:
from yellowbrick.classifier import ClassificationReport,ConfusionMatrix
classes=[0,1]
visualizer = ClassificationReport(alg, classes=classes)
#visualizer.fit(X_train, y_train)  
visualizer.score(X_test, y_test)  
g = visualizer.poof()

### 2.Random Forest
 
> Random forest is a supervised learning algorithm which is used for both classification as well as regression. But however, it is mainly used for classification problems. As we know that a forest is made up of trees and more trees means more robust forest. Similarly, random forest algorithm creates decision trees on data samples and then gets the prediction from each of them and finally selects the best solution by means of voting. It is an ensemble method which is better than a single decision tree because it reduces the over-fitting by averaging the result.
> > > The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=600,random_state=0, n_jobs= -1)
rf = rf.fit(X_train, y_train)
print('train accuracy',rf.score(X_train, y_train))
print('test accuracy',rf.score(X_test,y_test))

In [None]:
import scikitplot as skplt
rdf_prob = rf.predict_proba(X_test)

In [None]:
skplt.metrics.plot_roc(y_test, rdf_prob)
skplt.metrics.plot_ks_statistic(y_test, rdf_prob)
skplt.metrics.plot_cumulative_gain(y_test, rdf_prob)
skplt.metrics.plot_lift_curve(y_test, rdf_prob)
plt.show()

In [None]:
probas_list1 = [rf.predict_proba(X_test)]
xy=['rdf']
skplt.metrics.plot_calibration_curve(y_test,
                                      probas_list1,
                                    xy)

In [None]:
from yellowbrick.classifier import ClassificationReport,ConfusionMatrix
classes=[0,1]
visualizer = ClassificationReport(rf, classes=classes)
#visualizer.fit(X_train, y_train)  
visualizer.score(X_test, y_test)  
g = visualizer.poof()

### Plots Used for showing the Classification Results

### ROC curve
* A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
* A precision-recall curve is a plot of the precision (y-axis) and the recall (x-axis) for different thresholds, much like the ROC curve. A no-skill classifier is one that cannot discriminate between the classes and would predict a random class or a constant class in all cases

#### Lift curve
> **A lift curve shows the ratio of a model to a random guess ('model cumulative sum' / 'random guess' from above). Cumulative gains charts are a bit mor. A lift curve is a way of visualizing the performance of a classification model. Lift curves are closely related to, and frequently confused with, cumulative gains charts**

#### Calibration curve
> **In analytical chemistry, a calibration curve, also known as a standard curve, is a general method for determining the concentration of a substance in an unknown sample by comparing the unknown to a set of standard samples of known**

#### KS plot
> **Kolmogorov-Smirnov chart measures performance of classification models. K-S is a measure of the degree of separation between the positive and negative distributions.**

#### Cumulative gain plot
> **The cumulative gains chart shows the percentage of the overall number of cases in a given category "gained" by targeting a percentage of the total number of cases.**

## Preventions

* Quit smoking.
* Control other health conditions, such as high blood pressure, high cholesterol and diabetes.
* Exercise at least 30 minutes a day on most days of the week.
* Eat a diet that's low in salt and saturated fat.
* Maintain a healthy weight.
* Reduce and manage stress.
* Practice good hygiene.

### Summary
We started with the data exploration where we got a feeling for the dataset, checked about missing data and learned which features are important. During this process we used Plotly, seaborn and matplotlib to do the visualizations. During the data preprocessing part, we converted features into numeric ones, grouped values into categories and created a few new features. Afterwards we started training machine learning models, and applied cross validation on it. Of course there is still room for improvement, like doing a more extensive feature engineering, by comparing and plotting the features against each other and identifying and removing the noisy features. You could also do some ensemble learning.Lastly, we looked at it’s confusion matrix and computed the models precision.

## I hope this kernel is helpfull for you