# **INTRODUCTION : **

### In this kernel, I have tried to analyze the Heart Disease dataset. The idea is to better understand the relationship of various factors in the dataset that can be related to the heart disease. The original database contains 76 attributes, but all published experiments refer to using a subset of 14 of them,so we will be using these 14 features for analysing.


# Columns:
* **age   **        in years
* **sex**           (1 = male; 0 = female)
* **cp    **         chest pain type
* **trestbps  **  resting blood pressure (in mm Hg on admission to the hospital)
* **cho**l          serum cholestoral in mg/dl
* **fbs **           (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* **restecg**     resting electrocardiographic results
* **thalach**     maximum heart rate achieved
* **exang**       exercise induced angina (1 = yes; 0 = no)
* **oldpeak**    ST depression induced by exercise relative to rest
* **slope **       the slope of the peak exercise ST segment
* **ca**             number of major vessels (0-3) colored by flourosopy
* **thal **          3 = normal; 6 = fixed defect; 7 = reversable defect
* **target**       1 or 0

### Import the required libraries


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # Graphs
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# **1. Load the dataset**

In [None]:
df = pd.read_csv("../input/heart.csv")

### Now that our data is loaded......let us see what kind of data we have

# **2. Understand the data using descriptive Statistics**

In [None]:
df.columns


In [None]:
df.shape

In [None]:
df.head(10)

In [None]:
df.dtypes

In [None]:
df.describe()

>  1. There are 303 rows and 14 columns.
>  2. Each attribute is of int type except the attribute 'oldpeak' which is of float type.
>  3. We can also see that average age of people in the dataset is near 54 years,whereas minimum age is 29 and maximum age is 77.

### let us see if there is any null values in our dataset.

In [None]:
df.isnull().sum()

### As we can see that there is no null value present,therefore we can easily use our data for furthur analysis. 

# **3. Understanding the data using Visualisation**

In [None]:
df.hist(figsize=(12,12))

In [None]:
df.plot(kind='box',subplots=True,layout=(4,4),sharex=False,sharey=False,figsize=(18,18))

### Let us see that whether there is any relationship between the attributes.

In [None]:
df.corr()

### We cannot get a proper picture with the above analysis, let us draw a correlation graph for our better understanding.

In [None]:
fig=plt.figure(figsize=(15,15))
ax=fig.add_subplot(111)
cax=ax.matshow(df.corr(),vmin=-1,vmax=1)
fig.colorbar(cax)
ticks=np.arange(0,14,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(df.columns)
ax.set_yticklabels(df.columns)

### We can see that the attributes having :

* **Positive relationships**<br>
chest pain = target<br>
thalcah = slope<br>
thalach = target<br>
slope = target<br>

* **Negative relationships**<br>
oldpeak =slope<br>
cp = exang<br>
age = thalach

# **4. Furthur analysis**

- ## Let us see that how many people are suffering from heart attack disease

In [None]:
df.groupby('target').size()

 - people free from heart attack disease = 138
 - people suffering from heart attack    = 165

In [None]:
p_risk = (len(df.loc[(df['target']==1) ])/len(df.loc[df['target']]))*100
print("Percentage of people at risk : ", p_risk)

- ## Now let us see that if the gender of the person can affect him/her.

In [None]:
abc = pd.crosstab(df['sex'],df['target'])
abc

- Number of males free from risk   = 114
- Number of females free from risk = 24
- Number of males at risk          = 93
- Number of females at risk        = 72

In [None]:
female_risk_percent = (len(df.loc[((df['sex']==0) & df['target']==1) ])/len(df.loc[df['sex']==0]))*100
male_risk_percent = (len(df.loc[((df['sex']==1) & df['target']==1) ])/len(df.loc[df['sex']==1]))*100
print('percentage males at risk : ',male_risk_percent)
print('percentage females at risk : ',female_risk_percent)

### We can see that the females are at greater risk of heart attack than males. Let us plot the graph between sex and target for a clearer view.

In [None]:
abc.plot(kind='bar', stacked=False, color=['#f5b7b1','#a9cce3'])

- ## We should also see that how different ages can have the risk of heart attack

Let us draw a barplot between age and target.

In [None]:
xyz = pd.crosstab(df.age,df.target)
xyz.plot(kind='bar',stacked=False,figsize=(15,8))

### We can see that the people between the age of 40 to 55 are at higher risk of heart attack.

- ## Let us see that how chestpain is related with heart attack.

In [None]:
pqr = pd.crosstab(df.cp,df.target)
pqr

In [None]:
pqr.plot(kind='bar',figsize=(12,5))

### We can see that if a person has chest pain type 2 ,then he has higher chance of heart attack and if a person has chest pain type 0 , then he has a very little risk of heart attack.

- ## See the relationship between thal and risk of heart attack¶

In [None]:
mno = pd.crosstab(df.thal,df.target)
mno

In [None]:
mno.plot(kind='bar', stacked=False, color=['#2471a3','#ec7063'],figsize=(12,5))

### We can see that thal type2 can greately increase the risk of heart attack.

### **We can furthermore analyze the data,but first let us do some feature selection ,create models etc for our data.**

# **5. Splitting data into train and test sets.**

In [None]:
array = df.values
X = array[:, 0:13]
y = array[:, 13]

seed = 7
tsize = 0.2

We have selected first 13 column as features and 14th column(target) as a label.<br>
We choose target as a label because :<br>
target = 0  &nbsp;&nbsp; risk free of heart attack<br>
target = 1  &nbsp;&nbsp; risk of heart attack
<br>
Now Let Us split our data for further training of our model.

In [None]:
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=tsize, random_state=seed)

We divided our data as:<br>
Training data = 80%<br>
Testing data = 20%<br>
<br>

Now Let us preprocess our data.

# **6. Preprocess your data for Machine Learning** 

### From sklearn,we choose the standard scaler to preprocess our data.StandardScaler is used to transform attributes with a Gaussian Distribution with each value having mean = 0 and SD = 1 

In [None]:
from sklearn.preprocessing import StandardScaler

scale = StandardScaler()
X_train_scale = scale.fit_transform(X_train)
X_train = pd.DataFrame(X_train_scale)
X_test_scale =scale.fit_transform(X_test)
X_test = pd.DataFrame(X_test_scale)

# **7. Now let us create various models for training our data**

### **In the below code , I will be using various classification algorithms for training the model.¶**

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC

In [None]:
models=[]
models.append(('LR  :', LogisticRegression()))
models.append(('LDA :', LinearDiscriminantAnalysis()))
models.append(('KNN :', KNeighborsClassifier()))
models.append(('CART:', DecisionTreeClassifier()))
models.append(('NB  :', GaussianNB()))
models.append(('SVM :', SVC()))


In [None]:
results = []
names = []
score = 'accuracy'
seed = 7
folds = 10
X_train, X_validation, y_train, y_validation = train_test_split(X,y,test_size=0.2,random_state=seed)


for name, model in models:
    kfold = KFold(n_splits=folds,random_state=seed)
    cv_results = cross_val_score(model, X_train, y_train, scoring=score)
    results.append(cv_results)
    msg ="%s %f (%f)" % (name,cv_results.mean()*100,cv_results.std()*100)
    print(msg)
    

### **We can see that Linear Discriminant Analysis and Logistic Regression has almost the same accuracy but the Standard deviation of LR is less than LDA so we will use LR for further eximination.**

### Let us plot box graph for our different algorithms comparision.

In [None]:
qwerty =['LR', 'LDA', 'KNN', 'CART', 'NB', 'SVM'] 

fig = plt.figure(figsize=(10,10))
fig.suptitle("Algorithm Comparision")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(qwerty)
plt.show()

# **8.At Last let us predict out test data on our trained model.**

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [None]:
LR = LogisticRegression()
LR.fit(X_train, y_train)
predictions = LR.predict(X_validation)
print(accuracy_score(y_validation, predictions)*100)
print(classification_report(y_validation, predictions))

### We got an overall accuracy of 73.7% for our trained model.


## **Eventhough I got a very less accuracy than I expected,I am very happy because this is  the completion of my first project. If you have come this far please upvote and comment so that I can improvee myself. **