# Introduction

Of all the applications of machine-learning, diagnosing any serious disease using a black box is always going to be a hard sell. If the output from a model is the particular course of treatment (potentially with side-effects), or surgery, or the *absence* of treatment, people are going to want to know **why**.

This dataset gives a number of variables along with a target condition of having or not having heart disease. Below, the data is first used in a simple random forest model, and then the model is investigated using ML explainability tools and techniques.

,

# About Data

It's a clean, easy to understand set of data. However, the meaning of some of the column headers are not obvious. Here's what they mean,

- **age**: The person's age in years
- **sex**: The person's sex (1 = male, 0 = female)
- **cp:** The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
- **trestbps:** The person's resting blood pressure (mm Hg on admission to the hospital)
- **chol:** The person's cholesterol measurement in mg/dl
- **fbs:** The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false) 
- **restecg:** Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)
- **thalach:** The person's maximum heart rate achieved
- **exang:** Exercise induced angina (1 = yes; 0 = no)
- **oldpeak:** ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot. See more [here](https://litfl.com/st-segment-ecg-library/))
- **slope:** the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
- **ca:** The number of major vessels (0-3)
- **thal:** A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
- **target:** Heart disease (0 = no, 1 = yes)


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#import data
data=pd.read_csv("/kaggle/input/heart-disease-uci/heart.csv")

In [None]:
#Get information about data
data.info()

In [None]:
#To seen first 10 rows 
data.head(10)

In [None]:
#Import library about data visualization 
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
#seen data correlation in heatmap 
f,ax=plt.subplots(figsize=(18,18))
sns.heatmap(data.corr(),annot=True,linewidth=.5,fmt='.2f',ax=ax)
data.corr()
plt.show()

In [None]:
# Line Plot"trestbps(The person's resting blood pressure) and thalach (The person's maximum heart rate achieved)" attributes
#Note: Attributes have negative correlation that only using only show plot 
data.trestbps.plot(kind="line",color="g",label="age",linewidth=1,grid=True,linestyle=":")
data.thalach.plot(color="r",label="chol",linewidth=1,grid=True,linestyle="-")

 
plt.title("Line Plot")
plt.xlabel=('x axis')
plt.ylabel=('y axis')

plt.show()


In [None]:
# Scatter Plot"trestbps(The person's resting blood pressure) and thalach (The person's maximum heart rate achieved)" attributes 
data.plot(kind='scatter',x='trestbps',y='thalach',color='blue')
plt.title('Scatter Plot')
plt.show()

In [None]:
#Hstogram 
data.trestbps.plot(kind="hist",bins=50,figsize=(15,15))
plt.title("Histogram Plot")
plt.show()

In [None]:
#Data Filtering Logical 
data[(data["trestbps"]>130)&(data["chol"]>210)]

In [None]:
#Data Filtering Logical
data_logicfilter=data[np.logical_and(data["age"]>60,data["trestbps"]>170)]
data_logicfilter

In [None]:
#Getting data with using loops
for index,value in data[["trestbps"]][0:20].iterrows():
    print(index,":",value)

In [None]:
#Using Lambda function and calculated mean of age 
square = lambda x,y: x/y    
print(square(data["age"].sum(),len(data["age"])))
#Built in function mean()
print(data["age"].mean())

In [None]:
#Using Map function 
list = data["cp"]
y = map(lambda x:x**2,list)
print(list(y))

In [None]:

# iteration example
data_iter =data["chol"]
it = iter(data_iter)
print(next(it))    #print first row  next iteration
print(*it)         # print other rows remaining iteration



In [None]:

# zip example
data1 = data["age"]
data2 = data["sex"]
data_zip= zip(data1,data2)
print(data_zip)
data_zip_list = list(data_zip)#convert to list
print(data_zip_list)

In [None]:
#unzip data
un_zip = zip(*data_zip_list)
un_list1,un_list2 = list(un_zip) # unzip returns tuple
print(un_list1)
print(un_list2)
print(type(un_list2))

In [None]:
#Using List Comprehension
threshold = data["chol"].sum()/len(data["chol"])
print(threshold,data["chol"].mean())
data["chol_threshold"]=["High Chol"if i>threshold else "Low Chol" for i in data["chol"]]
data.loc[:20,["chol_threshold","chol"]] # we will learn loc more detailed later