# Heart Disease UCI data analysis and use of machine learning algorithm for classification.

**In that software page we did a lot of work like:**

* Take a look at the data.
* Detailed data analysis.
* Extracting and deducing a lot of information during the analysis.
* Data cleaning and transformation.
* data splitting.
* Apply machine learning algorithms to classify data.


<h1 align="center"> EDA :Heart Disease UCI and applied ML </h1>
<center><img src="https://storage.googleapis.com/kaggle-datasets-images/538044/983409/21d3eec8c2d5c04b7014f61ae3b516be/dataset-card.jpg?t=2020-03-03-06-25-57" width="60%" >


In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# Reading Data 

In [None]:
data=pd.read_csv("../input/heart-disease-uci/heart.csv")

# View the data.

In [None]:
data.head()

In [None]:
list(data.columns)

**There are thirteen features and one target as below:**

* age: The person's age in years
* sex: The person's sex (1 = male, 0 = female)
* cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
* trestbps: The person's resting blood pressure (mm Hg on admission to the hospital)
* chol: The person's cholesterol measurement in mg/dl
* fbs: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
* restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)
* thalach: The person's maximum heart rate achieved
* exang: Exercise induced angina (1 = yes; 0 = no)
* oldpeak: ST depression induced by exercise relative to rest
* slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
* ca: The number of major vessels (0-3)
* thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)

* target: Heart disease (0 = no, 1 = yes)

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data_dtype = data.dtypes
data_dtype.value_counts()

In [None]:
data.isnull().sum()

In [None]:
# Let's look at the values contained in each column.
for i in data.columns:
  print(i, data[i].unique())

In [None]:
sns.countplot(data["target"])
plt.title("Countplot for diagnosis")
plt.show();

In [None]:
# Here I get the number of values in each of our classes.
data["target"].value_counts()

In [None]:
# Here, get the percentage of the relationship between the features and the target.
data.corr()['target'].sort_values(ascending=False)

# EDA : General data exploration

In [None]:
data.hist(figsize=(18,10))
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
sns.heatmap(data.corr(), annot=True, linewidths=.5, ax=ax)
plt.show()

# EDA for each column

# EDA : age column

In [None]:
from scipy.stats import norm

In [None]:
#descriptive statistics summary
data['age'].describe()

In [None]:
data["age"].value_counts()

In [None]:
sns.boxplot('age',data=data) 
plt.xlabel('Age Column Distribution')

> **From the results, we note the presence of a cluster of data between 45 to 64 years.**

> **The data also skews between two values, 29 to 77 years.**

In [None]:
# Let's see the distribution of the age 
#histogram
sns.distplot(data['age']);

In [None]:
#skewness and kurtosis
print("Skewness: %f" % data['age'].skew())
print("Kurtosis: %f" % data['age'].kurt())

In [None]:
data.head(10)

> **Let's see the normal distribution of disease prevalence between different ages.**

In [None]:
fig = sns.FacetGrid(data, hue="target",aspect=4)
fig.map(sns.kdeplot,'age',shade= True)
oldest = data['age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

> **From the results we obtained, we note that the disease begins to spread at high rates at the age of 33 years and then intensifies in prevalence between approximately 43 to 46 years, and then continues to spread to reach the largest number of infections between approximately 50 to 58 years, and then begins to decline Gradually a weak percentage to hit Bchmda from time to time.**

In [None]:
sns.countplot('sex', data=data,hue='target')

> **Here is a distribution that shows the distribution of men and women in ages.**

In [None]:
sns.catplot(data=data,x="sex" , y="age" , hue="target" , palette="husl")

# EDA : trestbps column

**The ideal blood pressure for adults ranges between 90/60 mm Hg to 120/80 mmHg, and it should be noted that blood pressure fluctuates significantly in the short term during the day, such as: changes from one heartbeat to another, and from minute to minute, From hour to hour, and from day to day, as well as long-term changes over long periods of time such as days, weeks, months, seasons, and even years, so to ensure accurate evaluation of pressure readings, the doctor must rely on the average of two or more pressure readings during his visit three times Or more .**

In [None]:
data["trestbps"].describe()

In [None]:
sns.boxplot('trestbps',data=data) 
plt.xlabel('trestbps Column Distribution')

**From the figure, we notice that there is a clustering of data between 120 and 140 mmHg, but we notice an annoying thing, which is those values far from agglomeration, which may reach 200 mmHg.**

In [None]:
# Let's see the distribution of the data.
#histogram
sns.distplot(data['trestbps']);

In [None]:
#skewness and kurtosis
print("Skewness: %f" % data['trestbps'].skew())
print("Kurtosis: %f" % data['trestbps'].kurt())

> **Through the results we obtained and through the figure, we notice that there is a Skewness in the data of the positive type, but it is a very slight distortion.**

> **But with the presence of the light positive tail, we can conclude that there are outliers with a slight percentage.**

In [None]:
data["trestbps"].unique()

> **Here we divide the data into categories to facilitate the process of drawing and extracting information.**

In [None]:
bins  =   [90,101,112,123,134,145,156,167,178,189,200,211]
labels = ['Less than 100', '101-111','112-122','123-133','134-144','145-155','156-166','167-177','178-188','189-199','more than 200']
data['Max_pressure'] = pd.cut(data['trestbps'],right=False , bins= bins,labels = labels)
data['Max_pressure']

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x="Max_pressure",hue="target" , data=data,palette="rocket")

**Note that the blood pressure that is confined between the values of 112 to 144 witnesses the highest incidence of the disease as shown in the figure.**

In [None]:
fig = sns.FacetGrid(data, hue="Max_pressure",aspect=4)
fig.map(sns.kdeplot,'age',shade= True)
oldest = data['age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()


> **We note a dangerous rise in the values from 43 years to 68 years, and these values have the highest incidence of the disease.**

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x="sex",hue="Max_pressure" , data=data,palette="rocket")

> **We note that males are more likely to have a person's resting blood pressure (mm Hg on admission to the hospital) than women.**

In [None]:
# Now we will see the shape of the relationship between the column and the target column.
sns.swarmplot(x=data['target'],
              y=data['trestbps'])

# EDA : cp column

In [None]:
data["cp"].describe()

In [None]:
# Reveal the different values included in the column.
data["cp"].unique()

In [None]:
plt.figure(figsize=(12,6))
sns.countplot('cp', data= data, hue= 'target')

> **We note from the results that people with type 1 and type 2, have a greater chance of contracting the disease.**

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(x="sex",hue="cp" , data=data,palette="rocket")

> **In general, we note that males are more likely to suffer from chest pain, and specifically more likely to suffer from chest pain of the first type.
And all this more than women.**

In [None]:
fig = sns.FacetGrid(data, hue="sex",aspect=4)
fig.map(sns.kdeplot,'cp',shade= True)
oldest = data['cp'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

In [None]:
plt.figure(figsize=(10,6))
data.groupby('cp')['target'].value_counts(normalize=False).unstack().plot(kind='bar')

# EDA : thalach column
* The person's maximum heart rate achieved

> **The resting heart rate for a normal person ranges between 60-100 beats per minute**

In [None]:
data["thalach"].describe()

In [None]:
sns.boxplot('thalach',data=data) 
plt.xlabel('thalach Column Distribution')

> **We notice a pool of data for heart rate that is between 130 and 170, with some outliers.**

In [None]:
# Let's see the distribution of the data.
#histogram
sns.distplot(data['thalach']);

In [None]:
#skewness and kurtosis
print("Skewness: %f" % data['thalach'].skew())
print("Kurtosis: %f" % data['thalach'].kurt())

In [None]:
data["thalach"].unique()

In [None]:
# Now we will see the shape of the relationship between the column and the target column.
sns.swarmplot(x=data['target'],
              y=data['thalach'])

> **It is clear that this column contains a lot of values, which we must divide in order to be easy to extract information.**

In [None]:

bins  =   [90,101,112,123,134,145,156,167,178,189,200,211]
labels = ['Less than 100', '101-111','112-122','123-133','134-144','145-155','156-166','167-177','178-188','189-199','more than 200']
data['Max_Heart_Rate'] = pd.cut(data['thalach'],right=False , bins= bins,labels = labels)
data['Max_Heart_Rate']

In [None]:
plt.figure(figsize=(12,6))
sns.countplot('Max_Heart_Rate', data =data,hue='target')

> **Heart rate ranging from 145 to 188 are more susceptible to the disease, as shown in the figure.**

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(x="sex",hue="Max_Heart_Rate" , data=data,palette="rocket")

> **From the figure, we conclude that men are more likely to have a higher heart rate than women, and this is difficult for men.**

In [None]:
fig = sns.FacetGrid(data, hue="Max_Heart_Rate",aspect=4)
fig.map(sns.kdeplot,'age',shade= True)
oldest = data['age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

> **Here we are trying to show the relationship between age and high heart rate, but the results do not seem clear, so we will modify and include a new column that holds the age groups.**

In [None]:
bins  =   [29,41,53,65,78]
labels = ['29-40', '41-52','53-64','65-77']
data['Age_Group'] = pd.cut(data['age'],right=False , bins= bins,labels = labels)
data['Age_Group']

In [None]:
data.head()

In [None]:
plt.figure(figsize=(20,10))
sns.countplot('Age_Group', data= data, hue= 'Max_Heart_Rate')

> **Here the results seem more clear, and we note that the age groups most susceptible to an increase in heart rate are from 53 years to 64 years, and they come after them from 41 years to 52 years.**

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(x="Age_Group",hue="sex" , data=data,palette="rocket")

> **Here, we note that the number of men is much greater than the number of women in the age groups from 53 to 64 years, followed by the age group from 41 to 52 years, which also exceeds the number of men than the number of women. The heart and thus heart disease, and certainly men are at the forefront of the scene.
Which strongly confirms the results we obtained previously.**

> **Now we have finished analyzing the most features associated with the target column, so I will now be satisfied with those analyzes and start developing algorithms.**

# Data preprocessing 

In [None]:
data.head()

In [None]:
# Here I will return the data to its original.
# By dropping one of the columns we have created.
data=data.drop(["Max_Heart_Rate"],axis=1)
data=data.drop(["Max_pressure"],axis=1)
data=data.drop(["Age_Group"],axis=1)

# Outliers.

> **We will get rid of the outliers that we observed in the previous columns.**

In [None]:
# Here we extracted the columns with outliers.
df_out = data[["age", "trestbps", "chol", "thalach", "oldpeak"]]
df_out.describe()

In [None]:
infor = data.describe()

df2 = data[data.trestbps < infor.loc["mean", "trestbps"] + 3 * infor.loc["std", "trestbps"]]
df3 = df2[data.chol < infor.loc["mean", "chol"] + 3 * infor.loc["std", "chol"]]
df4 = df3[data.thalach > infor.loc["mean", "thalach"] - 3 * infor.loc["std", "thalach"]]
new_data = df4[data.oldpeak < infor.loc["mean", "oldpeak"] + 3 * infor.loc["std", "oldpeak"]]
new_data.head()

# Data Spliting

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
target=new_data["target"]
features=new_data.drop(["target"],axis=1)

In [None]:
# Here I do the data transformation.
scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
X = scaler.fit_transform(features)

In [None]:
# Here I has reduced the amount of data used in the testing process because the data is very small
x_train,x_test,y_train,y_test=train_test_split(X,target,test_size=0.1,random_state=0)

In [None]:
print(x_train.shape,x_test.shape,y_train.shape,y_test.shape)

# Let's start by developing machine learning algorithms for classification.

# 1. SVC Algorithm

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

SVCModel = SVC(kernel= 'rbf',# it can be also linear,poly,sigmoid,precomputed
               max_iter=20,C=0.01,gamma='auto')

SVCModel.fit(x_train, y_train)

In [None]:
#Calculating Details
print('SVCModel Train Score is : ' , SVCModel.score(x_train, y_train))
print('SVCModel Test Score is : ' , SVCModel.score(x_test, y_test))
#Calculating Prediction
y_pred = SVCModel.predict(x_test)
print('Predicted Value for SVCModel is : ' , y_pred[:20])
print('real Values for SVCModel     is : '"\n" , y_test[:20])

In [None]:
#Calculating Confusion Matrix
CM = confusion_matrix(y_test, y_pred)
print('Confusion Matrix is : \n', CM)

# drawing confusion matrix
sns.heatmap(CM, center = True)
plt.show()

# 2 . Random Forest Algorithm

In [None]:
from sklearn.ensemble import RandomForestClassifier
RandomForestClassifierModel=RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=7,
                                min_samples_split=2, min_samples_leaf=1,min_weight_fraction_leaf=0.0,
                                max_features='auto',max_leaf_nodes=7,min_impurity_decrease=0.0,
                                min_impurity_split=None, bootstrap=True,oob_score=False, n_jobs=-1,
                                random_state=0, verbose=0,warm_start=True)
RandomForestClassifierModel.fit(x_train, y_train)

In [None]:
#Calculating Details
print('RandomForestClassifierModel Train Score is : ' , RandomForestClassifierModel.score(x_train, y_train))
print('RandomForestClassifierModel Test Score is : ' , RandomForestClassifierModel.score(x_test, y_test))
print('RandomForestClassifierModel features importances are : ' , RandomForestClassifierModel.feature_importances_)

In [None]:
#Calculating Prediction
y_pred = RandomForestClassifierModel.predict(x_test)
y_pred_prob = RandomForestClassifierModel.predict_proba(x_test)
print('Predicted Value for RandomForestClassifierModel is : ' , y_pred[:10])
print("real values of y_test>>>>>>>>>>>>>>>>>>>>>>>>>>>is : \n" ,y_test[:10] )
print('Prediction Probabilities Value for RandomForestClassifierModel is : ' , y_pred_prob[:10])
 

In [None]:
#Calculating Confusion Matrix
CM = confusion_matrix(y_test, y_pred)
print('Confusion Matrix is : \n', CM)

# drawing confusion matrix
sns.heatmap(CM, center = True)
plt.show()

# In the end

> **We're done with that software page, we've done a lot of work on it, from analyzing and exploring data, analyzing columns and extracting information, to putting together machine learning algorithms to classify and get fairly good results.**

> **With regard to the results that we obtained from machine learning algorithms, it was certainly possible to obtain higher results in accuracy, but there is nothing wrong with those results so far.**

# Thank you very much for your time