<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# Forecasting of deaths from Heart Failure on medical measurement

Estimated time needed: **30** minutes

## Objectives

After completing this lab you will be able to:

*   Be confident about your data analysis skills


<p>
You can find the "Heart Disease Dataset UCI" from the following link: <br><a href="https://www.kaggle.com/datasets/ketangangal/heart-disease-dataset-uci" target="_blank">https://www.kaggle.com/datasets/ketangangal/heart-disease-dataset-uci</a>. <br><br>
The statistical data obtained from <a href=\"https://www.kaggle.com/datasets/ketangangal/heart-disease-dataset-uci" target=\"_blank\">https://www.kaggle.com/datasets/ketangangal/heart-disease-dataset-uci</a> under <a href=\"https://creativecommons.org/publicdomain/zero/1.0/\" target=\"_black\">CC0: Public Domain</a> license. <br><br>
We will use this dataset in this lab. It contains medical information about patients, who may suffer from heart disease. Comparing it to our previous dataset, this one has similar columns, except target column, which refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.
</p>

You will need the following libraries:


In [1]:
!pip install dython
!conda install --yes -c conda-forge imbalanced-learn

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from dython.nominal import associations
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split
from imblearn.pipeline import make_pipeline
from sklearn.metrics import plot_confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import recall_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import *

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
If error appeared please restart the kernel or run this block again
</div>

<b>Importing the Data</b>


you will need to download the dataset; if you are running locally, please comment out the following 


In [3]:
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0N40EN/heart_disease.csv'

Load the csv:


In [4]:
df= pd.read_csv(path)

Set number of digits in float type:

In [5]:
pd.options.display.float_format = '{:.2f}'.format

We use the method  <code>head()</code>  to display the first 5 columns of the dataframe:


In [6]:
df.head()

Unnamed: 0,age,sex,chest_pain_type,resting_blood_pressure,cholestoral,fasting_blood_sugar,rest_ecg,Max_heart_rate,exercise_induced_angina,oldpeak,slope,vessels_colored_by_flourosopy,thalassemia,target
0,52,Male,Typical angina,125,212,Lower than 120 mg/ml,ST-T wave abnormality,168,No,1.0,Downsloping,Two,Reversable Defect,0
1,53,Male,Typical angina,140,203,Greater than 120 mg/ml,Normal,155,Yes,3.1,Upsloping,Zero,Reversable Defect,0
2,70,Male,Typical angina,145,174,Lower than 120 mg/ml,ST-T wave abnormality,125,Yes,2.6,Upsloping,Zero,Reversable Defect,0
3,61,Male,Typical angina,148,203,Lower than 120 mg/ml,ST-T wave abnormality,161,No,0.0,Downsloping,One,Reversable Defect,0
4,62,Female,Typical angina,138,294,Greater than 120 mg/ml,ST-T wave abnormality,106,No,1.9,Flat,Three,Fixed Defect,0


<details>
<summary><b>Click to see attribute information</b></summary>

Input features (column names):

1. `age` - patient's age in years
2. `sex` - patient's sex ('Male', 'Female')
3. `chest_pain_type` - chest pain type ('typical angina', 'atypical angina', 'non-anginal pain', 'asymptomatic')
4. `resting_blood_pressure` - resting blood pressure
5. `cholestoral` - serum cholestoral in mg/dl
6. `fasting_blood_sugar` - fasting blood sugar > 120 mg/dl
7. `rest_ecg` - resting electrocardiographic results ('normal', 'ST-T wave abnormality', 'Left ventricular hypertrophy')
8. `Max_heart_rate` - maximum heart rate achieved
9. `exercise_induced_angina` - exercise induced angina ('Yes', 'No')
10. `oldpeak` - ST depression induced by exercise relative to rest
11. `slope` - the slope of the peak exercise ST segment ('Upsloping', 'Downsloping', 'Flat')
12. `vessels_colored_by_flourosopy` - number of major vessels colored by fluoroscopy
13. `thalassemia` - normal; fixed defect; reversible defect

Output feature (desired target):

14. `target` - does the patient have heart disease? (binary)
</details>

<b>Question 1</b>:  Display the data types of each column using the attribute `dtype`.


<b>Question 2</b>: Check if this DataSet contains NaN values:


<b>Question 3:</b> Check the correlation (numerical values) and association (objects) of each pair of columns.


<b>Question 4:</b> Divide the dataset into input and target factors.


<b>Question 5</b>: Create column transformer using `OrdinalEncoder()` and `StandardScaler()` and visualize it.


<b> Question 6:</b> Separate DataSets for train and test DataSets in 0.3 proportion train/test.


<b>Question 7: </b> Create Pipeline using `LogisticRegression()` model and show its accuracy and recall score.


<b>Question 8 :</b> Calculate Cross-Validation Score using 4 folds, calculate the average and standard deviation of estimate and predict the output.

<b>Question 9 :</b> Plot the confusion matrix to evaluate the correctness of the classification.


<b>Question 10</b>: Check whether the number of values of target column is similar, use `RandomOverSampler()` if it's not.


<b>Question 11</b>: Test different classifiers including `VotingClassifier()` and calculate their accuracy.


<b>Question 12</b>: Compare the accuracy of classifiers and build a plot to visualize it.


<b>Question 13</b>: Create a Pipeline based on Decision Tree, calculate and visualize its accuracy. Use `max_depth = 3` in order to see the vertices clearly.

<b>Question 14</b>: Visualize the Decision Tree using `plot_tree` function.

<b>Question 15</b>: Write `create_ensemble()` function, that can create ensemble using a predetermined number of classifiers. Make a pipeline with it, fit it and calculate its accuracy.

<b>Question 16</b>: Write `make_prediction()` function, that returns an answer to whether the patient has heart disease. Input should contain a DataFrame and a classifier.


<b>Question 17</b>: Create a new ensemble of your own list of classifiers using the first function.  

<b>Question 18</b>: Predict the output with your own data, using the second function and the ensemble you just obtained.

<b>Sources</b>


<a href="https://www.kaggle.com/datasets/ketangangal/heart-disease-dataset-uci" target="_blank">https://www.kaggle.com/datasets/ketangangal/heart-disease-dataset-uci</a>.

### Thank you for completing this lab!

## Author

<a href="https://author.skills.network/instructors/bohdan_kuno">Bohdan Kuno</a>

### Other Contributors

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/nataliya_boyko">Ass. Prof. Nataliya Boyko, PhD</a>

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                                         |
| ----------------- | ------- | ---------- | ---------------------------------------------------------- |
|2023-04-01|01|Bohdan Kuno|Lab created|


<hr>

## <h3 align="center"> © IBM Corporation 2020. All rights reserved. <h3/>
