# In Simple Steps: Classification Tree for Beginners

* This is a Classification Tree project on a dataset for heart disease prediction. I got this dataset from UC Irvine Machine Learning Repository.  
* UCI link for this dataset: "https://archive.ics.uci.edu/ml/datasets/Heart+Disease"  
* The goal of this project is to build a Classification Tree model that can predict the probability of a patient having heart disease or not from the features in this dataset.

**Credits to the creators of this dataset**

**Importing necessary libraries**

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from tqdm import tqdm # Progress bar

**Setting display options**

In [4]:
pd.options.display.max_columns = 100
pd.options.display.max_colwidth = 100
pd.options.display.precision = 5
pd.options.display.float_format = '{:.3f}'.format

## Step 1 - Importing data

In [5]:
df = pd.read_csv('processed.cleveland.data.csv', header = None)

# This datafile file can also be imported directly from the UCI portal thru the following 2 lines of code.
# url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
# df = pd.read_csv(url, header = None)

df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,63,1,1,145,233,1,2,150,0,2.300,3,0,6,0
1,67,1,4,160,286,0,2,108,1,1.500,2,3,3,2
2,67,1,4,120,229,0,2,129,1,2.600,2,2,7,1
3,37,1,3,130,250,0,0,187,0,3.500,3,0,3,0
4,41,0,2,130,204,0,2,172,0,1.400,1,0,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45,1,1,110,264,0,0,132,0,1.200,2,0,7,1
299,68,1,4,144,193,1,0,141,0,3.400,2,2,7,2
300,57,1,4,130,131,0,0,115,1,1.200,2,1,7,3
301,57,0,2,130,236,0,2,174,0,0.000,2,1,3,1


The dataset does not have column names so the above dataframe gives column numbers instead. Replacing the column numbers with column names would make it easier to uderstand the dataframe and so wasier to work with it. I collected the following attribite info from the dataset web page and will use the attribute names (with minor changes) as the column names.

**Attribute Information:**

1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
    -- Value 1: typical angina
    -- Value 2: atypical angina
    -- Value 3: non-anginal pain
    -- Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
    -- Value 0: normal
    -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak: ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
    -- Value 1: upsloping
    -- Value 2: flat
    -- Value 3: downsloping
12. ca: number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14. num (the predicted attribute): diagnosis of heart disease (angiographic disease status)
    -- Value 0: < 50% diameter narrowing
    -- Value 1: > 50% diameter narrowing

In [6]:
df.columns = ["age","sex","cp","restbp","chol","fbs","restecg","thalach","exang","oldpeak","slope","ca","thal","hd"]
df.head(3)

Unnamed: 0,age,sex,cp,restbp,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,hd
0,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1


## Step 2 - Missing data

## Step 3 - Formatting the data for Decision Trees

## Step 4