## **Setting Up**

The code below contains some things which we have to import which are necessary for our program to run and perform certain things. Key features are described in the comments.

In [None]:
# A This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for data visualisation purposes
from sklearn.tree import DecisionTreeClassifier, plot_tree # Our model and a handy tool for visualising trees
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import seaborn as sns
from scipy import stats
from scipy.stats import norm, boxcox
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## **Introduction**

In this project, I'm going to explore how to use a Decision Tree Classifier and decision tree regressor to predict the level of chest pain which a patient has. This will help in determining whether or not a patient has chest pain and how serious. I have chosen the decision tree classfier as I belive it is the best model for this task due to there being lots of numerical data rather than categorical data. I would like the model to be highly accurate (80% or above) which means I would like my mean absolute error to be low. I am looking at using the following for features: sex, cholestrol, thalach, age, trestbps, fbs as they are key factors  which contribute to chestpain. The data which I have picked is regarding factors which can cause heart disease. 

## **Gather Data and Explore Data**
I will limit the data which is in a .csv format, contains mostly numerical values, few categorical values and doesn't have many missing values.

Once I have found a file I will input in to the folder.

Now that we have a file containing data, I will get it into a Pandas DataFrame and take a peek.

I selected this data as it contained mostly numerical data which makes it easier for predictions. The data contains many features which are related and will help aid make accurate predictions. I am prediciting the level of chest pain which a patient has. It will help patients, know the level of Chest Pain which they have. I will be using features to do this. Features are certain factors which have an impact on predictions. There values aid in getting accurate predictions. The CP column will be the one used for the prediction target.


In [None]:
#input data
train_file_path = '../input/heart-disease-uci/heart.csv'

#Create a new Pandas DataFrame with our training data
Heart = pd.read_csv(train_file_path)

#Heart_test_data.columns
Heart.describe(include='all')
#Heart_train_data.head()

## **Prepare Data**
In this example, we want to predict the level of chest pain which a patient has. Therefore the Chest pain column is our prediction target.

Before I separate our prediction target from the rest of the data, we need to do some preparation so that there aren't any rows with missing values as our machine learning model will not be able to handle them.

Choosing our features first will help reduce the total number of rows we need to drop (remove).

I would like to choose a selection of features that are relevant to the predictions and don't have any missing values.

In [None]:
# Reducing the data to factors which we want to keep
# The features we chose have similar 'count' values when we describe() them
selected_features = ['sex', 'chol', 'thalach','age','trestbps',
                     'fbs','cp']

# Create a new training set with the features which we wanted to keep
prepared_data = Heart[selected_features]

# Drop rows (axis=0) that contain missing values
prepared_data = Heart.dropna(axis=0)

# Check that you still have a good 'count' value. The value should be the same for all columns.
# If your count is very low then you may need to remove features with the lowest count.
prepared_data.describe(include='all')

## **Graphs For Features**
I will now show some graphs related to skewness correction and probability plot of the features which I am selecting3.

In [None]:
def skewnessCorrector(dataset,columnName):
    print('''Before Correcting''')
    (mu, sigma) = norm.fit(dataset[columnName])
    print("Mu before correcting {} : {}, Sigma before correcting {} : {}".format(
        columnName.capitalize(), mu, columnName.capitalize(), sigma))
    plt.figure(figsize=(20, 10))
    plt.subplot(1, 2, 1)
    sns.distplot(dataset[columnName], fit=norm, color="lightcoral");
    plt.title(columnName.capitalize() +
              " Distplot before Skewness Correction", color="black")
    plt.subplot(1, 2, 2)
    stats.probplot(dataset[columnName], plot=plt)
    plt.show()
    # Applying BoxCox Transformation
    dataset[columnName], lam_fixed_acidity = boxcox(
        dataset[columnName])
    
    print('''After Correcting''')
    (mu, sigma) = norm.fit(dataset[columnName])
    print("Mu after correcting {} : {}, Sigma after correcting {} : {}".format(
        columnName.capitalize(), mu, columnName.capitalize(), sigma))
    plt.figure(figsize=(20, 10))
    plt.subplot(1, 2, 1)
    sns.distplot(dataset[columnName], fit=norm, color="orange");
    plt.title(columnName.capitalize() +
              " Distplot After Skewness Correction", color="black")
    plt.subplot(1, 2, 2)
    stats.probplot(dataset[columnName], plot=plt)
    plt.show()

col = ['age','chol','thalach','trestbps']
for column in col:
    skewnessCorrector(Heart, column)


## **Separating Features From Target**
Now that we have a set of data (as a Pandas DataFrame) without any missing values, I will be separating the features we will use for training from the target.

In [None]:
#Separate out the prediction target
y = prepared_data.cp

# Drop the target column (axis=1) from the original dataframe and use the rest as our feature data
X = prepared_data.drop('cp', axis=1)

#Taking a look at the data one more time
X.head()
#y.head()

## **One Hot Encode**
I will now start One Hot Enocding, it creates new columns, indicating the presence of each possible category value in the original data. 

In [None]:
# One hot encode the features. This will only act on columns containing non-numerical values.
one_hot_X = pd.get_dummies(X)

one_hot_X.head()

## **Training Decision Tree Classifier**
Now that we have data our model can digest, I will now train a model to get some predictions. We're going to use a Decision Tree Classifier which is different from the Decision Tree Regressor in that it makes categorical predictions instead of continuous numerical predictions.

In this case, the category we want to predict is the level of chest pain which a patient has, with the output being a Level 3, Level 2, Level 1 and 0 if they did not. Decision Tree Classifiers are also able to work with non-numerical prediction targets as well. 

I will also be splitting the training and testing data. Splitting the training set into two subsets is important because you need to have data that your model hasn't seen which help show how accurate the Decision Tree Model really is.


In [None]:
Cp = DecisionTreeClassifier(max_depth=3)

# Train the model on the one hot encoded data

#Splits training and testing data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

Cp.fit(one_hot_X, y)

print(Cp.classes_)

# Plotting the tree to see what it looks like 
plt.figure(figsize = (20,10))
plot_tree(Cp,
          feature_names=one_hot_X.columns,
          class_names=['0 CP','1 CP','2 Cp','3 CP'],
          filled=True)

#Shows the decision tree
plt.show()

## **Evaluate Model Performance**
I will now be able to print the predicitons in to a table below.

In [None]:
print("Making predictions for the first 5 people in the training set.")

# Get the first five predictions using a table
pred = Cp.predict(one_hot_X)

print("The predictions are:")

#Merge actual target values and predictions back in with original features to see how we went.
X['Cp'] = y
X['Predicted'] = pred

X.head()

## **Accuracy**
This will now predict how accurate my predicitons are.

In [None]:
#Calculates score
acc_svc = accuracy_score(y, pred)
#Will print the score
print(acc_svc)

## **Mean Absolute Error**
I will now be calculating the MAE. In statistics, mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of measurement versus an alternative technique of measurement. This will alow me to see how accurate my predictions are.

In [None]:
#calculates MAE
mean_absolute_error(y, pred)

## **Using A Decision Tree Regressor**
I will now be using a decision tree regressor to predict the chest pain type. I have chosen this model as it makes accurate predicitions with numerical data.  We're going to use a Decision Tree Regressor which is different from the Decision Tree Classifer in that it makes continuous numerical prediction instead of categorical predictions.

In [None]:
#Making model
CP = DecisionTreeRegressor(max_depth=3)

#Splits training and testing data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

# Fit model
CP.fit(one_hot_X, y)
print(CP)

# get predicted chest pain type on validation data
val_predictions = CP.predict(one_hot_X)

plt.figure(figsize = (20,10))
plot_tree(CP,
          feature_names=one_hot_X.columns,
          class_names=['0','1','2','3'],
          filled=True)

#Shows the decision tree
plt.show()

## **Evaluating Model Performance**
The decission tree regressor performed slightly better than the decision tree classifiers with a MAE of 0.63 rather than 0.66. I belive that this is the case as Decision Tree Regressor are better for numerical data rather than catagorical rather than Decision Tree Classifier which can both categorical and numerical data. I will than start tuning the hyperparameters for the decision tree regressor in order to make it more accurate and decrease the MAE.


In [None]:
print("Making predictions for the first 5 people in the training set.")

# Get the first five predictions using a table
pred = CP.predict(one_hot_X)

print("The predictions are:")

#Merge actual target values and predictions back in with original features to see how we went.
X['CP'] = y
X['Predicted'] = pred

X.head()

## **Mean Abosulte Error For Model 2**
I will now calculate the MAE for this model.

In [None]:
#Calculates MAE
mean_absolute_error(y, val_predictions)

## **Hyper Parameter Tuning**
I will now start my hyper parameter tuning where I will try and get more accurate predictions my changing things such as max_leaf_nodes. So far, the accuracy has not been very good thus I belive the hyper parameter tuning will help give better predictions and higher accuracy and lower MAE.

In [None]:
def get_mae(max_leaf_nodes, train_X, one_hot_X, train_y, y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(one_hot_X)
    mae = mean_absolute_error(y, preds_val)
    return(mae)


# Write loop to find the ideal tree size from candidate_max_leaf_nodes    
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]

scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}
best_tree_size = min(scores, key=scores.get)

# Fit the model with best_tree_size. Fill in argument to make optimal size
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)

# fit the final model
final_model.fit(one_hot_X, y)

plt.figure(figsize = (20,10))
plot_tree(final_model,
          feature_names=one_hot_X.columns,
          class_names=['0','1','2','3'],
          filled=True)


#Shows the decision tree
plt.show()


## **Conclusion**
The purpose of this investigation was to see the level of chestpain a patient has based on factors given. The Decision tree predicted the level of chest pain which the patient had.  The quality of the predictions has been average with an accuracy score of 60% and MAE of 0.6600660066006601. The accuracy of predictions are less than I would like it to be. I would have liked my predictions to be 80% accurate as said in the introduction. Unfortunately it has not met the standard though is somewhat accurate. The features which had a strong affect on the predicitons are the following: age, cholestrol, trestbps as I noticed given the chest pain type these data columns had patterns which allowed them to have strong effect. These data columns varied in the data based on chest pain type making them have a strong effect on the predictions. The best model was model 2 (Decision tree regressor) it had a lower MAE (0.63) compared to model 1 (0.66). I would have hoped to tune the hyper parameters more but due to the time I wasn't able to do this. I could have also made a random foresr model to see if I would get more accurate predcitions and a lower MAE.