# Heart Disease Machine Learning Project.

**Problem Definition**
### A project aimed at modelling heart disease based on biometric factors in order to predict its incidence.
 * labels are binary 0(no heart disease) and 1(heart disease)
    
**Data**
Taken from Kaggle.com, this data set is curated by UCI in Cleveland
https://www.kaggle.com/ronitf/heart-disease-uci

*Evaluation:
With only 303 rows, this is a small dataset,but the spread between target labels 0(no heart disease) and 1(heart disease) is fairly even.
Also, with 14 columns there is a lot of depth to analyze our dataset with.

Our benchmark goal will be 95% accuracy  
    
**Features**
* age 
* sex
* chest pain type (4 values)(cp)
    - 0 = typical angina (chest pain related to lack of blood in heart)
    - 1 = Atypical angina (chest pain not related to heart)
    - 2 = non-anginal pain (typical esophageal spasm not related to heart)
    - 4 = Asymptomatic  
* resting blood pressure (trestbps)
    - one of the key risk factors outlined by the CDC)
    - Over 120 is cause for concern
* serum cholestoral in mg/dl (chol)
    - Another key CDC risk factor 
    - Over 200 is high and over 240 is irregular
* fasting blood sugar > 120 mg/dl (fbs)
    - high blood sugar can damage blood vessels and the nerves that control your heart
    - Over 200 is irregular
* resting electrocardiographic results (values 0,1,2) (restecg)
    - can tell where the heart's blood supply is blocked or interrupted by a build-up of fatty substances.
    - Over 100 is irregular
    - 0  = No signs of irregularity
    - 1 = ST-T wave abnormality (signals non-normal heart beat)
    - 2 = definite sign of left-ventricular hypertrophy
* maximum heart rate achieved (thalach)
    - 220 minus age is the threshold
* exercise induced angina
    - heart pain based on exercies
* oldpeak = ST depression induced by exercise relative to rest
* slope =the slope of the peak exercise ST segment
* number of major vessels (0-3) colored by flourosopy (ca)
* thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

In [None]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split, cross_validate
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

## Data Exploration (EDA)

#### In this phase we want to learn as much about the data set as possible and become a subject matter expert.
* What are the questions and goals of the research
* What is the data like? 
* Are there relationships between features?
* Is there a missing component to the data?
* Are there outliers? Are they worth keeping?
* Do any features need augmentation, removal, etc?




In [None]:
df = pd.read_csv('../input/heart-disease-dataset/heart.csv')
df.head(5)

In [None]:
#more detail about target distribution
df.target.value_counts()

In [None]:
# Are there null values
df.isna().sum()

In [None]:
# It looks clean. Let's look at the rough relationship between target and features

target_breakdown = df.groupby('target').mean()

for column in target_breakdown.columns:
    target_breakdown[column] = target_breakdown[column]/target_breakdown[column].max()
transposed_for_graph = target_breakdown.T
transposed_for_graph.plot.bar(title='Comparison of Features by Target Label Group', xlabel='Feature', 
                              ylabel='Proportional Percentage of mean')
plt.savefig('Feature-and-Label-Comparison.jpg')

##### There are some clear relationships between the target and features: oldpeak, ca, cp, sex, and slope in particular.
* Women are much more likely to have heart disease in this sample 
* Incidence of chest pain appears highly correlated to heart disease.
* Exang appears to be a sign of good health, as does the number of vessels colored by flouroscopy.
* High ECG, Slope and Maximum heart rate appear to be indicative of heart disease probability
* We can see that Several factors may simply be adding noise. Resting blood pressure, cholesterol and age may not be as correlated to the target labels, as odd as that may seem. 

At 13 features, we should have enough depth to discard the 3 that aren't as strongly correlated. Why would we not create a separate test group or drop out these features altogether? 

We have only seen one line of correlation. The features themselves could be correlated in ways that, together, they helped predict heart disease. We now have to see how all features correlate to all other features.

In [None]:
#Matrix of variable correlation
correlation_matrix = df.corr()
fig, ax = plt.subplots(figsize=(20,20))
ax = sns.heatmap(correlation_matrix, annot=True, fmt='.2f')
ax.set(title= 'Correlation Between Heart Disease Variables')
ax.set_yticklabels(ax.get_yticklabels(),rotation=45)
plt.rc('axes', titlesize=24)
plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=13)    # fontsize of the tick labels
plt.rc('ytick', labelsize=13)
plt.savefig('Variable-Correlation-Matrix.jpg')
;

You can see that age has some of the highest correlation with cholesterol and resting blood pressure. 

It is possible that some combination of these factors will play a part in predicting heart disease.

It is also possible that a certain subset of the datapoints could be of interest, in a more limited scope that general correlation.

However, Machine makes it so easy to test this hypothesis. We will create a second test group without the three features which have the least correlation to our target label

## Modelling our Data

** We are going to do a preliminary run with 3 models 

In [None]:
X = df.drop('target', axis=1)
y = df['target']

test_group2 = df.drop(['age', 'chol', 'trestbps'], axis=1)
X2 = test_group2.drop('target', axis=1)
y2 = test_group2['target']

np.random.seed(42) # Enables reproducibility and consistency between models and tests.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2)



In [None]:
y2_test.value_counts(), y_test.value_counts(), X_train[:3], X2_train[:3]

They are not identical sets, but the label composition of the two test groups are similar, so we can assume our results will be comparable.


In [None]:
# For the function we will set up in the next cell, we need dictionaries of the models and the tests each will face
models_dict = {'KNeighbors': KNeighborsClassifier(),
              'RandomForest': RandomForestClassifier(),
              'LogisticRegression': LogisticRegression()}

In [None]:
def model_test_results(X, y, X_train, X_test, y_train, y_test):
    """
    X: the complete set of feature data
    y: the complete set of target data
    X_train: feature labels for model training
    X_test: feature labels for model testing
    y_train: target labels for the training set
    y_test: target labels for the test set
    """
    test_result_repository = []
    for k, v in models_dict.items():

        v.fit(X_train, y_train)
        y_preds = v.predict(X_test)
        model_test ={
                     f'{k} accuracy_score':accuracy_score(y_test, y_preds),
                     f'{k} precision_score':precision_score(y_test, y_preds),
                     f'{k} recall_score': recall_score(y_test, y_preds),
                     f'{k} f1_score': f1_score(y_test, y_preds),
                     f'{k} confusion_matrix': confusion_matrix(y_test, y_preds),
                     f'{k} classification_report': classification_report(y_test, y_preds),
                     f'{k} plot_roc_curve': plot_roc_curve(v, X, y),
                     f'{k} cross_validate': cross_validate(v, X,y, cv=5,
                                                     scoring=['accuracy','recall', 'precision'])}
        model_test_copy = model_test.copy()
        test_result_repository.append(model_test_copy)
    return test_result_repository
results = model_test_results(X,y,X_train,X_test,y_train,y_test)


It looks like Random Forest and Logistic Regression far outperformed K Neighbor in terms of false positives. Let's see some of the other metrics we measured 

In [None]:
for k, v in results[0].items():
    print(k)
    print(v, '\n\n')
    

In [None]:
for k, v in results[1].items():
    print(k)
    print(v, '\n\n')

In [None]:
for k, v in results[2].items():
    print(k)
    print(v, '\n\n')

In [None]:
#### While it looks like Random Forest Classifier and Logistic Regression both performed well, Logistic Regression outperformed its counterpart in all measures, and was more consistent in the cross validation test.

In [None]:
model_stats_plt = pd.DataFrame(data = [
                                       [results[0]['KNeighbors accuracy_score'],results[0]['KNeighbors recall_score']],
                                       [results[1]['RandomForest accuracy_score'], results[1]['RandomForest recall_score']],
                                       [results[2]['LogisticRegression accuracy_score'], results[2]['LogisticRegression recall_score']]
                                       ],
                               index=['KNeighbors', 'RandomForest', 'LogisticRegression'],
                               columns=['Accuracy', 'Recall'])
figure = model_stats_plt.plot.bar(figsize=(10,10), title='Model Test Comparison', xlabel='Model');

In [None]:
#We still have another test set. Luckily we can put the function we created before to quick use.

results2 = model_test_results(X2,y2, X2_train, X2_test, y2_train, y2_test)

In [None]:
# The ROC curves are almost identical, let's look closer at the data

for k, v in results2[0].items():
    print(k)
    print(v, '\n\n')

In [None]:
# KNeighbors seems to have improved. Let's see about RandomForest
for k, v in results2[1].items():
    print(k)
    print(v, '\n\n')

In [None]:
for k, v in results2[2].items():
    print(k)
    print(v, '\n\n')

The results are a wash, with slight decrease in performance. One point of interest is that in all models, the percentage of false negatives dropped, and recalls rose despite this decrease. The recall scores showed a 8%, 2% and 2% increase for K Neighbor, Random Forest, and Logistic Regression models respectively. 

The implication is significant. If we could test for general accuracy with one test, and then test negatives again with a model that excels in picking out false negatives, we could reduce the risk of error.

Since our first test group outperformed in most measures, lets focus on hyper parameter tuning on this data set first.
We will only be using Logistic Regression and Random Forest Models from here on out.

In [None]:
random_forest_grid = {'n_estimators': np.arange(10,1000,50),
                     'max_depth':[None,1,2,4,10],
                     'min_samples_split': np.arange(2,10,1),
                     'min_samples_leaf': np.arange(2,10,1)}
logistic_regression_grid = {'C':np.logspace(-4, 4,20),
                           'solver': ['liblinear']}
#Random Forest object
rfc = RandomForestClassifier()
#Logistic Regression Object
lrc = LogisticRegression()

#Randomized Search for Random Forest
rf_RandomizedSearchCV = RandomizedSearchCV(rfc, random_forest_grid, cv=5, verbose=True, n_iter=20)
# This does not even cover .01& of the possible combinations

#Randomized Search for Logistic Regression
lrc_RandomizedSearchCV = RandomizedSearchCV(lrc, logistic_regression_grid, cv=5, verbose=True, n_iter=20)
# For the logistic regression grid, 20 tests will exhaust all possibilities

In [None]:
np.random.seed(42)
lrc_RandomizedSearchCV.fit(X_train, y_train)

In [None]:
lrc_RandomizedSearchCV.best_params_

In [None]:
lrc_RandomizedSearchCV.score(X_test, y_test)

In [None]:
np.random.seed(42)
rf_RandomizedSearchCV.fit(X_train, y_train)

In [None]:
rf_RandomizedSearchCV.best_params_

In [None]:
rf_RandomizedSearchCV.score(X_test, y_test)

Both have not changed significantly between the SearchCV iterations and their baseline scores. Also, despite Random Forest seeing much more variation in hyperparameters, it is still below the accuracy of the Logistic Regression model.