In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


## Preamble
This is an ongoing notebook that I'm using to teach myself machine learning on the side. Most explanations are not meant to teach others but to teach myself. Things are incomplete and will be updated when I get the time. I welcome comments, suggestions, pointers, what-have-you.

# Introduction & Exploratory Analysis
### About this dataset

A detailed description of the data is given [here](https://www.kaggle.com/carlosdg/a-detail-description-of-the-heart-disease-dataset/), and I encourage anyone to check out that post as it's going to have more details than I give below.

Features:

*     Age : Age of the patient
*     Sex : Sex of the patient
*     cp : Chest Pain type (1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic)
*     trtbps : resting blood pressure (in mm Hg). Judging by the numbers on the histogram below, it's systolic blood pressure, for which the [Mayo Clinic](https://www.mayoclinic.org/diseases-conditions/high-blood-pressure/in-depth/blood-pressure/art-20050982) indicates anything higher than 130 is considered "high blood pressure".
*     chol : cholestoral in mg/dl fetched via BMI sensorm. [Mayo Clinic ](https://www.mayoclinic.org/diseases-conditions/high-blood-cholesterol/diagnosis-treatment/drc-20350806) gives below 200 mg/dL as "desirable.
*     fbs : Whether or not the patient's fasting blood sugar is above 120 mg/dl, 1 = true, 0 = false. Once again from the [Mayo Clinic:](https://www.mayoclinic.org/diseases-conditions/diabetes/diagnosis-treatment/drc-20371451) 100 to 125 is considered prediabetic. 

The rest of the variables are from a stress test to test patient blood flow.

*     rest_ecg : resting electrocardiographic results.
             -0: showing probable or definite left ventricular hypertrophy by Estes' criteria
             -1: normal
             -2: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
*     exang: Whether or not there was exercise-induced angina, which is chest pain caused by reduced blood flow to the chest. 1 = yes, 0 = no             
*     thalach : maximum heart rate achieved during the test.
*     oldpeak: decrease of the ST segment during exercise, compared to the same segment at rest.
*     slope: slope of the ST segment during the most demanding part of the exercise; 0 = descending, 1 = flat, 2 = ascending
*     thal: results of the blood flow observed in the test. 0 = Null, removed from the data (see Data problems below), 1 = no blood flow in some parts of the heart, 2 = normal blood flow, 3 = abnormal blood flow
*     ca: number of major vessels that were examined in the test. (0-3)
*     target : 0= less chance of heart attack 1= more chance of heart attack


### Other Information


### Data problems
We're going to drop rows that contain incorect information (from above source):
>    A few more things to consider:
>    data #93, 159, 164, 165 and 252 have ca=4 which is incorrect. In the original Cleveland dataset they are NaNs (so they should be removed)
>    data #49 and 282 have thal = 0, also incorrect. They are also NaNs in the original dataset.



In [None]:
data = pd.read_csv('/kaggle/input/heart-disease-uci/heart.csv')

In [None]:
data = data.drop([49,93,159,164,165,252])

In [None]:
data

In [None]:
data.info()

Histograms for each category:

In [None]:
def make_some_hists(df):
    targ = df['target']
    df = df.drop('target', axis=1).copy()
    
    for col in df.columns:    

        plt.figure(figsize=(7,4))
        sns.histplot(df, x=col, hue=targ,multiple="stack")
        plt.show()

In [None]:
make_some_hists(data)

In [None]:
# want to one-hot encode the chest pain type and electrocardiogram results
# replace number values with the types
data['cp'] = data['cp'].replace(0, 'asymptomatic')
data['cp'] = data['cp'].replace(1, 'atypical angina')
data['cp'] = data['cp'].replace(2, 'non-anginal pain')
data['cp'] = data['cp'].replace(3, 'typical angina')

data['restecg'] = data['restecg'].replace(0, 'left ventricular hypertrophy')
data['restecg'] = data['restecg'].replace(1, 'normal')
data['restecg'] = data['restecg'].replace(2, 'T/ST abnormalities')

data


In [None]:
def onehot_encode(df, column, prefix):
    df = df.copy()
    
    dummies = pd.get_dummies(df[column], prefix=prefix)
    df = pd.concat([df,dummies],axis=1)
    df = df.drop(column, axis=1)
    
    return df

In [None]:
X1 = onehot_encode(data, column='cp',prefix='cp')
X2 = onehot_encode(X1, column='restecg', prefix='restecg')
X2


Everything else, we're going to treat as a linear variable. It's possible that chest pains may have some order of severity, in which case we could leave them as is/rearrange them to some appropriate ranking. It is also possible that the slope and thal (slope of the ST segment and blood flow having a fixed defect, being normal, or reversible defect, respectively) could be treated as categorical - I'm not a doctor! For the purposes of this exercise, I'm going to leave them as they are.

Taking a quick look at the target:

In [None]:
plt.figure(figsize=(12,9))
sns.histplot(X2,x="target")
plt.show()



# Preprocessing


In [None]:
# data processing imports
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
def preprocess(df):
    df = df.copy()
    
    # split into target vector and feature matrix
    y = df['target']
    X = df.drop('target', axis=1)
    
    # split into training and testing
    # small-ish dataset so we'll go with an 80% split
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=13)
    
    # scale the feature matrix with a standard scaler
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = pd.DataFrame(scaler.transform(X_train), columns=X.columns)
    X_test = pd.DataFrame(scaler.transform(X_test), columns=X.columns)
    
    return X_train, X_test, y_train, y_test

In [None]:
X_train, X_test, y_train, y_test = preprocess(X2)

# Training and Modelling
Going to try a few classification models: K-Nearest Neighbours, an SGD classifier, SVC.

In [None]:
# import models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression, Ridge, SGDRegressor
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier


In [None]:
# make a dictionary of the models to iterate through to train and test each one 
models = {
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Logistic Regression": LogisticRegression(),
    "Ridge": Ridge(),
    "Stochastic Gradient Descent Regressor": SGDRegressor(),
    "Support Vector Classifier": SVC(),
    "Linear Support Vector Classifier": LinearSVC(),
    "Decision Tree Classifier": DecisionTreeClassifier(),
    "Random Forest Classifer": RandomForestClassifier()         
         }

In [None]:
# train
for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

# Evaluation
We'll get results from a loop and then dig into a couple of the better models.

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import confusion_matrix

In [None]:
for name, model in models.items():
    print(name + " R^2 Score: {:.5f}".format(model.score(X_test, y_test)))

### Logistic Regression
Logistic Regressions tend to work better with highly uncorrelated data, and judging by the heatmap below, it does seem that the features are very independant. 

In [None]:
sns.heatmap(X_train.corr(), center =0)

In [None]:
logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)

In [None]:
print("Logistic Regression R^2 Score: {:.5f}".format(logmodel.score(X_test, y_test)))

In [None]:
y_pred = logmodel.predict(X_test)
y_pred

In [None]:
print(classification_report(y_test, y_pred))

Notes on what the classification report is presenting:
* *Precision* measures the accuracy of the positive predictions, defined as the ratio between true positive predictions and all positive predictions - in simpler terms, what fraction of the patients for which the presence of heart disease was detected by the model actually had heart disease?

* *Recall* measures the amount of true positives that were actually detected, defined as the ratio between true positive predictions and the amount of positives present in the data. What fraction of the patients who actually have heart disease were detected by the model?

* *f1-score* is a combination of these. It presents a single metric for model evaluation by taking the harmonic mean of the two measures: $ f_{1} =  \frac{2}{ \frac{1}{p} + \frac{1}{r} } $, where *p* is the precision and *r* is the recall. The Harmonic mean is frequently used for averaging rates, since it overweights lower values, and hence will only be high when all the values are high.

* *Support* is simply the frequency of each class in the data. 

In the first row, where the logistic model is predicting the not-heart disease group, the precision is 0.88 and recall is 0.69. 
* 88% of the people that the model predicted *did not have* heart disease, actually did not have heart disease; recall 0 means no heart disease.

* 69% of the cases that actually *did not have* heart disease were correctly detected by the model.

The second row is the same for the positive class, or those who did have heart disease.



In [None]:
# plot a ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

In [None]:
def plot_roc(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0,1],[0,1], 'k--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    
plot_roc(fpr, tpr)
plt.show()