<h1 style="text-align:center">My Heart Diseas Classification Approach</h1>

<div style="text-align:center;"><img src="https://images.unsplash.com/photo-1571172964276-91faaa704e1f?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1650&q=80" /></div>

**Context:** 
> This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to
this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.

**About the Data:**
* age - age in years
* sex - (1 = male; 0 = female)
* chest pain type (4 values)
* resting blood pressure
* serum cholestoral in mg/dl
* fasting blood sugar > 120 mg/dl
* resting electrocardiographic results (values 0,1,2)
* maximum heart rate achieved
* exercise induced angina
* oldpeak = ST depression induced by exercise relative to rest
* the slope of the peak exercise ST segment
* number of major vessels (0-3) colored by flourosopy
* thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

# Imports

In [None]:
import numpy as np 
import pandas as pd 

# Data Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style='whitegrid')

# Modeling
from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_auc_score

from sklearn.model_selection import RandomizedSearchCV

# Exploratory Data Analysis

In [None]:
df = pd.read_csv("/kaggle/input/heart-disease-uci/heart.csv")

In [None]:
df

### Target Value: target

In [None]:
df['target'].value_counts()

In [None]:
b = sns.countplot(x='target', data=df)
b.set_title("Target Distribution");

We have slightly more rows with a `target` of `1`.

## age

In [None]:
df['age'].describe()

In [None]:
b = sns.distplot(df['age'])
b.set_title("Age Distribution");

In [None]:
b = sns.boxplot(y = 'age', data = df)
b.set_title("Age Distribution");

We are pretty much free of outliers. That's great!

In [None]:
b = sns.boxplot(y='age', x='target', data=df)
b.set_title("Age Distribution for Target")
plt.xlabel("0 = No Heart Disease, 1 = Heart Disease");

## sex

In [None]:
df['sex'].value_counts()

In [None]:
b = sns.countplot(x='sex', data=df)
b.set_title("Target Distribution");

Our data consists of roughly 2/3 `male` and 1/3 `female` patients.

In [None]:
pd.crosstab(df['target'], df['sex']).plot(kind="bar", figsize=(10,6))

plt.title("Target distribution for Sex")
plt.xlabel("0 = No Heart Disease, 1 = Heart Disease")
plt.ylabel("Count")
plt.legend(["female", "male"])
plt.xticks(rotation=0);

## cp

In [None]:
df['cp'].value_counts()

In [None]:
b = sns.countplot(x='cp', data=df)
b.set_title("cp Distribution");

In [None]:
pd.crosstab(df['target'], df['cp']).plot(kind="bar", figsize=(10,6))

plt.title("Target distribution for cp")
plt.xlabel("0 = No Heart Disease, 1 = Heart Disease")
plt.ylabel("Count")
plt.legend(["0", "1", "2", "3"])
plt.xticks(rotation=0);

## trestbps

In [None]:
df['trestbps'].describe()

In [None]:
b = sns.distplot(df['trestbps'])
b.set_title("trestbps Distribution");

In [None]:
b = sns.boxplot(y = 'trestbps', data = df)
b.set_title("trestbps Distribution");

In [None]:
b = sns.boxplot(y='trestbps', x='target', data=df)
b.set_title("trestbps Distribution for Target")
plt.xlabel("0 = No Heart Disease, 1 = Heart Disease");

## chol

In [None]:
df['chol'].describe()

In [None]:
b = sns.distplot(df['chol'])
b.set_title("trestbps Distribution");

In [None]:
b = sns.boxplot(y = 'chol', data = df)
b.set_title("trestbps Distribution");

We have a few outliers here.

In [None]:
b = sns.boxplot(y='chol', x='target', data=df)
b.set_title("chol Distribution for Target")
plt.xlabel("0 = No Heart Disease, 1 = Heart Disease");

## fbs

In [None]:
df['fbs'].value_counts()

In [None]:
b = sns.countplot(x='fbs', data=df)
b.set_title("fbs Distribution");

In [None]:
pd.crosstab(df['target'], df['fbs']).plot(kind="bar", figsize=(10,6))

plt.title("Target distribution for fbs")
plt.xlabel("0 = No Heart Disease, 1 = Heart Disease")
plt.ylabel("Count")
plt.legend(["0", "1"])
plt.xticks(rotation=0);

## restecg

In [None]:
df['restecg'].value_counts()

In [None]:
b = sns.countplot(x='restecg', data=df)
b.set_title("restecg Distribution");

In [None]:
pd.crosstab(df['target'], df['restecg']).plot(kind="bar", figsize=(10,6))

plt.title("Target distribution for restecg")
plt.xlabel("0 = No Heart Disease, 1 = Heart Disease")
plt.ylabel("Count")
plt.legend(["0", "1", "2"])
plt.xticks(rotation=0);

## thalach

In [None]:
df['thalach'].describe()

In [None]:
b = sns.distplot(df['thalach'])
b.set_title("trestbps Distribution");

In [None]:
b = sns.boxplot(y = 'thalach', data = df)
b.set_title("thalach Distribution");

In [None]:
b = sns.boxplot(y='thalach', x='target', data=df)
b.set_title("thalach Distribution for Target")
plt.xlabel("0 = No Heart Disease, 1 = Heart Disease");

## exang

In [None]:
df['exang'].value_counts()

In [None]:
b = sns.countplot(x='exang', data=df)
b.set_title("exang Distribution");

In [None]:
pd.crosstab(df['target'], df['exang']).plot(kind="bar", figsize=(10,6))

plt.title("Target distribution for exang")
plt.xlabel("0 = No Heart Disease, 1 = Heart Disease")
plt.ylabel("Count")
plt.legend(["0", "1"])
plt.xticks(rotation=0);

## oldpeak

In [None]:
df['oldpeak'].describe()

In [None]:
b = sns.distplot(df['oldpeak'])
b.set_title("oldpeak Distribution");

In [None]:
b = sns.boxplot(y = 'oldpeak', data = df)
b.set_title("oldpeak Distribution");

In [None]:
b = sns.boxplot(y='oldpeak', x='target', data=df)
b.set_title("oldpeak Distribution for Target")
plt.xlabel("0 = No Heart Disease, 1 = Heart Disease");

## slope

In [None]:
df['slope'].value_counts()

In [None]:
b = sns.countplot(x='slope', data=df)
b.set_title("slope Distribution");

In [None]:
pd.crosstab(df['target'], df['slope']).plot(kind="bar", figsize=(10,6))

plt.title("Target distribution for slope")
plt.xlabel("0 = No Heart Disease, 1 = Heart Disease")
plt.ylabel("Count")
plt.legend(["0", "1", "2"])
plt.xticks(rotation=0);

## ca

In [None]:
df['ca'].value_counts()

In [None]:
b = sns.countplot(x='ca', data=df)
b.set_title("ca Distribution");

In [None]:
pd.crosstab(df['target'], df['ca']).plot(kind="bar", figsize=(10,6))

plt.title("Target distribution for ca")
plt.xlabel("0 = No Heart Disease, 1 = Heart Disease")
plt.ylabel("Count")
plt.legend(["0", "1", "2", "3", "4"])
plt.xticks(rotation=0);

## thal

In [None]:
df['thal'].value_counts()

In [None]:
b = sns.countplot(x='thal', data=df)
b.set_title("thal Distribution");

In [None]:
pd.crosstab(df['target'], df['thal']).plot(kind="bar", figsize=(10,6))

plt.title("Target distribution for thal")
plt.xlabel("0 = No Heart Disease, 1 = Heart Disease")
plt.ylabel("Count")
plt.legend(["0", "1", "2", "3"])
plt.xticks(rotation=0);

## Missing values

Let's check for missing values:

In [None]:
df.isna().sum()

We do not have any. Thats great!

# Modeling

In [None]:
# Everything except target variable
X = df.drop("target", axis=1)

# Target variable
y = df['target'].values

In [None]:
# Random seed for reproducibility
np.random.seed(42)

# Split into train & test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [None]:
# Put models in a dictionary
models = {"KNN": KNeighborsClassifier(),
          "Logistic Regression": LogisticRegression(max_iter=10000), 
          "Random Forest": RandomForestClassifier(),
          "SVC" : SVC(probability=True),
          "DecisionTreeClassifier" : DecisionTreeClassifier(),
          "AdaBoostClassifier" : AdaBoostClassifier(),
          "GradientBoostingClassifier" : GradientBoostingClassifier(),
          "GaussianNB" : GaussianNB(),
          "LinearDiscriminantAnalysis" : LinearDiscriminantAnalysis(),
          "QuadraticDiscriminantAnalysis" : QuadraticDiscriminantAnalysis()}

# Create function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
    """
    Fits and evaluates given machine learning models.
    models : a dict of different Scikit-Learn machine learning models
    X_train : training data
    X_test : testing data
    y_train : labels assosciated with training data
    y_test : labels assosciated with test data
    """
    # Random seed for reproducible results
    np.random.seed(42)
    # Make a list to keep model scores
    model_scores = {}
    # Loop through models
    for name, model in models.items():
        # Fit the model to the data
        model.fit(X_train, y_train)
        # Predicting target values
        y_pred = model.predict(X_test)
        # Evaluate the model and append its score to model_scores
        #model_scores[name] = model.score(X_test, y_test)
        model_scores[name] = roc_auc_score(y_test, y_pred)
    return model_scores

In [None]:
model_scores = fit_and_score(models=models,
                             X_train=X_train,
                             X_test=X_test,
                             y_train=y_train,
                             y_test=y_test)
model_scores


`Logistic Regression` and `LinearDiscriminantAnalysis` have the best scores with `0.8685344827586206`.