# Heart Disease Machine Learning Model

> This is a analysis of the heart disease problem using the different machine learning models. It predicts wheather or not the patient have a heart disease. For Analysis such model we use different machine learning tools like pandas, matplotlib, sckit-learn, etc. 

For attaining this goal we use specific steps:
1. Problem Defination
2. Data
3. Evaluation
4. Feature
5. Modeling
6. Experimentation

## 1. Problem Defination
> We are using the problem based on the heart disease which train the model using the parameters and analysis wheater or not the patient have a heart disease.
We use the binary data (data based upon 0 and 1) to predict the heart disease problem using the classification model convert the data more precise using different machine learning tools.

## 2. Data
We collect the data from the Kaggle https://www.kaggle.com/ronitf/heart-disease-uci <br>
and from the UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/heart+disease

**Data Dictonary**<br>
We use only 14 attributes to predict wheater or not the patient is having heart disease.
* age: age in years
* sex: sex (1 = male; 0 = female)
* cp: chest pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
* trestbps: resting blood pressure (in mm Hg on admission to the hospital)
* chol: serum cholestoral in mg/dl
* fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* restecg: resting electrocardiographic results
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
* thalach: maximum heart rate achieved
* exang: exercise induced angina (1 = yes; 0 = no)
* oldpeak = ST depression induced by exercise relative to rest
* slope: the slope of the peak exercise ST segment
    - Value 1: upsloping
    - Value 2: flat
    - Value 3: downsloping
* ca: number of major vessels (0-3) colored by flourosopy
* thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
* num: diagnosis of heart disease (angiographic disease status)
    - Value 0: < 50% diameter narrowing
    - Value 1: > 50% diameter narrowing (in any major vessel: attributes 59 through 68 are vessels)

## 3. Evaluating the Model
Evaluation of the model consist of evaluating the dataset wheater it have any empty cells inside the table (in this case table refers to csv file), predicting the mean values, shape of the dataset, for what purpose we are actually creating this model, what we predicting, etc

While Evaluating this dataset, We are Evaluating:
- What parameters is responsible for heart disease problem?
- What we are predicting while making the model train?
- What is the average age of the people who are suffering from heart disease problem?
- Is this dataset is enough to make the model train?
- How much percentage we are successed in training our model?


In [None]:
# Making our tools ready

# importing all the evaluating models
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# importing all the machine learning tools
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# importing all featuring tools
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import roc_curve, plot_roc_curve
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [None]:
# loading the datesets
df = pd.read_csv('../input/heart-disease-uci/heart.csv')
df.head()

In [None]:
df.describe()

In [None]:
df.shape

In [None]:
df['target'].value_counts()

## 1. Evaluate the Data

### 1.1 Predicting the Data using the sex and target

sex: sex (1 = male; 0 = female)

In [None]:
# Sex vs Target

pd.crosstab(df.sex, df.target)

In [None]:
pd.crosstab(df.sex, df.target).plot(
    kind='bar',
    rot=0,
    ylabel='Frequency',
    xlabel='Sex',
    title='Frequency graph between the Sex and Target',
    colormap='tab20c'
)
plt.legend(['No-Disease', 'Disease']);

So, from the above graph we can predict that the difference between the disease and no-disease for female is more than the male. Female with no-disease is less than men and also Female with disease is less than the male.

### 1.2 Predicting the data using Chest Pain and Target

* cp: chest pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic

In [None]:
pd.crosstab(df.cp, df.target)

In [None]:
pd.crosstab(df.cp, df.target).plot(
    kind='bar',
    rot=0,
    xlabel='Chest Pain',
    ylabel='Frequency',
    title='Frequency Graph between the Chest Pain and Target',
    colormap='tab20c'
)
plt.legend(['No-Disease', 'Disease']);

From the above graph we can predict that, Person suffering from Non-Aginal Chest Pain have more chances of getting heart disease problem and the person suffering from typical angina chest pain doesnot face heart disease problem. 

### 1.3 Predicting the relation between the Independent Variables

#### Resting Blood Pressure vs Age

In [None]:
pd.crosstab(df.age, df.trestbps)

In [None]:
# Let's try to make this more understandable and visual
plt.figure(figsize=(10, 6))
# Relation between the resting blood pressure and age with positive target
plt.scatter(df.trestbps[df.target == 1], df.age[df.target == 1])
# Relation between the resting blood pressure and age with negative target
plt.scatter(df.trestbps[df.target == 0], df.age[df.target == 0])
plt.xlabel('Resting Blood Pressure (mm Hg)')
plt.ylabel('Age (Years)')
plt.title('Relation between the Resting Blood Pressure vs Age')
plt.legend(['Disease', 'No-Disease']);

**What we know from this graph?**

Most of the patient between the 40-60 years of age is facing the heart disease problem. Let's study the graph more deeper, Resting Blood Pressure from 100-140 mm Hg result into heart disease problem and the patient having the Resting Blood Pressure from 160-200 mm Hg don't facing the heart disease problem.

## Correlation Matrix

> A correlation matrix is simply a table which displays the correlation. The measure is best used in variables that demonstrate a linear relationship between each other. The fit of the data can be visually represented in a scatterplot. coefficients for different variables. For more understandable follow: https://www.displayr.com/what-is-a-correlation-matrix/

Let's perform the Coorelation matrix to understand the relation between the dependent variable and the independent variable and within the independent variable.

In [None]:
df.corr()

In above result, we are getting the value between the +1 value to -1 value. So, positive value indicate positive relationship and the negative value indicate the negative relationship.

In [None]:
# Let's visualise this correlation matrix using the seaborn heatmap
corr_matrix = df.corr()
plt.figure(figsize=(15, 10))
sns.heatmap(corr_matrix,
           fmt='.2f',
            linewidth=0.5,
           annot=True,
           cmap='Blues');

Enough EDA perform on the data to evaluate the dataset and gather the knowledge about the data. Let's perform some Machine Learning model and Experimentation to create a model that help us to acheive our goal we state in the problem defination.

## 5. Modeling
We use different machine learning model to solve our classification problem:
1. Logistic Regression
2. K-Neighbor Classifier
3. Random Forest Classifier

After creating the Model, we perform some hypermeter tunning to make our model more mature and more accurate before deploying it into real enviornment and create report for classification problem. This report include:
1. Accuracy
2. Precision
3. Recall
4. F1 Score
5. ROC Curve
6. Area Under Curve (AUC)

So, Let's make our data ready for training and testing our machine learning model.

In [None]:
np.random.seed(42)
# Create feature and label data
X = df.drop("target", axis=1)
y = df["target"]

# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Let's create function to perform different machine learning model on our train and test data and find out the best score for our classification problem.

In [None]:
# Create a dictionary of different models
model = { "Logistic Model": LogisticRegression(),
         "K Neighbors": KNeighborsClassifier(),
         "Random Forest": RandomForestClassifier()        
}

def fit_and_score(model, X_train, X_test, y_train, y_test):
    '''
    Function to train the different model and find out the best score for classification problem
    model : Different classification model
    X_train : training data (no label)
    X_test : testing data (no label)
    y_train : training label
    y_test : test label
    '''
    np.random.seed(42)
    
    model_score = {}
    
    for name, model in model.items():
        model.fit(X_train, y_train)
        model_score[name] = model.score(X_test, y_test)
        
    return model_score

In [None]:
model_score=fit_and_score(model=model,
                         X_train=X_train,
                         X_test=X_test,
                         y_train=y_train,
                         y_test=y_test)
model_score

In [None]:
pd.DataFrame(model_score.items(), columns=['Model', 'Score'])

So, From the above dataframe we can predict that Logistic Regression Model is result into best score among the three different models.

But we try to make our others model accuracy more accurate but at this stage we can see that out Classifier problem scoring is around 89% which is given by Logisitic Model. So, first we try to make our models to achieve this score by performing some hyperparameter tuning.

In [None]:
# Let's make our score more visual
model_df = pd.DataFrame(model_score.items(), columns=['Model', 'Score'])
model_df.plot(kind='bar', colormap='tab20c')
plt.ylabel('Score')
plt.title('Model Predicting Score Graph')
plt.xticks([0, 1, 2], ['Logistic Regression', 'K Neighbors', 'Random Forest']);

So, finally we perform some hyperparameter to tune our model. We use RandomSearch CV and GridSearch CV with Cross-Validation.

**So, first we use RandomSearchCV**
* Logistic Regression Parameter Tuning : https://www.kaggle.com/joparga3/2-tuning-parameters-for-logistic-regression

In [None]:
# Create grid for RandomSearchCV

# Random Search Grid for logisitic Regression
rs_lr_grid = {'penalty': ['l2', 'l1'], 
              'C': [0.001,0.01,0.1,1,10,100],
              'random_state': np.arange(0, 42, 5)}
# Random search grid for K-Neighbors Classifier
rs_knn_grid = {
    'n_neighbors': np.arange(1, 150, 5)
}
# Random search grid for random forest classifier
rs_rf_grid = {
    'n_estimators': [10, 30, 50, 100],
    'min_samples_leaf': [1, 5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'n_jobs': [None, 1]
}

In [None]:
# Let's fit the model and score them

def rs_fit_score(grid, model):
    '''
    Used to return the object for different classfication model performing Randomized Search CV
    grid : Random Search CV grid
    model : Classification Model
    '''
    rs_model = RandomizedSearchCV(model,
                                    param_distributions=grid,
                                    cv=5,
                                    n_iter=20,
                                    verbose=True)
    return rs_model

In [None]:
# Fiting Logistic Model
np.random.seed(42)

lr_model = rs_fit_score(rs_lr_grid, LogisticRegression())
lr_model.fit(X_train, y_train)

In [None]:
lr_model.best_params_

In [None]:
lr_score = lr_model.score(X_test, y_test)

In [None]:
# Fiting K-Neighbors Model
np.random.seed(42)
knn_model = rs_fit_score(rs_knn_grid, KNeighborsClassifier())
knn_model.fit(X_train, y_train)

In [None]:
knn_model.best_params_

In [None]:
knn_score = knn_model.score(X_test, y_test)

In [None]:
# Fitting Random Forest Classifier
rf_model = rs_fit_score(rs_rf_grid, RandomForestClassifier())
rf_model.fit(X_train, y_train)

In [None]:
rf_model.best_params_

In [None]:
rf_score = rf_model.score(X_test, y_test)

In [None]:
new_model_score = {'Logistic Regression': lr_score,
                   'K Neighbors': knn_score,
                   'Random Forest': rf_score}
pd.DataFrame(new_model_score.items(), columns=['Model', 'Score'])

Let's compare both the dataframe i.e. dataframe before performing hypermeter tuning and the dataframe after performing hyperparameter tuning

In [None]:
new_model_df = pd.DataFrame(new_model_score.items(), columns=['Model', 'Score'])
combine_model = pd.DataFrame(model_score.keys(), index=[0, 1, 2], columns=['Model'])
combine_model['Old Score'] = model_score.values()
combine_model['New Score'] = new_model_score.values()
combine_model

In [None]:
combine_model.plot(kind='bar', cmap='tab20c')
plt.xticks([0, 1, 2], combine_model['Model']);
plt.ylabel('Score')
plt.title('Hyperparameter Tuning Graph');

So, from the above graph we can predict that after performing some hyperparameter tuning, Logistic Regression is still have a maximum accuracy among the three different classification model. Random Forest Classifier increase after tuning the parameter but still have a less acuracy than the Logistic Regression.

Let's perform GridSearchCV.

### GridSearchCV

In [None]:
# Create grid for RandomSearchCV

# Random Search Grid for logisitic Regression
gs_lr_grid = {'C': np.logspace(-4, 4, 20),
              'penalty': ['l1', 'l2', None]}
# Random search grid for K-Neighbors Classifier
gs_knn_grid = {
    'n_neighbors': np.arange(1, 150, 5)
}
# Random search grid for random forest classifier
gs_rf_grid = {
    'n_estimators': np.arange(0, 101, 10),
    'min_samples_leaf': [1, 5, 10, 20],
    'min_samples_split': [2, 5, 10, 30],
    'n_jobs': [None, 1, 2]
}

In [None]:
# Let's fit the model and score them

def gs_fit_score(grid, model):
    '''
    Used to return the object for different classfication model performing Randomized Search CV
    grid : Random Search CV grid
    model : Classification Model
    '''
    gs_model = GridSearchCV(model,
                            param_grid=grid,
                            cv=5,
                            verbose=True)
    return gs_model

In [None]:
gs_lg_model = gs_fit_score(gs_lr_grid, LogisticRegression())
gs_lg_model.fit(X_train, y_train);

In [None]:
gs_lg_model.best_params_

In [None]:
gs_lg_model.score(X_test, y_test)

In [None]:
knn_lg_model = gs_fit_score(gs_knn_grid, KNeighborsClassifier())
knn_lg_model.fit(X_train, y_train);

In [None]:
knn_lg_model.best_params_

In [None]:
knn_lg_model.score(X_test, y_test)

In [None]:
rf_lf_model = rs_fit_score(gs_rf_grid, RandomForestClassifier())
rf_lf_model.fit(X_train, y_train);

In [None]:
rf_lf_model.best_params_

In [None]:
rf_lf_model.score(X_test, y_test)

So, performing GridSearchCV we see that no model get more accuracy even after changing some parameter. Finally, we get Logistic Regression machine learing model with the accuracy of 89%. 

So, Whats next? Now, we prepare classification report which consists of:
* Accuracy
* Precision
* Recall
* F1-Score
* ROC or AUC Curve
* Classification Report
* Confussion Matrix

and we use cross-validation where ever we found that usefull...

Let's create roc and auc score for predicting wheather our Logistic Regression model working accurate by plotting the graph between the true positive rate and false positive rate

In [None]:
y_pred = lr_model.predict(X_test)

In [None]:
plot_roc_curve(lr_model, X_test, y_test)
plt.title("ROC and AUC Curve");

So, from this auc and roc curve we see that we got 93% result where our model predict the true value (true positive rate, where 1 is come as 1 and false positive rate, where 0 is come as 0) which good enough in the first try. 

So, we try to find out the ```confusion matrix``` now to check out how much value we are getting under false positive and true negative.

In [None]:
confusion_matrix(y_test, y_pred)

So, from this we find out that 25 result are such that where our result were accurate in terms of negativity and 4 result were like where we need to be negative but got positive results means we want to get 0 but got 1. Similary in positive result we got 3 result where we want 1 but get 0 as our result and 29 times our model predict the correct positive results.

To make this more understandable let's plot this into seaborn heatmap...

In [None]:
conf_mat = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_mat, annot=True, cbar=False, cmap='Blues', xticklabels=['FPR', 'TPR'], yticklabels=['FPR', 'TPR'])
plt.title('Confusion Matrix');

Let's use ```classification report``` to look more deeper in the confusion matrix

In [None]:
print(classification_report(y_test, y_pred))

Under construction....