I accept any type of criticism or comment about the code, it is my first project here in kaggle


# Predicting heart disease with Sklearn

This notebook uses python and many helpful libraries to predict whether or not a patient has heart disease training a machine learning model with the dataset from https://www.kaggle.com/sulianova/cardiovascular-disease-dataset

## Data and features
Data description
There are 3 types of input features:

Objective: factual information;
Examination: results of medical examination;
Subjective: information given by the patient.
Features:

- Age: Objective Feature | age | int (days)
- Height: Objective Feature | height | int (cm) |
- Weight: Objective Feature | weight | float (kg) |
- Gender: Objective Feature | gender | categorical code | 1 - women, 2 - men
- Systolic blood pressure: Examination Feature | ap_hi | int |
- Diastolic blood pressure: Examination Feature | ap_lo | int |
- Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
- Glucose: Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
- Smoking: Subjective Feature | smoke | binary |
- Alcohol intake: Subjective Feature | alco | binary |
- Physical activity: Subjective Feature | active | binary |
- Presence or absence of cardiovascular disease: Target Variable | cardio | binary | 1 = disease, 0 = no disease

In [None]:
# imports for data analysis and plot
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# import models to use from sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression

# import fuctions for model evaluation and tuning 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, f1_score, recall_score, precision_score, plot_roc_curve

### Import the data and view it

In [None]:
data = pd.read_csv('/kaggle/input/cardiovascular-disease-dataset/cardio_train.csv', sep = ';')
data

### Drop the `id` column, because is useless

In [None]:
data.drop(labels = 'id', axis = 1, inplace = True)
data

In [None]:
# Check how many samples of each class there are and plot it 
data['cardio'].value_counts().plot(kind = 'bar');

In [None]:
# check datatypes in our data
data.info()

In [None]:
# view information about our data
data.describe()

In [None]:
# use pd.crosstab to check the heart disease frequency acording to the gender and plot it
pd.crosstab(data['cardio'], data['gender']).plot(kind = 'bar')
plt.xlabel('0 = no heart disease, 1 = heart disease')
plt.legend(['woman','man'])
plt.show()

#### We can se that it is more common for women to have heart disease in this dataset 

### View the distribution of the age using a histogram (remember that the age is in days)

In [None]:
data['age'].T.hist(bins = 40)

### Make a correlation matrix and plot it using seaborn 

In [None]:
corr_matrix = data.corr()
fig, ax = plt.subplots(figsize = (15,10))
ax = sns.heatmap(
    corr_matrix, 
    annot = True, 
    linewidths = 0.5,
    fmt = '0.2f', 
    cmap = 'GnBu'
)

### We can se a positive correlation betweeen the gender an if the patient smoke or not, lets see it in a bar graph 

In [None]:
pd.crosstab(data['smoke'], data['gender']).plot(kind = 'bar')
plt.xlabel('0 = no smoke, 1 = smoke')
plt.legend(['woman','man'])
plt.show()

#### In proportion, there are many more male smokers than female smokers 

## Creating models

In [None]:
# Split data into X and y
X = data.drop(labels = 'cardio', axis = 1)
y = data['cardio']

In [None]:
# split the data into training and test datasets
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#### In this notebook we are going to build, test and tune 2 sklearn machine learning models
 - `RandomForestClassifier()`
 - `LogisticRegression()`

### Create and fit a stock random forest classifier

In [None]:
np.random.seed(7)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

In [None]:
# evaluate the stock model on test data
clf.score(X_test, y_test)

### Improving this score tuning the hyperparameters with `RandomizedSearchCV()`

In [None]:
# grid of hyperparameters to tune
random_forest_grid = {
    'n_estimators': np.arange(10,1000, 50),
    'max_depth': [None, 3, 5, 10],
    'min_samples_split': np.arange(2,20,2),
    'min_samples_leaf': np.arange(1,20,2)
}

In [None]:
np.random.seed(7)

random_search_rf = RandomizedSearchCV(
    RandomForestClassifier(),
    param_distributions = random_forest_grid,
    cv = 5,
    n_iter = 25,
    verbose = True,
    n_jobs = -1
)

# Fit random search for random forest classifier
random_search_rf.fit(X_train, y_train)

In [None]:
# check wich are the best params
random_search_rf.best_params_

In [None]:
# evaluate the model on the test data using the score method
random_search_rf.score(X_test, y_test)

### Evaluating the RandomForestClassifier model

In [None]:
# make some predictions to calculate evaluation metrics
y_preds = random_search_rf.predict(X_test)

In [None]:
y_preds

### ROC curve and Area under the curve
AUC of 0.8 is acceptable, but not exellent 

In [None]:
plot_roc_curve(random_search_rf, X_test, y_test);

### Making a confussion matrix and ploting it using `sns.heatmap`

In [None]:
# make a fucntion for ploting the confussion matrix for later use
sns.set(font_scale = 1.5)
def conf_matrix(y_true, y_preds):
    fig, ax = plt.subplots(figsize = (5,5))
    ax = sns.heatmap(
        confusion_matrix(y_true,y_preds),
        annot=True,
        cbar = False,
        fmt = 'g'
    ) 
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')

In [None]:
# plotting the confusion matrix of our randomforestclassifier model
conf_matrix(y_test,y_preds)

The confusion matrix shows a high number of false-negative predictions, lets see the precision predicting each category with a classification report:

### Classification report

In [None]:
print(classification_report(y_test, y_preds))

### Evaluation metrics calculated using cross validation

In [None]:
# Check the best params for the RandomForestClassifier
random_search_rf.best_params_

In [None]:
# Create a RandomForestClassifier instance with the best params
rf_clf = RandomForestClassifier(
    n_estimators = 910,
    min_samples_split = 4,
    min_samples_leaf = 15,
    max_depth = 10
)

In [None]:
# Use cross_validation and the scoring parameter to evaluate the classifier and make a function for later use
def cv_classification_report(classifier, X, y):
    
    cv_accuracy = cross_val_score(classifier, X, y, scoring = 'accuracy', n_jobs = -1)
    cv_accuracy = np.mean(cv_accuracy)
    
    cv_precision = cross_val_score(classifier, X, y, scoring = 'precision', n_jobs = -1)
    cv_precision = np.mean(cv_precision)
    
    cv_recall = cross_val_score(classifier,X,y,scoring = 'recall', n_jobs = -1)
    cv_recall = np.mean(cv_recall)
    
    cv_f1 = cross_val_score(classifier, X, y, scoring = 'f1', n_jobs = -1)
    cv_f1 = np.mean(cv_f1)
    
    return {
    'Accuracy': cv_accuracy,
    'Precision': cv_precision,
    'Recall': cv_recall,
    'F1 Score': cv_f1
    }

In [None]:
# use the function
cv_metrics = cv_classification_report(rf_clf, X, y)

In [None]:
# view the cross-validated metrics
cv_metrics

In [None]:
# save the metrics in a pandas dataframe and plot it in a bar graph
# the variable name is for 'cross-validated random forest classifier metrics'
cv_rfc_metrics_df = pd.DataFrame(cv_metrics, index = [0]) 

sns.set(font_scale = 1.3)

cv_rfc_metrics_df.T.plot.bar(title = 'Cross-validated random forest classifier metrics', legend = False)
plt.yticks(np.linspace(0,1,11));

### Create and fit a stock LogisticRegression classifier

#### Preprocing the data
GradientDecent based model requiere data to be scaled

In [None]:
# create an instance of the scaler
std = StandardScaler()

# use StandardScaler to scale X
X_scaled = std.fit_transform(X)

In [None]:
# split into train and test datasets (The s in the varible names is for scaled)
X_train_s, X_test_s, y_train, y_test = train_test_split(X_scaled, y)

In [None]:
lr_stock = LogisticRegression()
lr_stock.fit(X_train_s, y_train)

In [None]:
# Evaluate the stock model on the test data using the scoring method
lr_stock.score(X_test_s, y_test)

### Lets improve the model tuning the hyperparameters using RandomizedSearchCV

In [None]:
# grid with hyperparameters to tune
logistic_regression_grid = {
    'C': np.logspace(-4,4,20),
    'solver': ['liblinear']
}

rs_logistic_regression = RandomizedSearchCV(
    LogisticRegression(),
    param_distributions = logistic_regression_grid,
    cv = 5,
    n_iter = 20,
    verbose = True,
    n_jobs = -1
)

# Fit the random hyperparameter search for logistic regression
rs_logistic_regression.fit(X_train_s, y_train)

In [None]:
# check the best hyperparameters
rs_logistic_regression.best_params_

In [None]:
# evaluate the model on the test data using the score method
rs_logistic_regression.score(X_test_s, y_test)

### Evaluating the Logistic regression model

In [None]:
# make a logistic regression classifier model with the best params
lr_clf = LogisticRegression(
    solver = 'liblinear',
    C =  29.763514416313132
)

# Fit the model
lr_clf.fit(X_train_s, y_train)

In [None]:
# make predictions on test data to evaluate
lr_y_preds = lr_clf.predict(X_test_s)
lr_y_preds

### ROC curve and Area under the curve

In [None]:
plot_roc_curve(lr_clf, X_test_s, y_test);

### Confusion matrix

In [None]:
conf_matrix(y_test,lr_y_preds)

### Classification Report 

In [None]:
print(classification_report(y_test,lr_y_preds))

In [None]:
cv_lr_metrics = cv_classification_report(lr_clf, X_test_s, y_test)

In [None]:
cv_lr_metrics

In [None]:
# save the metrics in a pandas dataframe and plot it in a bar graph
cv_lr_metrics_df = pd.DataFrame(cv_lr_metrics, index = [0]) 

sns.set(font_scale = 1.3)

cv_lr_metrics_df.T.plot.bar(title = 'Cross-validated logistic regression classifier metrics', legend = False)
plt.yticks(np.linspace(0,1,11));

## Model comparison
Now that we have 2 classifiers, one random forest classifier and one logistic regression we should compare both of them, and we have the cross validated metrics for each model in 2 variables:

In [None]:
# evaluation metrics for the random forest classifier
cv_rfc_metrics_df

In [None]:
# evaluation metrics for the logistic regression classifier
cv_lr_metrics_df

We can se that both models are so close, but in every metric the RandomForestClassifier wins over the Logistic Regression

The RandomForestClassifier model is still in a variable:

In [None]:
rf_clf