### Importing Libraries

In [226]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To tune model, get different metric scores and split data
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
    classification_report,
    make_scorer,
)

from sklearn.tree import DecisionTreeClassifier

# To build a logistic regression model
from sklearn.linear_model import LogisticRegression

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To suppress the warnings
import warnings

warnings.filterwarnings("ignore")

# This will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black

The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black


<IPython.core.display.Javascript object>

**1. Load the cardiac dataset. Identify the correct percentage of the positive (Yes) and negative (No) class distribution.**

In [227]:
import pandas as pd

cardiac = pd.read_csv("./datasets/Cardiac.csv")
df = cardiac.copy()
df["UnderRisk"].value_counts(normalize=True)

no     0.786277
yes    0.213723
Name: UnderRisk, dtype: float64

<IPython.core.display.Javascript object>

**2. Prepare the data according to the following instructions in a sequential manner.**
1. Encode target variable (Replace yes with 1 and no with 0)
2. Split the data into temp and test in the 80:20 ratio. Use the parameter ‘stratify’ while splitting the data
3. Split the temp set into train and validation in the 75:25 ratio. Use the parameter ‘stratify’ while splitting the data
4. Create dummies for X_train, X_val, and X_test. Use drop_first = True while creating dummies. 

In [228]:
# split into X and Y
X = df.drop(["UnderRisk"], axis=1)
y = df.UnderRisk
print("The shape of X is", X.shape)

The shape of X is (889, 12)


<IPython.core.display.Javascript object>

In [229]:
# y.apply(lambda x: 1 if x == "yes" else 0)

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)
le.transform(["yes", "no"])

array([1, 0])

<IPython.core.display.Javascript object>

In [230]:
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

<IPython.core.display.Javascript object>

In [231]:
# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)

(533, 12) (178, 12) (178, 12)


<IPython.core.display.Javascript object>

In [232]:
# Creating Dummy Variables

X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_val.shape, X_test.shape)

(533, 13) (178, 13) (178, 13)


<IPython.core.display.Javascript object>

**3. Here, we are attempting to predict whether a person will have cardiac arrest or not based on his medical background. According to you, which of the following would be the most appropriate metric of evaluation for prediction.**

Ans: Recall

In medical cases, predicting the presence of disease in the patients who actually have a disease is more important than predicting the absence of the disease in the patients who do not have the disease.  In such cases, reducing the number of FP is more important than FN. Therefore, the right metric would be recall.

**4. Now that we have decided on our evaluation metric, we can go ahead to build models. Since cardiac arrest prediction is a classification problem, we can start with logistic regression.**

Train the models as per the following instructions.

Build a logistic regression on the train set using the sklearn implementation with default parameters and random_state=1 and check the performance of the model on the train set. 
Oversample the train set using SMOTE with parameters listed below, build a logistic regression on the oversampled data, and check the performance of the model on the oversampled train set.
SMOTE parameters: sampling_strategy=1, k_neighbors=5, random_state=1

Which of the following statements is true on comparing the performance of both the models on train data and oversampled train data.

Ans: Accuracy of the model on an oversampled set has decreased whereas recall and precision of the oversampled data have increased.

In [233]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf

<IPython.core.display.Javascript object>

In [234]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

<IPython.core.display.Javascript object>

In [235]:
from sklearn.linear_model import LogisticRegression

# Fit the model on original data i.e. before upsampling
lr = LogisticRegression(random_state=1)
lr.fit(X_train, y_train)

# Calculating different metrics on train set
log_reg_model_train_perf = model_performance_classification_sklearn(
    lr, X_train, y_train
)
print("Training performance:")
log_reg_model_train_perf

Training performance:


Unnamed: 0,Accuracy,Recall,Precision,F1
0,0.787992,0.035088,0.571429,0.066116


<IPython.core.display.Javascript object>

In [236]:
confusion_matrix_sklearn(model, X_val, y_val)

NotFittedError: This AdaBoostClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

<IPython.core.display.Javascript object>

In [None]:
# SMOTE to upsample smaller class

print("Before UpSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label '0': {} \n".format(sum(y_train == 0)))

sm = SMOTE(
    sampling_strategy=1, k_neighbors=5, random_state=1
)  # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)

print("After UpSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))

In [None]:
# fit model on upsampled data
log_reg_over = LogisticRegression(random_state=1)

# Training the basic logistic regression model with training set
log_reg_over.fit(X_train_over, y_train_over)

# Calculating different metrics on train set
log_reg_over_train_perf = model_performance_classification_sklearn(
    log_reg_over, X_train_over, y_train_over
)
print("Training performance:")
log_reg_over_train_perf

**5. Let's try some other algorithms before settling on a final model.**

Train a bagging classifier on the oversampled data using the sklearn implementation with default parameters and random_state=1 and check the model performance on the validation set.
Which of the following options gives the correct range of the evaluation metrics?

Ans:
Recall: In a range of 0.55 to 0.65 
Precision: In a range of 0.20 to 0.30

In [None]:
# Bagging Classifier - base_estimator for bagging classifier is a decision tree by default
bagging_estimator_over = BaggingClassifier(random_state=1)
bagging_estimator_over.fit(X_train_over, y_train_over)

# Calculating different metrics on validation set
bag_over_val_perf = model_performance_classification_sklearn(
    bagging_estimator_over, X_val, y_val
)
print("validation performance:")
bag_over_val_perf

**6. Let’s try one more model and see how our model performs. Along with the evaluation metrics, we should check how many observations are correctly predicted by our model.**

Train a random forest classifier with the original training set  using the sklearn implementation with default parameters and random_state=1 and assess the model performance on the training set. 

Identify the correct range of number of cases that are correctly predicted as ‘yes’ by the random forest classifier:

Ans: 15 to 25

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Train the random forest classifier
rf_estimator = RandomForestClassifier(random_state=1)
rf_estimator.fit(X_train, y_train)

In [None]:
# Calculating different metrics on train set
rf_estimator_train_perf = model_performance_classification_sklearn(
    rf_estimator, X_train, y_train
)
print("Training performance:")
rf_estimator_train_perf

In [None]:
# creating confusion matrix
confusion_matrix_sklearn(rf_estimator, X_train, y_train)

**7. One model might not give the desired outcome, we can try different models and compare their performances. Let’s try different models.**

Train the models as per the following instructions. 
- Train Bagging classifier using BaggingClassifier(random_state=1)
- Train Random forest classifier using RandomForestClassifier(random_state=1)
- Train Logistic regression using LogisticRegression(random_state=1)
- Train Decision trees using  DecisionTreeClassifier(random_state=1)

Loop through all the above models to get the mean cross-validated scores. Use the following code for the CV results on **over sampled data** - 
```
    scoring = "recall"
    kfold = StratifiedKFold(

        n_splits=5, shuffle=True, random_state=1

    )  # Setting number of splits equal to 5

    cv_result = cross_val_score(

        estimator=model, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold

    )
```
Which of the following statements are true about the cross-validated recall scores on the oversampled data? 
- A. The average cross-validated recall score for the bagging classifier and the random forest is approximately the same. 
- B. The difference between the CV recall scores for logistic regression and decision tree is in the range of 1-5. 
- C. The CV recall score for logistic regression lies in the range of 0.75 to 0.90
- D. The CV recall score for decision trees lies in the range of 0.75 to 0.90

Ans: A, C, D

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("LR", LogisticRegression(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models

# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
    scoring = "recall"

    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5

    cv_result = cross_val_score(
        estimator=model, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
    )

    results.append(cv_result)

    names.append(name)

    print("{}: {}".format(name, cv_result.mean() * 100))

**8. Building the model with default parameters might not give a satisfactory outcome. Let’s try to identify the best combination of the hyperparameters.**

Train an AdaBoost classifier using the oversampled data and tune the model using random search. 

Use the following code to define the parameters - 

```
param_grid = {

    "n_estimators": np.arange(10, 110, 10),

    "learning_rate": [0.1, 0.01, 0.2, 0.05, 1],

    "base_estimator": [

        DecisionTreeClassifier(max_depth=1, random_state=1),

        DecisionTreeClassifier(max_depth=2, random_state=1),

        DecisionTreeClassifier(max_depth=3, random_state=1),

    ],

}

# Type of scoring used to compare parameter combinations

scorer = metrics.make_scorer(metrics.recall_score)

# Calling RandomizedSearchCV

randomized_cv = RandomizedSearchCV(

    estimator=model,

    param_distributions=param_grid,

    n_jobs=-1,

    n_iter=50,

    scoring=scorer,

    cv=5,

    random_state=1,

)
```
Which of the following is the best combination of the hyperparameters obtained on tuning the Adaboost classifier with oversampled data?

Ans: Best combination of the parameters are {'n_estimators': 50, 'learning_rate': 0.01, 'base_estimator': DecisionTreeClassifier(max_depth=1, random_state=1)}

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import RandomizedSearchCV

model = AdaBoostClassifier(random_state=1)

param_grid = {
    "n_estimators": np.arange(10, 110, 10),
    "learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Type of scoring used to compare parameter combinations
scorer = make_scorer(recall_score)

# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_grid,
    n_jobs=-1,
    n_iter=50,
    scoring=scorer,
    cv=5,
    random_state=1,
)

randomized_cv.fit(X_train_over, y_train_over)
print(randomized_cv.best_score_)

In [None]:
print(randomized_cv.best_params_)

**9. We can further check how our model performs on oversampled and undersampled data.**

Train the Adaboost classifier with undersampled and oversampled data. Assess the model performance for Adaboost with oversampled data on the oversampled train data and for Adaboost with undersampled data on the undersampled train data. 

Which of the following statements is true about the performance of the model? 

Ans: The performance of the model trained with undersampled data is better than the performance of the model trained with oversampled data.

In [None]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)


ada_under = AdaBoostClassifier(random_state=1)
ada_under.fit(X_train_un, y_train_un)
model_performance_classification_sklearn(ada_under, X_train_un, y_train_un)

In [None]:
ada_over = AdaBoostClassifier(random_state=1)
ada_over.fit(X_train_over, y_train_over)
model_performance_classification_sklearn(ada_over, X_train_over, y_train_over)

**10. It is important to understand the features that are critical in making the right predictions. Let’s try out what are the important features for our model.** 

Plot the feature importance of the variables for the Adaboost classifier trained with undersampled data. 

Which of the following are the most important features?

Ans: Gender and HighBP

In [None]:
import matplotlib.pyplot as plt

model1 = AdaBoostClassifier(random_state=1)
model1.fit(X_train_un, y_train_un)

feature_names = X_train_un.columns
importances = model1.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()