In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Fetal Health Classification

This kernel shows the process that I designed to create a model to classify the outcome of Cardiotocogram (CTG) exam, which represents the well being of the fetus. Given that, according to the exercise's notes, the dataset is highly imbalance so **accuracy is not a good metric to messure the model's performance**. The exercise's notes also mention to stratify the data when spliting.

The recommended metrics to measure the performance are:
* Area under the ROC curve
* F1 score
* Area under the Precision-Recall curve

I will use F1 and Area under the ROC curve

Moreover, the author suggest to stratify data when spliting the datasets into training and testing given the imbalance in the dataset. I will execute the experiment on both, stratify and no stratified data to compare the results.

This notebook is divided as follows:

1. Exploratory data analysis
    1. Reading the data
    2. Exploring the data
2. Data split and scale
3. Create the ML models with their grid search parameters
4. Tune and Fit the models, with grid search and predict on validation set
5. Choose the winner model

## 1. Exploratory data analysis
### 1.1. Reading the data

In [None]:
# Let's start importing the first two libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

In [None]:
# Now I'm setting the configuration to display the dataframes' info
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

In [None]:
# I will use the following two variables to have better visibility when printing the info I want to display by adding separators. (print(divider_length * divider_shape))
divider_length = 80
divider_shape = '='

In [None]:
# reading the dataset
dataset = pd.read_csv('../input/fetal-health-classification/fetal_health.csv')

### 1.2. Exploring the data

The goal of this section is to understand the info we read in previous section and determine whether we have to execute additional operations on the dataset such as cleaning, imputation, etc

In [None]:
print(divider_length * divider_shape)
print('ANALYSING Dataset')
print('Size: ' + str(dataset.shape))
print(divider_length * divider_shape)
print('Dataset columns: ')
print(dataset.columns)
print(divider_length * divider_shape)
print('Dataset head: ')
print(dataset.head(10))
print(divider_length * divider_shape)
print('Dataset info: ')
dataset.info()

- We are dealing with a dataset of 2126 rows and 22 columns
- The fetal_health column (last column in the dataset) holds the dataset labels. The rest of the columns will be used as the features
- All of the columns' data types are float64 (numeric)
- There are not null values in the dataset! (which is very unusual! but good news for a data scientist)

In [None]:
# double checking that there are not null values in the dataset
print(divider_length * divider_shape)
print('Number missing values per column in train set:')
train_cols_with_nulls = dataset.isnull().sum()
print(train_cols_with_nulls[train_cols_with_nulls > 0])


Let's check how imbalance is the dataset

In [None]:
print(divider_length * divider_shape)
print('Checking how imbalance is the dataset:')
print(dataset['fetal_health'].value_counts())

The dataset is pretty imbalace towards label 1.

Now let's check whether there are duplicate rows:

In [None]:
print(divider_length * divider_shape)
print('Check number of duplicate rows:')
print(dataset.duplicated().sum())

There are 13 duplicate rows. Let's now check duplicates per label

In [None]:
print(divider_length * divider_shape)
print('Labels counts duplicate: ')
print(dataset.loc[dataset.duplicated(), 'fetal_health'].value_counts())


Let's remove the duplicates. Given the few amount of duplicates per label, it is expected that the dataset without dups will still be imbalance

In [None]:
#%% Removing duplicates
dataset.drop_duplicates(inplace=True)
# Given the low number of duplicates for label 1 the dataset is still imbalance
print(divider_length * divider_shape)
print('Checking how imbalance is the dataset after removing duplicates:')
print(dataset['fetal_health'].value_counts())


As expected, the dataset without dups is still imbalanced.

There is no need to impute data since there are not null values in the dataset. Let's proceed to the next step

## 2. Data split and scale

The exercise's notes suggests to use 30% of the data for testing. It also mentions to stratify the data given the imbalance. I will work with two datasets, one with the stratify split and another one without stratify.

In [None]:
# Dividing the data into features (X) and labels (y)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values


In [None]:
# Splitting the dataset into the Training set and Test set. By using the same random_state there is guarantee that both _no_stra and _stra sets will be split in the same way
from sklearn.model_selection import train_test_split
X_train_no_stra, X_validation_no_stra, y_train_no_stra, y_validation_no_stra = train_test_split(X, y, test_size=0.3, random_state=0)
X_train_stra, X_validation_stra, y_train_stra, y_validation_stra = train_test_split(X, y, test_size=0.3, stratify=y, random_state=0)

In [None]:
# Let's save the dataset classes for future use
classes = np.unique(y)
print(divider_length * divider_shape)
print(classes)

Let's use sklearn standar scaler to scale our features

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_no_stra = sc.fit_transform(X_train_no_stra)
X_validation_no_stra = sc.transform(X_validation_no_stra)

X_train_stra = sc.fit_transform(X_train_stra)
X_validation_stra = sc.transform(X_validation_stra)


Now the data is ready to be injected into the ML models

## 3. Create the ML models with their grid search parameters

I will use the following models: Logistic Regression, K-Nearest-Neighbors, SupportVectorMachines, Decision Trees, Random Forest and Catboost


In [None]:
# Let's import the required libraries

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier

Now, let's create the models with the grid params that will be used to tune each model. I'm saving all of the models in a list that I will traverse in next steps.

In [None]:
all_models = [
    {'model': LogisticRegression(random_state=0),
      'grid_params': {'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
                     },
    },
    {'model': KNeighborsClassifier(),
     'grid_params': {'n_neighbors': [5, 10, 15],
                      'metric': ['euclidean', 'manhattan', 'chebyshev', 'minkowski'],
                   }
    },
    {'model': SVC(random_state = 0, probability=True),
      'grid_params': [{'C': [0.25, 0.5, 0.75, 1],
                      'kernel': ['linear'],
                      },
                      {'C': [0.25, 0.5, 0.75, 1],
                      'kernel': ['poly'],
                      'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                      'degree': [2, 3, 4, 5],
                      },
                      {'C': [0.25, 0.5, 0.75, 1],
                      'kernel': ['rbf'],
                      'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                      },
                      {'C': [0.25, 0.5, 0.75, 1],
                      'kernel': ['sigmoid'],
                      'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                      },
                    ]
    },
    {'model': DecisionTreeClassifier(random_state = 0),
      'grid_params': {'criterion': ['gini', 'entropy'],
                      'splitter': ['best', 'random'],
                      'max_depth': [5, 10, 20, None],
                      'max_features': ['auto', 'sqrt', 'log2', None],
                    }
    },
    {'model': RandomForestClassifier(random_state = 0),
      'grid_params': {'n_estimators': [100, 250, 500, 1000],
                      'criterion': ['gini', 'entropy'],
                      'max_depth': [5, 10, 20, None],
                      'max_features': ['auto', 'sqrt', 'log2', None],
                    }
    },
    {'model': CatBoostClassifier(early_stopping_rounds=100, verbose=False, random_state=0, loss_function='MultiClass'),
     'grid_params': {'learning_rate': [0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1],
                     'iterations': [100, 500, 1000],
                    }
    }
]

## 4. Tune and Fit the models, with grid search and predict on validation set

Now is time to traverse our previous model list and execute the grid search for all of them. I will save the grid search best result in the final_scores list.

In [None]:
#%% create grid search and fit the models
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from mlxtend.plotting import plot_confusion_matrix

average = 'macro'                # Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
multi_class = 'ovo'              # Computes the average AUC of all possible pairwise combinations of classes. Insensitive to class imbalance when average == 'macro'
scoring = {'f1_macro', 'roc_auc_ovo'}  # The two scores that I will use according to the exercise's notes

kfolds = KFold(n_splits=10)      # 10 plots
n_jobs = -1                      # use all available cores

final_scores = []

from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, classification_report, roc_auc_score, ConfusionMatrixDisplay

for model in all_models:        # traverse the model list (for each model)
    for stratify_data in [False, True]:   # choose whether to use the stratify or no stratify dataset
        if stratify_data == False:
            X_train, X_validation, y_train, y_validation = X_train_no_stra, X_validation_no_stra, y_train_no_stra, y_validation_no_stra
        else:
            X_train, X_validation, y_train, y_validation = X_train_stra, X_validation_stra, y_train_stra, y_validation_stra

        model_name = type(model['model']).__name__
        print(divider_length * divider_shape)
        print('Finding parameters for ' + model_name + ' with stratify_data = ', str(stratify_data))
        grid_search = GridSearchCV(estimator = model['model'],
                                   param_grid = model['grid_params'],
                                   scoring = scoring,
                                   refit='f1_macro',       # F1 will be the main score to refit the estimator
                                   cv = kfolds,
                                   n_jobs = n_jobs,
                                  )
        grid_search.fit(X_train, y_train)                   # fit the grid search
        best_classifier = grid_search.best_estimator_       # get the best estimator
        best_score = grid_search.best_score_                # get the best score (f1)
        best_parameters = grid_search.best_params_          # get the best parameters
        
        # Metrics on train data. Just for fun.
        y_pred_train = best_classifier.predict(X_train)
        y_pred_proba_train = best_classifier.predict_proba(X_train)

        # calculate the training accuracy
        train_accuracy = accuracy_score(y_true = y_train, y_pred = y_pred_train)

        # calculate the training f-score
        train_f1 = f1_score(y_true = y_train, y_pred = y_pred_train, average=average)

        # calculate the training roc_auc
        train_roc_auc = roc_auc_score(y_true = y_train, y_score = y_pred_proba_train, average=average, multi_class=multi_class)

        # calculate the confusion matrix for training data
        train_cm = confusion_matrix(y_true = y_train, y_pred = y_pred_train)

    
        
        # Metrics on validation data. The results are reported on this metrics
        y_pred_validation = best_classifier.predict(X_validation)
        y_pred_proba_validation = best_classifier.predict_proba(X_validation)

        # calculate the validation accuracy
        validation_accuracy = accuracy_score(y_true = y_validation, y_pred = y_pred_validation)

        # calculate the validation f-score
        validation_f1 = f1_score(y_true = y_validation, y_pred = y_pred_validation, average=average)

        # calculate the validation roc_auc
        validation_roc_auc = roc_auc_score(y_true = y_validation, y_score = y_pred_proba_validation, average=average, multi_class=multi_class)

        # calculate the confusion matrix for validation data
        validation_cm = confusion_matrix(y_true = y_validation, y_pred = y_pred_validation)

        # consolidated data in classification_report
        validation_report = classification_report(y_true = y_validation, y_pred = y_pred_validation)
    
        print(divider_length * divider_shape)
        print('Saving results for the best ' + model_name + ' with stratify_data = ' + str(stratify_data))
    
        # the best model is stored as a dictionary within the final_scores list
        final_scores.append({
                             'model': best_classifier,
                             'model_name': model_name,
                             'stratify_data':stratify_data,
                             'model_parameters': best_parameters,
                             'gs_cv_score': best_score,
                             'train_accuracy': train_accuracy,
                             'train_f1': train_f1,
                             'train_roc_auc': train_roc_auc,
                             'training_conf_mat': train_cm,
                             'validation_accuracy': validation_accuracy,
                             'validation_f1': validation_f1,
                             'validation_roc_auc': validation_roc_auc,
                             'validation_conf_mat': validation_cm,
                             'validation_report': validation_report,
                             }
                            )
        # printing the results I care the most: model's hyperparameters, F1, ROC_AUC, confusion matrix and classification report
        print('Best parameters: ' + str(best_parameters))
        print('F1: ' + str(round(validation_f1, 2)))
        print('ROC_AUC: ' + str(round(validation_roc_auc, 2)))
        fig, ax = plot_confusion_matrix(conf_mat=validation_cm, class_names=classes, figsize=(3, 3), cmap=plt.cm.Blues)
        plt.xlabel('Predictions', fontsize=12)
        plt.ylabel('Actuals', fontsize=12)
        plt.title('CM for ' + model_name + ' stratify=' + str(stratify_data), fontsize=14)
        plt.show()
        print(validation_report)
    
        print('\a')


Now we can save the results saved in the final_scores list into a dataframe and print the columns we are interested in.

In [None]:
# save results in a dataframe and print scores per model
print(divider_length * divider_shape)
print('Best scores per model')
df_final_scores = pd.DataFrame(final_scores)
print_columns = ['model_name', 'stratify_data', 'model_parameters', 'gs_cv_score', 
                 'train_accuracy', 'train_f1', 'train_roc_auc', 
                 'validation_accuracy', 'validation_f1', 'validation_roc_auc']
print(df_final_scores[print_columns])


## 5. Choose the winner model

Now we could determine the winner model. Let's take a closer look to both f1 and roc_auc metrics to reach out to a conclusion

In [None]:
print(divider_length * divider_shape)
print_columns = ['model_name', 'stratify_data', 'validation_f1', 'validation_roc_auc']
print(df_final_scores[print_columns])

ROC_AUC metric determines how good is the model by making a relationship between true positive rate and false positive rate. In an ideal dataset, and AUC of area 1 will define a perfect model. However this value doesn't get affected by imbalance datasets. In the other hand, F1 is a good metric to determine the estimators' perfomance for imbalance datasets given it is calculated by the weighted average of the precision and recall. That said, we could determine the best model as the one with highest F1 and if there is a tie, use ROC_AUC to break the draw.

According to the previous table, CatBoost with no stratify dataset will be the winner. Let's determine the winner with few lines of code

In [None]:
print(divider_length * divider_shape)
print('Best model')
print_columns = ['model_name', 'stratify_data', 'model_parameters', 
                 'validation_f1', 'validation_roc_auc']
winner_model_index = df_final_scores['validation_f1'].idxmax()
print(df_final_scores.loc[winner_model_index, print_columns])
fig, ax = plot_confusion_matrix(conf_mat=df_final_scores.loc[winner_model_index, 'validation_conf_mat'], class_names=classes, figsize=(3, 3), cmap=plt.cm.Blues)
plt.xlabel('Predictions', fontsize=12)
plt.ylabel('Actuals', fontsize=12)
plt.title('Winner Confusion Matrix', fontsize=14)
plt.show()
print('Validation classification report')
print(df_final_scores.loc[winner_model_index, 'validation_report'])
