In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from matplotlib import pyplot as plt 
import seaborn as sns
from keras.backend import clear_session
from keras.utils import to_categorical
from keras.layers import Dense
from keras.models import Sequential
from keras.metrics import AUC
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.metrics import confusion_matrix, f1_score, precision_recall_curve, classification_report
from tensorflow.random import set_seed
from tensorflow import get_logger


get_logger().setLevel('ERROR')

# We use these random seeds to ensure reproductibility
np.random.seed(1)
set_seed(1)

The challenge is to classify the health of a fetus as Normal, Suspect or Pathological based on cardiotocogram exam data. The dataset is composed of 1 csv file based on the research work of Ayres-de-Campos et al. [Ayres de Campos et al. (2000) SisPorto 2.0 A Program for Automated Analysis of Cardiotocograms. J Matern Fetal Med 5:311-318]. Let's do some exploratory data analysis to start with.

In [None]:
df = pd.read_csv('/kaggle/input/fetal-health-classification/fetal_health.csv')

df.info()

df.head(n=5)

The dataset is composed of 2126 observations and all the features (there are 21 features and one outcome) are numerical. It appears that no value is missing. Let's see if some data are irrelevant, if there are outliers to tackle with, or if some standardization is needed.

In [None]:
%matplotlib inline
df.hist(bins=20, figsize=(20, 15))
plt.show()

From these histograms, we can notice several points to address to improve a priori the predictive power of a developed machine learning model.
- The variance of the features differs a lot between variables, therefore standardization can be needed
- The feature "histogram_tendency" has -1, 0, 1 values but we do not know what they correspond to (are they categories?). Without extra information (there is nothing about it in the documentation), it is probably safer not to consider this feature
A last idea to consider : 
- Regarding the outcome we want to predict : we observe that there are way more healthy foetus rather than suspect or pathological ones. It could be good to stratify the data while splitting to train the model on a dataset with a more balanced distribution of 'healthy', 'suspect' and 'pathological' foetuses. Because of this inbalance, we will avoid to use 'accuracy' as a metrics. 

Let's try and compare two approaches :
1) Developing a neural network without performing any feature engineering.
2) A comparison of different ensemble techniques after performing some feature selection. 

On this notebook is shown the first approach. Let's go ! 

In [None]:
#Let's remove the "histogram_tendency", standardize the data and split it between a train and test set
df = df.drop(['histogram_tendency'], axis=1)

X, y = df.drop(['fetal_health'],axis=1), df['fetal_health']

scaler = StandardScaler()
X = scaler.fit_transform(X)

encoder = OneHotEncoder()
y = encoder.fit_transform(y.values.reshape(-1,1)).toarray()


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, stratify=y, random_state = 42)

def compile_model():
    clear_session()

    model = Sequential()

    model.add(Dense(20, input_shape=(20,), activation='relu'))
    model.add(Dense(20, activation='relu'))
    model.add(Dense(20, activation='relu'))
    model.add(Dense(20, activation='relu'))
    model.add(Dense(20, activation='relu'))
    model.add(Dense(3, activation='sigmoid'))
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[AUC(multi_label=True)])
    return model


loss, val_loss = [[],[]]

model = compile_model()
our_model = model.fit(X_train, y_train, validation_data=(X_test, y_test), 
                      epochs=100, verbose=0)

loss.append(our_model.history['loss'])
val_loss.append(our_model.history['val_loss'])



Let's get a first sense of how the model can overfit if we do not pay too much attention.

In [None]:
fig, ax = plt.subplots(figsize=(5,5))

ax.plot(loss[0], color='k', label='Train set')
ax.plot(val_loss[0], color='r', label='Test set')
ax.set_xlabel('Number of epochs')
ax.set_ylabel('Loss')
ax.set_xlim(0,30)
ax.legend()


When the number of epochs increases from 1 to ~5, the loss function decreases on both the training and the test set. However, when the number of epochs is > 5, the loss function keeps decreasing on the training set but increases on the test set, which is a sign that the model is **overfitting**. We will keep the number of epochs equal to 4!

Let's see how our model performs:

In [None]:
model = compile_model()
our_model = model.fit(X_train, y_train, epochs=4, verbose=0)
y_pred0 = model.predict(X_test)

def print_f1score(model, X_train, y_train, y_test, y_pred):
    
    training_score = f1_score(np.argmax(y_train,axis=1), 
                              np.argmax(model.predict(X_train),axis=1), average='micro')

    test_score = f1_score(np.argmax(y_test,axis=1), 
                          np.argmax(y_pred,axis=1), average='micro')

    print('f1-score on the training set: %s'%training_score)
    print('f1-score on the test set: %s'%test_score)

print_f1score(model, X_train, y_train, y_test, y_pred0)


In [None]:
# Check out these nice tricks of Dennis T to plot confusion matrix
# [https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea]

def print_confusion_matrix(model, X_train, y_train, y_test, y_pred):
    train_confusion = confusion_matrix(np.argmax(y_train,axis=1), np.argmax(model.predict(X_train),axis=1))
    test_confusion = confusion_matrix(np.argmax(y_test,axis=1), np.argmax(y_pred,axis=1))

    fig, ax = plt.subplots(1,2,figsize=(12,5))

    sns.heatmap(train_confusion/np.sum(train_confusion), ax=ax[0], annot=True, fmt='.2%', cmap='Reds')
    ax[0].set_xlabel('Predicted labels')
    ax[0].set_ylabel('Actual labels')
    ax[0].set_title('Confusion matrix (train set)')

    sns.heatmap(test_confusion/np.sum(test_confusion), ax=ax[1], annot=True, fmt='.2%', cmap='Reds')
    ax[1].set_title('Confusion matrix (test set)')
    ax[1].set_xlabel('Predicted labels')
    ax[1].set_ylabel('Actual labels')

print_confusion_matrix(model, X_train, y_train, y_test, y_pred0)


In [None]:
# Suspect correctly labeled
a = confusion_matrix(np.argmax(y_test, axis=1), np.argmax(y_pred0, axis=1))
tmp1 = np.round(100*a[1,1]/np.sum(a[1,:]),1)

# Pathological correctly labeled
tmp2 = np.round(100*a[2,2]/np.sum(a[2,:]),1)

print('The confusion matrix show us that, on unseen data:')
print(tmp1,'% of the Suspect foetus are correctly labeled')
print(tmp2,'% of the Pathological foetus are correctly labeled')

I am sure that we can do better ! One solution to reduce the proportion of false negatives and false positives is to perform a hyperparameter tuning. In a neural network model like that, the number of degrees of freedom is quite large. It is possible to modify the architecture of the network (number of layers, number of neurons per layer etc.), the activation functions, the optimizer etc. For the sake of simplicity, I will not tweak the architecture of the neural network here. Let's find better activation functions and optimizer!

In [None]:
def my_new_model(act, opt):
    clear_session()
    model = Sequential()
    model.add(Dense(20, input_shape=(20,), activation=act))
    model.add(Dense(20, activation=act))
    model.add(Dense(20, activation=act))
    model.add(Dense(20, activation=act))
    model.add(Dense(20, activation=act))
    model.add(Dense(3, activation='sigmoid'))
    model.compile(optimizer=opt, loss='categorical_crossentropy')
    return model

new_model = KerasClassifier(build_fn=my_new_model, verbose=0)

parameters = dict(opt = ['adam', 'sgd', 'adamax'], 
                  act=['relu', 'softmax', 'tanh', 'selu'],  
                  batch_size=[32, 64, 128, 256, 512])

random_search = RandomizedSearchCV(new_model, param_distributions=parameters, 
                                   n_iter=30, scoring='roc_auc', random_state=123)

res = random_search.fit(X_train, y_train)

print(res.best_params_)

In [None]:
#Let's see how we perform with the prescribed optimizer and activation function! 

optimized_model = my_new_model('relu', 'adam')

optimized_model.fit(X_train, y_train, epochs=14, batch_size=32, verbose=0)

y_pred = optimized_model.predict(X_test)

print_f1score(optimized_model, X_train, y_train, y_test, y_pred)

In [None]:
print_confusion_matrix(optimized_model, X_train, y_train, y_test, y_pred)

In [None]:
# Suspect correctly labeled
a = confusion_matrix(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1))
tmp1 = np.round(100*a[1,1]/np.sum(a[1,:]),1)

# Pathological correctly labeled
tmp2 = np.round(100*a[2,2]/np.sum(a[2,:]),1)

print('Thanks to the hyperparameter tuning, on unseen data, we have now:')
print(tmp1,'% of the Suspect foetus are correctly labeled')
print(tmp2,'% of the Pathological foetus are correctly labeled')

It is definitely better than what we have achieved before. Further, the confusion matrix derived from the model predictions on the training set and the test set are rather similar. It is a good hint that the model has not overfitted.
Let's take a look at the precision-recall curve for both models (before and after hyperparameter tuning).

In [None]:
precision = dict()
recall = dict()
precision_opt = dict()
recall_opt = dict()

for i in range(3):
    precision[i], recall[i], _ = precision_recall_curve(y_test[:,i],y_pred0[:,i])
    precision_opt[i], recall_opt[i], _ = precision_recall_curve(y_test[:,i],y_pred[:,i])

fig, ax = plt.subplots()

colors = ['k','r','b']
labels = ['Healthy','Suspect','Pathological']

for i in range(3):
    ax.plot(recall[i],precision[i],color=colors[i],label=labels[i],linestyle='--')
    ax.plot(recall_opt[i],precision_opt[i],color=colors[i],label=labels[i],linestyle='-')

ax.set_xlabel('Recall TP/(FN+TP)')
ax.set_ylabel('Precision TP/(FP+TP)')

plt.legend()
plt.show()

**Conclusion**:
The f1-score and the area under the precision-recall curve has increased with the hyperparameter tuning. The advantage of this technique is its easiness as it does not require much effort on feature engineering and/or selection. It can definitely be improved by tuning other aspects of the model, for instance, its architecture (adding a batch normalization, a dropout layer etc...). Further, an effort should be made to validate the model performance (by analysing -for instance- the model performance on different initial random seeds).

**COMING SOON** I'll publish soon another approach with feature selection and different ensemble techniques. The idea is to compare a voting classification technique, a random forest classifier and an extreme gradient boosting algorithm. Preliminary results show a better predictive strength of these models.