![Fetal Health Classification](https://stream.org/wp-content/uploads/Scientist-Fetus-Embryo-healthy-Life-Baby-Science-Studies-900.jpg)

Image source: Google Images

**Aim:**

To classify fetal health as ***Normal, Suspect, Pathological*** as the outcome of Cardiotocogram (CTG) exam. This will help to prevent child and maternal mortality.

**Approach:**

1. Fetch the data from dataset and identify independent and dependent/target variables
2. Since it is a multi-class problem, we need to binarise the target variable.
3. Understand the distribution of target variable
4. Understand the correlation between different variables
5. Select the independent variables that contribute the most towards the model and create a dataframe using only those variables
6. Scale the independent variables
7. Decide the number of folds to be used in ***Stratified K-Fold***
8. Derive various metrics like ***ROC curve, F1 Score, Precision and Recall*** for the classification model over the Stratified K folds.

In [None]:
import numpy as np
from numpy import interp
import pandas as pd
import os
import math

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from statistics import mean
from sklearn import model_selection
from sklearn.feature_selection import SelectFromModel
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.metrics import roc_curve, auc, f1_score, precision_recall_curve, classification_report
from sklearn.multiclass import OneVsRestClassifier

In [None]:
fetal_health_df = pd.read_csv("../input/fetal-health-classification/fetal_health.csv")
fetal_health_df.drop_duplicates(inplace = True)

In [None]:
fetal_health_df.head()

In [None]:
fetal_health_df.info()

In [None]:
fetal_health_df.shape

Dividing the data into Independent and Dependent variables. 

We will have to binarize the target variable(y) because roc_curve is restricted to binary classification or multi-label classification.

In [None]:
X = fetal_health_df.drop(columns = ['fetal_health'],axis = 1)
y = fetal_health_df['fetal_health'].to_numpy()
y = preprocessing.label_binarize(y, classes=[1.0, 2.0, 3.0])

# Understand the dependent variable distribution plot

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(fetal_health_df['fetal_health'], palette='viridis')
plt.title('Dependent variable distribution plot')
plt.xlabel('Fetal Health')

The target class **'fetal_health'** is unbalanced.

# Understand the correlation between differnt variables in the dataset

In [None]:
correlation = fetal_health_df.corr()
plt.figure(figsize=(20, 12))
sns.heatmap(correlation, cmap="coolwarm", annot=True)

In [None]:
col_names = X.columns

# Feature Selection
Train DecisionTreeClassifier over the data and select the features using feature importance generated by the model

In [None]:
feature_selection_classifier = DecisionTreeClassifier()
sfm = SelectFromModel(estimator=feature_selection_classifier)
X_transformed = sfm.fit_transform(X, y)
support = sfm.get_support()

In [None]:
selected_cols = [x for x, y in zip(col_names, support) if y == True]

In [None]:
#X_selected = fetal_health_df[selected_cols]
X_selected=fetal_health_df.loc[:, fetal_health_df.columns.isin(selected_cols)]

In [None]:
X_selected.head()

## Scaling the independent variables

In [None]:
scaler = preprocessing.StandardScaler() 
X_scaled = scaler.fit_transform(X_selected) 

## Deciding number of bins to be created for Stratified k-fold

Since the target variable distribution is non-uniform, we will use Stratified KFold for evaluating our model over different sets of data.

To decide the number of folds, we will use Sturge's rule: 

> Number of Bins = 1 + log(N)

In [None]:
n_bins = 1 + round(math.log(len(X_selected.axes[0])))
print(n_bins)

In [None]:
stratified_kf = model_selection.StratifiedKFold(n_splits = n_bins, shuffle=True)

# Calculate ROC curve, F1 score, Precision and Recall

In [None]:
fetal_health_classifier = OneVsRestClassifier(RandomForestClassifier(n_estimators=1000, class_weight='balanced', random_state = 42))

fig1 = plt.figure(figsize=[12,12])
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
all_f1_score = []
precision_dict = dict()
recall_dict = dict()


for i, (train_index, test_index) in enumerate(stratified_kf.split(X_scaled, y.argmax(1))):
    X_train, X_test = X_scaled[train_index], X_scaled[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    fetal_health_classifier.fit(X_train, y_train)
    y_pred = fetal_health_classifier.predict(X_test)
    prediction_proba = fetal_health_classifier.predict_proba(X_test)
    fpr, tpr, t = roc_curve(y_test[:, 1], prediction_proba[:, 1])
    precision_dict[i], recall_dict[i], _ = precision_recall_curve(y_test[:, 1], prediction_proba[:, 1])
    f1score = f1_score(y_test, y_pred, average='weighted')
    all_f1_score.append(f1score)
    tprs.append(interp(mean_fpr, fpr, tpr))
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
    

mean_fi_score = mean(all_f1_score)
print("Mean F1-score across all folds: ", mean_fi_score)
plt.plot([0, 1], [0, 1], linestyle = '--', lw = 2, color = 'black')
mean_tpr = np.mean(tprs, axis=0)
mean_auc = auc(mean_fpr, mean_tpr)
print("Mean ROC across all folds: ", mean_auc)
plt.plot(mean_fpr, mean_tpr, color='blue', label=r'Mean ROC(AUC=%0.2f)' % (mean_auc), lw = 2, alpha=1)
         
         
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.show()

In [None]:
fig2 = plt.figure(figsize=[12,12])

for i in range(len(precision_dict)):
    plt.plot(recall_dict[i], precision_dict[i], lw=2, label='Fold %d' % i)
    
    
    
plt.xlabel("recall")
plt.ylabel("precision")
plt.legend(loc="best")
plt.title("precision vs. recall curve")
plt.show()