# Main Objectives
The scope of this project is to build several machine learning algorithms which can predict and classify the health of the fetus with the best accuracy possible. This can be broken down into the following milestones:  
1. Data Cleaning, Exploration and Feature Engineering.
2. Modeling.
3. Selection of best model.  

The best model built could benefit the medical personnel in the task of automating the diagnosis of fetus and maternal health given the information gathered by the exam saving time, budget, also help in the search of the most impactful metrics or those most correlated to any pathology and finally in the aim to early detect diseases in both patients.

In [None]:
! pip install imblearn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
df=pd.read_csv('../input/fetal-health-classification/fetal_health.csv')

In [None]:
df.head()

In [None]:
df.shape

Checking if there are some null values in the entire dataframe:

In [None]:
df.isnull().sum().sum()

Confirming non null values in each feature and their corresponding data type:

In [None]:
df.info()

As we see above all features even the label are numerical, therefore processing of categorical variables is something that will not be done in this project, we will only explore the distribution and meaning of each numerical. 

# Feature Engineering 

Let's see the distribution of the label:

In [None]:
df.fetal_health.value_counts()

In [None]:
sns.countplot(x='fetal_health',data=df)

Plotting a pie chart with the apropriate name of the classes:

In [None]:
df2=df.copy(deep=True)
pie1=pd.DataFrame(df2['fetal_health'].replace(1.0,'Normal').replace(2.0,'Suspect').replace(3.0,'Pathological').value_counts())
pie1.reset_index(inplace=True)
pie1.plot(kind='pie', title='Pie chart of fetal health',y = 'fetal_health', 
          autopct='%1.1f%%', shadow=False, labels=pie1['index'], legend = False, fontsize=14, figsize=(12,12))

Let's see the histogram of each feature:

In [None]:
df.iloc[:,:-1].hist(figsize=[20,25], layout=[7,3])

Looking carefully to every histogram we could say that at least 8 features are extremely skewed and contain a significant amount of outliers, giving us the idea that these could be scaled using the technique 'robust scaling', but all of these values are correct and confirmed by the publisher of the dataset.  
Also we can see that the features had already been processed because some of them were created by binning or encoding categorical ordinal variables, such as: Light_decelerations, prolongued_decelerations, severe_decelerations, histogram_number_of_zeroes and histogram_tendency, which contain a specific number of possible values. Even the label was encoded too in this process this is why we had numbers instead of the apropriate name of the classes.  
About all others which were not mentioned above correspond to numerical continuous features, some of these are already standardized whereas others not yet. In order to assure a flawless performance of the classifiers models we will scale every feature by standardization.

Following we can see the 5 features mentioned and how many unique values each one contain:

In [None]:
print('Number of unique values in light_decelerations feature:',len(df.light_decelerations.unique()))
print('Number of unique values in prolongued_decelerations feature:',len(df.prolongued_decelerations.unique()))
print('Number of unique values in severe_decelerations feature:',len(df.severe_decelerations.unique()))
print('Number of unique values in histogram_number_of_zeroes feature:',len(df.histogram_number_of_zeroes.unique()))
print('Number of unique values in histogram_tendency feature:',len(df.histogram_tendency.unique()))

Statistical summary of features using 'describe table':

In [None]:
df.iloc[:,:-1].describe().T

The following is a summary of the 9 most skewed features:

In [None]:
df[['fetal_movement', 'histogram_number_of_zeroes', 'histogram_variance', 'light_decelerations',
   'mean_value_of_long_term_variability','mean_value_of_short_term_variability','percentage_of_time_with_abnormal_long_term_variability',
   'prolongued_decelerations','severe_decelerations']].describe().T

Let's see the box plot of each feature related to the label:

In [None]:
plt.figure(figsize=(25,35))
i=1
for feat in df.iloc[:,:-1].columns:
    plt.subplot(7,3,i)
    sns.boxplot(x='fetal_health',y=feat,data=df)
    i+=1

Some of these box plots look like the IQR is almost null, but this is because there are a few unique values in each feature, which is product of binning and encoding ordinal categorical variables.

We can see our label as a continuous variable because as the number increases in magnitude is more likely that the fetal would have a health problem, therefore we could correlate this with the features and interpret a positive pearson correlation as a feature with direct proportion to a health problem.  
Now based on this assumption let's make a heat map showing the pearson correlation of each feature to the label:

In [None]:
plt.figure(figsize=(5, 12))
heatmap = sns.heatmap(df.corr()[['fetal_health']].sort_values(by='fetal_health', ascending=True), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features Correlating with Fetal Health', fontdict={'fontsize':18}, pad=22)
heatmap.set_ylim([0,22])

Despite the fact that no features has a strong correlation with the label, we can have an idea of how each one impacts the outcome.

In [None]:
features = df.iloc[:,:-1]
label=df['fetal_health']

In [None]:
features.shape, label.shape

As all correlations computed were a bit low we could create polynomial features to obtain relationships between them which will expand the information given to the predictive model. Considering a second degree function any of the following formulas will give us the total number of features in our dataset omitting the bias component:

$ Features= 2n + \sum \limits _{j=1} ^{n-1} i $  

$ Features= 2n + \frac{n(n-1)}{2}$

This process will increase more than 10 times our features, specifically to 252 which at the same time will increase complexity and inaccuracy by curse of dimensionality, but let's evaluate the performance of models with and without these extra features.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
pf = PolynomialFeatures(degree=2, include_bias=False)
df3 = pf.fit_transform(features)

In [None]:
df3.shape

The following will keep the name of each column, which is crucial when we wanted to see see the importance and impact of each one in the prediction:

In [None]:
target_feature_names = ['x'.join(['{}^{}'.format(pair[0],pair[1]) for pair in tuple if pair[1]!=0]) for tuple in [zip(features.columns,p) for p in pf.powers_]]
output_df = pd.DataFrame(df3, columns = target_feature_names)

In [None]:
output_df.head()

The last step of feature engineering is the oversampling process, because we have an unbalanced label the prediction will tend to have a bias towards the most frequent class, which clearly is not good, so SMOTE will be used to have the same number of instances per class.

In [None]:
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(output_df, label)

In [None]:
X_res.shape

In [None]:
y_res.value_counts()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res , random_state=42, test_size = 0.3)

# Modeling

The following models will be built and compared using their corresponding error measurements:  
1. Logistic Regression with Ridge 'L2' regularization.
2. SVC with RBF kernel.
3. Random Forest with the best number of trees.
4. Voting Classifier combining the three models.

Before building the different models let's declare some error metrics in order to compare the performace of each one:

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import label_binarize

In [None]:
from sklearn.preprocessing import StandardScaler
s = StandardScaler()

In [None]:
X_train_s = s.fit_transform(X_train)
X_test_s = s.transform(X_test)

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegressionCV

lr_l2 = LogisticRegressionCV(Cs=10, cv=4, penalty='l2', solver='liblinear').fit(X_train_s, y_train)

We can see below the 3 sets of coefficients generated by the model.

In [None]:
pd.DataFrame(lr_l2.coef_.T, columns=[0,1,2])

Printing the class predicted for each instance:

In [None]:
y_pred_lr=lr_l2.predict(X_test_s).T
pd.DataFrame(y_pred_lr,columns=['Class predicted']).head(10)

Printing the probabilities that the instances belong to each one of the classes for Logistic regression:

In [None]:
pd.DataFrame(lr_l2.predict_proba(X_test_s),columns=['1','2','3']).head(10)

Classification report showing all measures of precision, recall, f1 score and accuracy:

In [None]:
print(classification_report(y_test,y_pred_lr))

We can see our model has a good performance, but let's compare this with the next models:

## Support Vector Classifier

Firstly, grid search will be used to find the best hyperparameters for the model, as we have multiclass in our label the decision_function_shape is set to 'ovr' which stands for 'one versus rest' and probability to True in order to obtain the probability that every instance belongs to each class.

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

svm_model=SVC(kernel='rbf', decision_function_shape='ovr', probability=True)
tuned_parameters = {'gamma': [0.01,0.1,1,10],'C':[0.01,0.1,1,10]}

model_svm = GridSearchCV(svm_model, tuned_parameters,cv=4,scoring='accuracy')
model_svm.fit(X_train_s, y_train)

Print the best estimators 'hyperparameters found by grid search':

In [None]:
print(model_svm.best_estimator_)

In [None]:
print(accuracy_score(model_svm.predict(X_test_s),y_test))

Building a new model with those hyperparameters:

In [None]:
svc= SVC(kernel='rbf',C=10,gamma=0.01,decision_function_shape='ovr',probability=True)
svc.fit(X_train_s,y_train)
y_pred_svm=svc.predict(X_test_s)

In [None]:
y_pred_svm=svc.predict(X_test_s).T
pd.DataFrame(y_pred_svm,columns=['Class predicted']).head(10)

Printing the probabilities that the instances belong to each one of the classes for SVC:

In [None]:
pd.DataFrame(svc.predict_proba(X_test_s),columns=['1','2','3']).head(10)

In [None]:
print(classification_report(y_test,y_pred_svm))

SVC performed a bit better than Logistic Regression.

## Random Forest 

As we know in tree-based models is not needed to scale the features nor encoding, but these are already engineered and in order to compare the performance of models under certain context the random forest will be trained in the same way as the prior models.

In [None]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

The number of trees will be selected by computing the 'out of bag error' of models with number of trees from 15 until 400, plotting their corresponding error and warm_start will be set to True to just add more trees to the existing ones reducing execution time.

In [None]:
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier(oob_score=True,
                            random_state=42,
                            warm_start=True,
                            n_jobs=-1)
oob_list = list()
for n_trees in [15, 20, 30, 40, 50, 100, 150, 200, 300, 400]:
    RF.set_params(n_estimators=n_trees)
    RF.fit(X_train_s, y_train)
    oob_error = 1 - RF.oob_score_
    oob_list.append(pd.Series({'n_trees': n_trees, 'oob': oob_error}))

rf_oob_df = pd.concat(oob_list, axis=1).T.set_index('n_trees')
rf_oob_df

In [None]:
sns.set_context('talk')
sns.set_style('white')

ax = rf_oob_df.plot(legend=False, marker='o', figsize=(14, 7), linewidth=5)
ax.set(ylabel='out-of-bag error');

We can see when the number of trees is around 300 the model has the lowest error, thus we will build a new model with this characteristics:

In [None]:
RF_300 = RandomForestClassifier(n_estimators=300
          ,oob_score=True 
          ,random_state=42
          ,n_jobs=-1)

RF_300.fit(X_train_s,y_train)
oob_error300 = 1 - RF_300.oob_score_
oob_error300

In [None]:
y_pred_rf=RF_300.predict(X_test_s)

Printing the class predicted for each instance and then the probabilities:

In [None]:
y_pred_rf=RF_300.predict(X_test_s).T
pd.DataFrame(y_pred_rf,columns=['Class predicted']).head(10)

In [None]:
pd.DataFrame(RF_300.predict_proba(X_test_s),columns=['1','2','3']).head(10)

In [None]:
print(classification_report(y_test,y_pred_rf))

Until now Random Forest has the best performance, but finally if we combine the three models built using Voting Classifier we can think the performance could increase or at least reduce in variance:   

## Voting Classifier 

In [None]:
from sklearn.ensemble import VotingClassifier

# The combined model: logistic regression, SVC and Random Forest
estimators = [('LR_L2', lr_l2), ('SVM', svc), ('RF', RF_300)]

VC = VotingClassifier(estimators, voting='soft')
VC = VC.fit(X_train_s, y_train)

In [None]:
y_pred_VC = VC.predict(X_test_s)
print(classification_report(y_test, y_pred_VC))

This model had the same performance setting voting as soft and hard, and is similar to the SVC. Finally in order to compare the error metrics of every model let's summarize in a table all of them.  
Note: As our label is multiclass the average of the following metrics was computed: Precision, recall , f1-score and Area under the curve.

In [None]:
metrics = list()
models = ['Logistic Regression', 'Support Vector Classifier', 'Random Forest', 'Voting Classifier']
predictions=[y_pred_lr, y_pred_svm, y_pred_rf, y_pred_VC]

for lab,i in zip(models, predictions):
    precision, recall, fscore, _ = score(y_test, i, average='weighted')
    accuracy = accuracy_score(y_test, i)
    auc = roc_auc_score(label_binarize(y_test, classes=[1,2,3]),
                        label_binarize(i, classes=[1,2,3]),
                        average='weighted')
    metrics.append(pd.Series({'precision':precision, 'recall':recall,
                              'fscore':fscore, 'accuracy':accuracy,
                              'auc':auc}, name=lab))
    
metrics = pd.concat(metrics, axis=1)

In [None]:
metrics

We can see above that all models had outstanding performances, even the accuracy of the worst is almost 96% which is not far than 97.5% of the best one, however as we are dealing with a medical environment and the health of patients is the most important the recommended model is Random Forest due to its highest metrics, relatively fast training and easy interpretability. From here we will compute all metrics and plots related to our chosen model.  
Let's plot the confusion matrix:

## Plottings of the best model

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test, y_pred_rf), display_labels=RF_300.classes_)
disp.plot(cmap='Blues')

The model has misclassified some instances which belong to class 1 and 2, this is why precision and recall for these classes had values of 97%, but overall it has classified correctly the huge majority.

Plotting ROC Curve and Precision-Recall Curve:

In [None]:
from sklearn.metrics import roc_curve, precision_recall_curve
sns.set_context('talk')

In [None]:
y_prob = RF_300.predict_proba(X_test_s)

In [None]:
y_test_b=label_binarize(y_test, classes=[1,2,3])

In [None]:
from itertools import cycle
from sklearn.metrics import auc
fpr = dict()
tpr = dict()
roc_auc = dict()

n_class = 3
lw = 2
plt.figure(figsize=(10,8))

for i in range(n_class):
    fpr[i], tpr[i], _ = roc_curve(y_test_b[:, i], y_prob[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])
    
colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_class), colors):
    plt.plot(fpr[i], tpr[i], color=color,lw=lw,label='ROC curve of class {0} (area = {1:0.4f})'
             ''.format(i, roc_auc[i]))
    
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Fetal Health Classification')
plt.legend(loc='best')
plt.show()

In [None]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score

precision = dict()
recall = dict()
average_precision = dict()
lines = []
labels = []
plt.figure(figsize=(10,8))

for i in range(n_class):
    precision[i], recall[i], _ = precision_recall_curve(y_test_b[:, i],y_prob[:, i])
    average_precision[i] = average_precision_score(y_test_b[:, i], y_prob[:, i])

for i, color in zip(range(n_class), colors):
    plt.plot(recall[i], precision[i], color=color, lw=2, label='Precision-recall for class {0} (area = {1:0.4f})'
             ''.format(i, average_precision[i]))

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Fetal Health Classification')
plt.legend(loc='lower left')
plt.show()

The effect of balancing the classes in our label (oversampling) was evident at avoiding a tendency towards predicting more the class 0 and also allowed us to compare much better the classification in the confusion matrix by having approximately the same proportion of them. Another highlight is that as it corresponds to a tree-based model we could have simply used features without encoding nor scaling which makes the building of this much easier and fast, whereas polynomial transformation had a significant and worthy effect and as we will see in the plot of the feature importances below the two biggest predictors were created in this process, this is a clear evidence which supports the use of polynomials in the training of models.

In [None]:
feat=pd.DataFrame(RF_300.feature_importances_,index=X_res.columns, columns=['Importance']).sort_values(by='Importance',ascending=False).head(15)
ax=feat.plot(kind='bar', figsize=(16,6))
ax.set(ylabel='Feature Importance')
ax.set(xlabel='Features')

Finally, although the label was finally balanced it is always recommended that our original dataset have a vast amount of records for each class implying a good representation of the population making our prediction more accurate and “reliable”, if we look at the original dataset and count the number of records of the less frequent class which surprisingly corresponds to ‘pathological’ is only 176!, clearly could be better and more representative to have at least 1000 of them, making us balance the label by undersampling which results better than extrapolate records from existing ones.