# Fetal Health Condition Dataset

This **notebook** explain the use of the [Fetal Health Classification](https://www.kaggle.com/andrewmvd/fetal-health-classification) dataset and a **Random Forest Classifier** to automaticaly detect the **fetal health condition** based on CTG information.


# Dataset Information

2126 fetal **cardiotocograms** (CTG) were automatically processed and the respective diagnostic features measured. The CTG were also classified by three expert obstetricians and a **consensus classification label** assigned to each of them. Classification was both with respect to a **morphologic pattern (A, B, C. ...)** and to a **fetal state (N, S, P)**. Therefore the dataset can be used either for 10-class or 3-class experiments.

## Inputs

The dataset contains a total of 21 inputs below described:

> > 1. FHR baseline (beats per minute);
2. number of accelerations per second;
3. number of fetal movements per second;
4. number of uterine contractions per second;
5. number of light decelerations per second;
6. number of severe decelerations per second;
7. number of prolongued decelerations per second;
8. percentage of time with abnormal short term variability;
9. mean value of short term variability;
10. percentage of time with abnormal long term variability;
11. mean value of long term variability;
12. width of FHR histogram;
13. minimum of FHR histogram;
14. maximum of FHR histogram;
15. number of histogram peaks;
16. number of histogram zeros;
17. histogram mode;
18. histogram mean;
19. histogram median;
20. histogram variance; and
21. histogram tendency.

## Target Variable

This notebook uses the **fetal state** as the **target variable**. As above mentioned, fetal state is classified according to 3 situations (**N** &mdash; Normal, **S** &mdash; Suspect or **P** &mdash; Pathologic).



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import numpy as np
import pandas as pd

In [None]:
import matplotlib.pyplot as plt

import matplotlib
matplotlib.rcParams['mathtext.fontset'] = 'stix'
matplotlib.rcParams['font.family'] = 'sans-serif'
matplotlib.rcParams['font.size'] = 12

In [None]:
# preprocessing libraries
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

# model selection libraries
from sklearn.model_selection import train_test_split

# machine learning libraries
from sklearn.ensemble import RandomForestClassifier

# postprocessing and checking-results libraries
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
from imblearn.over_sampling import RandomOverSampler

In [None]:
def plotConfusionMatrix(dtrue,dpred,classes,title = 'Confusion Matrix',\
                        width = 0.75,cmap = plt.cm.Blues):
  
    cm = confusion_matrix(dtrue,dpred)
    cm = cm.astype('float') / cm.sum(axis = 1)[:,np.newaxis]

    fig,ax = plt.subplots(figsize = (np.shape(classes)[0] * width,\
                                       np.shape(classes)[0] * width))
    im = ax.imshow(cm,interpolation = 'nearest',cmap = cmap)

    ax.set(xticks = np.arange(cm.shape[1]),
           yticks = np.arange(cm.shape[0]),
           xticklabels = classes,
           yticklabels = classes,
           title = title,
           aspect = 'equal')
    
    ax.set_ylabel('True',labelpad = 20)
    ax.set_xlabel('Predicted',labelpad = 20)

    plt.setp(ax.get_xticklabels(),rotation = 90,ha = 'right',
             va = 'center',rotation_mode = 'anchor')

    fmt = '.2f'

    thresh = cm.max() / 2.0

    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j,i,format(cm[i,j],fmt),ha = 'center',va = 'center',
                    color = 'white' if cm[i,j] > thresh else 'black')
    plt.tight_layout()
    plt.show()

The variables `X` and `y` are defined as the **input** and the **label** vectors, respectively.

In [None]:
df = pd.read_csv('../input/fetal-health-classification/fetal_health.csv')
df.drop_duplicates(inplace = True)

y = LabelEncoder().fit_transform(df['fetal_health'])
X = df.drop(columns = ['fetal_health'],axis = 1)

The dataset is **imbalanced** as can be seen in the bar-chart below depicted.

In [None]:
count = np.zeros(3)
for i in range(3):
    count[i] = np.where(y == i)[0].size
    
plt.subplots(figsize = (6.0,6.0))
plt.bar(np.arange(3),count,color = 'orange',edgecolor = 'black')
plt.xticks(np.arange(3),('N','S','P'))
plt.xlabel('Fetal State')
plt.ylabel('Number of Instances')
plt.show()

Rescaling the input values to unit norm is required for a fast convergence during the learning process &mdash; in this case, `Xnorm` represents the **input** vector `X` in its **normalized** version.

In [None]:
scaler = StandardScaler().fit(X)
Xnorm = scaler.transform(X)

A 70% of the **input** vector `Xnorm` is used to train the classifier model.

In [None]:
Xtrain,Xtest,ytrain,ytest = train_test_split(Xnorm,y,test_size = 0.30,stratify = y,shuffle = True,random_state = 21)

In order to solve the problem related to the imbalance of the dataset the function `RandomOverSampler` is used to equalize the number of samples for each of the classes **N**, **S** and **P**.

In [None]:
Xtrain,ytrain = RandomOverSampler(random_state = 21).fit_resample(Xtrain,ytrain)

In [None]:
clf = RandomForestClassifier(random_state = 21).fit(Xtrain,ytrain)

In [None]:
ypred = clf.predict(Xtest)

The model has an accuracy of 94% as can be seen through the `classification_report`.

In [None]:
print(classification_report(ytest,ypred))

In [None]:
plotConfusionMatrix(ytest,ypred,classes = np.array(['N','S','P']),width = 1.5,cmap = plt.cm.binary)