# Fetal Health Prediction

**Author: Victor Mayowa(MB;BS, Ilorin)**

**Source: [Kaggle](https://www.kaggle.com/datasets/andrewmvd/fetal-health-classification)**

<ul>
<li><a href="#title">Title page</a><li>
<li><a href="#toc">Table of content</a><li>
<li><a href="#abbreviation">List of abbreviations</a><li>
<li><a href="#abstract">Summary</a><li>
<li><a href="#background">Background for the study</a><li>
<li><a href="#aim">Aims</a><li>
<li><a href="#methodology">Proposed methodology</a><li>
<li><a href="#ethic">Ethical considerations</a><li>
<li><a href="#reference">List of references</a><li>
<li><a href="#appendix">Appendices</a><li>
</ul>

In [1]:
!pip install -r module.txt

ERROR: Invalid requirement: '_libgcc_mutex             0.1                 conda_forge    conda-forge' (from line 4 of module.txt)


#### List of Abbreviation

#### Summary

#### Abstract
Classify fetal health in order to prevent child and maternal mortality.

#### Context

Reduction of child mortality is reflected in several of the United Nations' Sustainable Development Goals and is a key indicator of human progress.
The UN expects that by 2030, countries end preventable deaths of newborns and children under 5 years of age, with all countries aiming to reduce under‑5 mortality to at least as low as 25 per 1,000 live births.

Parallel to notion of child mortality is of course maternal mortality, which accounts for **295 000 deaths** during and following pregnancy and childbirth (as of 2017). The vast majority of these deaths **(94%)** occurred in low-resource settings, and most could have been prevented.

In light of what was mentioned above, **Cardiotocograms (CTGs)** are a simple and cost accessible option to assess fetal health, allowing healthcare professionals to take action in order to prevent child and maternal mortality. The equipment itself works by sending ultrasound pulses and reading its response, thus shedding light on fetal heart rate (FHR), fetal movements, uterine contractions and more.



#### Data Summary

This dataset contains **2126 records** of features extracted from Cardiotocogram exams, which were then classified by three expert obstetritians into **3 classes:**

* Normal
* Suspect
* Pathological

#### Data Loading

In [2]:
# install all required libraries
#!pip install -U dataprep

In [3]:
#!pip install ydata-profiling

In [4]:
#!pip install xgboost


Collecting xgboost
  Downloading xgboost-2.0.1-py3-none-win_amd64.whl (99.7 MB)
Installing collected packages: xgboost
Successfully installed xgboost-2.0.1


In [5]:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy
import matplotlib.pyplot as plt
#from ydata_profiling import ProfileReport
import warnings

warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)

In [6]:
import torch
import math
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import precision_recall_fscore_support, f1_score, accuracy_score, classification_report, confusion_matrix, RocCurveDisplay, PrecisionRecallDisplay

In [7]:
df = pd.read_csv('fetal_health.csv')

In [8]:
df.head(5)

Unnamed: 0,baseline value,accelerations,fetal_movement,uterine_contractions,light_decelerations,severe_decelerations,prolongued_decelerations,abnormal_short_term_variability,mean_value_of_short_term_variability,percentage_of_time_with_abnormal_long_term_variability,mean_value_of_long_term_variability,histogram_width,histogram_min,histogram_max,histogram_number_of_peaks,histogram_number_of_zeroes,histogram_mode,histogram_mean,histogram_median,histogram_variance,histogram_tendency,fetal_health
0,120.0,0.0,0.0,0.0,0.0,0.0,0.0,73.0,0.5,43.0,2.4,64.0,62.0,126.0,2.0,0.0,120.0,137.0,121.0,73.0,1.0,2.0
1,132.0,0.006,0.0,0.006,0.003,0.0,0.0,17.0,2.1,0.0,10.4,130.0,68.0,198.0,6.0,1.0,141.0,136.0,140.0,12.0,0.0,1.0
2,133.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.1,0.0,13.4,130.0,68.0,198.0,5.0,1.0,141.0,135.0,138.0,13.0,0.0,1.0
3,134.0,0.003,0.0,0.008,0.003,0.0,0.0,16.0,2.4,0.0,23.0,117.0,53.0,170.0,11.0,0.0,137.0,134.0,137.0,13.0,1.0,1.0
4,132.0,0.007,0.0,0.008,0.0,0.0,0.0,16.0,2.4,0.0,19.9,117.0,53.0,170.0,9.0,0.0,137.0,136.0,138.0,11.0,1.0,1.0


In [9]:
df.shape


(2126, 22)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2126 entries, 0 to 2125
Data columns (total 22 columns):
 #   Column                                                  Non-Null Count  Dtype  
---  ------                                                  --------------  -----  
 0   baseline value                                          2126 non-null   float64
 1   accelerations                                           2126 non-null   float64
 2   fetal_movement                                          2126 non-null   float64
 3   uterine_contractions                                    2126 non-null   float64
 4   light_decelerations                                     2126 non-null   float64
 5   severe_decelerations                                    2126 non-null   float64
 6   prolongued_decelerations                                2126 non-null   float64
 7   abnormal_short_term_variability                         2126 non-null   float64
 8   mean_value_of_short_term_variability  

In [11]:
df['fetal_health'] = [int(label - 1) for label in df['fetal_health']]

In [12]:
data = df.drop_duplicates()

In [13]:
data.shape

(2113, 22)

In [14]:
y.value_counts()

NameError: name 'y' is not defined

In [None]:
X = data.drop("fetal_health", axis=1)
y = data["fetal_health"]

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Feature Selection using SelectKBest
k = 21  # Number of top features to select
selector = SelectKBest(score_func=f_classif, k=k)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

# Train and evaluate a Random Forest classifier with SelectKBest
rf_with_selectk = RandomForestClassifier(random_state=42)
rf_with_selectk.fit(X_train_selected, y_train)
y_pred_selectk = rf_with_selectk.predict(X_test_selected)
accuracy_selectk = accuracy_score(y_test, y_pred_selectk)

# Print results
print("Accuracy with SelectKBest:", accuracy_selectk)
print("Classification Report with SelectKBest:\n", classification_report(y_test, y_pred_selectk))

In [None]:
# Dimensionality Reduction using PCA
# Try different values of n_components and evaluate performance
for n_components in [10, 15, 20]:
    pca = PCA(n_components=n_components)
    X_train_pca = pca.fit_transform(X_train_scaled)
    X_test_pca = pca.transform(X_test_scaled)

    # Train and evaluate a Random Forest classifier with PCA
    rf_with_pca = RandomForestClassifier(random_state=42)
    rf_with_pca.fit(X_train_pca, y_train)
    y_pred_pca = rf_with_pca.predict(X_test_pca)
    accuracy_pca = accuracy_score(y_test, y_pred_pca)

    # Print results for each value of n_components
    print(f"Accuracy with PCA (n_components={n_components}):", accuracy_pca)
    print(f"Classification Report with PCA (n_components={n_components}):\n", classification_report(y_test, y_pred_pca))

#### Data Preprocessing

#### Exploratory Data Analysis

In [None]:
#!conda install dask 

In [None]:
#from dataprep.eda import create_report, plot, plot_correlation, plot_missing

#### Model development

classifiers = dict(
    k_nn=KNeighborsClassifier(n_neighbors=2), 
    mlp=MLPClassifier(alpha=1, max_iter=100),
    svm=SVC(probability=True),
    random_forest=RandomForestClassifier(random_state=42),
    gradientboost=GradientBoostingClassifier(),
    adaboost=AdaBoostClassifier(),
    xgboost=XGBClassifier(),
)

In [15]:
classifiers = {
    'SVC': SVC(),
    'RandomForest': RandomForestClassifier(),
    'AdaBoost': AdaBoostClassifier(),
    'Bagging': BaggingClassifier(),
    'GradientBoost': GradientBoostingClassifier(),
    'XGBClassifier': XGBClassifier(),
    'MLP': MLPClassifier()
}

# Define the hyperparameters for each classifier
parameters = {
    'SVC': {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']},
    'RandomForest': {'n_estimators': [50, 100, 200], 'max_depth': [None, 5, 10]},
    'AdaBoost': {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 1]},
    'Bagging': {'n_estimators': [50, 100, 200], 'max_samples': [0.5, 1.0]},
    'GradientBoost': {'n_estimators': [100, 200], 'learning_rate': [1e-1, 5e-2, 1e-2, 5e-3], 'max_depth': [3, 4, 5]},
    'XGBClassifier': {'booster': ['gbtree', 'gblinear', 'dart'], 'eta': [0.1, 0.2, 0.3, 0.4], 'alpha': [0.1, 0.2, 0.3, 0.4, 0.6]},
    'MLP': {'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
            'activation': ['tanh', 'relu'],
            'solver': ['sgd', 'adam'],
            'alpha': [0.0001, 0.05],
            'learning_rate': ['constant','adaptive']}
}


In [None]:
import pandas as pd
from sklearn.metrics import precision_score, recall_score, fbeta_score, f1_score, accuracy_score

# Assuming you have already performed the GridSearchCV and have the results

# Define a list of classifier names
classifier_names = ['SVC', 'RandomForest', 'AdaBoost', 'Bagging', 'GradientBoost', 'MLP', 'XGBClassifier']

# Create an empty DataFrame to store the results
results_df = pd.DataFrame(columns=['Classifier', 'Metric', 'Value'])
metrics = ['Precision', 'Recall', 'F-beta Score', 'F1 Score', 'Accuracy']

# Iterate through classifiers and get evaluation metrics
for classifier_name, classifier in classifiers.items():
    y_pred = grid_search.predict(X_test)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    fbeta = fbeta_score(y_test, y_pred, beta=1, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    accuracy = accuracy_score(y_test, y_pred)
    
    metrics_values = [precision, recall, fbeta, f1, accuracy]
    
    # Create a DataFrame for the current classifier's metrics
    classifier_df = pd.DataFrame({
        'Classifier': [classifier_name]*len(metrics),
        'Metric': metrics,
        'Value': metrics_values
    })
    
    # Append the classifier's metrics to the results DataFrame
    results_df = pd.concat([results_df, classifier_df], axis=0)

# Set multi-index for precision and recall
results_df.set_index(['Classifier', 'Metric'], inplace=True)

# Display the results DataFrame
print(results_df)


In [None]:
classification_grid_parameters = {
    SVC(): {
        'C': [0.0005, 0.001, 0.002, 0.01, 0.1, 1, 10],
        'gamma': [0.001, 0.01, 0.1, 1],
        'kernel': ['rbf', 'poly', 'sigmoid']
    },
    RandomForestClassifier(): {
        'n_estimators': [10, 40, 70, 100],
        'max_depth': [3, 5, 7],
        'min_samples_split': [0.2, 0.5, 0.7, 2],
        'min_samples_leaf': [0.2, 0.5, 1, 2],
        'max_features': [0.2, 0.5, 1, 2],
    },
    AdaBoostClassifier(): {
        'n_estimators': [50, 100, 200],
        'learning_rate': [0.01, 0.1, 1]
    },
    BaggingClassifier(): {
        'n_estimators': [10, 30, 50, 60],
        'max_samples': [0.1, 0.3, 0.5, 0.8, 1.],
        'max_features': [0.2, 0.5, 1, 2],
    },
    DecisionTreeClassifier(): {
        'max_depth': [3, 5, 7, None],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
    },
    GradientBoostingClassifier(): {},
    MLPClassifier(): {
        'hidden_layer_sizes': [(200,), (300,), (400,), (128, 128), (256, 256)],
        'alpha': [0.001, 0.005, 0.01, 1.],
        'batch_size': [128, 256, 512, 1024],
        'learning_rate': ['constant', 'adaptive'],
        'max_iter': [100, 200, 300, 400, 500]
    }
}

In [None]:
metrics = ['precision', 'recall', 'fbeta_score', 'accuracy']
res = np.empty((len(classifiers), len(metrics)*3+1))

In [None]:
res

In [None]:
classifier

In [None]:
# Define the classifiers and their respective hyperparameters
classifiers = {
    'SVC': SVC(),
    'RandomForest': RandomForestClassifier(),
    'AdaBoost': AdaBoostClassifier(),
    'Bagging': BaggingClassifier(),
    'GradientBoost': GradientBoostingClassifier(),
    'MLP': MLPClassifier()
}

# Define the hyperparameters for each classifier
parameters = {
    'SVC': {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']},
    'RandomForest': {'n_estimators': [50, 100, 200], 'max_depth': [None, 5, 10]},
    'AdaBoost': {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 1]},
    'Bagging': {'n_estimators': [50, 100, 200], 'max_samples': [0.5, 1.0]},
    'GradientBoost': {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 1], 'max_depth': [3, 4, 5]},
    'MLP': {'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
            'activation': ['tanh', 'relu'],
            'solver': ['sgd', 'adam'],
            'alpha': [0.0001, 0.05],
            'learning_rate': ['constant','adaptive']}
}

# Iterate through classifiers and perform GridSearchCV
for classifier_name, classifier in classifiers.items():
    print(f"Grid Search for {classifier_name}:")
    grid_search = GridSearchCV(classifier, parameters[classifier_name], cv=5, scoring='accuracy')
    grid_search.fit(X_train, y_train)
    print(f"Best Parameters: {grid_search.best_params_}")
    print(f"Best Cross-validated Score: {grid_search.best_score_}")
    print(f"Test Accuracy: {grid_search.score(X_test, y_test)}")
    print("\n")

In [None]:
def grid_search(self, classifier, params, n_jobs=2, verbose=1):
        """
        Performs GridSearchCV on `params` passed on the `self.model`
        And returns the tuple: (best_estimator, best_params, best_score).
        """
        score = accuracy_score
        grid = GridSearchCV(estimator=classifier, param_grid=params, scoring=make_scorer(score), n_jobs=n_jobs, verbose=verbose, cv=3)
        grid_result = grid.fit(self.X_train, self.y_train)
        y_pred = grid.predict(self.X_test)
        accuracy = accuracy_score(y_true=self.y_test, y_pred=y_pred)
        print("Grid Search Accuracy: {:.2f}%".format(accuracy*100))
        return grid_result.best_estimator_, grid_result.best_params_, grid_result.best_score_


In [None]:
 def set_best_estimators(self):
        # emotion classes you want to perform grid search on
        emotions = ["sad","fear","neutral","angry","happy","disgust","ps"]
        # number of parallel jobs during the grid search
        n_jobs = 4
        best_estimators = []
        for model, params in classification_grid_parameters.items():
            if model.__class__.__name__ == "KNeighborsClassifier":
                params['n_neighbors'] = [len(emotions)]
        best_estimator, best_params, cv_best_score = self.grid_search(model, params=params, n_jobs=n_jobs)
        best_estimators.append((best_estimator, best_params, cv_best_score))
        print(f"{emotions} {best_estimator.__class__.__name__} achieved {cv_best_score:.3f} cross validation accuracy score!")
        pickle.dump(best_estimators, open(f"best_classifiers.pickle", "wb"))
        return best_estimators

    def get_best_estimators(self):
        """
        Loads the estimators that are pickled in `grid` folder
        Note that if you want to use different or more estimators,
        you can fine tune the parameters in `grid_search function"
        and run it again ( may take hours )
        """

        if os.path.exists("best_classifiers.pickle"):
            df = pd.read_pickle('best_classifiers.pickle')
            print(df)
            best_estimators = pickle.load(open("best_classifiers.pickle", "rb"))
        else:
            best_estimators = self.set_best_estimators()

        return best_estimators


#### Model Evaluation

#### Model saving

#### Model Deployment