## Introduction

In this notebook we predict whether the health of a fetus is classified as normal, suspect, or pathological based on CTG data. To do this we will implement multiple machine learning classifiers and evaluate methods.

Our goal is to successfully predict fetal health condition given CTG data. This means obtaining results with the highest accuracy and lowest misclassification and error rate.

## Project Plan

1. Import libraries required
2. Import the data to the notebook
3. Preprocess data to format it for analysis
4. Generate data visualization for initial evaluation
5. Process data for ML models
6. Develop ML models using gridsearch
7. Test and evaluate ML classifiers using gridsearch parameters

The classification models used will be:
    1. K Nearest Neighbours
    2. Support Vector Machine
    3. Logistic Regression
    4. Random Forest    

The evaluation methods used will be:
    1. Confusion Matrix
    2. CLassification Report (Precision, Recall and F1-Score)
    3. Accuracy Rate
 

## Import Libraries

These are the libraries required for data processing, data visualization, developing ML models, and developing evaluation metrics.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

## Importing the Data

In [None]:
#Read data from csv file
df = pd.read_csv('../input/fetal-health-classification/fetal_health.csv')

In [None]:
#Preview raw data
df.head(10).T

## Preprocess Data

To access our data more easily, we will rename the column names to shorter labels.

In [None]:
#Rename columns for easier use and representation
col_names = ['FHR', 'ACC', 'FM', 'UC', 'LD', 'SD', 'PD', 'ASTV', 'MSTV',
               'ALTV', 'MLTV', 'Hist_Width', 'Hist_Min', 'Hist_Max', 'Hist_Peaks', 'Hist_Zeros', 
                'Hist_Mode', 'Hist_Mean', 'Hist_Median', 'Hist_Variance', 'Hist_Tendency', 'FH']
df.columns = col_names

|Old Columns Names                                     | New Columns Names  |
|------------------------------------------------------|--------------------|  
|baseline value                                        |  FHR               |
|accelerations                                         |  ACC               |
|fetal_movement                                        |  FM                |
|uterine_contractions                                  |  UC                |
|light_decelerations                                   |  LD                |
|severe_decelerations                                  |  SD                |
|prolongued_decelerations                              |  PD                |
|abnormal_short_term_variability                       |  ASTV              |
|mean_value_of_short_term_variability                  |  MSTV              |
|percentage_of_time_with_abnormal_long_term_variability|  ALTV              |
|mean_value_of_long_term_variability                   |  MLTV              |
|histogram_width                                       |  Hist_Width        |
|histogram_min                                         |  Hist_Min          |
|histogram_max                                         |  Hist_Max          |
|histogram_number_of_peaks                             |  Hist_Peaks        |
|histogram_number_of_zeroes                            |  Hist_Zeros        |
|histogram_mode                                        |  Hist_Mode         |
|histogram_mean                                        |  Hist_Mean         |
|histogram_median                                      |  Hist_Median       |
|histogram_variance                                    |  Hist_Variance     |
|histogram_tendency                                    |  Hist_Tendency     |
|fetal_health                                          |  FH                |



In [None]:
#Check for null entries
null_count = df.columns.isna().sum()
print("Number of null entries:\n", null_count)

## Data Analysis

Before constructing our models, we will conduct an exploratory analysis of our data primarily making inferences based off of data visualizations. This will allow us to see if there are any important correlations we can leverage while developing our classifiers.

In [None]:
#Basic data structure (data types and number of entries)
df.info()

In [None]:
#Summary statistics for the data
df.describe().T

In [None]:
#Plot histograms or all given features
hist_plot = df.hist(figsize = (25,25))
plt.show()

In [None]:
# Plot histogram of fetal health (target variable)
plt.rcParams['figure.figsize'] = (7,7)
sns.countplot(df['FH'])
ax = plt.gca()

In [None]:
#Generate pairplot for data
plt.rcParams['figure.figsize'] = (20,20)
sns.pairplot(data=df, hue='FH',diag_kind='hist')

In [None]:
#Plot probabilistic relation between features and target variable (Fetal Health)
sns.violinplot(df['FH'], df['FHR'])
plt.show()
sns.violinplot(df['FH'], df['ACC'])
plt.show()
sns.violinplot(df['FH'], df['FM'])
plt.show()
sns.violinplot(df['FH'], df['UC'])
plt.show()
sns.violinplot(df['FH'], df['LD'])
plt.show()
sns.violinplot(df['FH'], df['SD'])
plt.show()
sns.violinplot(df['FH'], df['PD'])
plt.show()
sns.violinplot(df['FH'], df['ASTV'])
plt.show()
sns.violinplot(df['FH'], df['MSTV'])
plt.show()
sns.violinplot(df['FH'], df['ALTV'])
plt.show()
sns.violinplot(df['FH'], df['MLTV'])
plt.show()

In [None]:
# Plot heatmap to determine correlation between all features
ax=plt.subplots(figsize=(15,15))
sns.heatmap(df.corr(), annot=True)

In [None]:
#Get correlation of all features to fetus health (target variable)
ax=plt.subplots(figsize=(25,2))
sns.heatmap(df.corr().sort_values(by=["FH"], ascending=False).head(1),annot=True)
plt.show()

## Process Data

Now we will process the data to use it in all of our ML models. This requires doing:
1. Split data into X (feature data) and y (target variable) 
2. Scaling data (using standard scaler)
3. Splitting our X and y into their respective training and testing sets

In [None]:
#Split data into X and y
X_raw = df.drop('FH', axis=1)
y = df['FH']

In [None]:
#Scale X data
scale_X = StandardScaler()
col_names.remove('FH')
X = pd.DataFrame(scale_X.fit_transform(X_raw), columns = col_names)

In [None]:
#Preview scaled data
X.head()

In [None]:
#Split data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

## Develop and Train ML Models

### 1. K-Nearest Neighbours (KNN) Classifier

In [None]:
# define model and parameters for gridsearch
knn = KNeighborsClassifier()
k_list = np.arange(1,30,2)
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']

#define grid search
grid = dict(n_neighbors=k_list,weights=weights,metric=metric)
cv = RepeatedStratifiedKFold(n_splits=20, n_repeats=5, random_state=1)
grid_search = GridSearchCV(estimator=knn, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)

In [None]:
grid_result = grid_search.fit(X_train, y_train)

In [None]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

### 2. Support Vector Machines (SVM) Classifier

In [None]:
# define model and parameters for gridsearch
svm_clf = SVC()
kernel = ['linear','poly', 'rbf', 'sigmoid']
C = [100, 50, 10, 1.0, 0.1, 0.01, 0.001]
gamma = ['scale']

# define grid search
grid = dict(kernel=kernel,C=C,gamma=gamma)
cv = RepeatedStratifiedKFold(n_splits=20, n_repeats=5, random_state=1)
grid_search = GridSearchCV(estimator=svm_clf, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)

In [None]:
grid_result = grid_search.fit(X_train, y_train)

In [None]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

### 3. Logistic Regression Classifier

In [None]:
# define models and parameters for gridsearch
log_reg = LogisticRegression()
solvers = ['newton-cg','lbfgs','liblinear','sag','saga']
penalty = ['l1','l2','elasticnet','none']
C = [100, 50, 10, 1.0, 0.1, 0.01, 0.001]

# define grid search
grid = dict(solver=solvers,penalty=penalty,C=C)
cv = RepeatedStratifiedKFold(n_splits=20, n_repeats=5, random_state=1)
grid_search = GridSearchCV(estimator=log_reg, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)

In [None]:
grid_result = grid_search.fit(X_train, y_train)

In [None]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

### 4. Random Forest Classifier

In [None]:
# define model and parameters for gridsearch
rf_clf = RandomForestClassifier()
n_estimators = [100,150,200]
max_features = ['sqrt', 'log2']
bootstrap = [True]
max_depth = [50,60,70,80]

# define grid search
grid = dict(n_estimators=n_estimators,max_features=max_features,bootstrap=bootstrap,max_depth=max_depth)
cv = RepeatedStratifiedKFold(n_splits=20, n_repeats=5, random_state=1)
grid_search = GridSearchCV(estimator=rf_clf, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)

In [None]:
grid_result = grid_search.fit(X_train, y_train)

In [None]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

## Test and Evaluate Models

In [None]:
# Fit knn classifier with gridsearch results
knn = KNeighborsClassifier(metric='manhattan',n_neighbors=7,weights='distance')
knn.fit(X_train, y_train)

# Run prediction
y_pred_knn = knn.predict(X_test)

In [None]:
# Fit svm classifier with gridsearch results
svm_clf = SVC(C=50,gamma='scale',kernel='rbf')
svm_clf.fit(X_train, y_train)

# Run prediction
y_pred_svm = svm_clf.predict(X_test)

In [None]:
# Fit logistic regression classifier with gridsearch results
log_reg = LogisticRegression(C=0.1,penalty='l2',solver='newton-cg')
log_reg.fit(X_train, y_train)

# Run prediction
y_pred_lgr = log_reg.predict(X_test)

In [None]:
# Fit random forest classifier with gridsearch results
rf_clf = RandomForestClassifier(bootstrap=True, max_depth=80, max_features='sqrt', n_estimators=150)
rf_clf.fit(X_train, y_train)

# Run prediction
y_pred_rfc = rf_clf.predict(X_test)

In [None]:
#Compile final predictions from all models
pred_model_names = ["KNN Model","SVM Model","Logistic Regression","Random Forest"]
y_pred_list = [y_pred_knn,y_pred_svm,y_pred_lgr,y_pred_rfc]

In [None]:
#Evaluate predictions by each classifier
i=0
print("="*70)
for y_pred in y_pred_list:
    print(pred_model_names[i])
    i += 1
    print("-"*65)
    print("Confusion Matrix \n", confusion_matrix(y_test,y_pred))
    print("-"*65)
    print("Classification Report \n", classification_report(y_test,y_pred))
    print("-"*65)
    print('Accuracy Score:',accuracy_score(y_test,y_pred))
    print("="*70)