# Overview of Problem 

## Aim
Classify fetal health in order to prevent child and maternal mortality. Classified into 3 classes:

- Normal
- Suspect
- Pathological

## Dataset
- 2126 fetal cardiotocograms (CTG) 

### Features

#### Response Variable
- fetal health: 
    - 1= normal
    - 2 = suspect
    - 3= pathological

#### Predictor Variables
- baseline value:Baseline Fetal Heart Rate (FHR)
- accelerations: Number of accelerations per second
- fetal_movement:Number of fetal movements per second
- uterine_contractions: Number of uterine contractions per second
- light_decelerations: per second
- severe_decelerations: per second
- prolongued_decelerations: per second
- abnormal_short_term_variability: Percentage of time with abnormal short term variability 
- mean_value_of_short_term_variability: Mean value of short term variability
- percentage_of_time_with_abnormal_long_term_variability: Percentage of time with abnormal long term variability
- Mean value of long term variability
- histogram_width: Width of the histogram made using all values from a record
- Histogram minimum value
- histogram_max
- histogram_number_of_peaks:Number of peaks in the exam histogram
- histogram_number_of_zeroes: Number of zeroes in the exam histogram
- histogram_mode
- histogram_mean
- histogram_median
- histogram_variance
- Histogram trend

# General Thoughts Before Starting

## Common Algorithms For Multi-Class Classification 
- k-Nearest Neighbors.
- Naive Bayes.
- Decision Trees.
- Ensemble Models: Random Forest.
- Boosting Models: AdaBoost and XGBoost

### Measure of Focus
Common measures used to evaluate the outcome of classification problems include AUC, F1 score, Precision and Recall.

Further, when analysing the data, it is apparent that there is considerable class imbalance and therefore accuracy is not recommended as primary metric.

In this problem, the cost of not correctly identifying risks with child birth (ie. Case 3 of the predictor variable, that being Pathological) is high and there recall for case 3 should be the metric of focus), this measures: for pathological outcomes what proportion of actual positives was identified correctly?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Importing Packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import plot_tree
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier


from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix, precision_score, recall_score

# Initial Loading and Data Analysis

In [None]:
df = pd.read_csv('/kaggle/input/fetal-health-classification/fetal_health.csv')
df.head()

In [None]:
print(f"Shape of Dataset: {df.shape}")

22 columns (21 predictor variables and 1 response variable) and 2126 rows

In [None]:
df.info()

No missing values

In [None]:
df.describe().T

In [None]:
df.head()

In [None]:
df.nunique()

In [None]:
print("Number of unqiue values by column \n")

for i in df:
    print("{}:".format(i), df[i].nunique(), "unique values \n", df[i].unique(), "\n")

## Setting Target Variable (Initial Loading and Data Analysis)

In [None]:
#axis = 0 to drop labels from the index oe axis=1 to drop labels from columns
X=df.drop(['fetal_health'],axis=1)
y= df.fetal_health

## EDA

### Analysing Target Variable

In [None]:
vis_fetal_health = y.value_counts().plot(figsize=(20, 5), kind="bar", color = ["green", "orange", "red"])
plt.title("Fetal health count")
plt.xlabel("Fetal helth")
plt.ylabel("Cases")


#Counting labels
print("Breakdown of unique values:\n",y.value_counts())

In [None]:
plt.title("Fetal state")

plt.pie(y.value_counts(),labels=["Normal", "Suspect", "Pathological"], colors = ["green", "orange", "red"],autopct="%1.1f%%",radius=1.2)
plt.xlabel("Fetal health")
plt.ylabel("Cases")
plt.show()

#### Discussion on target variable
This is an imbalanced dataset (78% of observations are 'Normal'). This means we have to be careful on our evaluation metric, in which the metric accuracy (Total correct/ total) is misleading. 
- For example: Imagine if we classified all the data as 0 (ie. Normal). Our accuracy would be 78%.

For this reason and for the fact the identifying case 3 when it occurs (ie. high recall) is of significant importance, we will use recall for case 3 as are primary evaluation metric.

Other metrics of importance include the F1 score (precision and recall) and recall for case 2 ('Suspect') as misclassified suspect cases (in particularlty if they are misclassified as 'normal') can lead to the missing of 'pathological' issues

### Correlation

In [None]:
# Correlation between different variables
corr = df.corr()
# Set up the matplotlib plot configuration
f, ax = plt.subplots(figsize=(18, 15))
# Generate a mask for upper traingle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Configure a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap
sns.heatmap(corr, annot=True, mask = mask, cmap=cmap)

#### Discussion on Correlation
- histogram mode, mean and median are highly correlated (obvious reasons), they are also correlated with the baseline value. This correlation is analysed below.
- Target variable: 
    - positive correlation: prolongued_decelerations, abnormal_short_term_variability, percentage_of_time_with_abnormal_long_term_variability
    - negative correlation: accelerations
    - the target variable correlations are analyses below this section

## Predictor Variable Correlations
Histogram mean, mode, median and baseline value

In [None]:
sns.regplot(x=df['histogram_mean'], y=df['histogram_median'], line_kws={"color":'black'})

In [None]:
sns.regplot(x=df['histogram_mean'], y=df['histogram_mode'], line_kws={"color":'black'})

In [None]:
sns.regplot(x=df['baseline value'], y=df['histogram_mode'], line_kws={"color":'black'})

In [None]:
sns.lmplot(x='histogram_mean', y='histogram_mode', hue='fetal_health', data=df)

In [None]:
sns.lmplot(x='baseline value', y='histogram_mode', hue='fetal_health', data=df)

In [None]:
sns.lmplot(x='baseline value', y='histogram_mean', hue='fetal_health', data=df)

The histogram mean, median and mode are highly correlated, we have removed all except one (mode).

Although mode and baseline are correlated (71%) we have kept the two variables as they vary when the mode is small (as shown above)

## Target Variable Correlation

In [None]:
Pos_Num_feature = df.corr()["fetal_health"].sort_values(ascending=False).to_frame()
Neg_Num_feature = df.corr()["fetal_health"].sort_values(ascending=True).to_frame()

In [None]:
Pos_Num_feature[1:6], Neg_Num_feature[0:5]

In [None]:
#Correlation b/w all features and Target Variable
Pos_Num_feature[1:], Neg_Num_feature

#### Discussion on Target Variable Correlation
Based on above correlation, let's analyse these further:
- prolongued_decelerations: 0.484859 correlation
- abnormal_short_term_variability: 0.471191 correlation
- percentage_of_time_with_abnormal_long_term_variance: 0.426146 correlation
- accelerations: -0.364066 correlation

### Graphing Prominant Variables

### Discrete Variables that are highly correlated to fetal health

#### Prolongued Decelerations

In [None]:
counts_df = df.groupby(["prolongued_decelerations", "fetal_health"])["fetal_health"].count().unstack()
# Transpose so fetal_health categories add up to 1, divide by the total number (transposed), then transpose one more time for plotting
fetal_health_percents_df = counts_df.T.div(counts_df.T.sum()).T

In [None]:
fig, ax = plt.subplots()


fetal_health_percents_df.plot(kind="bar", stacked=True, color=["green", "orange","red"], ax=ax)

sns.catplot(x="prolongued_decelerations", hue="fetal_health", data=df, kind="count", palette=sns.color_palette(['green', 'orange','red'])
             )


Majority of prolongued decelerations are 0.0, when higher, there is high occurance of pathological fetal health

### Continous Variables that are highly correlated to fetal health

#### abnormal_short_term_variability

In [None]:
sns.stripplot(x="fetal_health", y="abnormal_short_term_variability", data=df)

#### percentage_of_time_with_abnormal_long_term_variance

In [None]:
sns.stripplot(x="fetal_health", y="percentage_of_time_with_abnormal_long_term_variability", data=df)

Although abnormal long term variability is somewhat spread for case 3. When above 80, all observations are case 3

#### Accelerations

In [None]:
sns.stripplot(x="fetal_health", y="accelerations", data=df)

High accelerations good for fetal health

### Distribution of All Variables

In [None]:
df_hist_plot = df.hist(figsize = (20,15), color = "#000054")

- Three types of skewed distributions. A right (or positive) skewed distribution, left (or negative) skewed distribution, and normal distribution.

    - A left-skewed distribution (negatively-skewed) has a long left tail.
    - A right-skewed distribution (positively-skewed) has a long right tail
    - The skewness for a normal distribution is zero and looks a bell curve.

#### Histogram Variance Outliers
Outliers are present in Histogram Variance, let's have a closer look at these

In [None]:
df[df['histogram_variance'] > 180]

Removing outliers is risky as they may contain valuable information about the data. Here we can see that extreme vairance for Histogram Variance largely correlate with fetal health being pathological, therefore we will keep these outliers.

## Looking at the range of values

In [None]:
df_box_plot = df.boxplot(vert=False, color = "#000054")

The above plot shows the range of our feature attributes. All the features are in different ranges. To fit this dataset in a KNN model we must scale it to the same range. This is not required for decision trees and Ensemble methods as they are not sensitive to the the variance in the data.

# Feature Eng

In [None]:
#Keeping Histogram mode out of the 3 highly correlated variables, reasons discussed above
X = X.drop(['histogram_median'],axis=1)
X = X.drop(['histogram_mean'],axis=1)

In [None]:
X.head()

## Spliting Train/Test


In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X,y, test_size=0.2, random_state=1, stratify = y)

In [None]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

## Dealing with Missing Data

No missing values so imputation/ dropping features not required

### Normalize Data
Reasons discussed above in the graph of boxplots
Note: Decision trees and ensemble methods do not require feature scaling to be performed as they are not sensitive to the the variance in the data.

In [None]:
scaler = MinMaxScaler()

In [None]:
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

# Training

## Selecting Models

### MODEL 1: k-Nearest Neighbors

In [None]:
knn_classification = KNeighborsClassifier(n_neighbors = 3)

# fit the model using fit() on train data
# Used scaled data for KNN
knn_model = knn_classification.fit(scaled_X_train, Y_train)

In [None]:
knn_pred = knn_model.predict(scaled_X_test)


In [None]:
print(classification_report(Y_test,knn_pred))

In [None]:
ax= plt.subplot()
sns.heatmap(confusion_matrix(Y_test, knn_pred), annot=True, ax = ax, cmap = "Blues");

ax.set_xlabel("Predicted");
ax.set_ylabel("Actual"); 
ax.set_title("Confusion Matrix"); 
ax.xaxis.set_ticklabels(["Normal", "Suspect", "Pathological"]);
ax.yaxis.set_ticklabels(["Normal", "Suspect", "Pathological"]);

77% recall for case of focus (3) and overpredicting case 1 as shown through high accuracy but low precision and recall for classes that are underrepresented in the data (imbalance in dataset). Although recall for case 2 is not the prominant metric of focus, it is important especially if this misclassification is being classified as normal as the cost of missing ill fetal health it high (which it is misclassified as normal in 91% of its false positives).

###  MODEL 2: Decision Tree

In [None]:
decisionTreeClassifier = DecisionTreeClassifier()

In [None]:
decisionTreeClassifier.fit(X_train,Y_train)

In [None]:
decisionTreeClassifier_pred = decisionTreeClassifier.predict(X_test)

In [None]:
print(classification_report(Y_test,decisionTreeClassifier_pred))

In [None]:
ax= plt.subplot()
sns.heatmap(confusion_matrix(Y_test, decisionTreeClassifier_pred), annot=True, ax = ax, cmap = "Blues");

ax.set_xlabel("Predicted");
ax.set_ylabel("Actual"); 
ax.set_title("Confusion Matrix"); 
ax.xaxis.set_ticklabels(["Normal", "Suspect", "Pathological"]);
ax.yaxis.set_ticklabels(["Normal", "Suspect", "Pathological"]);

Recall for case 3 is stronger (87%) and recall for case 2 has improved (76%).

As decision trees are prone to overfitting, let's see if random forest can improve this score

###  MODEL 3: Random Forest (Bagging Ensemble Method)

RF algorithm is an ensemble method that uses multiple weak learners (ie. decision trees) and aggregates then up (bagging -> boostrapping + aggregation) to vote on the outcome of each prediction -> idea here is to reduce overfitting 

In [None]:
rfc = RandomForestClassifier()

In [None]:
rfc.fit(X_train,Y_train)

In [None]:
rfc_pred = rfc.predict(X_test)

In [None]:
print(classification_report(Y_test,rfc_pred))

In [None]:
ax= plt.subplot()
sns.heatmap(confusion_matrix(Y_test, rfc_pred), annot=True, ax = ax, cmap = "Blues");

ax.set_xlabel("Predicted");
ax.set_ylabel("Actual"); 
ax.set_title("Confusion Matrix"); 
ax.xaxis.set_ticklabels(["Normal", "Suspect", "Pathological"]);
ax.yaxis.set_ticklabels(["Normal", "Suspect", "Pathological"]);

Recall for class 3 has improved as well at the overall f1 score. Let's try a second type of ensemble method, Boosting, we will look at AdaBoost (uses stumps as weak learners) and XGBoost

###  MODEL 4: AdaBoost (Boosting Ensemble Method 1)

#### Build weak learner

In [None]:
#n_estimators -> the number of weak learner we are going to use.
#building the weak learners
base_estimator = DecisionTreeClassifier(criterion='entropy', max_depth=1)
AdaBoost = AdaBoostClassifier(base_estimator=base_estimator, n_estimators=400, learning_rate=1)

In [None]:
AdaBoostModel = AdaBoost.fit(X_train, Y_train)

In [None]:
ada_pred = AdaBoostModel.predict(X_test)

In [None]:
print(classification_report(Y_test,ada_pred))

In [None]:
ax= plt.subplot()
sns.heatmap(confusion_matrix(Y_test, ada_pred), annot=True, ax = ax, cmap = "Blues");

ax.set_xlabel("Predicted");
ax.set_ylabel("Actual"); 
ax.set_title("Confusion Matrix"); 
ax.xaxis.set_ticklabels(["Normal", "Suspect", "Pathological"]);
ax.yaxis.set_ticklabels(["Normal", "Suspect", "Pathological"]);

Case 3 recall is solid (91%).

In saying this, it is important to note that Recall for case 2 is poor (71%), although this is not the main metric of focus it is important as looking at the confusion matrix 100% False Negatives for 'Suspect' are incorrectly classified as 'Normal'. The issue here is 'Suspect' classifications give a call to action to doctors to look further at risks with child birth, so misclassifying them as Pathological would be prefered over misclassification of 'Normal'.

Let's try another ensemble boosting method (XGBoost)

###  MODEL 5: XGBoost (Boosting Ensemble Method 2)


In [None]:
XGB = XGBClassifier()
XGB_Model = XGB.fit(X_train,Y_train)

In [None]:
XGB_pred = XGB_Model.predict(X_test)

In [None]:
print(classification_report(Y_test,XGB_pred))

In [None]:
ax= plt.subplot()
sns.heatmap(confusion_matrix(Y_test, XGB_pred), annot=True, ax = ax, cmap = "Blues");

ax.set_xlabel("Predicted");
ax.set_ylabel("Actual"); 
ax.set_title("Confusion Matrix"); 
ax.xaxis.set_ticklabels(["Normal", "Suspect", "Pathological"]);
ax.yaxis.set_ticklabels(["Normal", "Suspect", "Pathological"]);

Again recall for case 3 is strong (94%) but also recall for case 2 is higher than random forest (80%), the importance of this is discussed above, reducing the amount of false negatives being classified as normal.

# Summary

XGBoost scores the highes for recall on outcome 3 (94%). This is important as identifying Pathological cases is the most important outcome in this problem (in order to prevent child and maternal mortality).

Further, a lot of models were scoring in the 80-90% range for case 3 recall but scoring poorly for case 2 recall (60-70% range), largely misclassifying these instances as 'Normal'. It is important to have strong recall for case 2 as it acts as an alarm/call to action for doctors to look further.

Therefore XGBoost is the best model for this problem.

# Example Output

In [None]:
type(XGB_pred)

In [None]:
type(Y_test)

In [None]:
output = pd.DataFrame(XGB_pred, columns = ['Pred'])
actual = pd.DataFrame(Y_test)

In [None]:
output = list(output['Pred'])
actual = list(actual['fetal_health'])

In [None]:
d = {'Pred':output,'Actual':actual}

In [None]:
d

In [None]:
results = pd.DataFrame(d)

In [None]:
results

In [None]:
results['Correct?'] = np.where(results['Pred'] == results['Actual'], 'Correct', 'Incorrect')
results.head()

In [None]:
summary = results.groupby(['Correct?']).sum()['Pred']
print("Summary of Classification \n {}:".format(summary))

In [None]:
results.to_csv('fetal_health_results.csv')