## Subash Chandra Biswal (U77884251) ##
# Assignment 1 - Cardiotocography


## Introduction and Overview


Author: J. P. Marques de Sá, J. Bernardes, D. Ayers de Campos.  
Source: UCI  
Please cite: Ayres de Campos et al. (2000) SisPorto 2.0 A Program for Automated Analysis of Cardiotocograms. J Matern Fetal Med 5:311-318, UCI    

2126 fetal cardiotocograms (CTGs) were automatically processed and the respective diagnostic features measured. The CTGs were also classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was both with respect to a morphologic pattern (A, B, C. ...) and to a fetal state (N, S, P). Therefore the dataset can be used either for 10-class or 3-class experiments.  

Attribute Information:  
LB - FHR baseline (beats per minute)  
AC - # of accelerations per second  
FM - # of fetal movements per second  
UC - # of uterine contractions per second  
DL - # of light decelerations per second  
DS - # of severe decelerations per second  
DP - # of prolongued decelerations per second  
ASTV - percentage of time with abnormal short term variability  
MSTV - mean value of short term variability  
ALTV - percentage of time with abnormal long term variability  
MLTV - mean value of long term variability  
Width - width of FHR histogram  
Min - minimum of FHR histogram  
Max - Maximum of FHR histogram  
Nmax - # of histogram peaks  
Nzeros - # of histogram zeros  
Mode - histogram mode  
Mean - histogram mean  
Median - histogram median  
Variance - histogram variance  
Tendency - histogram tendency  
CLASS - FHR pattern class code (1 to 10)  
NSP - fetal state class code (N=normal(1); S=suspect(2); P=pathologic(3))  

## Install and import necessary packages

In [1]:
# import packages
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier 
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier

# set random seed to ensure that results are repeatable
np.random.seed(1)

## Load data 

In [2]:
df = pd.read_csv('./cardiotocography_csv.csv')
df.head(5)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V28,V29,V30,V31,V32,V33,V34,V35,Class,NSP
0,23,240,357,120,120,0,0,0,73,0.5,...,0,0,0,0,0,0,1,0,9,2
1,45,5,632,132,132,4,0,4,17,2.1,...,0,0,0,1,0,0,0,0,6,1
2,45,177,779,133,133,2,0,5,16,2.1,...,0,0,0,1,0,0,0,0,6,1
3,45,411,1192,134,134,2,0,6,16,2.4,...,0,0,0,1,0,0,0,0,6,1
4,45,533,1147,132,132,4,0,5,16,2.4,...,0,0,0,0,0,0,0,0,2,1


## Explore the dataset

In [3]:
# Explore the dataset
# read the first row of the dataset 
print(df.head())
print(df.columns)
print(df.describe())
print(df.info())

   V1   V2    V3   V4   V5  V6  V7  V8  V9  V10  ...  V28  V29  V30  V31  V32  \
0  23  240   357  120  120   0   0   0  73  0.5  ...    0    0    0    0    0   
1  45    5   632  132  132   4   0   4  17  2.1  ...    0    0    0    1    0   
2  45  177   779  133  133   2   0   5  16  2.1  ...    0    0    0    1    0   
3  45  411  1192  134  134   2   0   6  16  2.4  ...    0    0    0    1    0   
4  45  533  1147  132  132   4   0   5  16  2.4  ...    0    0    0    0    0   

   V33  V34  V35  Class  NSP  
0    0    1    0      9    2  
1    0    0    0      6    1  
2    0    0    0      6    1  
3    0    0    0      6    1  
4    0    0    0      2    1  

[5 rows x 37 columns]
Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31',
       'V32', 'V33', 'V34', 'V35', 'Class', 'NSP'],
      dtype='object')
 

## Clean/transform data (where necessary)

In [4]:
# based on findings from data exploration, we need to clean up colum names, as there are some leading whitespace characters
df.columns = [s.strip() for s in df.columns] 
df.columns

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31',
       'V32', 'V33', 'V34', 'V35', 'Class', 'NSP'],
      dtype='object')

Drop the columns we are not using as predictors (see previous notebooks -- we are given a subset of input variables to consider). The Class variable is a target variable and has 10 classes. But, we are using the NSP variable as our target variable. So we can drop the Class variable.

In [5]:
df = df.drop(columns=['Class'])

The V25 variable has 3 categories such as 0, 1 and 2. We can encode this variable,

In [6]:
# translation V25 categories into dummy vars
df = df.join(pd.get_dummies(df['V25'], prefix='V25', drop_first=True))
df.drop('V25', axis=1, inplace = True)

df.head(3)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V29,V30,V31,V32,V33,V34,V35,NSP,V25_0,V25_1
0,23,240,357,120,120,0,0,0,73,0.5,...,0,0,0,0,0,1,0,2,0,1
1,45,5,632,132,132,4,0,4,17,2.1,...,0,0,1,0,0,0,0,1,1,0
2,45,177,779,133,133,2,0,5,16,2.1,...,0,0,1,0,0,0,0,1,1,0


In [7]:
df.isna().sum()

V1       0
V2       0
V3       0
V4       0
V5       0
V6       0
V7       0
V8       0
V9       0
V10      0
V11      0
V12      0
V13      0
V14      0
V15      0
V16      0
V17      0
V18      0
V19      0
V20      0
V21      0
V22      0
V23      0
V24      0
V26      0
V27      0
V28      0
V29      0
V30      0
V31      0
V32      0
V33      0
V34      0
V35      0
NSP      0
V25_0    0
V25_1    0
dtype: int64

The NSP target variable has 3 classes such as Normal (1), Suspect(2), and Pathologic (3). Out of these 3 classes suspect and pathologic are the stage of concerns and need immediate intervention to control the symptoms. So we can combine suspect and pathologic as the suspect class to target on these conditions. We are assigning 1 to suspect/Pathologic and 0 to Normal stage.

In [7]:
df['NSP'] = np.where(df['NSP']==1, 0, 1)

In [8]:
df.head(5)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V29,V30,V31,V32,V33,V34,V35,NSP,V25_0,V25_1
0,23,240,357,120,120,0,0,0,73,0.5,...,0,0,0,0,0,1,0,1,0,1
1,45,5,632,132,132,4,0,4,17,2.1,...,0,0,1,0,0,0,0,0,1,0
2,45,177,779,133,133,2,0,5,16,2.1,...,0,0,1,0,0,0,0,0,1,0
3,45,411,1192,134,134,2,0,6,16,2.4,...,0,0,1,0,0,0,0,0,0,1
4,45,533,1147,132,132,4,0,5,16,2.4,...,0,0,0,0,0,0,0,0,0,1


## Split data intro training and validation sets

In [9]:
# construct datasets for analysis
target = 'NSP'
predictors = list(df.columns)
predictors.remove(target)
X = df[predictors]
y = df[target]

In [10]:
# create the training set and the test set 
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=1)

## Save the Data ##

In [11]:
X_train.to_csv('./X_train.csv', index=False)
y_train.to_csv('./y_train.csv', index=False)
X_test.to_csv('./X_test.csv', index=False)
y_test.to_csv('./y_test.csv', index=False)
X.to_csv('./X.csv', index=False)
y.to_csv('./y.csv', index=False)