**Fetal health classification with 92% Accuracy**

 **TASK**: **Fetal health classification- An Approch to reduce fetal mortality (please upvote)**

 Fetal mortality is a public health issue that put in risk the women’s or baby’s life.This notebook uses several Machine Learning Classification techniques in order to predict the risk level of the fetal health ,it attempts to classify into three classes namely-
1.  Normal
2. Suspect
3. Pathological

# Importing neccesary libraries.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas_profiling as pp
import plotly.express as px
%matplotlib inline

In [None]:
dataset = pd.read_csv('../input/fetal-health-classification/fetal_health.csv')
dataset.head()

# Let us now see the overview of dataset

In [None]:
dataset.columns

In [None]:
sns.heatmap(dataset.isnull())

This shows there is no missing values in the dataset.

In [None]:
dataset.isnull().sum()

In [None]:
dataset.info()

In [None]:
dataset.describe()

In [None]:
pp.ProfileReport(dataset)

In [None]:
dataset['fetal_health'].value_counts()

The target feature is seem to be imbalanced.. We need to fix this, before moving into this let us see the feature importance first.

In [None]:
X1 = dataset.iloc[:,:-1]
y1 = dataset.iloc[:,-1]

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
fit_best_features = SelectKBest(score_func=f_classif,k=10)
best_features=fit_best_features.fit(X1,y1)

In [None]:
dataset_scores = pd.DataFrame(best_features.scores_)
dataset_cols = pd.DataFrame(X1.columns)

In [None]:
featurescores = pd.concat([dataset_cols,dataset_scores],axis=1)
featurescores.columns=['column','scores']

In [None]:
featurescores

In [None]:
print(featurescores.nlargest(13,'scores'))

These are the top 13  important features. 

Now let us see the correlation among the features.

In [None]:
corr_data=dataset.corr()
fig, axes = plt.subplots(figsize=(15,10))
sns.heatmap(corr_data,annot=True)

# EDA

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='fetal_health',data=dataset)

In [None]:
sns.distplot(dataset['baseline value'],bins=40)

In [None]:
sns.barplot(x='fetal_health',y='baseline value',data=dataset)

In [None]:
sns.boxplot(x='fetal_health',y='accelerations',data=dataset)

In [None]:
sns.violinplot(x='fetal_health',y='fetal_movement',data=dataset)

In [None]:
px.area(dataset, x="accelerations", y="uterine_contractions",color='fetal_health')

In [None]:
px.bar(dataset, x="fetal_health", y='uterine_contractions', color="fetal_health")

In [None]:
px.pie(dataset, values='prolongued_decelerations', names='fetal_health', title='prolongued_decelerations')

In [None]:
px.sunburst(dataset, path=['fetal_health'], values='abnormal_short_term_variability')

In [None]:
sns.barplot(x='fetal_health',y='mean_value_of_short_term_variability',data=dataset)

In [None]:
px.area(dataset, x="prolongued_decelerations", y="abnormal_short_term_variability",color='fetal_health')

In [None]:
sns.barplot(x='fetal_health',y='percentage_of_time_with_abnormal_long_term_variability',data=dataset)

In [None]:
px.area(dataset, x="mean_value_of_short_term_variability", y="baseline value",color='fetal_health')

**Let us separate our dependent and independent variables.**

In [None]:
X = dataset[['baseline value','accelerations','uterine_contractions','prolongued_decelerations', 'abnormal_short_term_variability',
       'mean_value_of_short_term_variability']]
y = dataset['fetal_health']

**Splitting the data into training and testing datasets.**

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state=101)

Now again splitting the training dataset into two parts namely training and validation dataset.The reason why i have done this is because our fetal_health dataset is imbalanced and we have to balance the dataset.

**The model performance on the imbalanced testing dataset will be bad if the training dataset is balanced and validation set is also balanced , but the model performance will be better if the training dataset is balanced and validation set is imbalanced . Since the model performance would be same with the validation set and testing set we can evaluate the accuracy of our model precisely when dealing with hyperparameter tuning.**

In [None]:
from sklearn.model_selection import train_test_split
X_X_train,X_val,y_y_train,y_val = train_test_split(X_train,y_train,test_size=0.2,random_state=101)

**Standardization**

In [None]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_X_train=ss.fit_transform(X_X_train)
X_val=ss.transform(X_val)
X_test=ss.transform(X_test)



Balancing the dataset using RandomOverSampler

In [None]:
from imblearn.over_sampling import RandomOverSampler
os =  RandomOverSampler()
X_train_res, y_train_res = os.fit_sample(X_X_train, y_y_train)

In [None]:
from collections import Counter
print('Original dataset shape {}'.format(Counter(y_y_train)))
print('Resampled dataset shape {}'.format(Counter(y_train_res)))

In [None]:
from collections import Counter
print('Original dataset shape {}'.format(Counter(y_test)))
#print('Resampled dataset shape {}'.format(Counter(y_train_res)))

# Model building

In [None]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

Hyperparameter tuning

In [None]:
model_param = {
    'DecisionTreeClassifier':{
        'model':DecisionTreeClassifier(),
        'param':{
            'criterion': ['gini','entropy'],
            'splitter':['best', 'random'],'max_features':['auto', 'sqrt', 'log2']
        }
    },
        'KNeighborsClassifier':{
        'model':KNeighborsClassifier(),
        'param':{
            'n_neighbors': [5,10,15,20,25]
            ,'weights':['uniform', 'distance'],'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute'],
            'leaf_size':[10,20,30,40,50]
        }
    },
        'SVC':{
        'model':SVC(),
        'param':{
            'kernel':['rbf','linear','sigmoid'],
            'C': [0.1, 1, 10, 100]
         
        }
    }
    
}

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
scores =[]
for model_name, mp in model_param.items():
    model_selection = GridSearchCV(estimator=mp['model'],param_grid=mp['param'],cv=5,return_train_score=False)
    model_selection.fit(X_X_train,y_y_train)
    scores.append({
        'model': model_name,
        'best_score': model_selection.best_score_,
        'best_params': model_selection.best_params_
    })

In [None]:
df_model_score = pd.DataFrame(scores,columns=['model','best_score','best_params'])
pd.set_option('display.max_colwidth', -1)
df_model_score

In [None]:
knn_model = KNeighborsClassifier(leaf_size=10,n_neighbors=10,weights='distance')
knn_model.fit(X_train_res,y_train_res)

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=knn_model, X=X_val ,y=y_val,cv=10)
print("accuracy is {:.2f} %".format(accuracies.mean()*100))
print("std is {:.2f} %".format(accuracies.std()*100))

In [None]:
decision_model = DecisionTreeClassifier(criterion='entropy',max_features='log2')
decision_model.fit(X_train_res,y_train_res)


In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=decision_model, X=X_val ,y=y_val,cv=10)
print("accuracy is {:.2f} %".format(accuracies.mean()*100))
print("std is {:.2f} %".format(accuracies.std()*100))

In [None]:
svc_modell = SVC(C=10,kernel='rbf')
svc_modell.fit(X_train_res,y_train_res)


In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=svc_modell, X=X_val ,y=y_val,cv=10)
print("accuracy is {:.2f} %".format(accuracies.mean()*100))
print("std is {:.2f} %".format(accuracies.std()*100))

In [None]:
random_model = RandomForestClassifier()
random_model.fit(X_train_res,y_train_res)


In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=random_model, X=X_val,y=y_val,cv=10)
print("accuracy is {:.2f} %".format(accuracies.mean()*100))
print("std is {:.2f} %".format(accuracies.std()*100))

In [None]:
xgb_model = XGBClassifier()
xgb_model.fit(X_train_res,y_train_res)


In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=xgb_model, X=X_val ,y=y_val,cv=10)
print("accuracy is {:.2f} %".format(accuracies.mean()*100))
print("std is {:.2f} %".format(accuracies.std()*100))

In [None]:
from sklearn.metrics import confusion_matrix,accuracy_score

In [None]:
from sklearn.ensemble import AdaBoostClassifier
adaboost_model = AdaBoostClassifier(base_estimator=RandomForestClassifier(max_depth=1))
adaboost_model.fit(X_train_res,y_train_res)

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=adaboost_model, X=X_val ,y=y_val,cv=5)
print("accuracy is {:.2f} %".format(accuracies.mean()*100))
print("std is {:.2f} %".format(accuracies.std()*100))

In [None]:
from sklearn.ensemble import BaggingClassifier
bagging_model = BaggingClassifier()
bagging_model.fit(X_train_res,y_train_res)

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=bagging_model, X=X_val ,y=y_val,cv=5)
print("accuracy is {:.2f} %".format(accuracies.mean()*100))
print("std is {:.2f} %".format(accuracies.std()*100))

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
extra_model = ExtraTreesClassifier()
extra_model.fit(X_train_res,y_train_res)

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=extra_model, X=X_val ,y=y_val,cv=5)
print("accuracy is {:.2f} %".format(accuracies.mean()*100))
print("std is {:.2f} %".format(accuracies.std()*100))

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gradiant_model = GradientBoostingClassifier()
gradiant_model.fit(X_train_res,y_train_res)

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=gradiant_model, X=X_val ,y=y_val,cv=5)
print("accuracy is {:.2f} %".format(accuracies.mean()*100))
print("std is {:.2f} %".format(accuracies.std()*100))

**Model prediction**

In [None]:
knn_pre = knn_model.predict(X_test)
acc_knn = accuracy_score(knn_pre,y_test)

In [None]:
xgb_pre = xgb_model.predict(X_test)
acc_xgb = accuracy_score(xgb_pre,y_test)

In [None]:
svm_pre = svc_modell.predict(X_test)
acc_svm = accuracy_score(svm_pre,y_test)

In [None]:
decision_pre = decision_model.predict(X_test)
acc_decision  = accuracy_score(decision_pre,y_test)

In [None]:
random_pre = random_model.predict(X_test)
acc_random = accuracy_score(random_pre,y_test)

In [None]:
adaboost_pre = adaboost_model.predict(X_test)
acc_adaboost = accuracy_score(adaboost_pre,y_test)

In [None]:
gradiant_model_pre = gradiant_model.predict(X_test)
acc_gradient = accuracy_score(gradiant_model_pre,y_test)

In [None]:
bagging_pre = bagging_model.predict(X_test)
acc_bagging = accuracy_score(bagging_pre,y_test)

In [None]:
extra_pre = extra_model.predict(X_test)
acc_extra = accuracy_score(extra_pre,y_test)

# Accuracies of all models.

In [None]:
print(acc_knn)
print(acc_xgb)
print(acc_svm)
print(acc_decision)
print(acc_random)
print(acc_adaboost)
print(acc_extra)
print(acc_gradient)
print(acc_bagging)

# Classification report of best models.

**Random forest classification report**

In [None]:
from sklearn.metrics import classification_report
print(classification_report(random_pre,y_test))

**XGB classifier classifiaction report**

In [None]:
from sklearn.metrics import classification_report
print(classification_report(xgb_pre,y_test))

**Extra tree classifier classification report**

In [None]:
from sklearn.metrics import classification_report
print(classification_report(extra_pre,y_test))

**bagging classifier report**

In [None]:
from sklearn.metrics import classification_report
print(classification_report(bagging_pre,y_test))

**It seems to be random forest classifier and extra tress classifier are best performing model with 92% accuracy.**

# Please upvote