# <center><u>Fetal Health Classification</u></center>
   #### <p>This dataset contains 2126 records of features extracted from Cardiotocogram exams, which were then classified by three expert obstetritians into 3 classes:
* Normal
* Suspect
* Pathological</p>
 
<p><h3 style="display: inline;">Target :</h3> So in this task, We will classify the data into three categories using various classification algorithms to achieve lowest prediction error.</p>

### Table of Content :
1. Importing Data and Libraries
2. Exploratory Data Analysis (EDA)
3. Data Pre-processing
6. Modeling & Hypertuning<br />
    * Logistic Regression<br />
    * Random Forest Classifier<br />
    * Gradient Boosting Classifier <br />
    * XGBoost Classifier <br />
6. Model Stacking
7. Plotting a Learning Curve

<h2 style='color:blue'>1. Import Necessary Libraries and Dataset</h2>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Data Vizulization
import matplotlib.pyplot as plt
import seaborn as sns

# Splitting the data
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

# Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

# Model Stacking
from sklearn.ensemble import StackingClassifier

# For Hyper-parameter Tuning the model
from sklearn.model_selection import GridSearchCV

# For checking Model Performance
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import learning_curve

import warnings
warnings.simplefilter(action="ignore")

In [None]:
data = pd.read_csv('../input/fetal-health-classification/fetal_health.csv')
data.head().T

 <h2 style='color:blue'>2. Exploratory Data Analysis (EDA)</h2>
 <p>EDA and Data Vizulization gives the basic overview of the quality and nature of the information available before you begin studying it in more detail. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.</p>
<p>In this step, We will get the basic information about the data like Mean, Standard Daviation, Quatiles, Min-Max values of all the numeric features.
<p>Also, We will try to understand the data using various plots.</p>

In [None]:
data.info()

In [None]:
data.describe().T

### Analyze & Vizulize the Target Variable

In [None]:
data['fetal_health'].unique()

In [None]:
sns.countplot(data['fetal_health'])

In [None]:
data['fetal_health'].value_counts()

### Histogram

In [None]:
hist_plot = data.hist(figsize=(20,20))

### Correlation Matrix

In [None]:
corr = data.corr()

plt.figure(figsize=(12,10))
sns.heatmap(corr, annot=True, cmap='rainbow')
plt.show()

<h2 style='color:blue'>3. Data Pre-processing</h2>

From the Correlation matrix, we can say that 'histogram_mode', 'histogram_mean' and 'histogram_median' are highly correlated to each other. Also, 'histogram_min' and 'histogram_width' are highly negatively correlated. So we will remove 'histogram_mode', 'histogram_median' and 'histogram_min' columns from the dataset.

In [None]:
data = data.drop(['histogram_min','histogram_median','histogram_mode'], axis=1)
data

### Find Missing Values :
<i>The real-world data often has a lot of missing values. The cause of missing values can be data corruption or failure to record data. The handling of missing data is very important during the preprocessing of the dataset as many machine learning algorithms do not support missing values.</i>

In [None]:
## Count the missing and null values
nv = data.columns[data.isnull().any()]
print('Null values = ', nv)

mv = data.columns[data.isna().any()]
print('Missing values = ', mv)

### Splitting the Data

In [None]:
# Splitting data into 75% train set and 25% test set

X = data.drop(['fetal_health'], axis=1)
y = data['fetal_health']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=42)

<h2 style='color:blue'>4. Modeling and Hypertuning</h2>

<h3 style='color: green'>Logistic Regression</h3>

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)

### Find out the best parameters using GridSearchCV

In [None]:
params = {"tol": [0.0001,0.0002,0.0003],
          "intercept_scaling": [1, 2, 3, 4]
         }

In [None]:
cv_method = StratifiedKFold(n_splits=3, 
                            random_state=42)

In [None]:
GridSearchCV_LR = GridSearchCV(estimator=LogisticRegression(), 
                       param_grid=params,
                       cv=cv_method,
                       n_jobs=2,
                       scoring="accuracy"
                      )

In [None]:
GridSearchCV_LR.fit(X_train, y_train)

In [None]:
best_params_LR = GridSearchCV_LR.best_params_
best_params_LR

In [None]:
lr = LogisticRegression(C=10, intercept_scaling=1, tol=0.0001, penalty="l2", solver="liblinear", random_state=42)
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

### Prediction

In [None]:
pred = lr.predict(X_test)

### Classification Report

In [None]:
print("Classification Report")
print(classification_report(y_test, pred))

### Confusion Matrix

In [None]:
ax= plt.subplot()
sns.heatmap(confusion_matrix(y_test, pred), annot=True, ax = ax, cmap = "BuPu");

# labels, title and ticks
ax.set_xlabel("Predicted labels")
ax.set_ylabel("True labels")
ax.set_title("Confusion Matrix")
ax.xaxis.set_ticklabels(["Normal", "Suspect", "Pathological"])

<h3 style='color: green'>Random Forest Classifier</h3>

In [None]:
model = RandomForestClassifier()
model.fit(X_train, y_train)
model.score(X_test, y_test)

### Find out the best parameters using GridSearchCV

In [None]:
params_RF = {"min_samples_split": [2, 6, 20],
             "min_samples_leaf": [1, 4, 16],
             "n_estimators" :[100,150, 200, 250],
             "criterion": ["gini"]             
            }

In [None]:
GridSearchCV_RF = GridSearchCV(estimator=RandomForestClassifier(), 
                                param_grid=params_RF, 
                                cv=cv_method,
                                n_jobs=2,
                                scoring="accuracy"
                                )

In [None]:
GridSearchCV_RF.fit(X_train, y_train)

In [None]:
best_params_RF = GridSearchCV_RF.best_params_
best_params_RF

In [None]:
rf = RandomForestClassifier(criterion="gini", n_estimators = 100, min_samples_leaf=1, min_samples_split=2, random_state=42)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

In [None]:
pred_rf = rf.predict(X_test)

### Classification Report

In [None]:
print("Classification Report")
print(classification_report(y_test, pred_rf))

### Confusion Matrix

In [None]:
ax= plt.subplot()
sns.heatmap(confusion_matrix(y_test, pred_rf), annot=True, ax = ax, cmap = "BuPu")

# labels, title and ticks
ax.set_xlabel("Predicted labels")
ax.set_ylabel("True labels")
ax.set_title("Confusion Matrix")
ax.xaxis.set_ticklabels(["Normal", "Suspect", "Pathological"])

<h3 style='color: green'>Gradient Boosting Classifier</h3>

In [None]:
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
pred_gbc = gbc.predict(X_test)

### Classification Report

In [None]:
print("Classification Report")
print(classification_report(y_test, pred_gbc))

### Confusion Matrix

In [None]:
ax = plt.subplot()
sns.heatmap(confusion_matrix(y_test, pred_gbc), annot=True, ax = ax, cmap = "BuPu")

# labels, title and ticks
ax.set_xlabel("Predicted labels")
ax.set_ylabel("True labels")
ax.set_title("Confusion Matrix")
ax.xaxis.set_ticklabels(["Normal", "Suspect", "Pathological"])

<h3 style='color: green'>XGBoost Classifier</h3>

In [None]:
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
xgb.score(X_test, y_test)

In [None]:
pred_xgb = xgb.predict(X_test)

In [None]:
print("Classification Report")
print(classification_report(y_test, pred_xgb))

In [None]:
ax = plt.subplot()
sns.heatmap(confusion_matrix(y_test, pred_xgb), annot=True, ax = ax, cmap = "BuPu")

# labels, title and ticks
ax.set_xlabel("Predicted labels")
ax.set_ylabel("True labels")
ax.set_title("Confusion Matrix")
ax.xaxis.set_ticklabels(["Normal", "Suspect", "Pathological"])

<h2 style='color:blue'> Model Stcking</h2>

In [None]:
estimators = [
    ('rf', RandomForestClassifier(criterion="gini", n_estimators = 100, min_samples_leaf=1, min_samples_split=2, random_state=42)),
    ('gb', GradientBoostingClassifier()),
    ('xgb', XGBClassifier()
    )
]

In [None]:
clf = StackingClassifier(estimators=estimators, final_estimator=RandomForestClassifier(criterion="gini", n_estimators = 100, min_samples_leaf=1, min_samples_split=2, random_state=42), cv=5)
clf.fit(X_train, y_train).score(X_test, y_test)

In [None]:
pred_clf = clf.predict(X_test)

In [None]:
print("Classification Report")
print(classification_report(y_test, pred_clf))

In [None]:
ax = plt.subplot()
sns.heatmap(confusion_matrix(y_test, pred_clf), annot=True, ax = ax, cmap = "BuPu")

# labels, title and ticks
ax.set_xlabel("Predicted labels")
ax.set_ylabel("True labels")
ax.set_title("Confusion Matrix")
ax.xaxis.set_ticklabels(["Normal", "Suspect", "Pathological"])

<h2 style='color:blue'>Plotting a Learning Curve</h2>

In [None]:
def plot_learning_curve(estimator, title, x, y, ylim=None, cv=None,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
    
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
        
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, x, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="#80CBC4",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="#00897B",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

### Logistic Regression Curve

In [None]:
plot_learning_curve(GridSearchCV_LR.best_estimator_,title = "Logistict Regression learning curve", x = X_train, y = y_train, cv = cv_method)

### Random forest Curve

In [None]:
plot_learning_curve(GridSearchCV_RF.best_estimator_,title = "Random Forest learning curve", x = X_train, y = y_train, cv = cv_method)

### Gradient Boosting Classifier Curve

In [None]:
plot_learning_curve(gbc,title = "Gradient Boosting Classifier learning curve", x = X_train, y = y_train, cv = cv_method)

### XGBoost Classifier Curve

In [None]:
plot_learning_curve(xgb, title = "XGBoost Classifier learning curve", x = X_train, y = y_train, cv = cv_method)

### Stacked Model Curve

In [None]:
plot_learning_curve(clf, title = "Stacked Model learning curve", x = X_train, y = y_train, cv = cv_method)