1. <a href="#intro"> Introduction </a>
2. <a href="#load"> Loading data </a>
3. <a href="#exp"> Exploring Data </a>
4. <a href="#prep"> Data preparation </a>
5. <a href="#model"> Build Model </a>
    * <a href="#first"> First experiment </a>
    * <a href="#cross"> Cross validation </a>
    * <a href="#grid"> Grid Search Cross Validation</a>
    * <a href="#final"> Final training </a>
6. <a href="#eval"> Evaluate Model </a>

# <a id="intro"> Introduction </a>

_Dataset information:_ https://archive.ics.uci.edu/ml/datasets/haberman's+survival

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.


_Attribute Information:_

1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute)
   * 1 = the patient survived 5 years or longer
   * 2 = the patient died within 5 year
   
Our task here will be to build a model to predict if a patient will survive to breast cancer surgery, given these 3 available features.

IMPORTANT NOTE: By any chance I am trying to solve a "real world" breast cancer survival prediction, and this notebook should not be used for any medical application. This is just a brief classification problem exploration.

# <a id="load">Loading data</a>

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, fbeta_score
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
import random

In [None]:
names = ['age', 'year_operation', 'axillary_nodes', 'survival']
df = pd.read_csv("/kaggle/input/habermans-survival-data-set/haberman.csv", names=names)

In [None]:
print(df.shape)

df.head()

# <a id="exp"> Exploring Data </a>

In [None]:
df.info()

In [None]:
df.describe()

There are indeed only 3 predictive features and 306 sample rows. Probably, we can deal with this problem using a small network, with a small batch size and using some regularization strategy, in order to avoid overfitting.

In [None]:
sns.pairplot(df, hue="survival")

In [None]:
scaler = MinMaxScaler()

X_train = pd.DataFrame(scaler.fit_transform(df.iloc[:,:-1]))
X_train.columns = df.iloc[:,:-1].columns

sns.set_theme(style="ticks", palette="pastel")
sns.boxplot(data=X_train)

In [None]:
df.hist()

_age_ seems to have a gaussian like distribution, and _auxiliary nodes_ looks more like an exponential one.

Also, our target feature - _survival_ - is clearly imbalanced: there is much more samples from class 1 than class 2. Let's check how imbalanced it is.

In [None]:
pd.DataFrame({'survived_qty': df.survival.value_counts(), 'survived_pct': round(df.survival.value_counts()/306,3)})

~73% percent (225 samples) are from class 1, i.e. patients that survived 5 years or longer.

# <a id="prep">Data Preparation</a>

In [None]:
# split into input and target
X, y = df.values[:, :-1], df.values[:, -1]

In [None]:
# ensure all data are floating point
X = X.astype('float32')

# label encode strings to 0/1
y = LabelEncoder().fit_transform(y)

In [None]:
# split into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=3)

# set number of input features
n_features = X_train.shape[1]

# <a id="model"> Build Model </a>

### <a id="first"> First experiment </a>

Let's build a simple MLP, just to have a feeling on the behaviour of such a model applied to our problem, with architecture and parameters chosen arbitrarily.

In [None]:
def create_model():
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_features,)))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(optimizer='adam', 
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

In [None]:
model = create_model()

history = model.fit(X_train, 
                    y_train, 
                    epochs=200, 
                    batch_size=16, 
                    verbose=0, 
                    validation_data=(X_test,y_test))

In [None]:
losses = pd.DataFrame(model.history.history)

In [None]:
losses[['loss','val_loss']].plot()

In [None]:
losses[['accuracy','val_accuracy']].plot()

In [None]:
model.evaluate(X_test, y_test)

In [None]:
predictions = (model.predict(X_test) > 0.5).astype("int32")

# classification_report
print(classification_report(y_test, predictions))

# confusion matrix
pd.DataFrame(confusion_matrix(y_test, predictions))

This simple test, without any tunning, achieved 77% accuracy on test set and an acceptable learning curve. As the accuracy obtained is above the percentage of 73% (class 1 full dataset share), there is an indicative that our approach is promising at solving this task.

### <a id="cross">Cross validation</a>

Let's apply a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html" target='_blank'>stratified k-fold cross-validation</a>, to have a more reliable estimate of our model performance. i.e. we are going to fit k models to our data and check mean accuracy.

In [None]:
# set 10-fold pertinent objects
kfold = StratifiedKFold(10)
scores = list()
n_features = X.shape[1]

# perform kfold
for fold, (train_K, test_K) in enumerate(kfold.split(X, y)):
    # split data
    X_train, X_test, y_train, y_test = X[train_K], X[test_K], y[train_K], y[test_K]

    # define model (same as before)
    model = create_model()

    # fit model
    model.fit(X_train, y_train, epochs=200, batch_size=16, verbose=0)

    # predict in test set
    predictions = (model.predict(X_test) > 0.5).astype("int32")

    # evaluate predictions
    score = accuracy_score(y_test, predictions)

    print(f"fold {fold+1}, score: {round(score,2)}")
    scores.append(score)
    # summarize all scores
print(f"Mean Accuracy: {round(np.mean(scores),2)} ({round(np.std(scores),2)})")

# <a id="grid"> Grid Search CV </a>

In order to optimize the hyperparameters of our model, let's use grid search capability from scikit-learn library (<a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html" target="_blank">GridSearchCV class</a>) to tune Keras deep learning models.

Keras models can be used along scikit-learn by wrapping them with the <a href="https://www.tensorflow.org/api_docs/python/tf/keras/wrappers/scikit_learn/KerasClassifier" target="_blank">KerasClassifier</a>. We just have to define a Keras model function and pass it to the build_fn KerasClassifier argument. Let's go and build the framework to optimize epochs and batch_size:

In [None]:
model = KerasClassifier(build_fn=create_model)

In [None]:
# set metric scores to monitor in gridsearch (need to use scikit-learn make_scorer function)

my_scores = {'accuracy' :make_scorer(accuracy_score),
             'recall'   :make_scorer(recall_score),
             'precision':make_scorer(precision_score),
             'f1'       :make_scorer(fbeta_score, beta = 1)}

In [None]:
# set dict with parameter to grid
param_grid = dict(epochs=[100,150,200,250], batch_size=[8,16,32,64])

grid = GridSearchCV(estimator=model,
                    param_grid=param_grid,
                    n_jobs=-1,
                    refit='accuracy',
                    scoring = my_scores,
                    cv=3,
                    verbose=0)

grid_result = grid.fit(X, y)

the best parameters and score (based on refit argument) can now be accessed:

In [None]:
print(grid_result.best_score_)
print(grid_result.best_params_)

And in a more detailed analysis:

In [None]:
pd.DataFrame(grid_result.cv_results_).columns.to_list()

In [None]:
# setting a dataframe with top 5 results from gridsearchCV

pd.DataFrame(grid_result.cv_results_)[['params',
                                       'mean_test_accuracy',
                                       'mean_test_recall',
                                       'mean_test_precision',
                                       'mean_test_f1',
                                       'rank_test_accuracy']].sort_values('rank_test_accuracy').head(5)

# <a id="final">Final training</a>

Our mean accuracy, according to the applied cross validation strategy, is now about ~77%.

A deeper discussion with domain experts should provide a guidance on tha acceptance of this performance level. Assuming that this result is acceptable, the next step is to train our verified model on full dataset, using the optimized parameters found ('batch_size': 8, 'epochs': 250).

In [None]:
X, y = df.values[:, :-1], df.values[:, -1]
X = X.astype('float32')
y = LabelEncoder().fit_transform(y)
n_features = X_train.shape[1]

In [None]:
model = create_model()

In [None]:
BS = grid_result.best_params_['batch_size']
ep = grid_result.best_params_['epochs']

model.fit(X,y, epochs=ep, batch_size=BS, verbose=0)

# <a id="eval"> Evaluate Model </a>

Now, we are in a position to apply the trained model to make predictions on new data.

To simulate a new data input, let's choose an aleatory sample from our dataset:

In [None]:
random_ind = random.randint(0,len(X))

new_data = X[random_ind]
exp_out = y[random_ind]
pred = (model.predict(new_data.reshape(1,3)) > 0.5).astype('int32')[0][0]

In [None]:
print(f"\n new_data: {new_data}\n \n expected output: {exp_out} \n \n predicted output: {pred}\n")