**Reference from**
> https://machinelearningmastery.com/imbalanced-classification-is-hard/

> ******In this project, I will use a small breast cancer survival dataset, referred to generally as the
Haberman Dataset. The dataset describes breast cancer patient data and the outcome is patient
survival. Specifically whether the patient survived for five years or longer, or whether the
patient did not survive. This is a standard dataset used in the study of imbalanced classification.
According to the dataset description, the breast cancer surgery operations were conducted
between 1958 and 1970 at the University of Chicago’s Billings Hospital. There are 306 examples
in the dataset, and there are 3 input variables.**

> **My Goal is- given patient breast cancer surgery details, what is the probability of
survival of the patient to five years or more?**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
# load the Haberman Breast Cancer Survival dataset
haberman = pd.read_csv("../input/haberman.csv/haberman.csv")

In [None]:
# peek of the dataset
haberman.head()

> ****Here-
 age: The age of the patient at the time of the operation.
 year: The two-digit year of the operation.
 nodes: The number of positive axillary nodes detected, a measure of a cancer has spread.

In [None]:
# shape of the dataset
haberman.shape

In [None]:
# summarize each column
haberman.describe()

> Looking at the age, we can see that the youngest patient was 30 and the oldest was 78; that is quite a range. The mean patient age was about 52 years. If the occurrence of cancer is somewhat random, we might expect the age distribution to be Gaussian. We can see that all operations were performed between 1958 and 1969. If the number of breast cancer patients is somewhat fixed over time, we might expect this variable to have a uniform distribution. We can see nodes have values between 0 and 52. This might be a cancer diagnostic related to lymphatic nodes.

In [None]:
# check data types
haberman.dtypes

> All variables are integers. Therefore, it might be helpful to look at each variable as a
histogram to get an idea of the variable distribution. This might be helpful in case we choose
models later that are sensitive to the data distribution or scale of the data, in which case, we
might need to transform or rescale the data

In [None]:
# create histograms of each variable
from matplotlib import pyplot
haberman.hist()
pyplot.show()

> We can see that age appears
to have a Gaussian distribution, as we might have expected. We can also see that year has
a uniform distribution, mostly, with an outlier in the first year showing nearly double the
number of operations. We can see nodes has an exponential type distribution with perhaps most
examples showing 0 nodes, with a long tail of values after that. A transform to un-bunch this
distribution might help some models later on. Finally, we can see the two-class values with an
unequal class distribution, showing perhaps 2 or 3 times more survival than non-survival cases.

In [None]:
# check how imbalanced the dataset actually is
from collections import Counter
# summarize the class distribution
target = haberman['status'].values
counter = Counter(target)
for k,v in counter.items():
    per = v / len(target) * 100
    print('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))

> We can see that
class 1 for survival has the most examples at 225, or about 74 percent of the dataset.
We can also see class 2 for non-survival has fewer examples as 80, or about 26 percent of the dataset. The
class distribution is skewed, but it is not severely imbalanced

**It is customary for an imbalanced dataset to model the minority class as a positive class. In
this dataset, the positive class represents non-survival. This means that we will be predicting
the probability of non-survival and will need to calculate the complement of the predicted
probability in order to get the probability of survival. As such, we can map the 1 class values
(survival) to the negative case with a 0 class label, and the 2 class values (non-survival) to the
positive case with a class label of 1.**

In [None]:
# retrieve numpy array
haberman = haberman.values
# split into input and output elements
X, y = haberman[:, :-1], haberman[:, -1]

# label encode the target variable to have the classes 0 and 1
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(y)

> I will evaluate candidate models using repeated stratified k-fold cross-validation. The k-fold
cross-validation procedure provides a good general estimate of model performance that is not
too optimistically biased, at least compared to a single train-test split. We will use k = 10,
meaning each fold will contain 306 10 or about 30 examples.
Stratified means that each fold will contain the same mixture of examples by class, that is
about 74 percent to 26 percent survival and non-survival. Repeated means that the evaluation
process will be performed multiple times to help avoid fluke results and better capture the
variance of the chosen model. I will use three repeats. This means a single model will be
fit and evaluated 10 × 3 (30) times and the mean and standard deviation of these runs will be
reported.

> Given that we are interested in predicting a probability of survival, we need a performance
metric that evaluates the skill of a model based on the predicted probabilities. In this case,
we will use the Brier score that calculates the mean squared error between the predicted
probabilities and the expected probabilities.

In [None]:
from sklearn.metrics import brier_score_loss
from numpy import mean
from numpy import std
# calculate brier skill score (BSS)
def brier_skill_score(y_true, y_prob):
    # calculate reference brier score
    ref_probs = [0.26471 for _ in range(len(y_true))]
    bs_ref = brier_score_loss(y_true, ref_probs)
    # calculate model brier score
    bs_model = brier_score_loss(y_true, y_prob)
    # calculate skill score
    return 1.0 - (bs_model / bs_ref)

> Next, we can make use of the brier_skill_score() function to evaluate a model using
repeated stratified k-fold cross-validation. To use our custom performance metric, we can
use the make scorer() scikit-learn function that takes the name of our custom function and
creates a metric that we can use to evaluate models with the scikit-learn API.
We will set the
needs proba argument to True to ensure that models that are evaluated make predictions using
the predict proba() function to ensure they give probabilities instead of class labels.

In [None]:
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
# evaluate a model
def evaluate_model(X, y, model):
    # define evaluation procedure
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    # define the model evaluation metric
    metric = make_scorer(brier_skill_score, needs_proba=True)
    # evaluate model
    scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)
    return scores

In [None]:
# summarize the loaded dataset
print(X.shape, y.shape, Counter(y))

> we will evaluate the baseline strategy of predicting the distribution of positive
examples in the training set as the probability of each case in the test set. This can be
implemented automatically using the DummyClassifier class and setting the strategy to
‘prior’ that will predict the prior probability of each class in the training dataset

In [None]:
from sklearn.dummy import DummyClassifier
# define the reference model
model = DummyClassifier(strategy='prior')
# evaluate the model
scores = evaluate_model(X, y, model)
print('Mean BSS: %.3f (%.3f)' % (mean(scores), std(scores)))

> I will evaluate a suite of models that are known to be effective at predicting probabilities.
Specifically, these are models that are fit under a probabilistic framework and explicitly predict a
calibrated probability for each example.
I will
compare each algorithm based on the mean score, as well as based on their distribution of scores.
I can define a function to create models that I want to evaluate, each with their default
configuration or configured as to not produce a warning.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.gaussian_process import GaussianProcessClassifier

In [None]:
# define models to test
def get_models():
    models, names = list(), list()
    # LR
    models.append(LogisticRegression(solver='lbfgs'))
    names.append('LR')
    # LDA
    models.append(LinearDiscriminantAnalysis())
    names.append('LDA')
    # QDA
    models.append(QuadraticDiscriminantAnalysis())
    names.append('QDA')
    # GNB
    models.append(GaussianNB())
    names.append('GNB')
    # MNB
    models.append(MultinomialNB())
    names.append('MNB')
    # GPC
    models.append(GaussianProcessClassifier())
    names.append('GPC')
    return models, names

> Running the below cell will first summarizes the mean and standard deviation of the BSS for each
algorithm (larger scores is better).

In [None]:
# define models
models, names = get_models()
results = list()
# evaluate each model
for i in range(len(models)):
    # evaluate the model and store results
    scores = evaluate_model(X, y, models[i])
    results.append(scores)
    # summarize and store
    print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))
    # plot the results
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

> In my case, the results suggest that only two of the algorithms are not skillful, showing
negative scores, and that perhaps the LR and LDA algorithms are the best performing.

**It can be a good practice to scale data for some algorithms if the variables have different units
of measure, as they do in this case. Algorithms like the LR and LDA are sensitive to the
distribution of the data and assume a Gaussian distribution for the input variables, which we
don’t have in all cases.
Nevertheless, we can test the algorithms with standardization, where each variable is shifted
to a zero mean and unit standard deviation. We will drop the MNB algorithm as it does not
support negative input values. We can achieve this by wrapping each model in a Pipeline
where the first step is a StandardScaler, which will correctly be fit on the training dataset and
applied to the test dataset within each k-fold cross-validation evaluation, preventing any data
leakage.
**

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
# define models to test
def get_models():
    models, names = list(), list()
    # LR
    models.append(LogisticRegression(solver='lbfgs'))
    names.append('LR')
    # LDA
    models.append(LinearDiscriminantAnalysis())
    names.append('LDA')
    # QDA
    models.append(QuadraticDiscriminantAnalysis())
    names.append('QDA')
    # GNB
    models.append(GaussianNB())
    names.append('GNB')
    # GPC
    models.append(GaussianProcessClassifier())
    names.append('GPC')
    return models, names

In [None]:
# define models
models, names = get_models()
results = list()
# evaluate each model
for i in range(len(models)):
    # create a pipeline
    pipeline = Pipeline(steps=[('t', StandardScaler()),('m',models[i])])
    # evaluate the model and store results
    scores = evaluate_model(X, y, pipeline)
    results.append(scores)
    # summarize and store
    print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))
# plot the results
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

> we can see that the standardization has not had much of an impact on the
algorithms, except the GPC. The performance of the GPC with standardization has shot up
and is now the best-performing technique. 

** Model Evaluation With Power Transform**
> Power transforms, such as the Box-Cox and Yeo-Johnson transforms, are designed to change
the distribution to be more Gaussian. We can use the
PowerTransformer scikit-learn class to perform the Yeo-Johnson transform and automatically
determine the best parameters to apply based on the dataset. Importantly, this transformer will also standardize the dataset as part
of the transform

> We have zero values in our dataset, therefore we will scale the dataset prior to the power
transform using a MinMaxScaler. Again, we can use this transform in a Pipeline to ensure it
is fit on the training dataset and applied to the train and test datasets correctly, without data
leakage.

In [None]:
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import MinMaxScaler

In [None]:
# define models to test
def get_models():
    models, names = list(), list()
    # LR
    models.append(LogisticRegression(solver='lbfgs'))
    names.append('LR')
    # LDA
    models.append(LinearDiscriminantAnalysis())
    names.append('LDA')
    # GPC
    models.append(GaussianProcessClassifier())
    names.append('GPC')
    return models, names

In [None]:
# define models
models, names = get_models()
results = list()
# evaluate each model
for i in range(len(models)):
    # create a pipeline
    steps = [('t1', MinMaxScaler()), ('t2', PowerTransformer()),('m',models[i])]
    pipeline = Pipeline(steps=steps)
    # evaluate the model and store results
    scores = evaluate_model(X, y, pipeline)
    results.append(scores)
    # summarize and store
    print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))
# plot the results
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

**We can see a further lift in model skill for the three models that were evaluated.
We can see that the LR appears to have out-performed the other two methods**

**Make Prediction on New Data**
> We will select the Logistic Regression model with a power transform on the input data as our
final model. We can define and fit this model on the entire training dataset

In [None]:
# fit the model
steps = [('t1', MinMaxScaler()),('t2', PowerTransformer()),('m',LogisticRegression(solver='lbfgs'))]
model = Pipeline(steps=steps)
model.fit(X, y)
# some survival cases
print('Survival Cases:')
data = [[31,59,2], [31,65,4], [34,60,1]]
for row in data:
    # make prediction
    yhat = model.predict_proba([row])
    # get percentage of survival
    p_survive = yhat[0, 0] * 100
    # summarize
    print('>data=%s, Survival=%.3f%%' % (row, p_survive))
# some non-survival cases
print('Non-Survival Cases:')
data = [[44,64,6], [34,66,9], [38,69,21]]
for row in data:
    # make prediction
    yhat = model.predict_proba([row])
    # get percentage of survival
    p_survive = yhat[0, 0] * 100
    # summarize
    print('>data=%s, Survival=%.3f%%' % (row, p_survive))

**> We can see that for the chosen survival cases, the probability of survival was
high, between 76 percent and 86 percent. Then some cases of non-survival are used as input to
the model and the probability of survival is predicted. As we might have hoped, the probability
of non-survival is modest, hovering around 52 percent to 63 percent.**