## Synopsis

This dataset, from the University Of Wisconsin, is the result of 569 digitalised images of breast mass tissue. The problem presented is a classification task with 2 outcomes, malignant and benign. The data consists of 30 features, each real valued.

> "Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image." [1](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)


## Goal

Accurately determine the diagnosis (M = malignant, B = benign) of potentially cancerous breast tissue cells based on their features.



<h1 id='Initialisation'>
1. Initialisation
<a class="anchor-link" href='https://www.kaggle.com/mr11235/breast-cancer-wisconsin-classification'>¶</a>
</h1>

In [None]:
#Numpy, Pandas
import numpy as np
import pandas as pd

!pip install seaborn --upgrade

#Visualisation libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

plt.rcParams["figure.figsize"] = [10, 6]

In [None]:
breast_cancer_data = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

breast_cancer_data.head()

In [None]:
breast_cancer_data.shape

<h1 id='Preparing training/Test Sets'>
1.1 Preparing training/Test Sets
<a class="anchor-link" href='https://www.kaggle.com/mr11235/breast-cancer-wisconsin-classification'>¶</a>
</h1>

The dataset is split approximately 70%/30% into train and test sets respectively. Cross validation will be used to evaluate the models, so I did not include a validation set in the data split. These are also stratified with respect to the diagnosis. Due to the imbalance of the whole dataset it is important that this imbalance is incorporated into the models that are trained. The id column doesn't provide useful information for the learning algorithms so it is dropped, as well as the mysterious 'Unnamed: 32' column. The diagnosis column is redefined into [0,1] being the benign and malignant outcomes respectively. Here, I:

* Drop the Id column as it has no objective value
* Split data into train/validation/test sets, being careful to stratify the data (at every split) based on the diagnosis
* Map target variable to numeric values for analysis

In [None]:
X=breast_cancer_data.drop('diagnosis', axis=1)
y=breast_cancer_data['diagnosis'].copy()

X.head()

In [None]:
breast_cancer_data.drop('Unnamed: 32', axis=1, inplace=True)
breast_cancer_data.drop('id', axis=1, inplace=True)

from sklearn.model_selection import train_test_split

train, test = train_test_split(breast_cancer_data,
                               test_size=0.4, 
                               stratify = breast_cancer_data['diagnosis'],
                               random_state=101)

X=breast_cancer_data.drop('diagnosis', axis=1).copy()
y=breast_cancer_data['diagnosis'].copy()
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.3, 
                                                    stratify = y, 
                                                    random_state=101)
#train size: 70%
#test size: 30%

train['diagnosis'] = train['diagnosis'].map({'M':1, 'B':0}).copy()

<h1 id='Missing Data'>
1.2 Missing Data
<a class="anchor-link" href='https://www.kaggle.com/mr11235/breast-cancer-wisconsin-classification'>¶</a>
</h1>

There is one column with all of its entries as Nan; 'Unnamed: 32'. I will delete this column.

In [None]:
breast_cancer_data.isnull().sum().any()

<h1 id='Training Data Analysis'>
2. Training Data Analysis
<a class="anchor-link" href='https://www.kaggle.com/mr11235/breast-cancer-wisconsin-classification'>¶</a>
</h1>

Goals:

* Look at the proportion of each diagnosis
* Gather statistical insight about the training data (mainly the binomial distributions of every feature, to determine a scaler for the data)
* See which features correlate the most with the diagnosis

In [None]:
train.head()

In [None]:
corr_values = X_train.corrwith(train['diagnosis']).sort_values(ascending=False)
corr_values = pd.DataFrame(corr_values, columns = ['corr_w_diagnosis'])
corr_values = corr_values.reset_index().rename(columns={'index':'features'})
corr_values['features'] = corr_values['features'].astype(str)
corr_values

There are many features which have a low correlation to the target variable. The high dimensionality of the data presents a multicollinearity problem which the above correlation values might not pick up on. In other words, how do each of the features affect each other resulting in each diagnosis? Could a cell's mean texture and mean fractal dimension conspire to highly influence the diagnosis? As I'm not an expert in this field, I calculate the **variable inflation factors** for each feature and decide which features to remove from the training and testing datasets. I could have also created a correlation matrix but I wanted a more robust method especially for a topic like this (predicting a cancer diagnosis is not a trivial thing in the real world!).

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_corr = pd.DataFrame()
vif_corr['features'] = X_train.columns
vif_corr['vif_value'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif_corr.sort_values('vif_value', ascending = False)
vif_corr['features'] = vif_corr['features'].astype(str)
vif_corr = pd.merge(corr_values, vif_corr, on = 'features')
vif_corr.sort_values(by = 'corr_w_diagnosis', ascending = False)

It seems that the vif values of each feature are a similar match to the correlation values (but for example, concave points_worst has a much lower vif value, so it might not actually be that useful at all. 

In [None]:
vif_corr.sort_values(by = 'vif_value', ascending = False)

I filter out the features with a low vif value and low correlation to the target (values less than 0.6 and 100 respectively. These values can be changed if needed).

In [None]:
features_to_drop = vif_corr.loc[(vif_corr['corr_w_diagnosis'] < 0.6) & (vif_corr['vif_value'] < 200)]
features_to_drop

In [None]:
list(features_to_drop['features'].values)

In [None]:
import plotly.express as px

fig = px.pie(train, 
             values=train['diagnosis'].value_counts(), 
             names = ['Benign', 'Malignant'], 
             title='Proportion of diagnoses in training data', 
             width=800, 
             height=400
            )
fig.show()

In [None]:
train['diagnosis'].map({1:'Malignant', 0:'Benign'})

Visualising all of the features vs the diagnosis, separated by mean, worst and standard-error.

In [None]:
#splitting feature columns into mean/worst and se 
feature_list = list(train.columns)
feature_array = np.array_split(feature_list, 3)

mean = list(feature_array[0])
mean.remove('diagnosis')

worst = list(feature_array[1])

se = list(feature_array[2])

feature_lists = [mean, worst, se]
for feat in feature_lists:
    fig, axs = plt.subplots(ncols = 1, nrows = 10, sharey = True, figsize = (12,40))
    for i in range(0,10):
        sns.scatterplot( x = feat[i], y = 'diagnosis', data = train, ax = axs[i] )

Now let's look at the binomial distributions for every feature:

In [None]:
pd.options.plotting.backend = 'matplotlib'
X_train.hist(bins=30, figsize=(20,15))
plt.title('Distribution of features before PowerTransformer transform')
plt.show()

After testing StandardScaler, MinmaxScaler and PowerTransformer, I found that my trained models performed better when PowerTransformer was used to scale the data.

In [None]:
from sklearn.preprocessing import PowerTransformer

X_train_hist = X_train.copy()

#PowerTransformer transform 
scaler = PowerTransformer()
hist_data = pd.DataFrame(scaler.fit_transform(X_train))

hist_data.hist(bins = 30, figsize = (20,15))
plt.title('Distribution of features after PowerTransformer transform')
plt.show()

<h1 id='Transforming Data'>
2.1. Observations
<a class="anchor-link" href='https://www.kaggle.com/mr11235/breast-cancer-wisconsin-classification'>¶</a>
</h1>

* This is an imbalanced dataset so data should be stratified accordingly.
* There is considerable overlap between features and diagnosis.
* 3 of the most correlated features to the diagnosis are 'worst' features.
* There are 4 features which are negatively correlated to the target.

<h1 id='Transforming Data'>
3. Transforming Data
<a class="anchor-link" href='https://www.kaggle.com/mr11235/breast-cancer-wisconsin-classification'>¶</a>
</h1>

I made a custom transformer that takes in the X and y data and does the following:

* Makes sure there are no missing rows
* Drops 'diagnosis', 'id' and 'unnamed' columns
* Drops the 'Unnamed: 32' column
* Maps target variable to numeric values
* Delete negatively correlated features

In [None]:
def custom_X_transformer(X):
    try: 
        #Drop rows with missing values
        X = X.replace([np.inf, -np.inf], np.nan)
        X = X.dropna()
    except:
        pass
    try:
        #Dropping irrelevant columns if they haven't already been dropped
        X = X.drop('diagnosis', axis=1)
    except:
        pass
    try:
        X = X.drop('Unnamed: 32', axis=1)
    except:
        pass
    try:
        #Dropping features with low correlation and vif values
        X = X.drop(list(features_to_drop['features'].values), axis=1)
    except:
        pass
    else:
        pass
    return X

def custom_y_transformer(y):
    #Map diagnosis to numerical values
    y = y.map({'M':1, 'B':0})
    y = y.dropna()
    
    return y

In [None]:
X_train_tr = custom_X_transformer(X_train)
X_test_tr = custom_X_transformer(X_test)

y_train_tr = custom_y_transformer(y_train)
y_test_tr = custom_y_transformer(y_test)

<h1 id='Model Training'>
4. Model Training
<a class="anchor-link" href='https://www.kaggle.com/mr11235/breast-cancer-wisconsin-classification'>¶</a>
</h1>

So, on to the models. In a real world scenario, we do not want to mis-diagnose someone who is later disgnosed with breast cancer. In the hyperparameter tuning section, I will attempt to minimize the number of false negatives and increase the **sensitivity** of the trained models below. In short, the models will predict a malignant outcome more often than the training data indicates. However, only <code>LogisticRegression()</code>, <code>RandomForestClassifier()</code> and <code>SVC()</code> algorithms support <code>class_weight</code> as a hyperparameter. The rest of the algorithms will be trained without class weight tuning.

I have chosen a handfull of classifier models to train, keeping the list short to conserve running time. In particular, I chose two boosting algorithms to see if they would be an improvement over other stacked models or deep learning models.

* Define LogisticRegression, RandomForestClassifier, SVC, KNeighborsClassifier AdaBoostClassifier and GradientBoostingClassifier pipelines
* Scale the data using PowerTransformer
* Review their cross validated accuracy scores with default hyperparameters
* Tune hyperparameters of each model 
* Stack models with default hyperparameters
* Define and train a multi-layer-perceptron model on the data
* Evaluate the best of these models on the test set

In [None]:
np.seterr(divide = 'ignore')    #Using powerTransformer gave a 'Divide by zero' error, as a naive workaround I just removed the error and still got good results.

In [None]:
#Set the rate of positive diagnoses to be higher than benign diagnoses, to decrease the number of false-positives
#Default weights would be ~ 0:3.7, 1:6.3
class_weights = {0:1, 1:30}

In [None]:
#Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
#Boosting algorithms
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

def get_models():
    models = {
        'LogisticRegression': LogisticRegression(class_weight = class_weights),
        'RandomForestClassifier': RandomForestClassifier(class_weight = class_weights),
        'SVC': SVC(class_weight = class_weights),
        'KNeighborsClassifier': KNeighborsClassifier(),
        'AdaBoostClassifier': AdaBoostClassifier(),
        'GradientBoostingClassifier': GradientBoostingClassifier()
    }
    
    return models, class_weights

I decided not to use Principal Component Analysis in my pipelines as its use had a negligible effect on the final models.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PowerTransformer
from sklearn.decomposition import PCA

def get_pipelines():
    models = get_models()[0].values()
    model_names = get_models()[0].keys()
    pipe_dict = {}
    for model, name in zip(models, model_names):
        pipe_dict[name] = Pipeline([
                                    ('scaler', PowerTransformer()),
                                    ('classifier', model) 
                                    ])
    return pipe_dict

<h1 id='Training Base Models'>
4.1 Training Base Models
<a class="anchor-link" href='https://www.kaggle.com/mr11235/breast-cancer-wisconsin-classification'>¶</a>
</h1>

In [None]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

def train_base_models(X_train, y_train):
    models = get_models()[0]    #Gets only the models from get_models, not the class_weights.
    pipelines = list(get_pipelines().values())    
    names = list(models.keys())
    for pipe, name in zip(pipelines, names):
        pipe.fit(X_train, y_train)
        score = cross_val_score(pipe, X_train, y_train, cv = 3)    
        
        print( 'Accuracy scores for {}: {}'.format(name, score) )
        print( 'Average accuracy score for {}: {}'.format(name, sum(score)/3) )
        print('')

In [None]:
train_base_models(X_train_tr, y_train_tr)

<h1 id='Hyperparameter Tuning Using BayesSearchCV'>
4.2 Hyperparameter Tuning Using BayesSearchCV
<a class="anchor-link" href='https://www.kaggle.com/mr11235/breast-cancer-wisconsin-classification'>¶</a>
</h1>

After training the baseline models it's time to improve their accuracy with hyperperameter tuning. I did not use RandomSearchCV here as, while being less time consuming to run, the results were often poorer than those of the base models. GridSearch in this case simply would have taken too much time to run. Instead, I used BayesSearchCV as it is more time efficient and gave better accuracy for the tuned models. Here I use skopt but hyperopt is also a good alternative library. After playing around with the hyperparameters and training the models, I found 



In [None]:
from skopt.space import Real, Integer, Categorical
from skopt.utils import use_named_args

#0: benign, 1:malignant
#Making each algorithm predict a malignant diagnosis 10x more than a benign one. Reduces the number of false-negatives.

lr_search = {
    'classifier__C': Real(1e-2, 1e2),
    'classifier__max_iter': Integer(50, 10000),
    'classifier__solver': Categorical(['lbfgs', 'liblinear'])
}
rf_search = {
    'classifier__n_estimators': Integer(50, 300),
    'classifier__min_impurity_decrease': Real(1e-2, 1e-1),
    'classifier__max_depth': Real(1e-1, 1e2),
    'classifier__min_samples_split': Real(1e-6, 1),
    'classifier__min_samples_leaf': Real(1e-3, 0.5)
}
svc_search = {
    'classifier__C': Real(1e-2, 1e3),
    'classifier__kernel': Categorical(['linear', 'rbf']),
    'classifier__gamma': Categorical(['scale', 'auto'])
}
knn_search = {
    'classifier__n_neighbors': Integer(1, 50),
    'classifier__leaf_size': Integer(1, 100),
    'classifier__p': Categorical([1,2])
}
adaboost_search = {
    'classifier__n_estimators': Integer(1, 200),
    'classifier__learning_rate': Real(1e-2, 1e-1)
}
gradboost_search = {
    'classifier__n_estimators': Integer(1, 200),
    'classifier__loss': Categorical(['deviance', 'exponential']),
    'classifier__learning_rate': Real(1e-3, 1e-1),
    'classifier__min_samples_split': Integer(2, 10),
    'classifier__min_samples_leaf': Real(1e-3, 0.5)
}

In [None]:
from skopt import BayesSearchCV

from sklearn.utils.testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning

@ignore_warnings(category = ConvergenceWarning)
def bayes_param_search(X_train, y_train):
    models = get_models()[0]
    pipelines = list(get_pipelines().values())
    searches = [lr_search, rf_search, svc_search, knn_search, adaboost_search, gradboost_search]
    best_params = []    #stores the best_params_ of every model pipeline
    for i in range( 0, len(searches) ):
        opt = BayesSearchCV(
            pipelines[i],
            searches[i],
            cv=3,    
            n_iter=100,    #number of settings that are tried, i found that any more than this was unnecessary.
            random_state=2021
        )
        opt.fit(X_train, y_train)
        best_params.append(opt.best_params_)

        print( 'valid score: {}'.format(opt.best_score_) )
        print( 'best params: {}'.format(str(opt.best_params_)) )
        print('')
        
    return best_params

In [None]:
@ignore_warnings(category = ConvergenceWarning)
def train_opt_models(X_train, y_train, X_test, y_test):
    opt_models = []
    best_params_list = bayes_param_search(X_train, y_train)
    models = get_models()[0]
    pipelines = list(get_pipelines().values())   #retreives all of the pipelines which are included in the bayes search
    names = list(models.keys())
    for name, pipe, params in zip(names, pipelines, best_params_list):
        pipe = pipe.set_params(**params)
        pipe.fit(X_train, y_train)
        
        #scoring opt model        
        train_score = cross_val_score(pipe, X_train, y_train, cv = 3)
        test_score = pipe.score(X_test, y_test)

        print( 'Accuracy scores for {}: {}'.format(str(name), train_score) )
        print( 'Average accuracy score for {}: {}'.format(str(name), sum(train_score)/3) )
        print( 'Testing accuracy score: {}'.format(test_score) )
        print('')

        #store model for later use if needed
        opt_models.append(pipe)
        
    return opt_models

In [None]:
train_opt_models(X_train_tr, y_train_tr, X_test_tr, y_test_tr)

<h1 id='Stacking Base Models'>
4.3 Stacking Base Models
<a class="anchor-link" href='https://www.kaggle.com/mr11235/breast-cancer-wisconsin-classification'>¶</a>
</h1>

This next function takes the pipelines from <code>get_pipelines()</code> and stacks the following baseline models:

* AdaBoostClassifier
* Random forest classifier

and, 

* SVC
* AdaBoostClassifier

After testing different combinations of models, with class weights, without class weight, using different final estimators (i.e. the model used to combine all of the estimators), I quickly realised that trying all possible combinations of model stacking would be very cumbersome. So, in the end I chose some 'good' performing model stacks and trained them with a RandomForest model.

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.pipeline import make_pipeline

@ignore_warnings(category = ConvergenceWarning)
def train_model_stacker(X_train, y_train, X_test, y_test):
    stacked_models = []
    class_weights = get_models()[1]
    stacker_names = ['estimator0', 'estimator1']
    estimators0 = [
        ('AdaBoostClassifier', AdaBoostClassifier()),
        ('RandomForestClassifier', make_pipeline( PowerTransformer(), RandomForestClassifier(class_weight = class_weights) ))]
    estimators1 = [
        ('SVC', SVC(class_weight = class_weights)),
        ('AdaBoostClassifier', make_pipeline( PowerTransformer(), AdaBoostClassifier() ))
    ]
    estimators_list = [estimators0, estimators1]
    
    for estimator, name in zip(estimators_list, stacker_names):
        stacker = StackingClassifier(
                                    estimators = estimator,
                                    final_estimator =  AdaBoostClassifier()
                                    )
        stacker.fit(X_train_tr, y_train_tr)
        
        #train score
        train_score = cross_val_score(stacker, X_train, y_train, cv = 3)
        print( 'Training accuracy scores for {}: {}'.format(name, train_score) )
        print( 'Average training accuracy score for {}: {}'.format(name, sum(train_score)/3) )
        
        #test score
        test_score = stacker.score(X_test, y_test)
        print( 'Testing accuracy scores for {}: {}'.format(name, test_score) )
        print('')
        
        #store model for later use if needed
        stacked_models.append(stacker)

In [None]:
train_model_stacker(X_train_tr, y_train_tr, X_test_tr, y_test_tr)

<h1 id='MLP Model'>
4.4 MLP Model
<a class="anchor-link" href='https://www.kaggle.com/mr11235/breast-cancer-wisconsin-classification'>¶</a>
</h1>

The final model is a simple, deep learning classifier model. Again, I tweaked its imputs to improve performance:

Layers:
* <code>units</code> (number of neurons in each layer) are set to (12, 8, 1). Optimal units for the best accuracy took some tinkering with this model. 
* first layer <code>input_dim</code> set to dimension of training set, as is the norm, output node is the 'classifier' node (outputs 0 or 1)
* <code>kernel_initializer</code> controls the weights for each layer. 'uniform' generates weights based on a normal distribution.

fitting model:
* <code>epochs</code> 'pass throughs' of training data through the model. after tinkering with this value, I found above 200 to have no impact.

In [None]:
#Set the rate of positive diagnoses to be higher than benign diagnoses, to decrease the number of false-positives
class_weight = {0:1 , 1:30}

In [None]:
from keras.models import Sequential
from keras.layers import Dense
import tensorflow as tf
from sklearn.model_selection import train_test_split

def train_MLP_model(X_train, y_train, X_test, y_test):
    
    X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.3, random_state=2021)
    
    model = Sequential()
    model.add(Dense(units = 12,
                    input_dim = len(X_train.columns),
                    kernel_initializer = 'random_uniform',
                    activation ='relu'
                   )
             )
    model.add(Dense(units = 8,
                    kernel_initializer = 'random_uniform',
                    activation ='relu'
                   )
             )
    model.add(Dense(1,
                    kernel_initializer = 'random_uniform',
                    activation ='sigmoid'
                   )
             )
    model.compile(
        optimizer = 'adam', 
        loss = 'binary_crossentropy', 
        metrics = ['accuracy']
    )
    history = model.fit(
                X_train,
                y_train,
                batch_size = 16,
                epochs = 100,
                verbose = 1,
                class_weight = class_weight
            )
    
    #train score
    train_score = model.evaluate(X_valid, y_valid)
    print( 'Training accuracy score : {}'.format(train_score[1]) )
    
    #test score
    test_score = model.evaluate(X_test, y_test)
    print( 'Test accuracy score : {}'.format(test_score[1]) )
    
    #use for final predictions
    preds = model.predict(X_test)
    
    return (model, train_score, test_score, preds)

In [None]:
#Scaling X data
transformer = PowerTransformer()
scaled_X_train = pd.DataFrame(transformer.fit_transform(X_train_tr))
scaled_X_test = pd.DataFrame(transformer.fit_transform(X_test_tr))

In [None]:
#Testing MLP model
MLP_model = train_MLP_model(scaled_X_train, y_train_tr, scaled_X_test, y_test_tr)
MLP_model

In [None]:
preds = [1 * (x[0]>=0.5) for x in MLP_model[3]]

In [None]:
from sklearn.metrics import confusion_matrix

test_confusion_matrix = confusion_matrix(y_test_tr, preds)

In [None]:
test_confusion_matrix