## Python: Exploring features + Machine Learning analysis in the Breast Cancer Wisconsin dataset

The open-source Diagnostic Wisconsin Breast cancer dataset (__[available via Kaggle](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data/discussion)__) contains features of the cell nuclei of malignant and benign tumor cells. The features are computed based on Whole Slide Images of a breast mass. The project's primary challenge is to determine whether a sample is malignant or benign, based on the characteristics in the dataset. 

### Preprocessing

In [62]:
# import libraries
import pandas as pd
import numpy as np

In [171]:
# Explore data (load in dataframe)
data = pd.read_csv("data.csv")
print(data.shape)
print(data.keys())

(569, 33)
Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')


The dataset consist of 569 tissue samples with 33 columns. However, not every column is a feature. The first step is to "clean" the data by removing trivial columns and preparing the prediction labels.

In [172]:
data.drop(['Unnamed: 32','id'],axis=1,inplace=True)
data.diagnosis=[1 if each=="M" else 0 for each in data.diagnosis]

In [169]:
# dataset
X = data.loc[:,data.columns!="diagnosis"]
# target
y = data.loc[:, "diagnosis"]
y = y[:,np.newaxis]

In [44]:
print(y.value_counts())

0    357
1    212
Name: diagnosis, dtype: int64


### Training

For our model, we are going to make use of the simple classifiers from ```scikit-learn```, namely the Support Vector Machine, Decision Tree, Gaussian Naive Bayes, Logistic Regression, Random Forest and K-Nearest Neighbors classifiers. However, to make the final model more robust, we are adding some additional steps: nested cross validation, statistical-based feature selection and hyperparameter optimization using a grid search method. Finally, a ROC curve is computed to determine the model performance

In [164]:
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from scipy import stats
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve

import matplotlib.pyplot as plt
import utils

# Reload external modules
from importlib import reload
reload(utils)
from utils import *

import warnings
warnings.filterwarnings('ignore')

In [165]:
# Set parameters
outer_cv = 5
inner_cv = 3
epoch = 30
alpha = 0.05

In [167]:
# Using a Stratified split to balance uneven classes
skf = StratifiedKFold(n_splits=outer_cv)

scores = dict()
save_models = dict()
# Split the dataset into train and test
for i, (train_index, test_index) in enumerate(skf.split(X,y)):
    print(f"Outer Fold {i}")
    X_train, X_test =  X.iloc[train_index,:], X.iloc[test_index,:]
    y_train, y_test = y[train_index], y[test_index]
    
    # Statistical-based Feature selection
    feat_select = list()
    for feature in X_train.columns:
        values = X_train[feature]
        # Determine whether data is normal distributed
        _, p = stats.shapiro(values)
        class_0 = np.where(y_train==0)[0]
        class_1 = np.where(y_train==1)[0]
        if p > alpha:
            # Student t-test
            _, p_val = stats.ttest_ind(X_train.iloc[class_0,:][feature], X_train.iloc[class_1,:][feature])
        else:
            # Mann-Whitney U test
            _, p_val = stats.mannwhitneyu(X_train.iloc[class_0,:][feature], X_train.iloc[class_1,:][feature])
        # Finally, select feature if significant
        if p_val < alpha:
            feat_select.append(feature)
    # Drop insignificant features
    X_train, X_test =  X_train[feat_select], X_test[feat_select]
    
    # Initialize models and parameter grid
    models, model_names = initiate_models()
    grids = param_grid()
    
    for model, name, params in zip(models, model_names, grids):
        print(f"Grid search for {name}")
        grid = GridSearchCV(model, params, scoring='accuracy', cv=inner_cv, verbose= False)
        grid.fit(X_train, y_train)
        
        if name in save_models:
            save_models[name].append(grid)
        else:
            save_models[name] = [grid]
            
        # Predictions on test set
        y_pred = grid.predict(X_test)
        # Evaluation metrics
        if name in scores:
            scores[name]['accuracy'].append(accuracy_score(y_test, y_pred))
            scores[name]['accuracy'].append(roc_auc_score(y_test, y_pred))
        else:
            scores[name] = {"accuracy": [accuracy_score(y_test, y_pred)],
                            "auc": [roc_auc_score(y_test, y_pred)]}

Outer Fold 0
Grid search for svm
Grid search for tree
Grid search for NB
Grid search for logistic
Grid search for rf
Grid search for knn
Outer Fold 1
Grid search for svm
Grid search for tree
Grid search for NB
Grid search for logistic
Grid search for rf
Grid search for knn
Outer Fold 2
Grid search for svm
Grid search for tree
Grid search for NB
Grid search for logistic
Grid search for rf
Grid search for knn
Outer Fold 3
Grid search for svm
Grid search for tree
Grid search for NB
Grid search for logistic
Grid search for rf
Grid search for knn
Outer Fold 4
Grid search for svm
Grid search for tree
Grid search for NB
Grid search for logistic
Grid search for rf
Grid search for knn


## Evaluation

In [168]:
model_names = list(scores.keys())
test_accuracy = [scores[name]['accuracy'] for name in model_names]
test_auc = [scores[name]['auc'] for name in model_names]
print(f'Test accuracy: {np.mean(test_accuracy)}')
print(f'Test AUC: {np.mean(test_auc)}')

Test accuracy: 0.873224455187359
Test AUC: 0.8296484332350693
