# GUC Brain ML workshop

by AbdElRhman ElMoghazy


In this Tutorial we will learn how to classify the breast cancer dataset using Ensemble learning techniques. This tutorial is part of  GUC Brain ML workshop and will cover the following:

* [Data Exploration and Analysis](#section-one)
* [Feature Selection](#section-two)
* [Model choice and hyperparameters optmization](#section-three)
* [Model Evaluation](#section-seven)

In [None]:
import numpy as np
import pandas as pd
import optuna
import os
from sklearn import preprocessing
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

data = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")

<a id="section-one"></a>

### Exploratoratory Data Analysis
we will start by exploring the data statistically and visually. We will explore the distribution of data using describe() function from pd.DataFrame and will check for anomalies or possible outliers.
Using info() function from pd.DataFrame, we can get an idea about the data type of each feature and whether it contains Null or NaN values as well.
We will then learn how to deal with categorical data (although we don't have any feature except the target in the categorical form). Then we will explore correlations between features and also other methods for outliers detection including the boxplot -IQR method- and also the z-score way.

In [None]:
data.head(10)

In [None]:
data.info()

In [None]:
data.describe()

#### categorical Variables:
 Variables that consist of a set of dicrete finite categories or labels, it is divided into two main types:
 * Nominal: there is no natural priority or order for the different categories (You can use dummies function) <br>https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
 * Ordinal: Can be ordered or ranked. We can use Ordinal Encoder or LabelEncoder from Sklearn <br>
 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html
 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
 

Let's now check the number of unique labels/categories in the 'diagnosis' column using the value_counts() method from pandas.DataFrame()

In [None]:
data.diagnosis.value_counts()

The column consists of two categories only, B and M. Let's explore further if we can order those or just one-hot encode them.

#### Note:
One hot encoding a feature adds new features for each unique category, so if you have only two catogries "B" and "M" in diagnosis feature, you will have two new columns B and M where B feature will have 1s in the places diagnosis = "B" and M feature will have 1's in the places diagnosis = "M"

#### Example

One-hot encoding:

diagnosis &nbsp;&nbsp;&nbsp; B | M <br>
B &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;         1 | 0 <br>
M &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;         1 | 0 <br>
M &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;         1 | 0 <br>
B &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;         1 | 0 <br>
B &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;         1 | 0 <br>

Label encoding: if B is ranked lower than M: <br>
diagnosis  &nbsp;    diagnosis_new <br>
B  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            1 <br>
M  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            2 <br>
M  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            2 <br>
B  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            1 <br>
B  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;            1 <br>


But how can we really know if the data is ranked or not? This can be done using dommain knowledge of the data, for example most of the features are described in the main dataset page, You can also determine this in real life problems using your own knowledge of the problem and the data collected.
Here in this data it is obvious that B means Begnign and M means Melignant, also note that diagnosis is the target variable so I will go for label encoding the variable to get one output for each row.

In [None]:
label_enc = preprocessing.LabelEncoder()
data.diagnosis = label_enc.fit_transform(data.diagnosis)
labels = data.diagnosis
train = data.drop(['diagnosis', 'Unnamed: 32', "id"], axis = 1)

In [None]:
corr = data.corr()
f, ax = plt.subplots(figsize=(25, 25))
cmap = sns.diverging_palette(2000, 13, as_cmap=True)
sns.heatmap(corr, cmap=cmap, center=0,square=True, linewidths=.5)

In [None]:
data.columns

In [None]:
sns.pairplot(data[['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean']], hue = 'diagnosis')

#### Outliers Detection

There are many methods that can be used to detect outliers in a dataset. In this workshop we will discuss the following:
* Box Plot method
* Standarization (Z-sore) method

##### Box Plot :: Consists of five main components:
* Q1, first quartile (Midean of the first half of the data)
* Q2, Midean of the data
* Q3, midean of the second half of the data
* Max value
* Min value

##### Main equations in box plots:
$$ IQR = Q3 - Q1 $$
$$ Outliers = Q3 + 1.5 * IQR$$
$$ Q1 - 1.5 * IQR $$

##### Z-score method
Z-score represents the number of standard deviations removed from the mean for each data point. In a simpler way, it is the distance for a point from the mean in standard deviations.
$$ z-score = {x - mean \over std} $$

In [None]:
fig = plt.figure(figsize = (20,10))
ax = fig.gca()
sns.boxplot(data= train[['area_mean', 'area_worst']], orient="h", palette="Set1", ax = ax)

In [None]:
train.boxplot()

In [None]:
from scipy import stats

rows = np.any(stats.zscore(train.values) > 2.5, axis=1)
outliers = train.loc[rows]
outliers.shape

<a id="section-two"></a>

### Feature Selection
Feature selection is one of the essential steps in the Machine Learning pipeline. It can be very helpful to solve the 'Curse of dimentionality' if the number of features is huge.
There are many feature selection techniques including:

* model based feature selection which requires training a secondary Machine Learning model,
* Statistical based Feature selection which relies on statistical methods (eg: Hypothesis testing, Correlations)

In [None]:
# load the iris datasets
# fit an Extra Trees model to the data

x_train, x_test, y_train, y_test = train_test_split(train.values, labels.values, test_size=0.2, random_state=42 )
clf = ExtraTreesClassifier()
clf.fit(x_train,y_train)
# display the relative importance of each attribute
z = clf.feature_importances_
#make a dataframe to display every value and its column name
df = pd.DataFrame()

df["values"] = z
df['column'] = list(train.columns.values)
# Sort then descendingly to get the worst features at the end
df.sort_values(by='values', ascending=False, inplace = True)
df.head(100)

<a id="section-three"></a>

### Model Selection

Let's now try our first model before diving into further Feature Engineering.

#### Ensembling
An ensamble of weak classifiers usually gives better results than strong individual classifiers. Ensemble Learning consists of three main types:
* Bagging
* Boosting
* Stacking

In this workshop we will try bagging and boosting ensembles and leave the third type for another workshop


First we need to split the data into training and testing to be able to evaluate the model later. We will use from train_test_split from sklearn.model_selection

In [None]:
x_train, x_test, y_train, y_test = train_test_split(train.values, labels.values, test_size=0.2, random_state=42 )

In [None]:
clf = XGBClassifier(random_state=0)

clf.fit(x_train, y_train)
print('Accuracy of classifier on training set: {:.2f}'.format(clf.score(x_train, y_train) * 100))
print('Accuracy of classifier on test set: {:.2f}'.format(clf.score(x_test, y_test) * 100))

In [None]:
clf_et = ExtraTreesClassifier(n_estimators=950, random_state=0)

clf_et.fit(x_train, y_train)
print('Accuracy of classifier on training set: {:.2f}'.format(clf_et.score(x_train, y_train) * 100))
print('Accuracy of classifier on test set: {:.2f}'.format(clf_et.score(x_test, y_test) * 100))

#### Hyperparameters Tuning

There are many Machine Learning Models that can be used with this type of data. It is usually the case that when some Machine Learning model will give a good result on simillar dataset, for this kind of tabular data, tree-based ensembling methods usually gives the best results.

In [None]:
def objective(trial,data=train.values,target=labels.values):
    
    train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.15,random_state=42)
    
    param = {
        'tree_method':'gpu_hist',  # this parameter means using the GPU when training our model to speedup the training process
        'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
        'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
        'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
        'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.008,0.009,0.01,0.012,0.014,0.016,0.018, 0.02]),
        'n_estimators': trial.suggest_int('n_estimators', 100, 4000, 100),
        'max_depth': trial.suggest_categorical('max_depth', [5,7,9,11,13,15,17,20]),
        'random_state': trial.suggest_categorical('random_state', [24, 48,2020]),
    }
    
    model = XGBClassifier(**param)  
    model.fit(train_x,train_y,eval_set=[(test_x,test_y)],early_stopping_rounds=100,verbose=False)
    preds = model.predict(test_x)
    acc = accuracy_score(test_y, preds)
    return acc


study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

In [None]:
param = {'lambda': 7.726177577712451, 
         'alpha': 0.020954967406242572, 
         'colsample_bytree': 0.8, 
         'subsample': 0.4, 
         'learning_rate': 0.018, 
         'n_estimators': 1000, 
         'max_depth': 9, 
         'random_state': 48, 
         'min_child_weight': 2}

In [None]:
clf = XGBClassifier(**param)

clf.fit(x_train, y_train)

print('Accuracy of classifier on training set: {:.2f}'.format(clf.score(x_train, y_train) * 100))
print('Accuracy of classifier on test set: {:.2f}'.format(clf.score(x_test, y_test) * 100))

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
from sklearn.utils.multiclass import unique_labels
from warnings import simplefilter
from collections import defaultdict

<a id="section-seven"></a>

In [None]:
def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    classes = ["Sp" + str(x) for x in unique_labels(y_true, y_pred)]

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Loop over data dimensions and create text annotations.
    fmt = 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax


np.set_printoptions(precision=2)
plot_confusion_matrix(y_test, clf.predict(x_test), classes=np.unique(y_train), normalize=False,
                      title='Normalized confusion matrix')

plt.show()