# Predicting Breast Cancer Diagnosis Using XGBClassifier

This dataset is from the UCI Machine Learning Repository, downloaded from Kaggle. Link [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Ten real-valued features are computed for each cell nucleus:

a) **radius** (mean of distances from center to points on the perimeter)<br>
b) **texture** (standard deviation of gray-scale values)<br>
c) **perimeter**<br>
d) **area**<br>
e) **smoothness** (local variation in radius lengths)<br>
f) **compactness** (perimeter^2 / area - 1.0)<br>
g) **concavity** (severity of concave portions of the contour)<br>
h) **concave points** (number of concave portions of the contour)<br>
i) **symmetry**<br>
j) **fractal dimension** ("coastline approximation" - 1)<br>

The columns names ending with "se" or "worst" refer to the standard error or the maximum of that feature observed, respectively.

The target column is the binary "diagnosis" column.

# Summary

#### XGBClassifier
    * Unscaled
        Test accuracy: 0.9824
        Recall: 0.95
    * Scaled
        Test accuracy: 0.9649
        Recall: 0.95
        
    * Unscaled after dropping low-importance columns:
        Test accuracy: 0.9649
        Recall: 0.92
    * Scaled after dropping low_importance columns:
        Test accuracy: 0.9673
        Recall: 0.95
        

In [1]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import warnings
import matplotlib.pyplot as plt
import numpy as np

from xgboost import XGBClassifier

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

In [43]:
cancer = pd.read_csv('breast_cancer.csv')

cancer = cancer.drop(['Unnamed: 32', 'id'], axis = 1)

diag_map = {'B':0, 'M': 1}

cancer['diagnosis'] = cancer['diagnosis'].map(diag_map)

In [44]:
X = cancer.drop('diagnosis', 1)
y = cancer['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 20, stratify = y)

In [45]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Unscaled

In [46]:
xgb_grid = {
    "learning_rate": [0.01, 0.1, 0.5],
    "n_estimators": [50, 100, 150],
    "max_features": [0.5, 0.7, 0.9],
    "subsample": [0.7, 0.9],
    "max_depth": [3, 5],
}

model_xgb_grid = GridSearchCV(XGBClassifier(), param_grid = xgb_grid, verbose = 1, n_jobs = -1)
model_xgb_grid.fit(X_train, y_train)

print(model_xgb_grid.best_params_)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    5.4s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   13.5s
[Parallel(n_jobs=-1)]: Done 504 tasks      | elapsed:   23.4s


Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


{'learning_rate': 0.1, 'max_depth': 3, 'max_features': 0.5, 'n_estimators': 100, 'subsample': 0.7}


[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed:   24.2s finished


In [47]:
model_xgb = XGBClassifier(learning_rate = 0.5, 
                          max_depth = 3, 
                          max_features = 0.5, 
                          n_estimators = 100,
                         subsample = 0.9)
model_xgb.fit(X_train, y_train)
y_pred_xgb = model_xgb.predict(X_test)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_xgb),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)


print('Training Score: {}'.format(model_xgb.score(X_train, y_train)))
print('Test Score: {}'.format(model_xgb.score(X_test, y_test)))

print(classification_report(y_test, y_pred_xgb))
print(confusion_df)

Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Training Score: 1.0
Test Score: 0.9824561403508771
              precision    recall  f1-score   support

           0       0.97      1.00      0.99        72
           1       1.00      0.95      0.98        42

    accuracy                           0.98       114
   macro avg       0.99      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              72               0
Actually Mal.               2              40


In [48]:
importances = model_xgb_grid.best_estimator_.feature_importances_
im_df = pd.DataFrame({"feat": X_train.columns, "importance": importances})
im_df.sort_values("importance", ascending=False)

Unnamed: 0,feat,importance
22,perimeter_worst,0.377811
27,concave points_worst,0.107975
20,radius_worst,0.088973
23,area_worst,0.074238
7,concave points_mean,0.073565
1,texture_mean,0.028169
6,concavity_mean,0.026674
13,area_se,0.024636
3,area_mean,0.02455
21,texture_worst,0.023002


# Scaled

In [49]:
model_xgb_grid_s = GridSearchCV(XGBClassifier(), param_grid = xgb_grid, verbose = 1, n_jobs = -1)
model_xgb_grid_s.fit(X_train_scaled, y_train)

print(model_xgb_grid_s.best_params_)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  76 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done 376 tasks      | elapsed:   16.3s


Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


{'learning_rate': 0.1, 'max_depth': 3, 'max_features': 0.5, 'n_estimators': 100, 'subsample': 0.7}


[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed:   20.2s finished


In [59]:
model_xgb_scale = XGBClassifier(learning_rate = 0.1, 
                          max_depth = 3, 
                          max_featues = 0.5, 
                          n_estimators = 100, 
                          subsample = 0.9)
model_xgb_scale.fit(X_train_scaled, y_train)
y_pred_xgb_s = model_xgb_scale.predict(X_test_scaled)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_xgb_s),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)

print('Training Score: {}'.format(model_xgb_scale.score(X_train_scaled, y_train)))
print('Test Score: {}'.format(model_xgb_scale.score(X_test_scaled, y_test)))

print(classification_report(y_test, y_pred_xgb_s))
print(confusion_df)

Parameters: { max_featues } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Training Score: 1.0
Test Score: 0.9649122807017544
              precision    recall  f1-score   support

           0       0.97      0.97      0.97        72
           1       0.95      0.95      0.95        42

    accuracy                           0.96       114
   macro avg       0.96      0.96      0.96       114
weighted avg       0.96      0.96      0.96       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              70               2
Actually Mal.               2              40


This model is way overfit!!

In [52]:
importances = model_xgb_grid_s.best_estimator_.feature_importances_
im_df = pd.DataFrame({"feat": X_train.columns, "importance": importances})
im_df.sort_values("importance", ascending=False)

Unnamed: 0,feat,importance
22,perimeter_worst,0.377811
27,concave points_worst,0.107975
20,radius_worst,0.088973
23,area_worst,0.074238
7,concave points_mean,0.073565
1,texture_mean,0.028169
6,concavity_mean,0.026674
13,area_se,0.024635
3,area_mean,0.02455
21,texture_worst,0.023002


# Dropping low-importance columns

Threshold = 0.1

## Unscaled

In [24]:
cancer_dropped_u = cancer[['perimeter_worst','concave points_worst', 'diagnosis']]


X = cancer_dropped_u.drop(columns = 'diagnosis')

y = cancer_dropped_u['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 20)

In [25]:
xgb_grid = {
    "learning_rate": [0.01, 0.1, 0.5],
    "n_estimators": [50, 100, 150],
    "max_features": [0.5, 0.7, 0.9],
    "subsample": [0.7, 0.9],
    "max_depth": [3, 5],
}

model_xgb_grid = GridSearchCV(XGBClassifier(), param_grid = xgb_grid, verbose = 1, n_jobs = -1)
model_xgb_grid.fit(X_train, y_train)

print(model_xgb_grid.best_params_)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 128 tasks      | elapsed:    1.8s


Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


{'learning_rate': 0.01, 'max_depth': 3, 'max_features': 0.5, 'n_estimators': 150, 'subsample': 0.9}


[Parallel(n_jobs=-1)]: Done 533 out of 540 | elapsed:    7.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed:    7.2s finished


In [35]:
model_xgb_op = XGBClassifier(learning_rate = 0.1, 
                          max_depth = 3, 
                          max_features = 0.5, 
                          n_estimators = 100, 
                          subsample = 0.9)
model_xgb_op.fit(X_train, y_train)
y_pred_xgb = model_xgb_op.predict(X_test)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_xgb),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)


print('Training Score: {}'.format(model_xgb_op.score(X_train, y_train)))
print('Test Score: {}'.format(model_xgb_op.score(X_test, y_test)))

print(classification_report(y_test, y_pred_xgb))
print(confusion_df)

Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Training Score: 0.9736263736263736
Test Score: 0.9649122807017544
              precision    recall  f1-score   support

           0       0.94      1.00      0.97        66
           1       1.00      0.92      0.96        48

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              66               0
Actually Mal.               4              44


## Scaled

In [53]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [54]:
model_xgb_grid_s = GridSearchCV(XGBClassifier(), param_grid = xgb_grid, verbose = 1, n_jobs = -1)
model_xgb_grid_s.fit(X_train_scaled, y_train)

print(model_xgb_grid_s.best_params_)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  76 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done 376 tasks      | elapsed:   17.9s


Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


{'learning_rate': 0.1, 'max_depth': 3, 'max_features': 0.5, 'n_estimators': 100, 'subsample': 0.7}


[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed:   21.4s finished


In [56]:
model_xgb_scale_op = XGBClassifier(learning_rate = 0.1, 
                          max_depth = 3, 
                          max_featues = 0.5, 
                          n_estimators = 100, 
                          subsample = 0.9)
model_xgb_scale_op.fit(X_train_scaled, y_train)
y_pred_xgb_s = model_xgb_scale_op.predict(X_test_scaled)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_xgb_s),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)

print('Training Score: {}'.format(model_xgb_scale_op.score(X_train_scaled, y_train)))
print('Test Score: {}'.format(model_xgb_scale_op.score(X_test_scaled, y_test)))

print(classification_report(y_test, y_pred_xgb_s))
print(confusion_df)

Parameters: { max_featues } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Training Score: 1.0
Test Score: 0.9649122807017544
              precision    recall  f1-score   support

           0       0.97      0.97      0.97        72
           1       0.95      0.95      0.95        42

    accuracy                           0.96       114
   macro avg       0.96      0.96      0.96       114
weighted avg       0.96      0.96      0.96       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              70               2
Actually Mal.               2              40
