# Predicting Breast Cancer Diagnosis

This dataset is from the UCI Machine Learning Repository, downloaded from Kaggle. Link [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Ten real-valued features are computed for each cell nucleus:

a) **radius** (mean of distances from center to points on the perimeter)<br>
b) **texture** (standard deviation of gray-scale values)<br>
c) **perimeter**<br>
d) **area**<br>
e) **smoothness** (local variation in radius lengths)<br>
f) **compactness** (perimeter^2 / area - 1.0)<br>
g) **concavity** (severity of concave portions of the contour)<br>
h) **concave points** (number of concave portions of the contour)<br>
i) **symmetry**<br>
j) **fractal dimension** ("coastline approximation" - 1)<br>

The columns names ending with "se" or "worst" refer to the standard error or the maximum of that feature observed, respectively.

The target column is the binary "diagnosis" column.

# Summary

#### [LogisticRegression](LogisticRegression_Breast_Cancer.ipynb) 
    * Unscaled
        Test accuracy: 0.9561
        Recall: 0.90
    * Scaled
        Test accuracy: 0.9736
        Recall: 0.95
        
#### [RandomForestClassifier](RandomForest_Breast_Cancer.ipynb)
    * Unscaled
        Test accuracy: 0.9736
        Recall: 0.98
    * Scaled
        Test accuracy: 0.9736
        Recall: 0.98
        
#### [KNN](KNN_Breast_Cancer.ipynb)
    * Unscaled
        Test accuracy: 0.9298
        Recall: 0.83
    * Scaled
        Test accuracy:  0.6315
        Recall: 0.93
        
#### [SVC](SVC_Breast_Cancer.ipynb)
    * Unscaled
        Test accuracy: 0.9561
        Recall: 0.90
    * Scaled
        Test accuracy:  0.9824
        Recall: 0.98
        
#### [XGBClassifier](XGB_Breast_Cancer.ipynb)
    * Unscaled
        Test accuracy: 0.9824
        Recall: 0.95
    * Scaled
        Test accuracy: 0.9649
        Recall: 0.95

In [1]:
#!pip install category_encoders
#!pip install xgboost
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import warnings
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [2]:
# function from https://gist.github.com/AdamSpannbauer/c99c366b0c7d5b6c4920a46c32d738e5

def print_vif(x):
    """Utility for checking multicollinearity assumption
    
    :param x: input features to check using VIF. This is assumed to be a pandas.DataFrame
    :return: nothing is returned the VIFs are printed as a pandas series
    """
    # Silence numpy FutureWarning about .ptp
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        x = sm.add_constant(x)
    
    vifs = []
    for i in range(x.shape[1]):
        vif = variance_inflation_factor(x.values, i)
        vifs.append(vif)

    print('VIF results\n-------------------------------')
    print(pd.Series(vifs, index=x.columns))
    print('-------------------------------\n')

In [18]:
cancer = pd.read_csv('breast_cancer.csv')

# Data Exploration and Model Preparation

In [19]:
cancer.shape

(569, 33)

In [20]:
cancer.head(2)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,


In [21]:
cancer['Unnamed: 32'].isna().mean()

1.0

The column "Unnamed: 32" contains nothing but NaNs, so I will drop that column along with the id column.

In [22]:
cancer = cancer.drop(['Unnamed: 32', 'id'], axis = 1)

Secondly, the "diagnosis" column (the target) needs to be converted from strings to numbers.

In [23]:
diag_map = {'B':0, 'M': 1}

cancer['diagnosis'] = cancer['diagnosis'].map(diag_map)

In [24]:
cancer['diagnosis'].value_counts()

0    357
1    212
Name: diagnosis, dtype: int64

There is not a large class imbalace, but even so, I will stratify the test set by diagnosis to minimize any imbalace.

In [25]:
X = cancer.drop('diagnosis', 1)
y = cancer['diagnosis']

I will perform a train_test_split to isolate a testing set and a training set.

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 20, stratify = y)

In [27]:
y_train.value_counts()

0    285
1    170
Name: diagnosis, dtype: int64

In [28]:
y_test.value_counts()

0    72
1    42
Name: diagnosis, dtype: int64

I will scale the data and try all models with scaled and un-scaled data.

In [29]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

For each model type, I will perform a GridSearchCV to find the best hyperparameters. Then, using these best hyperparameters, I will build the specified model and obtain the training score, test score, confustion matrix, and a classification report. I am not only interested in the highest test accuracy, but also the highest recall because this minimizes the number of false negatives. I do not want to tell someone they are healthy when they actually have a malignant breast mass. That person should be receiving care, but instead they would be overlooked.

# LogisticRegression

## Unscaled

In [30]:
lr_grid = {
    'C': [0.1, 1, 10, 20],
    'solver': ['newton-cg', 'lbfgs', 'liblinear','sag', 'saga'],
    'max_iter': [100, 1000, 10000]
}

model_lr_grid = GridSearchCV(LogisticRegression(max_iter = 1000), param_grid = lr_grid, verbose = 1, n_jobs = -1)
model_lr_grid.fit(X_train, y_train)

print(model_lr_grid.best_params_)

Fitting 5 folds for each of 60 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 165 tasks      | elapsed:   11.3s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:   26.8s finished


{'C': 20, 'max_iter': 100, 'solver': 'newton-cg'}


In [15]:
%time
model_lr = LogisticRegression(C= 20, solver = 'newton-cg', max_iter = 100)
model_lr.fit(X_train, y_train)

y_pred_lr = model_lr.predict(X_test)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_lr),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)


print('Training Score: {}'.format(model_lr.score(X_train, y_train)))
print('Test Score: {}'.format(model_lr.score(X_test, y_test)))
print(classification_report(y_test, y_pred_lr))
print(confusion_df)

Wall time: 0 ns
Training Score: 0.9736263736263736
Test Score: 0.956140350877193
              precision    recall  f1-score   support

           0       0.95      0.99      0.97        72
           1       0.97      0.90      0.94        42

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              71               1
Actually Mal.               4              38


## Scaled

In [31]:
model_lr_grid_s = GridSearchCV(LogisticRegression(), param_grid = lr_grid, verbose = 1, n_jobs = -1)
model_lr_grid_s.fit(X_train_scaled, y_train)
print(model_lr_grid.best_params_)

Fitting 5 folds for each of 60 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  68 tasks      | elapsed:    2.6s


{'C': 20, 'max_iter': 100, 'solver': 'newton-cg'}


[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    8.7s finished


In [32]:
%time
model_lr_scale = LogisticRegression(C= 20, solver = 'newton-cg', max_iter = 100)
model_lr_scale.fit(X_train_scaled, y_train)

y_pred_lr_s = model_lr_scale.predict(X_test_scaled)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_lr_s),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)


print('Training Score: {}'.format(model_lr_scale.score(X_train_scaled, y_train)))
print('Test Score: {}'.format(model_lr_scale.score(X_test_scaled, y_test)))
print(classification_report(y_test, y_pred_lr_s))
print(confusion_df)

Wall time: 0 ns
Training Score: 0.989010989010989
Test Score: 0.9736842105263158
              precision    recall  f1-score   support

           0       0.97      0.99      0.98        72
           1       0.98      0.95      0.96        42

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              71               1
Actually Mal.               2              40


# RandomForestClassifier

## Unscaled

In [47]:
rfc_grid = {
    'n_estimators': [5, 10, 50],
    'criterion': ['gini', 'entropy'],
    'max_depth': [5, 10, 20],
    'min_samples_leaf': [5, 20, 50]
}

model_rfc_grid = GridSearchCV(RandomForestClassifier(), param_grid = rfc_grid, verbose = 1, n_jobs = -1)
model_rfc_grid.fit(X_train, y_train)

print(model_rfc_grid.best_params_)

Fitting 5 folds for each of 54 candidates, totalling 270 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 263 out of 270 | elapsed:    9.2s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done 270 out of 270 | elapsed:    9.3s finished


{'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 5, 'n_estimators': 50}


In [44]:
%time
model_rfc = RandomForestClassifier(criterion = 'entropy', 
                                   max_depth = 5, 
                                   n_estimators = 10, 
                                   min_samples_leaf = 5, 
                                   random_state = 20)
model_rfc.fit(X_train, y_train)

y_pred_rfc = model_rfc.predict(X_test)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_rfc),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)

print('Training Score: {}'.format(model_rfc.score(X_train, y_train)))
print('Test Score: {}'.format(model_rfc.score(X_test, y_test)))
print(classification_report(y_test, y_pred_rfc))
print(confusion_df)

Wall time: 999 µs
Training Score: 0.9758241758241758
Test Score: 0.9649122807017544
              precision    recall  f1-score   support

           0       0.99      0.96      0.97        72
           1       0.93      0.98      0.95        42

    accuracy                           0.96       114
   macro avg       0.96      0.97      0.96       114
weighted avg       0.97      0.96      0.97       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              69               3
Actually Mal.               1              41


## Scaled

In [41]:
model_rfc_grid_s = GridSearchCV(RandomForestClassifier(), param_grid = rfc_grid, verbose = 1, n_jobs = -1)
model_rfc_grid_s.fit(X_train_scaled, y_train)

print(model_rfc_grid.best_params_)

Fitting 5 folds for each of 54 candidates, totalling 270 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 128 tasks      | elapsed:    2.8s


{'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 5, 'n_estimators': 10}


[Parallel(n_jobs=-1)]: Done 270 out of 270 | elapsed:    6.4s finished


In [42]:
%time
model_rfc_scale = RandomForestClassifier(criterion = 'entropy', 
                                         max_depth = 5, 
                                         n_estimators = 50, 
                                         min_samples_leaf = 5,
                                        random_state = 20)
model_rfc_scale.fit(X_train_scaled, y_train)

y_pred_rfc_s = model_rfc_scale.predict(X_test_scaled)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_rfc_s),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)

print('Training Score: {}'.format(model_rfc_scale.score(X_train_scaled, y_train)))
print('Test Score: {}'.format(model_rfc_scale.score(X_test_scaled, y_test)))
print(classification_report(y_test, y_pred_rfc_s))
print(confusion_df)

Wall time: 0 ns
Training Score: 0.989010989010989
Test Score: 0.9736842105263158
              precision    recall  f1-score   support

           0       0.99      0.97      0.98        72
           1       0.95      0.98      0.96        42

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              70               2
Actually Mal.               1              41


A Random Forest with scaled data gives a test acuracy of roughly 97%. The model overfits as Random Forests are prone to do, though only by about 1%.

# KNearestNeighbors

## Unscaled

In [45]:
knn_grid = {
    'n_neighbors': [5, 10, 50],
    'weights': ['uniform', 'distance'],
}

model_knn_grid = GridSearchCV(KNeighborsClassifier(), param_grid = knn_grid, verbose = 1, n_jobs = -1)
model_knn_grid.fit(X_train, y_train)

print(model_knn_grid.best_params_)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


{'n_neighbors': 10, 'weights': 'distance'}


[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    2.7s finished


In [49]:
model_knn = KNeighborsClassifier(weights = 'distance', n_neighbors = 10)
model_knn.fit(X_train, y_train)

y_pred_knn = model_knn.predict(X_test)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_knn),
    index=["Actually 0", "Actually 1",],
    columns=["Predicted 0", "Predicted 1",],
)


print('Training Score: {}'.format(model_knn.score(X_train, y_train)))
print('Test Score: {}'.format(model_knn.score(X_test, y_test)))
print(classification_report(y_test, y_pred_knn))
print(confusion_df)

Training Score: 1.0
Test Score: 0.9298245614035088
              precision    recall  f1-score   support

           0       0.91      0.99      0.95        72
           1       0.97      0.83      0.90        42

    accuracy                           0.93       114
   macro avg       0.94      0.91      0.92       114
weighted avg       0.93      0.93      0.93       114

            Predicted 0  Predicted 1
Actually 0           71            1
Actually 1            7           35


## Scaled

In [50]:
model_knn_grid = GridSearchCV(KNeighborsClassifier(), param_grid = knn_grid, verbose = 1, n_jobs = -1)
model_knn_grid.fit(X_train_scaled, y_train)

print(model_knn_grid.best_params_)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


{'n_neighbors': 10, 'weights': 'distance'}


[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    3.5s finished


In [54]:
model_knn_scale = KNeighborsClassifier(weights = 'distance', n_neighbors = 10)
model_knn_scale.fit(X_train_scaled, y_train)

y_pred_knn_s = model_knn_scale.predict(X_test_scaled)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_knn_s),
    index=["Actually 0", "Actually 1",],
    columns=["Predicted 0", "Predicted 1",],
)

print('Training Score: {}'.format(model_knn.score(X_train_scaled, y_train)))
print('Test Score: {}'.format(model_knn.score(X_test_scaled, y_test)))
print(classification_report(y_test, y_pred_knn_s))
print(confusion_df)

Training Score: 0.6263736263736264
Test Score: 0.631578947368421
              precision    recall  f1-score   support

           0       0.96      0.97      0.97        72
           1       0.95      0.93      0.94        42

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

            Predicted 0  Predicted 1
Actually 0           70            2
Actually 1            3           39


KNeighborsClassifier with unscaled data gives about 92% accuracy. So far, this model is the worst performing of the ones tested, though the recall of the scaled model come close to others.

# Support Vector Classifier

## Unscaled

In [55]:
svc_grid = {
    'kernel': ['linear', 'poly', 'rbf'],
    'C': [1, 10, 20],
    'degree': [2, 3, 5]
}

model_svc_grid = GridSearchCV(SVC(), param_grid = svc_grid, verbose = 1, n_jobs = -1)
model_svc_grid.fit(X_train, y_train)

print(model_svc_grid.best_params_)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   12.1s
[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed:  1.3min finished


{'C': 10, 'degree': 2, 'kernel': 'linear'}


In [58]:
%time
model_svc = SVC(kernel = 'linear', C = 10)
model_svc.fit(X_train, y_train)

y_pred_svc = model_svc.predict(X_test)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_svc),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)

print(model_svc.score(X_train, y_train))
print(model_svc.score(X_test, y_test))
print(classification_report(y_test, y_pred_svc))
print(confusion_df)

Wall time: 0 ns
0.978021978021978
0.956140350877193
              precision    recall  f1-score   support

           0       0.95      0.99      0.97        72
           1       0.97      0.90      0.94        42

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              71               1
Actually Mal.               4              38


## Scaled

In [59]:
model_svc_grid_s = GridSearchCV(SVC(), param_grid = svc_grid, verbose = 1, n_jobs = -1)
model_svc_grid_s.fit(X_train_scaled, y_train)

print(model_svc_grid_s.best_params_)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


{'C': 10, 'degree': 2, 'kernel': 'rbf'}


[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed:    0.3s finished


In [65]:
model_svc_scale = SVC(kernel = 'rbf', C = 1)
model_svc_scale.fit(X_train_scaled, y_train)

y_pred_svc_s = model_svc_scale.predict(X_test_scaled)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_svc_s),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)

print('Training Score: {}'.format(model_svc_scale.score(X_train_scaled, y_train)))
print('Test Score: {}'.format(model_svc_scale.score(X_test_scaled, y_test)))
print(classification_report(y_test, y_pred_svc_s))
print(confusion_df)

Training Score: 0.989010989010989
Test Score: 0.9824561403508771
              precision    recall  f1-score   support

           0       0.99      0.99      0.99        72
           1       0.98      0.98      0.98        42

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              71               1
Actually Mal.               1              41


Scaling gives better accuracy for SVC (about 98%).

This model is the best so far.

# Boosting

## Unscaled

In [66]:
xgb_grid = {
    "learning_rate": [0.01, 0.1, 0.5],
    "n_estimators": [50, 100, 150],
    "max_features": [0.5, 0.7, 0.9],
    "subsample": [0.7, 0.9],
    "max_depth": [3, 5],
}

model_xgb_grid = GridSearchCV(XGBClassifier(), param_grid = xgb_grid, verbose = 1, n_jobs = -1)
model_xgb_grid.fit(X_train, y_train)

print(model_xgb_grid.best_params_)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   11.6s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   22.0s
[Parallel(n_jobs=-1)]: Done 504 tasks      | elapsed:   32.9s


Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


{'learning_rate': 0.1, 'max_depth': 3, 'max_features': 0.5, 'n_estimators': 100, 'subsample': 0.7}


[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed:   33.8s finished


In [68]:
model_xgb = XGBClassifier(learning_rate = 0.5, 
                          max_depth = 3, 
                          max_features = 0.5, 
                          n_estimators = 100, 
                          subsample = 0.9)
model_xgb.fit(X_train, y_train)
y_pred_xgb = model_xgb.predict(X_test)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_xgb),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)


print('Training Score: {}'.format(model_xgb.score(X_train, y_train)))
print('Test Score: {}'.format(model_xgb.score(X_test, y_test)))

print(classification_report(y_test, y_pred_xgb))
print(confusion_df)

Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Training Score: 1.0
Test Score: 0.9824561403508771
              precision    recall  f1-score   support

           0       0.97      1.00      0.99        72
           1       1.00      0.95      0.98        42

    accuracy                           0.98       114
   macro avg       0.99      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              72               0
Actually Mal.               2              40


## Scaled

In [69]:
model_xgb_grids = GridSearchCV(XGBClassifier(), param_grid = xgb_grid, verbose = 1, n_jobs = -1)
model_xgb_grids.fit(X_train_scaled, y_train)

print(model_xgb_grids.best_params_)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    5.4s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   13.7s
[Parallel(n_jobs=-1)]: Done 504 tasks      | elapsed:   23.5s


Parameters: { max_features } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


{'learning_rate': 0.1, 'max_depth': 3, 'max_features': 0.5, 'n_estimators': 100, 'subsample': 0.7}


[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed:   24.2s finished


In [70]:
model_xgb_scale = XGBClassifier(learning_rate = 0.1, 
                          max_depth = 3, 
                          max_featues = 0.5, 
                          n_estimators = 100, 
                          subsample = 0.9)
model_xgb_scale.fit(X_train_scaled, y_train)
y_pred_xgb_s = model_xgb_scale.predict(X_test_scaled)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_xgb_s),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)

print('Training Score: {}'.format(model_xgb_scale.score(X_train_scaled, y_train)))
print('Test Score: {}'.format(model_xgb_scale.score(X_test_scaled, y_test)))

print(classification_report(y_test, y_pred_xgb_s))
print(confusion_df)

Parameters: { max_featues } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Training Score: 1.0
Test Score: 0.9649122807017544
              precision    recall  f1-score   support

           0       0.97      0.97      0.97        72
           1       0.95      0.95      0.95        42

    accuracy                           0.96       114
   macro avg       0.96      0.96      0.96       114
weighted avg       0.96      0.96      0.96       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              70               2
Actually Mal.               2              40
