# Predicting Breast Cancer Diagnosis Using Logistic Regression

This dataset is from the UCI Machine Learning Repository, downloaded from Kaggle. Link [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Ten real-valued features are computed for each cell nucleus:

a) **radius** (mean of distances from center to points on the perimeter)<br>
b) **texture** (standard deviation of gray-scale values)<br>
c) **perimeter**<br>
d) **area**<br>
e) **smoothness** (local variation in radius lengths)<br>
f) **compactness** (perimeter^2 / area - 1.0)<br>
g) **concavity** (severity of concave portions of the contour)<br>
h) **concave points** (number of concave portions of the contour)<br>
i) **symmetry**<br>
j) **fractal dimension** ("coastline approximation" - 1)<br>

The columns names ending with "se" or "worst" refer to the standard error or the maximum of that feature observed, respectively.

The target column is the binary "diagnosis" column.

# Summary

#### LogisticRegression
    * Unscaled
        Test accuracy: 0.9561
        Recall: 0.90
    * Scaled
        Test accuracy: 0.9736
        Recall: 0.95
        
    * Unscaled after dropping low-importance columns:
        Test accuracy: 0.9561
        Recall: 0.90
    * Scaled after dropping low_importance columns:
        Test accuracy: 0.9737
        Recall: 0.98
        
Using LogisticRegression after scaling using the StandardScaler provides 97% accuracy identifying malignant breast masses. Dropping low-importance columns does not improve accuracy for either model, but does increase the recall of the scaled model. These model do require a high number of iterations before convergence, so they may be computationally expensive.

In [1]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import warnings
import matplotlib.pyplot as plt
import numpy as np

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

In [2]:
cancer = pd.read_csv('breast_cancer.csv')

cancer = cancer.drop(['Unnamed: 32', 'id'], axis = 1)

In [3]:
diag_map = {'B':0, 'M': 1}

cancer['diagnosis'] = cancer['diagnosis'].map(diag_map)

In [4]:
X = cancer.drop('diagnosis', 1)
y = cancer['diagnosis']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 20, stratify = y)

In [6]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Unscaled

In [8]:
lr_grid = {
    'C': [0.1, 1, 10, 20],
    'solver': ['newton-cg', 'lbfgs', 'liblinear','sag', 'saga'],
    'max_iter': [100, 1000, 10000]
}

model_lr_grid = GridSearchCV(LogisticRegression(max_iter = 1000), param_grid = lr_grid, verbose = 1, n_jobs = -1)
model_lr_grid.fit(X_train, y_train)

print(model_lr_grid.best_params_)

Fitting 5 folds for each of 60 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 165 tasks      | elapsed:   10.8s


{'C': 20, 'max_iter': 100, 'solver': 'newton-cg'}


[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:   24.7s finished


In [9]:
%time
model_lr = LogisticRegression(C= 20, solver = 'newton-cg', max_iter = 100)
model_lr.fit(X_train, y_train)

y_pred_lr = model_lr.predict(X_test)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_lr),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)


print('Training Score: {}'.format(model_lr.score(X_train, y_train)))
print('Test Score: {}'.format(model_lr.score(X_test, y_test)))
print(classification_report(y_test, y_pred_lr))
print(confusion_df)

Wall time: 0 ns
Training Score: 0.9736263736263736
Test Score: 0.956140350877193
              precision    recall  f1-score   support

           0       0.95      0.99      0.97        72
           1       0.97      0.90      0.94        42

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              71               1
Actually Mal.               4              38


In [13]:
coef = model_lr_grid.best_estimator_.coef_[0]
im_df = pd.DataFrame({"feat": X_train.columns, "coef_sq": coef ** 2})
im_df.sort_values("coef_sq", ascending=False)

Unnamed: 0,feat,coef_sq
27,concave points_worst,22.96185
28,symmetry_worst,18.73375
26,concavity_worst,18.18486
0,radius_mean,16.14987
24,smoothness_worst,10.57438
6,concavity_mean,8.086143
11,texture_se,7.256004
7,concave points_mean,6.560612
8,symmetry_mean,4.850327
15,compactness_se,3.625108


## Scaled

In [11]:
model_lr_grid_s = GridSearchCV(LogisticRegression(), param_grid = lr_grid, verbose = 1, n_jobs = -1)
model_lr_grid_s.fit(X_train_scaled, y_train)
print(model_lr_grid.best_params_)

Fitting 5 folds for each of 60 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 205 tasks      | elapsed:    2.4s


{'C': 20, 'max_iter': 100, 'solver': 'newton-cg'}


[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    7.3s finished


In [17]:
%time
model_lr_scale = LogisticRegression(C= 20, solver = 'newton-cg', max_iter = 100)
model_lr_scale.fit(X_train_scaled, y_train)

y_pred_lr_s = model_lr_scale.predict(X_test_scaled)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_lr_s),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)


print('Training Score: {}'.format(model_lr_scale.score(X_train_scaled, y_train)))
print('Test Score: {}'.format(model_lr_scale.score(X_test_scaled, y_test)))
print(classification_report(y_test, y_pred_lr_s))
print(confusion_df)

Wall time: 0 ns
Training Score: 0.989010989010989
Test Score: 0.9736842105263158
              precision    recall  f1-score   support

           0       0.97      0.99      0.98        72
           1       0.98      0.95      0.96        42

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              71               1
Actually Mal.               2              40


In [18]:
coef = model_lr_grid_s.best_estimator_.coef_[0]
im_df = pd.DataFrame({"feat": X_train.columns, "coef_sq": coef **2})
im_df.sort_values("coef_sq", ascending=False)

Unnamed: 0,feat,coef_sq
10,radius_se,0.968698
21,texture_worst,0.897413
20,radius_worst,0.716231
27,concave points_worst,0.664492
26,concavity_worst,0.650671
23,area_worst,0.62061
22,perimeter_worst,0.568371
13,area_se,0.553395
7,concave points_mean,0.505014
24,smoothness_worst,0.492587


# Dropping low-importance columns

Threshold = 0.1

## Unscaled

In [27]:
cancer_dropped_u = cancer.drop(columns=['concave points_se',
                                        'symmetry_se',
                                       'compactness_mean',
                                       'texture_mean',
                                       'concavity_se',
                                       'fractal_dimension_mean',
                                       'area_mean',
                                       'perimeter_mean',
                                       'concave points_mean',
                                       'area_se',
                                       'perimeter_worst',
                                       'area_worst'])

X = cancer_dropped_u.drop(columns = 'diagnosis')

y = cancer_dropped_u['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 20, stratify = y)

In [28]:
lr_grid = {
    'C': [0.1, 1, 10, 20],
    'solver': ['newton-cg', 'lbfgs', 'liblinear','sag', 'saga'],
    'max_iter': [100, 1000, 10000, 100000]
}

model_lr_grid = GridSearchCV(LogisticRegression(max_iter = 1000), param_grid = lr_grid, verbose = 1, n_jobs = -1)
model_lr_grid.fit(X_train, y_train)

print(model_lr_grid.best_params_)

Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 165 tasks      | elapsed:    5.1s


{'C': 20, 'max_iter': 100, 'solver': 'newton-cg'}


[Parallel(n_jobs=-1)]: Done 400 out of 400 | elapsed:   16.7s finished


In [29]:
%time
model_lr_op = LogisticRegression(C= 20, solver = 'newton-cg', max_iter = 100)
model_lr_op.fit(X_train, y_train)

y_pred_lr = model_lr_op.predict(X_test)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_lr),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)


print('Training Score: {}'.format(model_lr_op.score(X_train, y_train)))
print('Test Score: {}'.format(model_lr_op.score(X_test, y_test)))
print(classification_report(y_test, y_pred_lr))
print(confusion_df)

Wall time: 0 ns
Training Score: 0.967032967032967
Test Score: 0.956140350877193
              precision    recall  f1-score   support

           0       0.95      0.99      0.97        72
           1       0.97      0.90      0.94        42

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              71               1
Actually Mal.               4              38


## Scaled

In [23]:
cancer_dropped_s = cancer.drop(columns=['concave points_se',
                                        'smoothness_se',
                                        'symmetry_se',
                                       'compactness_mean',
                                       'symmetry_se',
                                       'texture_se',
                                       'smoothness_mean',
                                       'compactness_worst',
                                       'concavity_se',
                                       'fractal_dimension_worst',
                                       'symmetry_mean'])

X = cancer_dropped_s.drop(columns = 'diagnosis')

y = cancer_dropped_s['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 20, stratify = y)

In [24]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [25]:
model_lr_grid_s = GridSearchCV(LogisticRegression(), param_grid = lr_grid, verbose = 1, n_jobs = -1)
model_lr_grid_s.fit(X_train_scaled, y_train)
print(model_lr_grid_s.best_params_)

Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:    3.6s


{'C': 10, 'max_iter': 100, 'solver': 'saga'}


[Parallel(n_jobs=-1)]: Done 400 out of 400 | elapsed:    8.6s finished


In [30]:
%time
model_lr_scale_op = LogisticRegression(C= 10, solver = 'saga', max_iter = 1000)
model_lr_scale_op.fit(X_train_scaled, y_train)

y_pred_lr_s = model_lr_scale_op.predict(X_test_scaled)

confusion_df = pd.DataFrame(
    confusion_matrix(y_test, y_pred_lr_s),
    index=["Actually Ben.", "Actually Mal.",],
    columns=["Predicted Ben.", "Predicted Mal.",],
)


print('Training Score: {}'.format(model_lr_scale_op.score(X_train_scaled, y_train)))
print('Test Score: {}'.format(model_lr_scale_op.score(X_test_scaled, y_test)))
print(classification_report(y_test, y_pred_lr_s))
print(confusion_df)

Wall time: 0 ns
Training Score: 0.989010989010989
Test Score: 0.9736842105263158
              precision    recall  f1-score   support

           0       0.99      0.97      0.98        72
           1       0.95      0.98      0.96        42

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

               Predicted Ben.  Predicted Mal.
Actually Ben.              70               2
Actually Mal.               1              41


