## Task: Diagnosing breast cancer

We have found patient data for breast cancer on the internet. Can we train a model that can make the diagnosis?
 
=> Source of the data: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

The goal is to predict the type of tumor (column: `diagnosis`). With a well-performing model, we could potentially save some surgeries or at least alleviate many concerns.

- **M:** Malignant, malignant tumor
- **B:** Benign, benign cancer

**Task:**
 - Train a model.
 - You cannot directly predict the `diagnosis` column. First, it needs to be converted into a column with 0 and 1 values.
 - Evaluate various classification models... which one works the best?
 - Note that you may need to scale the data for some models.
 - Make sure to set aside test data - only use it at the very end to determine accuracy on never-before-seen data.
   - **Important:** To make this meaningful, we can only assess accuracy on the test data *after* selecting the model (and tuning the parameters)!
 - The `id` column is likely not informative for our model. You should consider removing this column from the data.

**Bonus questions:**
 - For some models, PCA (Principal Component Analysis) might be worthwhile.
 - How many columns does the model actually need? Can it work with fewer columns? This would mean our doctor would need to measure and enter fewer values into the computer, saving valuable time!

In [16]:
# Matplotlib config

%matplotlib inline
%config InlineBackend.figure_formats = ['svg']
%config InlineBackend.rc = {'figure.figsize': (5.0, 4.0)}

import pandas as pd
import numpy as np

df = pd.read_csv("cancer.csv")
df.drop("id", axis = 1, inplace = True)
df["diagnosis"] = df["diagnosis"] == "M"

df.head()


Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,True,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,True,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,True,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,True,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,True,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [19]:
from sklearn.model_selection import train_test_split, GridSearchCV, RepeatedKFold
from sklearn.linear_model import LogisticRegression

X = df.drop("diagnosis", axis = 1)
y = df["diagnosis"]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 42)

model = GridSearchCV(LogisticRegression(), param_grid = {'max_iter': [10000]}, cv = RepeatedKFold())
model.fit(X_train, y_train)

print(model.best_score_)
print(model.score(X_test, y_test))

0.9472527472527471
0.956140350877193


In [20]:
df.corr()["diagnosis"].abs().sort_values(ascending = False)

diagnosis                  1.000000
concave points_worst       0.793566
perimeter_worst            0.782914
concave points_mean        0.776614
radius_worst               0.776454
perimeter_mean             0.742636
area_worst                 0.733825
radius_mean                0.730029
area_mean                  0.708984
concavity_mean             0.696360
concavity_worst            0.659610
compactness_mean           0.596534
compactness_worst          0.590998
radius_se                  0.567134
perimeter_se               0.556141
area_se                    0.548236
texture_worst              0.456903
smoothness_worst           0.421465
symmetry_worst             0.416294
texture_mean               0.415185
concave points_se          0.408042
smoothness_mean            0.358560
symmetry_mean              0.330499
fractal_dimension_worst    0.323872
compactness_se             0.292999
concavity_se               0.253730
fractal_dimension_se       0.077972
smoothness_se              0

In [24]:
from sklearn.model_selection import train_test_split, GridSearchCV, RepeatedKFold
from sklearn.linear_model import LogisticRegression

# X = df.drop("diagnosis", axis = 1)
X = df[["concave points_worst", "perimeter_worst", "perimeter_mean", "radius_mean"]]
y = df["diagnosis"]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 42)

model = GridSearchCV(LogisticRegression(), param_grid = {'max_iter': [10000]}, cv = RepeatedKFold())

model.fit(X_train, y_train)

print(model.best_score_)
print(model.score(X_test, y_test))

0.9296703296703295
0.9736842105263158
