## Task: Diagnosing breast cancer

We have found patient data for breast cancer on the internet. Can we train a model that can make the diagnosis?
 
=> Source of the data: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

The goal is to predict the type of tumor (column: `diagnosis`). With a well-performing model, we could potentially save some surgeries or at least alleviate many concerns.

- **M:** Malignant, malignant tumor
- **B:** Benign, benign cancer

**Task:**
 - Train a model.
 - You cannot directly predict the `diagnosis` column. First, it needs to be converted into a column with 0 and 1 values.
 - Evaluate various classification models... which one works the best?
 - Note that you may need to scale the data for some models.
 - Make sure to set aside test data - only use it at the very end to determine accuracy on never-before-seen data.
   - **Important:** To make this meaningful, we can only assess accuracy on the test data *after* selecting the model (and tuning the parameters)!
 - The `id` column is likely not informative for our model. You should consider removing this column from the data.

**Bonus questions:**
 - For some models, PCA (Principal Component Analysis) might be worthwhile.
 - How many columns does the model actually need? Can it work with fewer columns? This would mean our doctor would need to measure and enter fewer values into the computer, saving valuable time!

In [None]:
# Matplotlib config

%matplotlib inline
%config InlineBackend.figure_formats = ['svg']
%config InlineBackend.rc = {'figure.figsize': (5.0, 4.0)}

import pandas as pd
import numpy as np

df = pd.read_csv("cancer.csv")
df.drop("id", axis = 1, inplace = True)
df["diagnosis"] = df["diagnosis"] == "M"

df.head()

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV, RepeatedKFold
from sklearn.svm import SVC

X = df[["concave points_worst", "perimeter_worst", "perimeter_mean", "radius_mean"]]

# X = df.drop("diagnosis", axis = 1)
y = df["diagnosis"]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 42)

model = GridSearchCV(SVC(), param_grid = {
    'kernel': ["rbf"],
    'gamma': [0.01, 0.1, 1, 10],
    "C": [0.01, 0.1, 0.5, 1, 5, 10]
}, cv = RepeatedKFold(), n_jobs = 15)

model.fit(X_train, y_train)

print(model.best_score_)