# Breast cancer detection

## Scenario
You have been hired as a Data Scientist by a biotech startup that is developing sensors for early detection of breast cancer. During the startup's recent fundraising round, several investors asked whether the sensor performance could be further improved through the use of Deep Learning, "big data" and other such buzzwords. Which is where you come in.

Before diving into more complicated techniques (which require significant more time and resources), you start by evaluating more common Machine Learning methods on public (and readily available) datasets. The goal is to provide a benchmark on baseline performance and to determine whether the concept is valid.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# scikit-learn
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# classifiers
from sklearn.linear_model import LogisticRegression, RidgeClassifier, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.neural_network import MLPClassifier

# make the code compatible with both Python 2 and 3
from __future__ import print_function, division

### Breast cancer data
We'll load the breast cancer data that comes bundle with scikit-learn.

More info for the dataset can be found on the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names

In [None]:
# load breast cancer data
# - X = (n_samples, n_features) feature array
# - y = (n_samples) target vector
X, y = load_breast_cancer(return_X_y=True)

# get summary stats
print("# of samples:", X.shape[0])
print("# of features:", X.shape[1])

**Discussion:** What do you know about this dataset? Do the specifics of the dataset help you any way (e.g. narrowing down the type of models to consider)?

### Splitting the data
Before going further, we should split the data into training and testing sets so we can use some supervised learning methods. The idea is to learn a model using the data only in the training set, and then evaluate the data on a new (untouched/unseen) testing set.

There are many ways to split data into training and testing. To start, we'll simply use a function from scikit-learn that splits the data randomly into specified sizes.

In [None]:
# split the data into training and testing sets (60/40 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

# check the sizes of the training and testing sets

# inspect the data


### Train a predictive model
We'll start by training some common classification methods to predict whether a patient will get breast cancer. The simplest method to start with is Logistic Regression.

In [None]:
# initialize the Logistic Regression model
model = LogisticRegression()

# train the model on the training data
model.fit(X_train, y_train)

# check the accuracy [-] of the trained model
print("Score (train): {:.3f}".format(model.score(X_train, y_train)))
print("Score (test):  {:.3f}".format(model.score(X_test, y_test)))

**Discussion:** How did the model perform? Is the performance "good enough" or do we need to go further? How could we decide (since we are not experts on breast cancer detection in this scenario)?

### Effect of pre-processing 
Let's evaluate the effect(s) of pre-processing using some standard techniques.

In [None]:
# scale the data so it has a mean=0 and variance=1
X, y = load_breast_cancer(return_X_y=True)
X = StandardScaler().fit_transform(X)

# split the data into training and testing sets (60/40 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

# train a Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)
print("Score (train): {:.3f}".format(model.score(X_train, y_train)))
print("Score (test):  {:.3f}".format(model.score(X_test, y_test)))

**Discussion:** What was the effect(s) of the pre-processing?

### Comparing classification methods
Now let's try comparing the performance of a range of classification methods. Note that predictive accuracy of the methods is only one metric for comparing the performance of the methods. For example, keeping training times down may be preferrable to a 1-2% increase in accuracy.

In the code cells below, try out some other common classification methods (which we've already imported at the start of the notebook):
- Ridge Classifier
- SGDClassifier
- SVC
- NuSVC
- MLPClassifier

In [None]:
# Ridge Classifier
model = RidgeClassifier()
model.fit(X_train, y_train)
print("Score (train): {:.3f}".format(model.score(X_train, y_train)))
print("Score (test):  {:.3f}".format(model.score(X_test, y_test)))

In [None]:
# SGD Classifier (SGD = Stochastic Gradient Descent)
model = ???


In [None]:
# Support Vector Classifier (SVC)
model = ???


In [None]:
# NuSVC
model = ???


In [None]:
# LinearSVC
model = ???


In [None]:
# MLPClassifier (MLP = Multi-Layer Perceptron)
model = ???


**Discussion:** What do the results seem to indicate about the performance of the various methods? Are there any limitations to the analysis (e.g. should you try other methods or other data)?

### Next steps
What should you do next? Do you have enough results to draw any conclusions or recommend something to your company?

Going back to the main point, do these results answer the question of whether your company should be integrating topics such as Deep Learning or "big data"? Why or why not? If no, what else is needed before you can answer the question?