# Breast cancer detection

## Scenario
You have been hired as a Data Scientist by a biotech startup that is developing sensors for early detection of breast cancer. During the startup's recent fundraising round, several investors asked whether the sensor performance could be further improved through the use of Deep Learning, "big data" and other such buzzwords. Which is where you come in.

Before diving into more complicated techniques (which require significant more time and resources), you start by evaluating more common Machine Learning methods on public (and readily available) datasets. The goal is to provide a benchmark on baseline performance and to determine whether the concept is valid.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# scikit-learn
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# classifiers
from sklearn.linear_model import LogisticRegression, RidgeClassifier, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.neural_network import MLPClassifier

### Breast cancer data
We'll load the breast cancer data that comes bundle with scikit-learn.

In [None]:
# load breast cancer data
# - X = (n_samples, n_features) feature array
# - y = (n_samples) target vector
X, y = load_breast_cancer(return_X_y=True)

# get summary stats
print("# of samples:", X.shape[0])
print("# of features:", X.shape[1])

### Pre-processing
Before we train any models, we'll perform some standard pre-processing.

In [None]:
# scale the data so it has a mean=0 and variance=1
X = StandardScaler().fit_transform(X)

# split the data into training and testing sets (60/40 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
print("# of samples (training):", X_train.shape[0])
print("# of samples (testing): ", X_test.shape[0])

### Train classification model
We'll start by training one of the most common classification methods: logistic regression.

In [None]:
# Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)
print("Score (train): {:.3f}".format(model.score(X_train, y_train)))
print("Score (test):  {:.3f}".format(model.score(X_test, y_test)))

In [None]:
# Ridge Classifier
model = RidgeClassifier()
model.fit(X_train, y_train)
print("Score (train): {:.3f}".format(model.score(X_train, y_train)))
print("Score (test):  {:.3f}".format(model.score(X_test, y_test)))

In [None]:
# SGD Classifier
model = SGDClassifier(max_iter=100)
model.fit(X_train, y_train)
print("Score (train): {:.3f}".format(model.score(X_train, y_train)))
print("Score (test):  {:.3f}".format(model.score(X_test, y_test)))

In [None]:
# Support Vector Machine (SVM)
model = SVC()
model.fit(X_train, y_train)
print("Score (train): {:.3f}".format(model.score(X_train, y_train)))
print("Score (test):  {:.3f}".format(model.score(X_test, y_test)))

In [None]:
# 
model = NuSVC()
model.fit(X_train, y_train)
print("Score (train): {:.3f}".format(model.score(X_train, y_train)))
print("Score (test):  {:.3f}".format(model.score(X_test, y_test)))

In [None]:
# 
model = LinearSVC()
model.fit(X_train, y_train)
print("Score (train): {:.3f}".format(model.score(X_train, y_train)))
print("Score (test):  {:.3f}".format(model.score(X_test, y_test)))

In [None]:
# Multi-Layer Perceptron (MLP), aka Artificial Neural Network (ANN)
model = MLPClassifier(max_iter=500)
model.fit(X_train, y_train)
print("Score (train): {:.3f}".format(model.score(X_train, y_train)))
print("Score (test):  {:.3f}".format(model.score(X_test, y_test)))