# Machine Learning in Public Health Workshop

I'm going to walk through how we might work towards building a first machine learning model. But first, let's briefly discuss what machine learning is. This is going to be a rough crash course -- ask questions if you feel there's a gap!


## Motivation for Machine Learning
For many problems, it's sufficient to write a deterministic program to solve it. For example, if you want to find the sum of all numbers between 1 and 1 million, we could write a python program such as:

```python
sum([i for i in range(1000000)])
```

However, there are some problems that are more difficult to solve via computing. For example, how would you write a program to detect a face? This is a task that seems easy for most people -- easy to the extent that it is unconscious. But writing a program to perform such a task is difficult!

This difficulty is the motivation for machine learning.

## What is Machine Learning?
If we don't know the process for detecting faces, what if we learn that process? Let's frame our facial detection problem a bit:

Given an image, we want to determine whether or not a face exists in that image.

Assuming we've already given up on directly programming the process of detecting a face, how might we go about learning that process?

In machine learning, we typically give some data to an algorithm for "learning" and then you can use what that algorithm learned to make predictions about new data.

In the facial detection example, we might provide many images, each with a label - "Yes" if it contains a face or "No" if it does not (in practice this could be 1 and 0, respectively).

This data is used to _train_ a machine learning model.

## Model Training
We call the process of a model learning from data "training." In this, we typically give a model some data (often with the correct answers -- hence the label) and let it tweak itself according to some algorithm to get answers right more often. Think of this as homework: you often have some answers for homework questions so that you can try the problem yourself and then score yourself to see how well you did. This let's you improve your own score before it really counts (i.e. on the test).

In our facial detection example, you would give the model an image and let the model predict whether or not a face was in the image. Then you'd score the model prediction with the real label (we'll talk about how we score in a bit). The training algorithm will then tweak the model in light of a this score to do more of the things that get right answers and less of the things that get wrong answers.

Note: I'm being intentionally vague here about "algorithm" because there are many such algorithms. Specific algorithms are outside the scope of this workshop, but if you're interested, feel free to talk to me about it afterwards!

## Model Evaluation
You don't give your models homework for practice (in the form of training data) for no reason. Eventually, you'll want to evaluate to see how well your model performs. Just as (most) teachers wouldn't give you the answer key to the test, the model should be given new data *without* labels to see how well it performs with respect to the true labels.

Just like a real test, the model gets a score. But we don't improve our model from this score like before. This score tells us a bit about how "good" our model is.

This score concept keeps coming up; how do we do that?

## Scoring (Loss Function)
We score models with what's known as a loss function. This is just something that tells you how "wrong" a prediction is. For instance, predicting that a picture of a Honda is a picture of a snake is more wrong than predicting that it's a picture of a Subaru. The loss function dictates how to assign a numerical value to dictate the "wrongness" of a prediction.

## Deployment
Once you decide your model's evaluation score is good enough, you might decide to deploy and use it for some application! Just like any other code, be wary of bugs -- they can be oh-so-subtle in machine learning. With great power, comes a lot of testing.

## Workshop
We'll be looking at breast cancer data taken from the UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29 [1].
This dataset provides several features about tumors:

1. Sample code number: id number 
2. Clump Thickness: 1 - 10 
3. Uniformity of Cell Size: 1 - 10 
4. Uniformity of Cell Shape: 1 - 10 
5. Marginal Adhesion: 1 - 10 
6. Single Epithelial Cell Size: 1 - 10 
7. Bare Nuclei: 1 - 10 
8. Bland Chromatin: 1 - 10 
9. Normal Nucleoli: 1 - 10 
10. Mitoses: 1 - 10 
11. Class: (2 for benign, 4 for malignant)

A description of these parameters can be found below [2]:
```
Clump thickness: Benign cells tend to be grouped in monolayers, while cancerous cells are often grouped in multilayers. 

Uniformity of cell size/shape: Cancer cells tend to vary in size and shape. That is why these parameters are valuable in determining whether the cells are cancerous or not. 

Marginal adhesion: Normal cells tend to stick together. Cancer cells tends to loos this ability. So loss of adhesion is a sign of malignancy. 

Single epithelial cell size: Is related to the uniformity mentioned above. Epithelial cells that are significantly enlarged may be a malignant cell. 

Bare nuclei: This is a term used for nuclei that is not surrounded by cytoplasm (the rest of the cell). Those are typically seen in benign tumours. 

Bland Chromatin: Describes a uniform "texture" of the nucleus seen in benign cells. In cancer cells the chromatin tend to be more coarse. 

Normal nucleoli: Nucleoli are small structures seen in the nucleus. In normal cells the nucleolus is usually very small if visible at all. In cancer cells the nucleoli become more prominent, and sometimes there are more of them.
```

We'll dig into this dataset and try to predict whether a tumor is benign or malignant based on these features.

Note: I have made some slight modifications to the dataset to make it easier to use.

---
[1] O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear 
      programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.

[2] https://answers.yahoo.com/question/index?qid=20101204013824AAWTufG

---

In [1]:
%ls

Machine Learning in Public Health.ipynb
breast_cancer.csv


In [2]:
import pandas as pd

data = pd.read_csv('breast_cancer.csv')
data.head()

Unnamed: 0,Id Number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1.0,3,1,1,2
1,1002945,5,4,4,5,7,10.0,3,2,1,2
2,1015425,3,1,1,1,2,2.0,3,1,1,2
3,1016277,6,8,8,1,3,4.0,3,7,1,2
4,1017023,4,1,1,3,2,1.0,3,1,1,2


Lots of simple numerical fields. This should be simple to get a model going right away! We'll start out by applying a model called LogisticRegression. LogisticRegression essentially tries to divide classes of points (in our case benign vs. malignant tumors) with a line:

<img src=http://mlpy.sourceforge.net/docs/3.5/_images/elasticnetc.png>

The algorithm starts with an initial line and then just slightly rotates and moves until it can separate the two classes such that e.g. benign tumors are on one side of the line and malignant tumors are on the other side of the line. Luckily with scikit-learn, training such a model is easy.

Do note that in general there are many different models that can solve this problem well. It's often hard to decide what model to use. In general, you should try many different kinds of models, but a good place to start is [sk learn's guide](http://scikit-learn.org/stable/tutorial/machine_learning_map/).

Let's try logistic regression:

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

target_columns = ["Class"]
feature_columns = [col for col in data.columns if col not in target_columns]

X = data[feature_columns]
Y = data[target_columns]

Xtr, Xte, Ytr, Yte = train_test_split(X, Y, test_size=0.2)

model = LogisticRegression()
model.fit(Xtr, Ytr)

print("Train Score: {}".format(model.score(Xtr, Ytr)))
print("Test Score: {}".format(model.score(Xte, Yte)))

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

That's strange: NaN in our input! That means that somewhere we have missing data. If we look in the documentation (here: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names) we can see it mentioned. Let's remove the offending column ('Bare Nuclei'):

In [6]:
clean_data = data.drop(['Bare Nuclei'], 1)
clean_data.head()

Unnamed: 0,Id Number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,3,1,1,2
1,1002945,5,4,4,5,7,3,2,1,2
2,1015425,3,1,1,1,2,3,1,1,2
3,1016277,6,8,8,1,3,3,7,1,2
4,1017023,4,1,1,3,2,3,1,1,2


Okay, let's try building a model one more time!

In [7]:
target_columns = ["Class"]
feature_columns = [col for col in clean_data.columns if col not in target_columns]

X = clean_data[feature_columns]
Y = clean_data[target_columns]

Xtr, Xte, Ytr, Yte = train_test_split(X, Y, test_size=0.2)

model = LogisticRegression()
model.fit(Xtr, Ytr)

print("Train Score: {}".format(model.score(Xtr, Ytr)))
print("Test Score: {}".format(model.score(Xte, Yte)))

Train Score: 0.6565295169946332
Test Score: 0.65


  y = column_or_1d(y, warn=True)


Wow! With a simple first approach, we're already 66% accurate at detecting whether a tumor is malignant or benign. That seems pretty good for a first pass. But how good is it really? Let's take a look at the split between classes:

In [8]:
num_benign = (clean_data['Class'] == 2).sum()
num_malignant = (clean_data['Class'] == 4).sum()

print("Num Benign: {}".format(num_benign))
print("Num Malignant: {}".format(num_malignant))

Num Benign: 458
Num Malignant: 241


It's kind of suspicious that the train and test score accuracies are pretty close to the proportion of benign tumors in the data. Let's look at the coefficients to see if anything is learned:

In [16]:
for name, coef in zip(X.columns, model.coef_[0]):
  print("{} {}".format(name.ljust(30), coef))

Id Number                      -6.355527513495812e-07
Clump Thickness                8.13452412616266e-12
Uniformity of Cell Size        1.095795841180402e-11
Uniformity of Cell Shape       1.0560595550634325e-11
Marginal Adhesion              8.445561129199733e-12
Single Epithelial Cell Size    5.973667456503636e-12
Bland Chromatin                7.477649410691694e-12
Normal Nucleoli                9.143176062822985e-12
Mitoses                        2.589681030384016e-12


Notice how all of these are extremely small! This means that our model is really just predicting that every tumor is benign. That's not very useful! If we notice, the weight on the ID is the highest (by several orders of magnitude). Perhaps that ID is screwing up learning! Let's remove that as well:

In [14]:
clean_data_without_id = clean_data.drop(['Id Number'], 1)
clean_data_without_id.head()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,5,1,1,1,2,3,1,1,2
1,5,4,4,5,7,3,2,1,2
2,3,1,1,1,2,3,1,1,2
3,6,8,8,1,3,3,7,1,2
4,4,1,1,3,2,3,1,1,2


In [19]:
target_columns = ["Class"]
feature_columns = [col for col in clean_data_without_id.columns if col not in target_columns]

X = clean_data_without_id[feature_columns]
Y = clean_data_without_id[target_columns]

Xtr, Xte, Ytr, Yte = train_test_split(X, Y, test_size=0.2)

model = LogisticRegression()
model.fit(Xtr, Ytr)

print("Train Score: {}".format(model.score(Xtr, Ytr)))
print("Test Score: {}".format(model.score(Xte, Yte)))

Train Score: 0.9570661896243292
Test Score: 0.9642857142857143


  y = column_or_1d(y, warn=True)


Wow! Such a simple change brings us up in accurracy dramatically! Let's take a look at the coefficients now and match them to their feature name:

In [20]:
for name, weight in zip(clean_data_without_id.columns, model.coef_[0]):
  print("{} {}".format(name.ljust(30), weight))

Clump Thickness                0.30411694253731925
Uniformity of Cell Size        0.15529520593654014
Uniformity of Cell Shape       0.3664941998387165
Marginal Adhesion              0.1750852383206438
Single Epithelial Cell Size    0.098101382267995
Bland Chromatin                0.28336172072437005
Normal Nucleoli                0.17089611007351646
Mitoses                        0.192557263169656


From this we can see that Uniformity of cell shape seems to be the strongest indicator. How might we further improve?

Let's try three other ways to improve our accuracy:

1. Try to add the Bare Nuclei feature back
2. Try to rescale features
3. Try a different model

## Your Turn
Grab a partner to work with -- it's incredibly useful to work with someone when learning new things. Don't fret if you can't figure out how to do something.

### Bare Nuclei Feature
This feaure caused problems before because a few of the values were NaN. How might we handle those? There are generally two approaches: 1) throw the record out or 2) fill it with some value.

It's hard to say which might be better in this case, but let's just fill the places with NaN and replace with something sensible like 0.

In [23]:
filled_data = data.drop('Id Number', 1).fillna(0)
filled_data.head()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,5,1,1,1,2,1.0,3,1,1,2
1,5,4,4,5,7,10.0,3,2,1,2
2,3,1,1,1,2,2.0,3,1,1,2
3,6,8,8,1,3,4.0,3,7,1,2
4,4,1,1,3,2,1.0,3,1,1,2


In [24]:
target_columns = ["Class"]
feature_columns = [col for col in filled_data.columns if col not in target_columns]

X = filled_data[feature_columns]
Y = filled_data[target_columns]

In [None]:
# Your turn! Train a LogisticRegression model on X and Y (Hint: train/test?).

### Rescale Features
Sometimes, learning is difficult when features aren't appropriately scaled. It's much more difficult to learn when your points are far away from where your line starts!

<img src=http://mlpy.sourceforge.net/docs/3.5/_images/elasticnetc.png>

Imagine if your classification line started nowhere near these points -- learning might be slower! Furthermore, many models make strong assumptions about the distribution of your data, so it's a common practice to encourage learning by transforming data to have zero mean and unit variance. There are other types of data transformations that are commonly used -- feel free to read about some of them (and perhaps even try them): http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing.

In [25]:
from sklearn.preprocessing import StandardScaler

target_columns = ["Class"]
feature_columns = [col for col in filled_data.columns if col not in target_columns]

X = filled_data[feature_columns]
Y = filled_data[target_columns]

X_scaled = StandardScaler().fit_transform(X)

In [52]:
# Your turn! Train and evaluate a LogisticRegression model on X_scaled and Y.

Whew! We've done a lot of model training. Let's try one last thing: new models. We've been using LogisticRegression for everything. Let's try two other models: Support Vector Machines and Random Forests. These have the exact same interface as LogisticRegression.

* SVC: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
* Random Forest: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [26]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [54]:
# Your turn! Train and evaluate an SVC and RandomForestClassifier!