# Machine Learning in Public Health Workshop

I'm going to walk through how we might work towards building a first machine learning model. But first, let's briefly discuss what machine learning is. This is going to be a rough crash course -- ask questions if you feel there's a gap!


**Disclaimer: this is a super fast crash course. In practice, applying data science to public health is challenging. It's important problem but it requires deep collaboration between domain experts in the health field and data scientists to build valuable solutions. Take this knowledge to collaborate with domain experts -- not to "disrupt" their industry.**


## Motivation for Machine Learning
For many problems, it's sufficient to write a deterministic program to solve it. For example, if you want to find the sum of all numbers between 1 and 1 million, we could write a python program such as:

```python
sum([i for i in range(1000000)])
```

However, there are some problems that are more difficult to solve via computing. For example, how would you write a program to detect a face? This is a task that seems easy for most people -- easy to the eXvant that it is unconscious. But writing a program to perform such a task is difficult!

This difficulty is the motivation for machine learning.

## What is Machine Learning?
If we don't know the process for detecting faces, what if we learn that process? Let's frame our facial detection problem a bit:

Given an image, we want to determine whether or not a face exists in that image.

Assuming we've already given up on directly programming the process of detecting a face, how might we go about learning that process?

In machine learning, we typically give some data to an algorithm for "learning" and then you can use what that algorithm learned to make predictions about new data.

In the facial detection example, we might provide many images, each with a label - "Yes" if it contains a face or "No" if it does not (in practice this could be 1 and 0, respectively).

This data is used to _train_ a machine learning model.

## Model Training
We call the process of a model learning from data "training." In this, we typically give a model some data (often with the correct answers -- hence the label) and let it tweak itself according to some algorithm to get answers right more often. Think of this as homework: you often have some answers for homework questions so that you can try the problem yourself and then score yourself to see how well you did. This let's you improve your own score before it really counts (i.e. on the test).

In our facial detection example, you would give the model an image and let the model predict whether or not a face was in the image. Then you'd score the model prediction with the real label (we'll talk about how we score in a bit). The training algorithm will then tweak the model in light of a this score to do more of the things that get right answers and less of the things that get wrong answers.

Note: I'm being intentionally vague here about "algorithm" because there are many such algorithms. Specific algorithms are outside the scope of this workshop, but if you're interested, feel free to talk to me about it afterwards!

## Model Evaluation
You don't give your models homework for practice (in the form of training data) for no reason. Eventually, you'll want to evaluate to see how well your model performs. Just as (most) teachers wouldn't give you the answer key to the test, the model should be given new data *without* labels to see how well it performs with respect to the true labels.

Generally, you may hear folks talk about 3 datasets:

* Training: this is what your model sees and learns from
* Validation (AKA dev): this is what you use to help guide your model selection choices
* Testing: this is what you use to estimate the final performance of your model.

In this workshop, we'll forget about a test set for simplicity. But in real world usage -- it's an important component!

Just like a real test, the model gets a score. But we don't improve our model from this score like before. This score tells us a bit about how "good" our model is.

This score concept keeps coming up; how do we do that? Usually with what we call a loss function. The loss function should give a notion of the "wrongness" of your model's prediction.

There are many qualities you should evaluate in a model that are hard to summarize in a single number, though!

* Does the model perform well overall?
* Does the model have any egregious predictions (e.g. what edge cases are there?)
* Does the model treat various groups fairly?
* Does the model have desired system performance characteristics?


## Deployment
Once you decide your model's evaluation score is good enough, you might decide to deploy and use it for some application! Just like any other code, be wary of bugs -- they can be oh-so-subtle in machine learning. With great power, comes a lot of testing.

If you want to deploy something we do today -- I recommend using something like [Algorithmia](https://algorithmia.com/developers/model-deployment/scikit) as it will be super easy to get something up and running quickly!

## Continued Improvement
After deploying a model, you often want to monitor it to make sure it maintains performance. There are tons of considerations here -- there's only so much we can go through in our limited time together! But a quick summary is: take a lot of what you validate before deployment and do that on a continuous basis!


## Workshop
We'll be looking at breast cancer data taken from the UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29 [1].
This dataset provides several features about tumors:

1. Sample code number: id number 
2. Clump Thickness: 1 - 10 
3. Uniformity of Cell Size: 1 - 10 
4. Uniformity of Cell Shape: 1 - 10 
5. Marginal Adhesion: 1 - 10 
6. Single Epithelial Cell Size: 1 - 10 
7. Bare Nuclei: 1 - 10 
8. Bland Chromatin: 1 - 10 
9. Normal Nucleoli: 1 - 10 
10. Mitoses: 1 - 10 
11. Class: (2 for benign, 4 for malignant)

A description of these parameters can be found below [2]:
```
Clump thickness: Benign cells tend to be grouped in monolayers, while cancerous cells are often grouped in multilayers. 

Uniformity of cell size/shape: Cancer cells tend to vary in size and shape. That is why these parameters are valuable in determining whether the cells are cancerous or not. 

Marginal adhesion: Normal cells tend to stick together. Cancer cells tends to loos this ability. So loss of adhesion is a sign of malignancy. 

Single epithelial cell size: Is related to the uniformity mentioned above. Epithelial cells that are significantly enlarged may be a malignant cell. 

Bare nuclei: This is a term used for nuclei that is not surrounded by cytoplasm (the rest of the cell). Those are typically seen in benign tumours. 

Bland Chromatin: Describes a uniform "texture" of the nucleus seen in benign cells. In cancer cells the chromatin tend to be more coarse. 

Normal nucleoli: Nucleoli are small structures seen in the nucleus. In normal cells the nucleolus is usually very small if visible at all. In cancer cells the nucleoli become more prominent, and sometimes there are more of them.
```

We'll dig into this dataset and try to predict whether a tumor is benign or malignant based on these features. It's often desirable to immediately start throwing models at the data, but the first step is always to understand your data, so let's take a look!

Note: I have made some slight modifications to the dataset to make it easier to use.

---
[1] O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear 
      programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.

[2] https://answers.yahoo.com/question/index?qid=20101204013824AAWTufG

---

In [None]:
%ls

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('breast_cancer.csv')

data_tr, data_va = train_test_split(data, test_size=0.3)


data_tr.head()

First off, it seems clear that we just have numerical features -- that will make our first attempt at a model easier as many ML libraries expect numeric features (e.g. if we had text, we may need to convert to a numeric representation for some libraries/models).

Let's take a look at some other qualities of this data!

In [None]:
data_tr.isnull().mean()

here we can see that the only feature that has any missing values is `Bare Nuclei`. Proper handling of missing values can be difficult at times because depending on [how it is missing, we have to adjust our strategy for how to deal with it](https://www.theanalysisfactor.com/missing-data-mechanism/).

(A common theme in good data science work: know your assumptions).

In [None]:
data_tr.describe()

As discussed above, we can see most of these features range from 1-10. But we also see differences in the distributions (see the rows `25%`, `50%`, and `75%`). Some of these features seem to have a lot of data on the same values (e.g. if both `25%` and `50%` have the same value, then you can be pretty sure at least 25% of the data has that value!). Also of interest is we can see a little insight into our label distribution (there are only 2 labels, `2` and `4`). It looks like it's reasonably balanced -- can you figure out the proportion of each label yourself?

Let's look at how all these correlate with each other to see how they relate.

In [None]:
data_tr.corr()

Notice the diagonol is 1.0 -- this is because a variable is always perfectly correlated with itself. We can also see each variable's correlation with the class. We can see that `Bare Nuceli`, `Uniformity of Cell Size`, and `Uniformity of Cell Shape`. We can also recognize that the latter 2 are highly correlated with each other.

What does that mean? It means that most of the information we could learn from one of them is provided by the other. It's tempting to always throw as many features as possible at a machine learning model (and in hackathon mode, that might be the right thing to ship quickly). But when building a real product for people, simple is often better :).

I think we've seen enough to give a quick go. Let's create a first model on the `Uniformity of Cell Size` feature. We'll start out by applying a model called LogisticRegression. LogisticRegression essentially tries to divide classes of points (in our case benign vs. malignant tumors) with a line:

<img src=http://mlpy.sourceforge.net/docs/3.5/_images/elasticnetc.png>

The algorithm starts with an initial line and then just slightly rotates and moves until it can separate the two classes such that e.g. benign tumors are on one side of the line and malignant tumors are on the other side of the line. Luckily with scikit-learn, training such a model is easy.

Do note that in general there are many different models that can solve this problem well. It's often hard to decide what model to use. In general, you should try many different kinds of models, but a good place to start is [sk learn's guide](http://scikit-learn.org/stable/tutorial/machine_learning_map/).

In [None]:
from sklearn.linear_model import LogisticRegression

def featurize(data, feature_columns, target_column):
    """Extract features/labels from a given dataset."""
    return data[feature_columns], data[target_column]

In [None]:
target_column = "Class"
feature_columns = ["Uniformity of Cell Size"]

Xtr, Ytr = featurize(data_tr, feature_columns, target_column)
model = LogisticRegression().fit(Xtr, Ytr)
print("Train Score: {}".format(model.score(Xtr, Ytr)))

Xva, Yva = featurize(data_va, feature_columns, target_column)
print("Validation Score: {}".format(model.score(Xva, Yva)))

Wow! Already have a pretty high score on our held out set. We could call it done here. But let's dig into what we're getting wrong a little bit to understand. First, let's take a look at a finer grain metric: confusion matrix.

In [None]:
from sklearn import metrics

def build_cm(model, data, feature_columns, target_column):
    """Build a confusion matrix for given model / data.
    
    rows -> true label
    columns -> predicted label
    
    First value -> 2, Second value -> 4
    """
    X, Y = featurize(data, feature_columns, target_column)
    predictions = model.predict(X)
    return metrics.confusion_matrix(Y, predictions, labels=[2, 4])


cm = build_cm(model, data_tr, feature_columns, target_column)
print(cm)

We can see here that the model has a lot more of the label `2` (~3x as much). Often imbalance in label can hurt the learning process as the model may learn to just predict the stronger class more often.

In [None]:
cm = build_cm(model, data_va, feature_columns, target_column)
tn, fp, fn, tp = cm.ravel()

print(f"Num True Labels [2, 4]: [{tn + fp}, {fn + tp}]")
print(f"Num Predicted Labels [2, 4]: [{tn + fn}, {tp + fp}]")
print(f"Precision: {tp / (tp + fp)}")
print(f"Recall / Sensitivity: {tp / (tp + fn)}")
print(f"Specificity: {tn / (tn + fp)}")

These statistics tell us that our model may be biased to predicting a negative label. As such, our model is still having some trouble identifying people with malignant tumors. Let's take a closer look at the model's mispredictions

In [None]:
def build_misclassified(model, data, feature_columns, target_column):
    """Build dataset of misclassified rows for a model and dataset."""
    X, Y = featurize(data, feature_columns, target_column)
    predictions = model.predict(X)
    index = predictions != Y
    
    misclassified = data[index].copy()
    misclassified["predicted_class"] = predictions[index]
    return misclassified

build_misclassified(model, data_tr, feature_columns, target_column).head()

In [None]:
m = build_misclassified(model, data_tr, feature_columns, target_column)
m[m["Class"] == 4]

One thing we see pretty quickly is there are a lot of rows that have `Bare Nuclei` value of 10 -- same with a couple of other features. Let's see if there's a trend we can possibly exploit.

In [None]:
m[m["Class"] == 2]

No values of 10 in the other mispredictions. Also, we notice that `Bare Nuclei` has NaN values here. Let's take a look at some of our statistics from earlier broken down by label.

In [None]:
data_tr[data_tr["Class"] == 2].isnull().mean()

In [None]:
data_tr[data_tr["Class"] == 2].describe()

In [None]:
data_tr[data_tr["Class"] == 4].isnull().mean()

In [None]:
data_tr[data_tr["Class"] == 4].describe()

We can see there are huge differences in the mean for most of these features, and it may be that `Bare Nuclei` isn't missing completely at random -- it seems to be missing at a much higher rate for benign tumors.

We should then be careful with its use as a feature. If we were to fill its value, what should we fill it with? There are [many options](https://www.kaggle.com/residentmario/simple-techniques-for-missing-data-imputation). We'll do the naive thing first and give it a sentinel value of -1:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer


target_column = "Class"
feature_columns = ["Bare Nuclei", "Uniformity of Cell Size"]

Xtr, Ytr = featurize(data_tr, feature_columns, target_column)
model = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value=-1)),
    ('logreg', LogisticRegression())
]).fit(Xtr, Ytr)

print("Train Score: {}".format(model.score(Xtr, Ytr)))

Xva, Yva = featurize(data_va, feature_columns, target_column)
print("Validation Score: {}".format(model.score(Xva, Yva)))

Awesome! We have now brought our score up a lot -- let's take a deeper look:

In [None]:
cm = build_cm(model, data_va, feature_columns, target_column)
tn, fp, fn, tp = cm.ravel()

print(cm)
print(f"Num True Labels [2, 4]: [{tn + fp}, {fn + tp}]")
print(f"Num Predicted Labels [2, 4]: [{tn + fn}, {tp + fp}]")
print(f"Precision: {tp / (tp + fp)}")
print(f"Recall / Sensitivity: {tp / (tp + fn)}")
print(f"Specificity: {tn / (tn + fp)}")

This looks a lot better!

There may be more ways to improve this model -- now it's your turn. What else should we do to improve this model?