## Chapter 11: ML

DS = mostly turning business problems into data problems, collecting + 
understanding + cleaning + formatting data, after which ML is almost an afterthought.

### Modeling

**model** = a specification of a mathematical (or probabilistic) relationship that exists between different variables.

* if trying to raise \$ for a social networking site, might build a business model (likely in a spreadsheet) that takes inputs like “# of users”, “ad revenue per user”, “# of employees” that outputs estimated annual profit for the next several years. 
* cookbook recipe entails a model that relates inputs like “# of eaters” and “hungriness” to quantities of ingredients needed. 
* Poker on TV = estimate each player’s “win probability” in real time based on a model that takes into account revealed cards so far + the distribution of cards in the deck

* The business model is probably based on simple mathematical relationships: profit = revenue - expenses, revenue = units sold * average price, etc. 
* Recipe model is probably based on trial + error (tried different combos  of ingredients until found one well-liked) 
* Poker model is based on **probability theory**, rules of poker, + some reasonably innocuous assumptions about the random process by which cards are dealt.

### What Is Machine Learning?
ML = creating + using models that are *learned from data* (predictive modeling or data mining) typically w/ goal to use *existing* data to develop models we can use to predict various outcomes for *new* data: 

* Predicting whether an email message is spam or not
* Predicting whether a credit card transaction is fraudulent
* Predicting which advertisement a shopper is most likely to click on
* Predicting which football team is going to win the Super Bowl

**supervised models** = have a set of data labeled w/ correct answers to learn from) vs. **unsupervised models** = no such labels vs. **semisupervised** = only some data are labeled vs. **online** = model needs to continuously adjust to newly arriving data

There are entire universes of models that might describe a relationship we’re interested in. In most cases, we choose a **parameterized family of models** + then use data to learn parameters that are, in some way, optimal.

* might assume person’s height is (roughly) a linear function of weight + then use data to learn what that linear function is.
* might assume a decision tree = good way to diagnose diseases patients have + then use data to learn the “optimal” tree. Throughout

### Overfitting and Underfitting
Common danger in ML = **overfitting** = producing model that performs
well on training data but *generalizes poorly* to any *new* data
* could involve **learning noise** in the data or involve **learning to ID specific inputs rather than whatever factors are *actually* predictive for the desired output**

Other side = **underfitting** = producing a model that doesn’t perform well even on training data (typically when this happens you decide your model isn’t good enough + keep looking for a better one)

Models that are *too complex* lead to overfitting + don’t generalize well beyond training data. To make sure models aren’t too complex, most fundamental approach = **using different data to train a model + to *test*
the model**

Simplest way = split data set, so that (for example) 2/3 is used to train the model, after which measure model’s performance on the remaining 1/3

In [1]:
import random

def split_data(data, prob):
    """Split given data into fractions (prob)"""
    results = [],[] # 2 sets (lists)
    for row in data:
        # add into 1st list if below given prob, add to 2nd list if not
        results[0 if random.random() < prob else 1].append(row)
    return results

Often given a **matrix `x` of input** variables + a **vector `y` of output** variables. In that case, need to make sure to put corresponding values together in either the training or test data:

In [2]:
def train_test_split(x,y,test_pct):
    data = zip(x,y)     # combine given predictor matrix with label vector
    train, test = split_data(data, 1 - test_pct) # split into sets
    # zip(*... UNPACKS a list such that each element = separate arg
    x_train, y_train = zip(*train)
    x_test, y_test = zip(*test)
    return x_train, y_train, x_test, y_test

## can now do something like
#model = someModel()
#x_train, y_train, x_test, y_test = train_test_split(xs,ys,.33) # make sets
#model.train(x_train,y_train)
#performance = model.test(x_test,y_test)

If model was overfit to training, it will hopefully perform really poorly
on test. If performs well on test, you can be more confident it’s fitting rather than overfitting.

However, there are a couple of ways this can go wrong.
* 1) if there are common patterns in the test *and* train data that wouldn’t generalize to an even larger data set.
    * ex: data set consists of user activity, 1 row per user per week. 
    * In such a case, most users appear in both training + test data, + certain models might learn to ID *users* rather than *discover relationships involving attributes*
    * not a huge worry, but can happen
* 2) Bigger problem = use test/train split not *just* to *judge* a model but *also to choose from among many models*. 
    * In that case, although each *individual* model may not be overfit, the “choose model that performs best on test ” is a **meta-training** that makes the test set function as a *second training set*
    * *Of course* the model that performed best on test is going to perform well on the test set
    * In such a situation, split the data into *THREE* parts: training set for building models, **validation set** for *choosing* among trained models, + test set for *judging* the final model
    
### Correctness
Ex: Cheap, noninvasive test can be given to a newborn + predicts w/ > 98% accuracy whether the newborn will ever develop leukemia. Test predicts leukemia if + only if baby is named "Luke". Test is indeed > 98% accurate, but nonetheless, is incredibly stupid, + a good illustration of why we **don’t typically use “accuracy” to measure how good a model is**

Imagine building a model to make a binary judgment. Is this email spam? Should we hire a candidate? Is an air traveler secretly a terrorist? Given a set of labeled data + such a predictive model, every DP lies in 1 of
4 categories:
* **True positive**: “message is spam + we correctly predicted spam.”
* **False positive (Type 1 Error)**: “message = *not* spam, but we predicted spam.”
* **False negative (Type 2 Error)**: “message = spam, but we predicted not spam.”
* **True negative**: “message = *not* spam, + we correctly predicted not spam.”

We often represent these as counts in a **confusion matrix**. See how the leukemia test fits into this framework. These days approximately 5/1000 babies = named Luke, + the lifetime **prevalence** of leukemia = ~1.4%, or 14/1000 people. If we believe these **2 factors are independent** + apply my “Luke is for leukemia” test to 1M people, we’d expect to see a confusion matrix w/ 14k predicted leukimia's, 986k non-leukemias, 5k total Luke's, 995k non-Luke's, 70 Luke+leukimia (TP), 4930 Luke+non-leukimia (FP), 13930 non-Luke+leukemia (FN), 981070 non-Luke+non-leukemia (TN).

In [8]:
def model_acc(tp,fp,fn,tn):
    correct = tp+tn
    total = tp+tn+fp+fn
    return correct/total
print(model_acc(70,4930,13930,981070))

0.98114


Seems pretty impressive, but is clearly not a good test = means we probably **shouldn’t put a lot of credence in raw accuracy.**

It’s common to look at the *combo of **precision and recall**. 
* **Precision** = how accurate "positive" predictions were
* **Recall** = what fraction of positives out of all actual positive our model identified:

In [9]:
def precision(tp,fp,fn,tn):
    return tp / (tp+fp)

def recall(tp,fp,fn,tn):
    return tp / (tp+fn)

print("precision: ",precision(70,4930,13930,981070),"\nrecall: ",
     recall(70,4930,13930,981070))

precision:  0.014 
recall:  0.005


Both are terrible, reflecting that this is a terrible model.

Sometimes precision + recall are combined into **F1 score**

In [11]:
def f1(tp,fp,fn,tn):
    p = precision(tp,fp,fn,tn) # tp/(tp+fp) = correct pos / all predicted pos
    r = recall(tp,fp,fn,tn) # tp/(tp+fn) = correct pos / all real pos
    return 2*p*r / (p+r)
print(f1(70,4930,13930,981070))

0.00736842105263158


**F1 = harmonic mean** of precision + recall --> necessarily lies between them

Usually choice of a model involves a **trade-off between precision + recall**. 
* model that predicts “yes” when it’s even a *little* bit confident will probably have high recall but low precision;
* model that predicts “yes” only when it’s *extremely* confident is likely to have a low recall and a high precision.

Alternatively, think of this as a trade-off between FP vs FN
* Saying “yes” too often gives lots of FP's
* Saying “no” too often gives lots of FN's

Imagine there were 10 risk factors for leukemia, + that more you had = more likely you were to develop leukemia. In that case, imagine a continuum
of tests: “predict leukemia if at least 1 risk factor,” “predict leukemia if at least 2 risk factors,” + so on. As you increase the **threshhold**, you increase test’s precision (since people w/ more risk factors = more likely to develop the disease), + decrease test’s recall (fewer + fewer of eventual disease-sufferers will meet threshhold). In cases like this, **choosing the right threshhold is a matter of finding the right trade-off**

### The Bias-Variance Trade-off
Another way of thinking about overfitting = trade-off between **bias and
variance.**

Both = measures of what would happen if you were to retrain a model many times on different training sets (*\*from same larger population*).
* Ex: degree-0 polynomial model will make a lot of mistakes for pretty much any training set (\*drawn from same population), which means that it has a high bias (*to the model*)
* However, any 2 randomly chosen training sets should give pretty similar models (since any 2 randomly chosen training sets should have pretty similar average values), so we say it has a low variance. 
* **High bias, low variance typically = underfitting**

On the other hand
* degree-9 polynomial model can fit a training set perfectly = very low bias 
* but very high variance (since any 2 training sets would likely give rise to very different models)
* **Low bias, high variance typically = overfitting.**

Thinking about model problems this way can help figure out what do when a model doesn’t work so well. If a model has **high bias** (performs poorly even on training data), 1 thing to try = **adding more features.** Going from the degree 0 model to the degree 1 model can  be a big improvement. 

If a model has **high variance**, can similarly **remove features**. But another solution = to **obtain more data** (if possible).
* Ex: Fit a degree 9 polynomial to different size samples. 
* Model fit based on 10 DP's  = all over the place, as we saw before. 
* If instead trained on 100 DP's == much less overfitting
* Model trained from 1K DP's looks very similar to a degree 1 model.

*Holding model complexity constant*, more data = harder it is to overfit. On the other hand, **more data won’t help with bias = if a model doesn’t use enough features to capture regularities in the data, throwing more data at it won’t help**

### Feature Extraction and Selection

As mentioned, when data doesn’t have enough features, model is likely to underfit + when data has too many features, it’s easy to overfit

**Features** = inputs provided to a model + in simplest case, are given to you. 
* ex: To predict salary based on years of experience, "years of experience" is the only feature you have.
* Although,might also consider adding years of experience squared, cubed, + so on if it helps build a better model

Things become more interesting as data becomes more complicated. Imagine trying to build a spam filter to predict junk or not. Most models won’t know what to do w/ a raw email (collection of text). You’ll have to **extract features**. 
* Does email contain the word “Viagra”? How many times does the letter d appear? What was the domain of the sender?
* The first = simply a yes/no, typically encoded as 1/0. 
* The second = a number.
* The third = a choice from a discrete set of options.

Pretty much always, we’ll extract features from data that fall into 1 of these 3 categories. What’s more, the *type* of features we have constrains the type of models we can use.

**Naive Bayes classifier** = suited to yes/no features, **Regression models** require numeric features (could include dummy variables = 0s/1s),
**decision trees** can deal w/ numeric *or* categorical data.

Can *create* features, or sometimes instead look for ways to *remove* features
* Ex: inputs might be vectors of several hundred #'s + depending on the situation, might be appropriate to distill these down to handful of important dimensions (Dimensionality Reduction) + use only those small # of features. 
* Or it might be appropriate to use a technique like **regularization**, that penalizes models the more features they use

*How do we choose features*? = combo of experience + domain expertise (receive lots of emails + probably sense the presence of certain words might be a good indicator of spamminess + # of `d`’s is likely not a good indicator of spamminess)

But in general you’ll have to try different things, which is part of the fun.