# DAML 09 - Real World Problems

Michal Grochmal <michal.grochmal@city.ac.uk>

Just so we have this in one place,
let's group together all the higher level issues that concern machine learning models.
These issues happen almost every time we want
to use a machine learning model to solve a real world problem (i.e. not a toy problem).
We have already discussed several of them but a list works better as a reference.

## Bias, Variance, and the Size of Training Data

The *bias vs variance* trade-off argues that a model that is not complex enough
will *underfit* the data, and a model that is too complex will *overfit* the data.
We control model complexity through model hyperparameters,
and we can estimate a good complexity by trying several hyperparameters and
cross-validating their performance.

Yet, the model complexity found by tuning the hyperparameters is not constant
when we add more data points.
In other words, we can prove that a problem is solvable by sampling our data points;
but we cannot argue that a model tuned to use certain hyperparameters
will perform (generalize) as well on new data.

If we have enough data to worry about the training time,
we can use a **learning curve** to estimate how well our hyperparameters,
tuned on small samples will perform on the full dataset.
The *total variance in the dataset is directly proportional
to the number of samples needed to explain this variance*.
Therefore, if we train and tune our model on increasingly bigger samples of our dataset,
the score of the model on its training data and its test set will converge.
Once this converges we can be confident that we have found the number of samples
needed to account for the entire variance in the data.

## Validate your Model!

The difficult part of the art of machine learning is not making a model work,
it is to prove that is works and that it will work for new data.
Moreover, depending of the problem we may want to validate a model for different things,
e.g. in a fraud detection model we want the recall
of fraud data points to be the most important validation.

If your model will work with new data, just cross-validating it is not enough.
Cross-validation allows us to select the best hyperparameters,
and gives us a good estimate of how well a model performs;
but it does not give us an estimate of how badly our model can perform on new data,
i.e. we do not have a generalization baseline.

To estimate how our model performs against new data, we need to separate our data
into a training and test sets and only then perform cross-validation on the training set alone.
The resulting model's generalization can then be evaluated on the test set.
In other words, we now have a test set, and several folds
which are the training and validation sets.
This ensures that the model sees only the training set during the tuning of its parameters,
and sees only the training and validation sets during the tuning of its hyperparameters.

## Scaling Data

Machine Learning algorithms work on numbers, and assume that if a number is bigger
it means that this number is more important.
Yet, that is often not what we actually want.
Most models are sensitive to the magnitude of the features,
and scaling the features to have a similar magnitude will,
more often than not, give better results.
Borrowing from `sklearn` two common ways of scaling features are:

- [StandardScaler][scal] subtracts the mean and then divides by the maximum (absolute)
  to achieve mean zero and variance one for all features.

- [Normalizer][norm] forces all features to have values between zero and one.
  Neural networks often need data in this form.

[scal]: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
[norm]: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html

## Ensembles, Voting and OVO vs OVR

Ensemble methods are powerful.
Depending on how you setup the ensemble it can bestow the performance of models
or work around limitations of certain models.

A *bagging* technique (akin of, but no limited to, random forests or ada boost)
takes several models, trains each and predicts by majority vote across all models.
The internal models are often trained on subsets of the data, or with different
randomization and hyperparameters.
Bagging increases the performance and generalization of the models,
and makes up for models which suffer from in-built overfitting (i.e. low generalization).

*One vs one* (OVO) and *one vs rest* (OVR) techniques are used to allow binary classifiers
to perform multiclass classification.
OVO trains a classifier for every pair of classes,
this means that a big number of binary classifiers will be trained on
subsets of data (only the samples for the two classes).
The OVO runs all competing classifiers and decides on the classes with most wins.
In OVR (also called OVA, One vs All) the number of trained classifiers is the number of classes,
each classifier is trained on the samples of one class as the positive class
and all other samples as the negative class. 
OVR then selects the answer by picking the class with the higher score.

## Probabilities versus Decision Functions

Being able to explain why your model classifies things the way it does,
may or may not be important for the problem you are solving.
Classification (and often regression) can be performed in two ways:
by constructing a decision function and deciding upon classes/values based
on this function; or by assigning probabilities to each class/value and
deciding based on the higher probability.

Probabilities can be computed for every class, and therefore are easy to explain to humans,
e.g. a misclassification between the wrong class having 51% confidence
and the correct one having 47% confidence makes easy to argue that the model
needs a little more tuning (but does not need a rewrite).
With a decision function the explanation of model errors can become hairy,
since most decision functions compute their score in very high dimensions.

Some models based on decision functions are (these have a `decision_function` method in `sklearn`):

- SVM
- Logistic Regression (some regression algorithms *are* their own decision functions)
- Neural Networks

Models based on probabilities are (these have a `predict_proba` method in `sklearn`):

- k Nearest Neighbors
- Naive Bayes
- Decision Trees and Random Forests

Most classifiers that work by building a decision functions are *binary classifiers*,
and require the use of OVO or OVR techniques to perform multilabel classification.
Probability based classifiers perform multilabel classification by default.

Dimensionality reduction has a only the *explained variance* method (in the PCA)
that may give some insight into the model, manifold methods lack an explanation feature.
Some clustering methods may provide probabilities but these have less
value to explain the model since the cluster centers may be incorrect,
instead techniques such as *silhouette scores* better visualize (and explain) a clustering model.

## Online Learning

One thing we did not touch yet is the concept of *online learning*.
This is to differ it from *offline* (or batch) learning.
In *offline learning* we can work with all data we will ever feed the model
with at once, i.e. we can load a dataset in memory, use cross-validation
to tune a model over this dataset and test against a test set.
Offline learning is incredibly common in data analysis.

The problem starts when we plan to build a model (say, classifier) on top of data that
is continuously entering or flowing through the system.
We never have the full dataset in such a case,
we may have all data until today but even that will not necessarily be complete.
What we need is a model that can learn and *re-learn* from new data,
such is an online learning model.

To perform online learning we need to be capable of tuning model
parameters to new data without looking back at the previous data.
In other words,
the model parameters must *represent the data seen until now* and if we change such
a parameter slightly it will not affect the overall model too much.

It turns out that most models that predict probabilities cannot be used for online learning,
this is because these models make an estimate of a probability distribution
and it is not possible to re-estimate a probability distribution knowing only
the distribution and new data points.
Well, it is possible to draw points from the distribution,
add the new ones and re-estimate but that would still require us to know how many points to
draw (e.g. the same number of points as in the old data).

Decision functions, when cleverly parametrized, are much easier to re-tune to new data.
A small change to a parameter in the decision function can bring the model closer
new data with very little computational effort.

But how small a change?  That answer depends on the problem.
This small (or not so small) change is called the *learning rate* of an online learning model.
A big learning rate will make the model forget about old data quickly,
a very small learning rate will make the model have a lot of inertia when adapting to new data.

### Do you need Online Learning?

There is a huge cost in achieving actual online learning: one cannot use probabilistic models,
must define and test a learning rate, and several models
can only use a subset of its capabilities as online learning models
(e.g. SVMs can only use the linear kernel because it has a tunable decision function).
And in most problems a model does not receive data all the time.

Most machine learning problems will receive data at known intervals,
e.g. a daily snapshot of a database or a time series of the last 30 days of trades.
This means that you can retrain your model with the new data every time you receive it.
You may need to slightly re-tune the hyperparameters
but the grid search should be close to the current hyperparameters,
since the new data is unlikely to be very different from the old one.

Retraining a new model at certain intervals does not need to create downtimes,
you can automate the training and only when the training and validation finishes
point a load balancer to the new model.
Training a new model at every batch also allow you to test it for obvious inconsistencies.
Since most machine learning is performed as a service:
a trained model sits in memory waits for input and sends back predictions,
training a model at server startup is a completely valid and often used technique.  

In summary, you will only need an online learning model if either:

- The entire dataset cannot fit in memory,
  in this case you need online learning to train the model using parts of the dataset at a time.

- You need very quick adaptation to new data, e.g. algorithmic trading.

### Can I do Online Dimensionality Reduction and Clustering?

Absolutely.  Moreover, it would sometimes be impossible to do online classification
without the ability to perform online dimensionality reduction.
The [Incremental PCA][ipca] performs PCA in an online fashion,
it understands the idea of batch processing and updates its eigenvectors batch by batch.

Using the same concept (batches) the `k-means` algorithm has a [Mini Batch k-means][mbkm] variant.
Akin of the incremental PCA, the mini batch k-means can deal with very big datasets in an online fashion.
In general, in `sklearn` any model that has a `partial_fit` method, can perform online learning.

[ipca]: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html
[mbkm]: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

## Models Rot

An ML model is, as the name suggests,
a mathematical technique which estimates the behavior of the real world.
Unfortunately (fortunately?) the real world changes, and if our model
does not change in response it will soon perform worse and worse.

Of course, this is not relevant to self-closed datasets,
as the ones seen in competitions or in toy problems.
But most models are expected to work on real, and actual, data.
Recent data may have new trends,
and a performance estimate without that trend will be overoptimistic.
In other words, the performance of your model will decrease over time if you do not update it.

### Monitor your model

Since the performance will decrease over time you need to know when it decreases to a point
in which your model is not good enough anymore to perform its job.
Even if you do not retrain your model in, say, daily batches,
you still need to check the model's performance against new data;
i.e. you need to cross check whether a model trained on new data
would classify in the same manner as the current running model.

Another reason to monitor your model is that you cannot test for every behavior during
model validation (if you could you would not be using ML to solve the problem!).
A new model, once trained on new data or an online model that has been battered with
bad input for a while, may perform abysmally once deployed.
In such a case you need a way of reverting to a previous model.
For offline learning this is often easy as long as you did store previous data.
For online learning you will need to store snapshots of your model at certain intervals.

## Save your model

First of all ask whether it makes sense to save the model.
Often retraining on the most recent data, or a previous data snapshot at server startup
is more convenient and even faster for some models (e.g. KNN).

Since we are working with Python we can simply use Python's default way of storing
serializable memory objects: [pickle][pick].
The default `pickle.dumps` and `pickle.loads` work on pretty much all `sklearn` models,
and, since `NumPy` arrays are serializable, `pickle` works as well on most other ML libraries.

That said, `pickle` may result in quite bloated objects,
this was one of the reasons [joblib][jobl] was developed.
The `pickle` bloat is due to the fact that it converts `NumPy` arrays into lists,
which `joblib.dumps` performs much more efficiently by storing the array as a binary lump.
`sklearn` comes with a (possibly outdated) `joblib` copy at `sklearn.externals.joblib`.

[pick]: https://docs.python.org/3/library/pickle.html
[jobl]: https://pythonhosted.org/joblib/index.html

## Types of Machine Learning Models

We saw that we can divide machine learning problems/models in four categories:

- Classification
- Regression
- Dimensionality Reduction
- Clustering

We also saw that the line between *classification* and *regression* is pretty thin,
i.e. that these models only differ in how they present their outputs.
Yet, there are even thinner divisions between other forms of machine learning techniques.
Some other types for machine learning problems you may see out there are:

- *Anomaly detection* is a binary classification problem where *normal*
  is one of the classes and *abnormal* the other.
  Tuning between precision and recall for these classes is often automated in a way that
  it is easy to change an abnormal activity into a normal sample upon human inspection.

- *Association Rule Learning* is a clustering problem where we use several
  different distance measures and deterministic ways of defining clusters.
  In other words, it is a clustering technique simplified to a level which
  can be easily explained.
  One such technique is hierarchical clustering.

- *Reinforcement Learning* is a non-linear online learning classification/regression
  technique with a *variable learning rate*.
  This attempt to reuse knowledge obtained from one task into learning another task.
  Reinforcement learning is what we often consider robot-AI, and, to be fair,
  is quite popular in robot development.

Finally there are also **genetic algorithms** and **swarm intelligence**
which are completely different techniques.
The mathematical foundation to the convergence of genetic algorithms
or swarm intelligence are not developed.
Both techniques take analogies from nature and use them to iteratively build and tune ML models.
Experimental techniques proved that both techniques
work but their application to new problems is tricky to develop.
In essence, the use of these algorithms in a practical manner is viable,
and even very successful, on specific problems.
But the mathematical formulation of the techniques is next to non-existent.
This may change in the following years (perhaps decades),
if you want a not extensively explored field to work in there's your chance.