In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

# Overview of Machine Learning
<!-- requirement: images/plot_classifier_comparison.png -->
<!-- requirement: images/ml_map.png -->

**Machine Learning**: A computer program learns if it improves its __performance__ on some __task__ with __experience__.

* __Task__: Machine learning tasks can be divided into two major types. 

    1. The first type is __supervised__, where we make predictions based on our data. Supervised machine learning tasks can be further broken down into 
        1. __regression__ tasks, where we make predictions that are continuous and ordered (ex: housing prices), and 
        1. __classification__ tasks, where we make predictions that are categorical and unordered (ex: whether a patient has a disease). 
    1. The second type is __unsupervised__, which we use to gain insight into the underlying data structure or to preprocess our data prior to building supervised models. Examples of unsupervised tasks include 
        1. clustering and 
        1. dimension reduction. 
        
* __Performance__: We need to come up with a __metric__ to evaluate our model. Some metrics are errors (that we want to minimize) and others are scores (that we want to maximize). The type of metric we will choose depends on our task.

    1. Regression: sum of squared error, absolute error, $R^2$
    1. Classification: Accuracy and loss, precision and recall, AUC, ROC, log loss or cross entropy, Gini index
    
* __Experience__: Typically, we optimize our model by minimizing a loss or cost function. Doing so requires taking the derivative of our cost function with respect to our model __parameters__. If we do this process iteratively (usually in __batches__ via __Gradient Descent__) we say our model "learns with experience." In some cases, however, we can solve for the optimal parameters without using gradient descent (ex: linear regression). Strictly speaking, our program is not "learning" in this senario. We use iterative, optimization techniques when we have lots of data that are streaming in or when large quantities of data makes computation difficult. 

## Decision-making flowchart

A few factors to consider:
1. What are you trying to predict?  Should the algorithm be a classification, regression, or clustering problem? Should you use a linear or non-linear model?
1. How does the algorithm scale to larger datasets in terms of both **memory** or **time**.  Does this make the computation infeasible?
1. Is there an **iterative** (versus **batch**) version of this algorithm?  Is **online learning** or **streaming** possible?
1. Do you have to worry about explicability?
1. Are there many features?  If there are, do you want to reduce the dimension (e.g. PCA or Lasso)?
1. **Accuracy**: does the algorithm tend to underfit (e.g. it doesn't satisfy the asymptotic approximation property)?  A fixed model form is less able to take advantage of more data than a flexible model form.
1. Does the "prior" you're introducing with a **parametric** model make sense for your dataset?
1. Are you working with timeseries data and wish to do forecasting?
1. Are you building a classifier with unbalanced classes? How are you dealing with outliers? Anomalies? 

Performance considerations can often be nuanced.

* Neural Networks train faster than SVMs (at least three reasons why?)
* SVMs predict faster than Neural Networks (why?)

## Comparing ML Algorithms

It's important to be able to clearly understand and explain the theoretical and practical differences between machine learning algorithms. This notebook (for now) is a collection of resources to help you understand the landscape of algorithms as a whole.

This flowchart ([interactive source](http://scikit-learn.org/stable/tutorial/machine_learning_map/)) has some good examples of the kinds of criteria one needs to be thinking about when choosing an algorithm.
![Machine learning flowchart from the scikit-learn documentation](images/ml_map.png)

### Notes on individual models:

#### Decision Trees
* Can be used for outlier detection
* Can easily explain
* Random forests scale well if you want to update model with new data
* Can be slow when it comes to predicting (if you have lots of branches)
* Works well for high dimensional feature spaces
* Gradient boosting and random forests perform well with large datasets 

#### Support Vector Machines
* Good with limited data
* Can be used for outlier detection 
* Better performance with outliers compared to linear regression
* Cannot stream data in
* Fast when predicting
* Good with non-linear problems

#### Naive Bayes
* Can easily explain
* Works well if you have small amounts of data
* Usually computationally costly
* Working under the assumption that your features are independent (where "naive" part comes from)
* Bayesian methods are an important concept in generative models

#### Neural Network
* Good for non-linear classification/regression problems
* Slow to train
* Fast to run
* Forms the basis of deep, hierarchical modeling. 

#### K Nearest Neighbors
* Simple to explain
* Slow and memory inefficient
* Not great for high dimensional feature space (due to computational complexity)

#### Linear Regression ( Ridge or Lasso) and Logistic Regression
* Good for interpreting the relationship between independent and dependent variables
* Can perform feature importance with ridge and lasso
* Making an assumption about the distribution of your noise (about your signal)

## Overfitting

After you have selected a ML algorithm, you need to ensure that your model is generalizable (or that you are not overfitting the model to your data). Overfitting is when the model captures the noise rather than the underlying signal in the data and occurs when the model is too complex -- it is trained with too many features relative to the number of observations or it has an involved architecture. To prevent overfitting

1. Break your data into __training__ and __test sets__ and perform __(cross-)validation__ to find the best __hyperparameters__ (that dictate your model's architecture). 
1. Cut down on the number of features by performing a __feature importance__ analysis. You can do this with
    1. Optimizing $\alpha$ in the ridge or lasso regression algorithm
    1. Clustering
    1. PCA
    1. Decision trees
    1. Droping or shuffling columns

_After you cross-validate your model, remember to retrain it on the entire dataset!_ 

## Scaling, Normalization, and Standardization

It is important to scale your features when you are
1. Working with algorithms that use Euclidean distance measures (ex: K-Nearest Neighbor and K-Means)
1. Performing dimension reduction techniques like PCA, where you want to find the direction of maximum variance in your dataset. Larger scaled features will have greater influences on the principle components. 

Ways you can scale your features include:
1. Subtracting out the mean and dividing by the standard deviation.
1. Fixing the range.

## Ensemble methods
Combining models can improve the accuracy of your predictions. For decision trees, remember that progressing from decision trees -> random forests -> gradient-boosting trees
1. increases accuracy
1. decreases explicability
1. increases computation time/memory footprint (boosting algorithms can't be parallelized)

### Bagging
Random forests demonstrate the idea of bootstrap aggregation, or bagging. Each individual model has an equal vote in the ensemble, and you have the freedom to skew the weak learners towards higher variance, knowing that the averaging will (ideally) wash out the randomness and prevent overfitting.

### Boosting
Gradient boosting trees demonstrate the power of training on the residual from your own predictions in order to achieve very good prediction metrics. Other boosting algorithms offer similar benefits, as well as similar drawbacks. Beware of overfitting here.

### Blending
The FeatureUnion full_model you implemented in ml.py is an example of combining the predictions of many (usually simple) models and passing them as features to a final regressor or classifier. Using a linear model is equivalent to taking a weighted average of the contributors, whereas a more complex final estimator is capable of combining the component models nonlinearly. In general, the explicability of these techniques is slightly better than a "black box" algorithm like a neural network.

You might notice that in eg. Kaggle competitions, many of the winning entries are based on ensemble models. Ensembles tend to outperform individual models, but require careful tuning of each of the components, as well as more computational power.

This image ([source code](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)) shows how various scikit-learn classifers fit their models for a few different configurations of training data.
![Comparison of various scikit-learn classifiers from the official documentation](images/plot_classifier_comparison.png)

## Modeling for purposes other than predictive power
You might think after learning about complex classification models: "why would anyone ever use plain old logistic regression?" It's important to remember that the probabilistic output of logistic regression is valuable in certain circumstances. In general, consider that beyond just comparing error metrics with other algorithms, you may sometimes want to compare output with entirely different methods of analysis - for example, if you're working with an actuary.

Finally remember that there is often value in understanding your features regardless of their predictive power. Certain algorithms will reveal more about the input features than others.

**Spoilers**

Training comparison
1. SVMs require solving the dual Lagrangian (quadratic optimization) as opposed to the primal Lagrangian
1. Multiclass classification for SVMs requires one-vs-one or one-vs-all, either is more time-consuming than a NN
1. NN training is an embarrassingly parallel problem. Parallelizing SVM training is not trivial (although possible)

Prediction comparison
1. Linear SVM requires only a dot product calculation to check which side of the decision boundary the data point is on. NN requires propagating the data through the network (multiple matrix multiplications). SVM kernels scale with the number of support vectors.

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*