# Most common ML applications in production today

1. Recommenders
2. Fraud detection
3. Click prediction
4. Forecasting
5. Churn prediction
6. Lead scoring

From Carlos Guestrin's [Data Science Summit 2016 Keynote](https://www.youtube.com/watch?v=wLXEJkiTsLc&index=51&list=PLykRMO7ZuHwONAMHcteqniITxlLaZpFoy). Guestrin's point was that you only need to prepare your data sets (customer profile, product details, activity data) once to be able to create all 6 of the above applications.

## Feature engineering 

This probably doesn't even belong here but fuck it. Think in terms of converting activity features into event counts:

<table style="display:inline-block;text-align:left;">
    <tr>
        <td>Bought lots of baby products</td>
        <td>count # activity==buy && category==baby</td>
    </tr>
    <tr>
        <td>Bought this item recently & people who buy it, buy again quickly</td>
        <td>last bought < 30 days && count # repeat buy < 30 days</td>
    </tr>
    <tr>
        <td>Clicked on products like this in this section</td>
        <td>count # activity==click && category==baby && session==current</td>
    </tr>
</table>

**Example:**

Initial data (raw activity data):

<table style="display:inline-block;text-align:left;">
  <tr>
    <td>user_id</td>
    <td>timestamp</td>
    <td>action</td>
    <td>item</td>
    <td>price</td>
  </tr>
  <tr>
    <td>536365</td>
    <td>2016-05-19</td>
    <td>viewed</td>
    <td>Panda</td>
    <td>20.99</td>
  </tr>
  <tr>
    <td>536365</td>
    <td>2016-05-26</td>
    <td>purchased</td>
    <td>Elephant</td>
    <td>35.99</td>
  </tr>
</table>

Post-transformation (training features: counts)

<table style="display:inline-block;text-align:left;">
  <tr>
    <td>user_id</td>
    <td>days since recent event</td>
    <td># events in last 30 days</td>
    <td># events in last 60 days</td>
    <td># events in last 90 days</td>
  </tr>
  <tr>
    <td>536365</td>
    <td>32</td>
    <td>0</td>
    <td>64</td>
    <td>92</td>
  </tr>
</table>

The idea is to eventually create reusable feature engineering pipelines.

Deciding what activities to count can be done manually... Or automatically.

** Input counts > Learn boosted trees from counts > Use decision paths in trees to define important non-linear count features** 

(This is a common technique I guess.)

# Choosing a model

**Do you have an output?**

<table style="display:inline-block;text-align:left;">
  <tr>
    <td>No (Unsupervised)</td>
    <td>Yes (Supervised)</td>
  </tr>
  <tr>
    <td>PCA</td>
    <td>Anything else</td>
  </tr>
</table>

**Do you have time series or text data?**

<table style="display:inline-block;text-align:left;">
  <tr>
    <td>Time series</td>
    <td>Text</td>
  </tr>
  <tr>
    <td>ARIMA (of which ARMA, AR, and MA are all subtypes)</td>
    <td>Bag-of-Words (better for short texts)<br>
    TFDIF (better for long texts)</td>
  </tr>
</table>

**Is your output quantitative or qualitative?**

<table style="display:inline-block;text-align:left;">
  <tr>
    <td>Quantitative (Regression)</td>
    <td>Qualitative (Classification)</td>
  </tr>
  <tr>
    <td>Linear regression (which can be improved by ridge and lasso)</td>
    <td>Logistic regression (which can be improved by ridge and lasso)<br>
    Naive Bayes (Gaussian, Bernoulli, and Multinomial)</td>
  </tr>
 </table>
 
Either: 

* KNN
* Simple decision tree
* Random forest (which is a type of bagging)
* Boosting (which we only know how to apply to random forest)

Remember to use the correct form of each. Also remember that you can convert your quantitative outcomes into qualitative ones.

**Is your goal interpretation or prediction?**

<table style="display:inline-block;text-align:left;">
  <tr>
    <td>&nbsp;</td>
    <td>Regression</td>
    <td>Classification</td>
  </tr>
  <tr>
    <td>Interpretation</td>
    <td>Linear regression (lasso is particularly helpful)<br>
    Simple decision tree</td>
    <td>Logistic regression (lasso is particularly helpful)<br>
    Simple decision tree</td>
  </tr>
  <tr>
    <td>Prediction</td>
    <td>KNN (especially if our feature space is less than 5)<br>
    Random forest</td>
    <td>KNN (again if our feature space is less than 5)<br>
    Random forest<br>
    Naive Bayes (very good for text data)</td>
  </tr>
</table>

# Linear and logistic regression

<table style="display:inline-block;text-align:left;">
    <tr><td>Interpretable?</td><td>Yes</td></tr>
    <tr><td>Predictive?</td><td>No</td></tr>
    <tr><td>Easy to compute?</td><td>Yes</td></tr>
    <tr><td>Need to standardize data?</td><td>No</td></tr>
    <tr><td>Need to remove outliers?</td><td>Yes</td></tr>
</table>

## Other advantages

Can be run on sparse data.

## Other disadvantages

Assumes linear association among variables, assumes normally distributed error terms.

## Ridge regression

Computationally easy, converges coefficients to zero.

## Lasso regression

Computationally harder, actually reduces coefficients to zero.

Inputs for both ridge and lasso need to be standardized. Outputs from both ridge and lasso can be used as inputs for another regression.

# KNN

<table style="display:inline-block;text-align:left;">
    <tr><td>Interpretable?</td><td>No</td></tr>
    <tr><td>Predictive?</td><td>Yes</td></tr>
    <tr><td>Easy to compute?</td><td>Yes</td></tr>
    <tr><td>Need to standardize data?</td><td>Yes</td></tr>
    <tr><td>Need to remove outliers?</td><td>???</td></tr>
</table>

## Other advantages

Easily captures non-linearity.

## Other disadvantages

Cannot be used on sparse data with a feature space of more than 4.

# Simple decision tree

<table style="display:inline-block;text-align:left;">
    <tr><td>Interpretable?</td><td>Yes</td></tr>
    <tr><td>Predictive?</td><td>No</td></tr>
    <tr><td>Easy to compute?</td><td>Yes</td></tr>
    <tr><td>Need to standardize data?</td><td>No</td></tr>
    <tr><td>Need to remove outliers?</td><td>???</td></tr>
</table>

# Random forest

<table style="display:inline-block;text-align:left;">
    <tr><td>Interpretable?</td><td>No</td></tr>
    <tr><td>Predictive?</td><td>Yes</td></tr>
    <tr><td>Easy to compute?</td><td>-ish</td></tr>
    <tr><td>Need to standardize data?</td><td>No</td></tr>
    <tr><td>Need to remove outliers?</td><td>???</td></tr>
</table>

# Boosting

<table style="display:inline-block;text-align:left;">
    <tr><td>Interpretable?</td><td>No</td></tr>
    <tr><td>Predictive?</td><td>Yes</td></tr>
    <tr><td>Easy to compute?</td><td>No</td></tr>
    <tr><td>Need to standardize data?</td><td>No</td></tr>
    <tr><td>Need to remove outliers?</td><td>???</td></tr>
</table>

# ARIMA

<table style="display:inline-block;text-align:left;">
    <tr><td>Interpretable?</td><td>If small p and q</td></tr>
    <tr><td>Predictive?</td><td>Yes</td></tr>
    <tr><td>Easy to compute?</td><td>No</td></tr>
    <tr><td>Need to standardize data?</td><td>???</td></tr>
    <tr><td>Need to remove outliers?</td><td>???</td></tr>
</table>

## AR

Good for smooth patterns.

## MA

Good for handling shocks in the system.

## ARMA

Combines AR and MA.

## ARIMA

Combines AR and MA while taking care of linear trends.

# Naive Bayes

<table style="display:inline-block;text-align:left;">
    <tr><td>Interpretable?</td><td>No</td></tr>
    <tr><td>Predictive?</td><td>Yes</td></tr>
    <tr><td>Easy to compute?</td><td>Yes</td></tr>
    <tr><td>Need to standardize data?</td><td>???</td></tr>
    <tr><td>Need to remove outliers?</td><td>???</td></tr>
</table>

## Other advantages

Can be updated quickly.

## Other disadvantages

Strong assumption of input independence, unreliable probability predictions.

# Combining models

## Combining regression models

Generalized Additive Models.

## Combining classification models

VotingClassifier.

# Handling computationally expensive algorithms

* [96-nodes, Postgres](https://databeta.wordpress.com/2009/05/14/bigdata-node-density/)
* [Snow and RMPI](https://cran.r-project.org/web/views/HighPerformanceComputing.html)
* Hadoop and Mapreduce
* Amazon EC2