# Most common ML applications in production today

1. Recommenders
2. Fraud detection
3. Click prediction
4. Forecasting
5. Churn prediction
6. Lead scoring

From Carlos Guestrin's [Data Science Summit 2016 Keynote](https://www.youtube.com/watch?v=wLXEJkiTsLc&index=51&list=PLykRMO7ZuHwONAMHcteqniITxlLaZpFoy). Guestrin's point was that you only need to prepare your data sets (customer profile, product details, activity data) once to be able to create all 6 of the above applications.

## Feature engineering 

This probably doesn't even belong here but fuck it. Think in terms of converting activity features into event counts:

* **Bought lots of baby products:** count # activity==buy && category==baby
* **Bought this item recently & people who buy it, buy again quickly:** last bought < 30 days && count # repeat buy < 30 days
* **Clicked on products like this in this section:** count # activity==click && category==baby && session==current

**Example:**

Initial data (raw activity data):

In [11]:
import pandas as pd

user_id = [536365, 536365]
raw_activity_data = pd.DataFrame({
        'timestamp': [pd.Timestamp('20160519'), pd.Timestamp('20160526')],
        'action': ['viewed', 'purchased'],
        'item': ['Panda', 'Elephant'],
        'price': [20.99, 35.99]
    }, index = user_id)

raw_activity_data

Unnamed: 0,action,item,price,timestamp
536365,viewed,Panda,20.99,2016-05-19
536365,purchased,Elephant,35.99,2016-05-26


Post-transformation (training features: counts)

In [12]:
user_id = [536365]
training_features_count = pd.DataFrame({
        'days since recent event': 32,
        '# events in last 30 days': 0,
        '# events in last 60 days': 64,
        '# events in last 90 days': 92
    }, index = user_id)

training_features_count

Unnamed: 0,# events in last 30 days,# events in last 60 days,# events in last 90 days,days since recent event
536365,0,64,92,32


The idea is to eventually create reusable feature engineering pipelines.

Deciding what activities to count can be done manually... Or automatically.

** Input counts > Learn boosted trees from counts > Use decision paths in trees to define important non-linear count features** 

(This is a common technique I guess.)

# Choosing a model

1. Do you have an output?
2. Is your output quantitative or qualitative?
3. Is your goal interpretation or prediction?

# Handling computationally expensive algorithms

* [96-nodes, Postgres](https://databeta.wordpress.com/2009/05/14/bigdata-node-density/)
* [Snow and RMPI](https://cran.r-project.org/web/views/HighPerformanceComputing.html)
* Hadoop and Mapreduce
* Amazon EC2