|<img src="https://patch.com/img/cdn20/users/22965231/20180313/040305/styles/raw/public/processed_images/shutterstock_110928791-1520971084-4672.jpg" width="200" /> | <img src="https://pbs.twimg.com/media/EeBsCTPX0AAICkt.png" width="400" />|
| -- | -- |

# NYC Taxi analysis with Saturn Cloud

The notebooks in this repo showcase an end-to-end data science workflow executed on [Saturn Cloud](https://www.saturncloud.io/).

> Saturn Cloud is a data science and machine learning platform for scalable Python analytics. It automates DevOps and ML infrastructure with cloud-based Jupyter and Dask for big data, while leveraging Docker and Kubernetes so that your work is reproducible, shareable, and ready for production. 

The following tasks are performed:

1. Ingest 1.6 billion CSV records and write to optimized Parquet files
1. Create train/test sets for machine learning tasks
1. Train a number of machine learning models
1. Deploy models via REST API
1. Create dashboard with exploratory analysis and ML model diagnostics

The dashboard features a live-scoring widget using a model hosted with Saturn Cloud.

<img src="img/pipeline.png" width="800">

If you don't already use Saturn Cloud, [see how to get started here](https://www.saturncloud.io/docs/getting-started/).

# Dashboard

The dashboard is hosted using a Saturn Cloud [Deployment](https://www.saturncloud.io/docs/deployments/) and presents analysis of ride volume, fare collections, and a detailed analysis of driver tips. The "ML" tab shows various diagnostics for the machine learning models, including metric summaries and longidtudinal analysis for model drift detection. It also features a live-scoring widget using a model also hosted with a Saturn Deployment.

![](img/dashboard1.png)

![](img/dashboard2.png)

![](img/dashboard3.png)


In [23]:
import pandas as pd
import numpy as np
import dask.dataframe as dd
import holoviews as hv
import hvplot.pandas

In [232]:
%%html
<style>table {display: inline-block}</style>

# Machine learning model training

Various machine learning models were trained using Saturn Cloud's [Jupyter workspaces](https://www.saturncloud.io/docs/getting-started/spinning/jupyter/) along with [one-click Dask and RAPIDS (GPU) clusters](https://www.saturncloud.io/docs/getting-started/spinning/dask/). Models were also trained with Spark outside Saturn Cloud for performance comparisons.

| Tool | Hardware | 
| -- | -- |
| Scikit-learn | Single node |
| Dask | CPU cluster |
| Spark | CPU cluster* |
| RAPIDS | GPU cluster |

Models:
- Elastic net regression + hyperparameter tuning
- XGBoost regression
- Random forest classification

Not all models were run with all libraries for various reasons, this matrix records which were run:

|                | Scikit-learn | Dask | Spark | RAPIDS |
|----------------|--------------|------|-------|--------|
| Elastic net    | x            | x    | x     |        |
| XGBoost        | x            | x    |       |        |
| Random forest  | x            |      | x     | x      |

_\*Spark execution not supported in Saturn Cloud_. These models were trained using a Hadoop cluster for runtime comparison purposes.

In [174]:
metrics = dd.read_csv('s3://saturn-titan/nyc-taxi/ml_results/metrics/*.csv').compute()
metrics = metrics[(metrics.ml_task == 'high_tip') | (metrics.model != 'random_forest')]  # only compare random forest classifier
metrics['fit_minutes'] = metrics.fit_seconds / 60

# compare to scikit time
scikit = metrics[metrics.tool == 'scikit'].drop(['tool', 'value'], axis=1)
metrics = pd.merge(metrics, scikit, on=['ml_task', 'model', 'metric'], suffixes=['', '_scikit'], how='left')
metrics['scikit_improvement'] = np.where(metrics.fit_seconds <= metrics.fit_seconds_scikit,
                                           metrics.fit_seconds_scikit / metrics.fit_seconds,
                                           -(metrics.fit_seconds / metrics.fit_seconds_scikit))

# compare to spark time
spark = metrics[metrics.tool == 'spark'].drop(['tool', 'value', 'scikit_improvement'], axis=1)
metrics = pd.merge(metrics, spark, on=['ml_task', 'model', 'metric'], suffixes=['', '_spark'], how='left')
metrics['spark_improvement'] = np.where(metrics.fit_seconds <= metrics.fit_seconds_spark,
                                           metrics.fit_seconds_spark / metrics.fit_seconds,
                                           -(metrics.fit_seconds / metrics.fit_seconds_spark))
# format
metrics = metrics.applymap(lambda x: x if type(x) != str else x.replace('_', ' ').capitalize())

# ordering
metrics['tool'] = metrics['tool'].astype('category').cat.reorder_categories(['Scikit', 'Spark', 'Dask', 'Rapids'])
metrics = metrics.sort_values(['ml_task', 'model', 'tool'])

In [175]:
def runtime_plot(model, title):
    return (
        metrics[metrics.model == model].hvplot.barh(
            x='tool', 
            y='fit_minutes',
            color='tool',
            cmap=['gray', '#fda061', 'lightblue'],
            title=title,
            ylabel='Train time in minutes (lower is better)',
            xlabel='',
        ).opts(fontsize={'labels': '120%', 'ticks': '130%', 'title': '200%'})
    )

## Runtimes by model

All models achieved comparable regression/classification performance. We explore the runtimes of each model here.

In [176]:
runtime_plot('Elastic net', 'Elastic Net + Hyperparameter Tuning')

In [177]:
runtime_plot('Xgboost', 'XGBoost Regressor')

**NOTE**: PySpark does not support XGBoost

In [178]:
runtime_plot('Random forest', 'Random Forest Classifier')

## Compare to scikit-learn

Each number is read as x-fold time reduction compared to scikit-learn. Negative numbers mean time is x-fold slower than scikit-learn.

Highlights:

> Dask hyperparameter tuning (elastic net) is **113x faster** than scikit-learn

> RAPIDS random forest is **400x faster** than scikit-learn

> Spark random forest is **2x slower** than scikit-learn

In [206]:
by_scikit = metrics[metrics.tool != 'Scikit'].copy()
by_scikit['tool'] = by_scikit.tool.cat.remove_unused_categories()

by_scikit.hvplot.barh(
    x='model',
    by='tool',
#     color='tool',
    cmap=['#fda061', 'lightgreen', 'lightblue'],
    height=400,
    y='scikit_improvement',
    xlabel='',
    ylabel='Speed vs. scikit-learn (higher is better)',
    title='Speed increase versus scikit-learn'
).opts(fontsize={'labels': '120%', 'ticks': '130%', 'title': '200%'})

**Not run**: RAPIDS and Spark XGBoost, RAPIDS elastic net, Dask random forest

In [224]:
np.round(by_scikit[['tool', 'model', 'scikit_improvement']])

Unnamed: 0,tool,model,scikit_improvement
2,Spark,Random forest,-2.0
0,Rapids,Random forest,400.0
7,Spark,Elastic net,3.0
3,Dask,Elastic net,113.0
4,Dask,Xgboost,17.0


## Compare to Spark

Each number is read as x-fold time reduction compared to Spark. Negative numbers mean time is x-fold slower than Spark.

> RAPIDS random forest is **778 times faster** than Spark

> Dask hyperparameter tuning (elastic net) is **39 times faster** than Spark

In [227]:
by_spark = metrics[(metrics.tool != 'Spark') & (metrics.spark_improvement.notnull())].copy()
by_spark['tool'] = by_spark.tool.cat.remove_unused_categories()

by_spark.hvplot.barh(
    x='model',
    by='tool',
    cmap=['gray', 'lightgreen', 'lightblue'],
    height=400,
    y='spark_improvement',
    xlabel='',
    ylabel='Speed vs. Spark (higher is better)',
    title='Speed increase versus Spark'
).opts(fontsize={'labels': '120%', 'ticks': '130%', 'title': '200%'})

**Not run**: RAPIDS elastic net, Dask random forest

In [228]:
np.round(by_spark[['tool', 'model', 'spark_improvement']])

Unnamed: 0,tool,model,spark_improvement
1,Scikit,Random forest,2.0
0,Rapids,Random forest,778.0
5,Scikit,Elastic net,-3.0
3,Dask,Elastic net,39.0


## Full results

In [135]:
metrics

Unnamed: 0,ml_task,tool,model,metric,value,fit_seconds,fit_minutes,fit_seconds_scikit,fit_minutes_scikit,scikit_improvement,fit_seconds_spark,fit_minutes_spark,fit_seconds_scikit_spark,fit_minutes_scikit_spark,spark_improvement
1,High tip,Scikit,Random forest,Auc,0.558524,574.561481,9.576025,574.561481,9.576025,1.0,1117.825969,18.630433,574.561481,9.576025,1.945529
2,High tip,Spark,Random forest,Auc,0.536425,1117.825969,18.630433,574.561481,9.576025,-1.945529,1117.825969,18.630433,574.561481,9.576025,1.0
0,High tip,Rapids,Random forest,Auc,0.504997,1.436731,0.023946,574.561481,9.576025,399.90899,1117.825969,18.630433,574.561481,9.576025,778.034498
5,Tip,Scikit,Elastic net,Rmse,0.207689,8226.132331,137.102206,8226.132331,137.102206,1.0,2865.376522,47.756275,8226.132331,137.102206,-2.870873
7,Tip,Spark,Elastic net,Rmse,0.207875,2865.376522,47.756275,8226.132331,137.102206,2.870873,2865.376522,47.756275,8226.132331,137.102206,1.0
3,Tip,Dask,Elastic net,Rmse,0.207601,73.07067,1.217845,8226.132331,137.102206,112.57776,2865.376522,47.756275,8226.132331,137.102206,39.213771
6,Tip,Scikit,Xgboost,Rmse,0.206786,9851.059192,164.18432,9851.059192,164.18432,1.0,,,,,
4,Tip,Dask,Xgboost,Rmse,0.206804,587.82643,9.797107,9851.059192,164.18432,16.758449,,,,,
