# Rayflow prototyping

Explore [MLflow recipes][1] with [Ray][2] using a large dataset.

### Task

Build MLflow Recipe to create a pipeline to ingest, split and transform data, and train, tune, evaluate a regression model to predict the **fare_amount** of a trip with a _yellow taxi_ in NYC.

### Constraints

* Focus is on integrating Ray and MLflow Recipes
* Model predicts *fare_amount* given passenger_count, trip_distance, rate_code_id and payment_type only
* Feature engineering is kept to a minimum

### Results

* [x] Ingest data
* [x] Split data (without using Ray)
* [x] Transform data
* [x] Train model
* ~~[ ] Tune model~~
* [ ] Evaluate model
* [ ] Register model
* [ ] Generate predictions

### General observations

* Ray is a super powerful framework and I only scratched the surface of its functionality
* MLflow recipes are rather limited
    - forces a strict layout of files on the project
    - no possibility to add custom steps
    - it's unclear how to pass arguments to custom functions
* Documentation of MLflow recipes is awful
    - One essentially has to read the source code to understand what's going on
    - It's unclear how data flows between steps: Are they passed as DataFrames in memory? Are they automatically loaded from disk?
* MLflow Recipe examples could be more detailed
* Data dictionary of NYC Taxi dataset is outdated

### Key insights

* Ray is great, use it with a general purpose orchestrator


[1]: https://mlflow.org/docs/latest/recipes.html
[2]: https://docs.ray.io/en/latest/

## Setup

In [None]:
from mlflow.recipes import Recipe

## Orchestrate

Let's start cooking!

In [None]:
recipe = Recipe(profile="local")

In [None]:
recipe.clean()

### Ingest data

* Data was compressed using zip which (apparently) is not natively supported by Ray (it's an odd choice for a compression algorithm anyway)
* Need to semi-automatically unzip archives

In [None]:
recipe.run(step="ingest")

### Split data into train, validation and test

* Splitting data using Ray built-in functions (`train_test_split`) does not work, since Recipes expect a _single_ pandas Series as output
* Similarly defining a custom filter function using Ray is painful, since Recipes expects a boolean series to index into the dataset

In [None]:
recipe.run(step="split")

### Transform data

* Recipes demand that custom transform functions return a Scikit-learn like transformer (w.r.t function signatures), which make it tedious to implement non-sklearn logic (the last time I wrote custom transformers was in 7 years ago)
* Transformation on target column have to be done elsewhere, i.e. in _split_ step, since target column in intransparenlty dropped from DataFrames by Recipes

In [None]:
recipe.run(step="transform")

### Train model

* As with the transform step, the custom training function has to be compatible with Scikit-learn estimator, which makes its implementation tedious
* Installing lightgbm via `poetry add ray[lightgbm]` is not enough to actually use LightGBM, one also needs to install the pure lightgbm package via `poetry add lightgbm`

In [None]:
recipe.run(step="train")

### Tune model

In [None]:
recipe.run(step="tune")

### Evaluate model

In [None]:
recipe.run(step="evaluate")

### Register model

In [None]:
recipe.run(step="register")

### Generate predictions

In [None]:
recipe.run(step="predict")