# Building Scalable ML Pipeline with Fugue

In this demo, we will show how to use Fugue to do preprocessing, hyperparameter tuning and distributed inference.

You will find minimal presense of Fugue specific code in this notebook, because the philosophy of Fugue is that you should try to use native python and SQL to solve problems, and use Fugue as a glue only when necessary.

In this particular case, we want to demonstrate that SQL is a great language for ML preprocessing including data cleaning and featurization. Fugue SQL is mostly standard SQL but with enhanced syntax to let you do more with less code.

We are not competing with other solutions, we only provide general ideas of what should be done and can be done for a scalable ML pipeline.


## Links 

Fugue is a pure abstraction layer that makes code portable across differing computing frameworks such as Pandas, Spark and Dask. It allows users to write code compatible across all 3 frameworks. It guarantees consistency regardless of scale and a unified framework for compute.

[Fugue Repo](https://github.com/fugue-project/fugue)

[Fugue Slack](https://join.slack.com/t/fugue-project/shared_invite/zt-jl0pcahu-KdlSOgi~fP50TZWmNxdWYQ)

In [None]:
!pip install fuggle

## Setup environment

Now we use the fuggle package (Fugue customized environment for Kaggle) and set the default execution engine to Spark. Yes, even on Kaggle machines, Spark can still be faster than pandas (if you use appropriately).

The `setup` will create shortcut to use Spark, Dask and Pandas. It will also enable the cell highlight for ``%%fsql`` - FugueSQl cells in notebook

In [None]:
from fuggle import setup, PlotBar
setup("spark")

We import a few dependencies and create a simple python function to draw histograms, it will be used in the later steps

In [None]:
import seaborn as sns
import zipfile
import os
import pandas as pd

def hist(df:pd.DataFrame, x:str, by=None) -> None:
    if by is not None:
        sns.histplot(df, x=x, hue=by)
    else:
        sns.histplot(df, x=x)

## Convert all files to parquet

Now we extract all zip files, load and save as parquet. Parquet is the preferred data format for scalable solutions:

1. It has explicit schema, so you don't need any framework to infer schema, it can improve consistency and reduce overhead.
2. It is columnar storage and it is normally smaller than CSV.
3. All major computing frameworks support parquet and optimize the read performance.

In [None]:
file_list = [
    '/kaggle/input/instacart-market-basket-analysis/aisles.csv.zip',
    '/kaggle/input/instacart-market-basket-analysis/orders.csv.zip',
    '/kaggle/input/instacart-market-basket-analysis/sample_submission.csv.zip',
    '/kaggle/input/instacart-market-basket-analysis/order_products__train.csv.zip',
    '/kaggle/input/instacart-market-basket-analysis/products.csv.zip',  
    '/kaggle/input/instacart-market-basket-analysis/order_products__prior.csv.zip',    
    '/kaggle/input/instacart-market-basket-analysis/departments.csv.zip']

for file_name in file_list:
    with zipfile.ZipFile(file=file_name) as target_zip:
        target_zip.extractall()
    name = file_name.split("/")[-1][:-4]
    pd.read_csv(name).to_parquet(name[:-4]+".parquet")
    os.remove(name)

## Explore and transform order data

Now you will see Fugue SQL, it looks like standard SQL but you can see

1. There are additional syntax such as `LOAD`, `DROP`, `SAVE`, they are the necessary features to make SQL a real programming language
2. You can assign each step's output to a variable just like python.
3. But most of the time, you don't specify an explicit variable, indicating the steps are consuming the output of the last step. This is called anonymity, a very important feature of Fugue SQL, making your expression simpler. But it's optional, so if you don't like it, you can assign variable for each step.

In [None]:
%%fsql
-- load the raw orders data from file into a dataframe
LOAD "/kaggle/working/orders.parquet"

-- consume the loaded dataframe, convert eval_set -> eval, and remove eval_set
SELECT *, 
    CASE WHEN eval_set='prior' THEN 0
         WHEN eval_set='train' THEN 1
         ELSE 2 END AS eval
DROP COLUMNS eval_set

-- for each user_id, compute the total number of orders, as a new column of the previous dataframe
-- question for you, why do we use a window function here not MAX, GROUP BY?
SELECT *, MAX(order_number) OVER (PARTITION BY user_id) AS num_orders

-- invert the order_number, for inv_order_number, it starts from 0, and smaller means newer
SELECT *, num_orders - order_number AS inv_order_number

-- save the processed dataframe into a new file, load it back as res, for visualization
res = SAVE AND USE OVERWRITE "/kaggle/working/orders_all.parquet"
-- print res
PRINT

-- for each user_id, the latest order must be either train or test, can't be eval
TAKE 1 ROW FROM res PREPARTITION BY user_id PRESORT order_number DESC
SELECT COUNT(*) AS ct WHERE eval=0
-- make sure the count is 0
PRINT

-- print the size of prior, train, test set
SELECT eval, COUNT(DISTINCT order_id) AS ct FROM res GROUP BY eval
PRINT

-- draw the distribution of number of orders per users
SELECT eval, user_id, MAX(order_number) AS num_orders FROM res GROUP BY eval, user_id
OUTPUT USING hist(x="num_orders")

## Explore and transform Order-Product data



In [None]:
%%fsql
-- union prior and train data so that it's easy for the following steps
SELECT *, 0 AS eval FROM (LOAD "/kaggle/working/order_products__prior.parquet")
UNION ALL
SELECT *, 1 AS eval FROM (LOAD "/kaggle/working/order_products__train.parquet")

-- normalize the position in cart -> cart_pos
-- notice: we use window again instead of group by
order_products = 
    SELECT *, (add_to_cart_order-1)/MAX(add_to_cart_order) OVER (PARTITION BY order_id, product_id) AS cart_pos
    
-- load products metadata and join with the main dataframe
products = LOAD "/kaggle/working/products.parquet"

SELECT order_products.*, department_id, aisle_id
    FROM order_products INNER JOIN products ON order_products.product_id = products.product_id
    
-- save the processed order-product dataframe, load back for visualization
res = SAVE AND USE OVERWRITE "/kaggle/working/order_products_all.parquet"
PRINT

-- print the size of prior and train data, and compare with the previous output to make sure they match
SELECT eval, COUNT(DISTINCT order_id) AS ct FROM res GROUP BY eval
PRINT

-- draw the distribution of number of products per order
SELECT order_id, COUNT(product_id) AS num_products FROM res GROUP BY order_id
SELECT num_products, COUNT(*) AS ct GROUP BY num_products
OUTPUT USING PlotBar(x="num_products", title="products per order")

## Join all data together to be a single dataframe

Notice, this may not be a good idea for two reasons:

1. The joined data has a lot of redundancy and may take a lot of disk space (or memory space, depending on how you want to keep it)
2. The machine may not be big enough to keep the data, and loading back for compute may also slow down

But the advantages are also obvious:

1. You no longer need to deal with multiple data sources, the logic can be simplified
2. It's no longer necessary to join data sources to extract information, you could gain speed when sacrificing space

You really need to make the decision based on specific cases. For this case, the joined dataframe is small in the parquet format, so it's a great idea to do so. For your practical case (out of Kaggle), you may base on the main cloud providers with unlimited storage space to use, and it's also a good idea to do so, as long as you clean up the temporary file after the job.

Also, in general, if permanent space is not concern, it's a great idea to save intermediate data into permanent space, so even your compute environment is restarted, you can still resume from the last step. Reading large amount of data from disk isn't as slow as you think.

In [None]:
%%fsql
-- load back the dataframes generated by the last two steps
orders = LOAD "/kaggle/working/orders_all.parquet"
products = LOAD "/kaggle/working/order_products_all.parquet"

-- inner join
data =
    SELECT orders.*, product_id,reordered,department_id,aisle_id,cart_pos
    FROM orders INNER JOIN products ON orders.order_id=products.order_id
    
-- save and print a few rows to make sure it's fine
SAVE AND USE OVERWRITE "/kaggle/working/all.parquet"
PRINT

## Preparation for building featurization logic

Featurization is one of the most important steps in machine learning. You can't expect to extract the best features at one time, it must be an iterative process. You extract, test, think and adjust.

So for building a practical machine learning pipeline, making the featurization step isolated and extendable is even more important than what feature you extracted. The higher level design must be clean and flexible so later on you have more time to focus on what features you want.

You need to pay attention to a few general rules:

1. For faster iteration and faster prototyping, you should sample the original data to make it small enough to work on
2. For robustness, you need to make your logic testable. Smaller dataset is helpful, you can also try to separate your featurization logic from the infra and from each other, try to implement some core feature logic using simple python so you can unit test them
3. Before using, you should test the entire featurization logic with full scale to make sure you cover all edge cases and there is no performance issue.

It has to be an iterative process, be patient but don't be slow.

In [None]:
%%fsql
-- load the entire dataset
all = LOAD "/kaggle/working/all.parquet"

-- take 1% users as the sample
users = SAMPLE 1 PERCENT SEED 0 FROM (SELECT DISTINCT user_id)

-- use the sample users to filter the original dataset, so we make sure each user's data is complete
SELECT * FROM all LEFT SEMI JOIN users ON all.user_id = users.user_id PERSIST

-- since the sample_set will be very small, we can use YIELD DATAFRAME here so later steps can consume sample_set directly
YIELD DATAFRAME AS sample_set

Different features will be generated by different featurization logic, each featurizer should output certain keys with the feature values. For this case, we will have user level features, product level features, user-product level features etc. So a big final join must happend (this is also common in any kind of featurization task). In order to make it more convenient, we write a simple fugue extension `join_all` that can be invoked in Fugue SQL code.

In [None]:
from fugue import WorkflowDataFrames, WorkflowDataFrame, FugueWorkflow

def join_all(dag:FugueWorkflow, dfs:WorkflowDataFrames, how:str="inner") -> WorkflowDataFrame:
    return dag.join(dfs, how=how)

recent_n=5

And before using `join_all` in the pipeline, we can test if it is working. Here we use `%%fsql native` meaning that we test the logic using NativeExecutionEngine, which is using pandas rather than spark. Since Fugue guarantees the same expression generates the same result regardless of execution engine, this is a practical approach to test your logic without heavy dependency, and this can also become a unittest to stay in your codebase.

In [None]:
%%fsql native
a = CREATE [[0,1,10],[0,2,20]] SCHEMA user:int,prod:int,v1:int
b = CREATE [[0,1000]] SCHEMA user:int,v2:int
c = CREATE [[0,1,100],[0,2,200]] SCHEMA user:int,prod:int,v3:int
SUB a,b,c USING join_all(how="inner")
PRINT

## Featurization

This is a highly iterative step. But since we use the sample_set, it should be very fast.

In [None]:
%%fsql
labels = SELECT user_id, product_id, 1 AS label FROM sample_set WHERE eval=1 AND reordered=1
prior_data = SELECT * FROM sample_set WHERE eval=0

-- average cart position for each product
p_cart_pos = SELECT product_id, AVG(cart_pos) AS p_cart_pos FROM prior_data GROUP BY product_id

-- user-product level features
up_features = 
    SELECT user_id, product_id, 
           AVG(cart_pos) AS up_cart_pos,
           MAX(reordered) AS reordered, -- as long as there is one reordered, it's reordered
           COUNT(*)/(MAX(inv_order_number)+1) AS up_order_freq
    FROM prior_data GROUP BY user_id, product_id

-- for each product, how many users ever reordered
p_reorder_ratio = SELECT product_id, SUM(reordered)/COUNT(*) AS p_reorder_ratio FROM up_features GROUP BY product_id

-- for each user, how many product even reordered
u_reorder_ratio = SELECT user_id, SUM(reordered)/COUNT(*) AS u_reorder_ratio FROM up_features GROUP BY user_id

-- truncate the data to keep only recent_n orders, this entire cell can be a jinja template, and variables can be set in previous cells
recent = SELECT * FROM prior_data WHERE inv_order_number<{{recent_n}}

-- user-product level recent features
recent_features =
    SELECT user_id, product_id,
           AVG(cart_pos) AS up_recent_cart_pos,
           MAX(reordered) AS up_recent_reordered,
           COUNT(*)/{{recent_n}} AS up_recent_order_freq
    FROM recent GROUP BY user_id, product_id

-- join all non-recent features using inner join
up = SUB up_features,p_cart_pos,p_reorder_ratio,u_reorder_ratio USING join_all

-- construct the final feature set by left joining recent features, fill nulls are done by COALESCE
features=
    SELECT up.*,
        COALESCE(up_recent_cart_pos,1.0) AS up_recent_cart_pos,
        COALESCE(up_recent_reordered,0) AS up_recent_reordered,
        COALESCE(up_recent_order_freq,0.0) AS up_recent_order_freq
    FROM up LEFT OUTER JOIN recent_features 
        ON up.user_id=recent_features.user_id AND up.product_id=recent_features.product_id

-- we only keep the features whose users are in positive labels
training_features = SELECT * FROM features LEFT SEMI JOIN labels ON features.user_id=labels.user_id

-- final join with label to get negative labels, now we get the training data
training =
    SELECT training_features.*, COALESCE(label,0) AS label
    FROM training_features 
        LEFT OUTER JOIN labels ON training_features.user_id=labels.user_id AND training_features.product_id=labels.product_id
    PERSIST
    YIELD DATAFRAME

PRINT

## Hyperparameter Tuning

With the training data generated from the previous step, we are ready to try different models with different parameters.

First of all, we will install `fugue_tune`, a library for paramter tuning.

In [None]:
!pip install 'fugue_incubator==0.0.8'

Now, let's convert the training dataframe to pandas dataframe dropping irrelevant columns

In [None]:
tdf = training.as_pandas().drop(["user_id","product_id"],axis=1)
tdf

`fugue_tune` supports hybrid search, you can define a search space with model sweeping + grid search + bayesian optimization. Notice the `space` expression in the following code, that is how you define a search space for sklearn compatible models. In this case, we are going to try the simplest LogisticRegression and LGBM classifier.

In `LGBMClassifier`, we defined a few hyperparamters, and we want to do bayesian optimization so we use `Rand` or `RandInt`. If we want to do grid search on the parameter, we can do for example `num_leaves=Grid(20,30,40)`.

Fugue treats grid search differently, it will fully parallelize the grid search, and then do sequential bayesian optimization in each independent task, so to maximize the parallelism. On Kaggle machine, it has only 4 cores, so in the following case, we didn't use grid search, because it will not work well. But with a real cluster, grid search can be a lot faster than bayesian optimization.

In general, grid search requires more computing resource, takes less time, while bayesian optmization is the opposite. So when you define the space using `fugue_tune`, you try to find a way to balance time and cost for your problem.

In [None]:
%%time
from fugue_tune import Grid, Rand, RandInt, Choice
from fugue_tune.sklearn import sk_space as ss, suggest_sk_model
from fugue_tune.hyperopt import HyperoptRunner

from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression

space = sum(
    ss(LogisticRegression, max_iter=1000),
    ss(LGBMClassifier,num_leaves=RandInt(20,100), max_depth=RandInt(5,50), learning_rate=Rand(0.8,2.0), n_estimators=RandInt(100,300), random_state=0)
)


suggest_sk_model(
    space,
    tdf,
    scoring="f1", # score for cross validation
    serialize_path = "/tmp", # we need specify a temp path for storing some intermediate data
    objective_runner = HyperoptRunner(max_iter=30, seed=0)
)


With `suggest_sk_model` we get a suggestion for the most promising hyperparameter combinations. And without surprise, `LGBMClassifier` outperforms `LogisticRegression`. We must notice that currently, the model is trained and cross validated on only 1% of the entire training data. We use the small data to achieve faster iteration.

When we get the suggested model, we should retrain the model with the full training data. So should we also do hyperparameter tuning on the entier training data? It depends on how fast you want the iteration to be, and how much compute resource you have.

## Apply the featurization on the entire dataset

To apply the process to the entire dataset is straightforward, you only need to change the input data. Fugue is scale agnostic, so the main part of the featurization logic is not changed.

In real situations, the following step may be done with a powerful cluster, you also need to pay attention to specify how much compute resource you want to use. For example if you use spark, for the previous steps, you may use small cluster, but at this step, you should increase the number of instances, cores and memory accordingly.

In [None]:
%%fsql
source = LOAD "/kaggle/working/all.parquet"
labels = SELECT user_id, product_id, 1 AS label FROM source WHERE eval=1 AND reordered=1
prior_data = SELECT * FROM source WHERE eval=0

up_features = 
    SELECT user_id, product_id, 
           AVG(cart_pos) AS up_cart_pos,
           MAX(reordered) AS reordered,
           COUNT(*)/(MAX(inv_order_number)+1) AS up_order_freq
    FROM prior_data GROUP BY user_id, product_id

p_cart_pos = SELECT product_id, AVG(cart_pos) AS p_cart_pos FROM prior_data GROUP BY product_id

p_reorder_ratio = SELECT product_id, SUM(reordered)/COUNT(*) AS p_reorder_ratio FROM up_features GROUP BY product_id
u_reorder_ratio = SELECT user_id, SUM(reordered)/COUNT(*) AS u_reorder_ratio FROM up_features GROUP BY user_id

recent = SELECT * FROM prior_data WHERE inv_order_number<{{recent_n}}

recent_features =
    SELECT user_id, product_id,
           AVG(cart_pos) AS up_recent_cart_pos,
           MAX(reordered) AS up_recent_reordered,
           COUNT(*)/{{recent_n}} AS up_recent_order_freq
    FROM recent GROUP BY user_id, product_id

up = SUB up_features,p_cart_pos,p_reorder_ratio,u_reorder_ratio USING join_all

features=
    SELECT up.*,
        COALESCE(up_recent_cart_pos,1.0) AS up_recent_cart_pos,
        COALESCE(up_recent_reordered,0) AS up_recent_reordered,
        COALESCE(up_recent_order_freq,0.0) AS up_recent_order_freq
    FROM up LEFT OUTER JOIN recent_features 
        ON up.user_id=recent_features.user_id AND up.product_id=recent_features.product_id


training_features = SELECT * FROM features LEFT SEMI JOIN labels ON features.user_id=labels.user_id

-- we do a random split by using a rand number
data =
    SELECT training_features.*, COALESCE(label,0) AS label, rand(0)<0.2 AS is_test
    FROM training_features 
        LEFT OUTER JOIN labels ON training_features.user_id=labels.user_id AND training_features.product_id=labels.product_id
    PERSIST

-- select test data, clean, and save to a single file so pandas can consume directly
SELECT * FROM data WHERE is_test
DROP COLUMNS user_id, product_id, is_test
SAVE OVERWRITE SINGLE "/kaggle/working/test.parquet"

-- select train data, clean, and save to a single file so pandas can consume directly
SELECT * FROM data WHERE NOT is_test
DROP COLUMNS user_id, product_id, is_test
SAVE OVERWRITE SINGLE  "/kaggle/working/train.parquet"

Now, we just need the simplest pandas opeations

In [None]:
train = pd.read_parquet("/kaggle/working/train.parquet")
train_x = train.drop("label",axis=1)
train_y = train["label"]
test = pd.read_parquet("/kaggle/working/test.parquet")
test_x = test.drop("label",axis=1)
test_y = test["label"]

And based on the suggested hyperparameters, we create `LGBMClassifier` and train on 80% of the entire training data. We also save the model to disk. This is because it's the best way to be used by distributed inference.

In [None]:
%%time
from joblib import dump, load
from sklearn.metrics import f1_score

model = LGBMClassifier(random_state=0, num_leaves=23, max_depth=15, learning_rate=1.2983994130341636, n_estimators=132)
model.fit(train_x,train_y)
dump(model, "/kaggle/working/model")

pred = load("/kaggle/working/model").predict(test_x)
f1_score(test_y, pred)

## Distributed inference

Distributed inference is a much simpler problem compared with distributed training and tuning. Using Fugue, it is even simpler. You can treat this as a transformation problem, so you only need to write a transformer to do the inference. And to create a transformer, you only write the simplest python code with no dependency on Fugue.

In [None]:
# schema: *,pred:int
def infer(df:pd.DataFrame, model_path:str) -> pd.DataFrame:
    model = load(model_path)
    return df.assign(pred=model.predict(df))

infer(test_x, "/kaggle/working/model")

Now we can use this extension in Fugue SQL. `PREPARTITION` is added to ensure the data is well distributed so to achieve better load balance.

In [None]:
%%fsql
LOAD "/kaggle/working/test.parquet"
DROP COLUMNS label
TRANSFORM PREPARTITION 16 USING infer(model_path="/kaggle/working/model")
PRINT

## Summary

In this notebook, we have demonstrated how to use Fugue on different steps of ML pipeline. And we also showed that Fugue SQL is a great language for machine learning. It helps you write scale agnostic code. And with Fugue framework, it's much easier to test each step of your pipeline.

It is worth to note that, Fugue also has functional API. If you are not a fan of SQL, then you may consider the functional API.

You may have noticed that, Fugue related code is minimal in the entire notebook. We solved this problem mostly by standard SQL and native python.