# Sklearn Solution

## Simple

In [4]:
import pandas as pd
import json
from sklearn.datasets import load_iris

# Get teh data
iris  = load_iris()
features = iris.feature_names
df = pd.DataFrame(iris.data, columns=features)
df["target"] = iris.target

df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


### Create a pipeline for production

In [5]:
from goldilox import Pipeline
from catboost import CatBoostClassifier

pipeline = Pipeline.from_sklearn(CatBoostClassifier(verbose=0)).fit(df[features], df["target"])

# I/O Example
raw = pipeline.raw
print(f"predict for {json.dumps(raw, indent=4)}")
pipeline.inference(raw)

predict for {
    "sepal length (cm)": 5.1,
    "sepal width (cm)": 3.5,
    "petal length (cm)": 1.4,
    "petal width (cm)": 0.2
}


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),prediction
0,5.1,3.5,1.4,0.2,0


### Validation    
We can see the pipeline is valid, but cannot handle missing value if they happen in production.

In [6]:
pipeline.validate()

True

### Variabels and description
We can add variables which want to assosiate with the pipeline, and a description.
* A greate place to put the training params, evaluation results, version, branch, etc,.

In [12]:
pipeline.description = "LightGBM on the iris dataset with sklearn"
pipeline.variables["var1"] = 1

### Serve

In [13]:
print(f"Saved to: {pipeline.save("../tests/models/server.pkl")}")
print(f"Check out the docs: http://127.0.0.1:5000\n")
!gl serve ../tests/models/server.pkl

Saved to: ../tests/models/server.pkl
Check out the docs: http://127.0.0.1:5000

[2021-11-16 18:53:50 +0100] [74866] [INFO] Starting gunicorn 20.1.0
[2021-11-16 18:53:50 +0100] [74866] [INFO] Listening at: http://127.0.0.1:5000 (74866)
[2021-11-16 18:53:50 +0100] [74866] [INFO] Using worker: uvicorn.workers.UvicornH11Worker
[2021-11-16 18:53:50 +0100] [74872] [INFO] Booting worker with pid: 74872
[2021-11-16 18:53:50 +0100] [74872] [INFO] Started server process [74872]
[2021-11-16 18:53:50 +0100] [74872] [INFO] Waiting for application startup.
[2021-11-16 18:53:50 +0100] [74872] [INFO] Application startup complete.
^C
[2021-11-16 18:53:58 +0100] [74866] [INFO] Handling signal: int
[2021-11-16 18:53:58 +0100] [74866] [INFO] Shutting down: Master


# Vaex solution

Vaex solutions are much more **powerful** and allow for easier feature engineering and scale.    
In this example we do a simple feature engineering, and process the results to labels, so it would be easier to consume on the frontend side.

* We do not need to implement transformers for each feature engineering step or estimators. Instead we create simple functions which does what we want.

## Simple

In [20]:
import vaex
import warnings
from vaex.ml.datasets import load_iris_1e5
from vaex.ml.catboost import CatBoostModel
from goldilox import Pipeline
import numpy as np
import json

warnings.filterwarnings("ignore")


df = load_iris_1e5()
target = "class_"

# feature engineering example
df["petal_ratio"] = df["petal_length"] / df["petal_width"] 

booster = CatBoostModel(features=["petal_length", "petal_width", "sepal_length", "sepal_width", "petal_ratio"],
                        target=target,
                        prediction_name="predictions",
                        params={"num_boost_round":500, "verbose":0, "objective":"MultiClass"})
                        
booster.fit(df)
df = booster.transform(df)

# post model processing example
@vaex.register_function()
def argmax(ar, axis=1):
    return np.argmax(ar, axis=axis)
df.add_function("argmax", argmax)
df["prediction"] = df["predictions"].argmax()

df["label"] = df["prediction"].map({0: "setosa", 1: "versicolor", 2: "virginica"})

# # Vaex remember all the transformations, this is a skleran.pipeline alternative
pipeline = Pipeline.from_vaex(df, description="simple Catboost")
pipeline.raw.pop(target) # (optional) we don"t expect to get the class_ in queries
assert pipeline.validate()
print("Pipeline raw data example:")
print(json.dumps(pipeline.raw, indent=4))
print("")
print("Pipeline output example:")
pipeline.inference(pipeline.raw).to_records()
df.head(2)

Pipeline raw data example:
{
    "sepal_length": 5.9,
    "sepal_width": 3.0,
    "petal_length": 4.2,
    "petal_width": 1.5
}

Pipeline output example:


#,sepal_length,sepal_width,petal_length,petal_width,class_,petal_ratio,predictions,prediction,label
0,5.9,3,4.2,1.5,1,2.8,"'array([6.09742414e-07, 9.99998377e-01, 1.013054...",1,versicolor
1,6.1,3,4.6,1.4,1,3.28571,"'array([5.37810099e-07, 9.99998682e-01, 7.801238...",1,versicolor


### Serve

In [2]:
print(f"Saved to: {pipeline.save("../tests/models/server.pkl")}")
print(f"Check out the docs: http://127.0.0.1:5000\n")

!gl serve ../tests/models/server.pkl

Saved to: ../tests/models/server.pkl
Check out the docs: http://127.0.0.1:5000

[2021-11-16 18:54:44 +0100] [74906] [INFO] Starting gunicorn 20.1.0
[2021-11-16 18:54:44 +0100] [74906] [INFO] Listening at: http://127.0.0.1:5000 (74906)
[2021-11-16 18:54:44 +0100] [74906] [INFO] Using worker: uvicorn.workers.UvicornH11Worker
[2021-11-16 18:54:44 +0100] [74911] [INFO] Booting worker with pid: 74911
[2021-11-16 18:54:44 +0100] [74911] [INFO] Started server process [74911]
[2021-11-16 18:54:44 +0100] [74911] [INFO] Waiting for application startup.
[2021-11-16 18:54:44 +0100] [74911] [INFO] Application startup complete.
^C
[2021-11-16 18:54:54 +0100] [74906] [INFO] Handling signal: int
[2021-11-16 18:54:55 +0100] [74906] [INFO] Shutting down: Master


## Advance   
Let"s have a look at an edance training function, which we want to re-run when new data arrives.     
To implement this, we must everything within a function which recive a dataframe and return a Vaex DataFrame

The function:    
First we run a "*random_split*" experiment and save the results.    
Next, we train the data on the entire dataset.    
Finally, we add the evalution as a varaible so we can recall how good the model was.


* This way we can change the pipeline training and outputs without changing our infrastructure at all.
* This also create a model for production who learned from the entire data.

In [31]:
import warnings
from vaex.ml.datasets import load_iris
from goldilox import Pipeline

warnings.filterwarnings("ignore") # lightgbm fun

def fit(df):
    import vaex
    import numpy as np
    from vaex.ml.catboost import CatBoostModel
    from sklearn.metrics import accuracy_score
    from goldilox import Pipeline

    train, test = df.ml.train_test_split(test_size=0.2, verbose=False)        
    target = "class_"
    prediction_name = "predictions"
    
    train["petal_ratio"] = train["petal_length"] / train["petal_width"] 
    
    features = ["petal_length", "petal_width", "sepal_length", "sepal_width", "petal_ratio"]    
    params = {"num_boost_round":500, "verbose":0, "objective":"MultiClass"}
    booster = CatBoostModel(features=features,
                        target=target,
                        prediction_name=prediction_name,
                        params=params)    
    booster.fit(train)    

    @vaex.register_function()
    def argmax(ar, axis=1):
        return np.argmax(ar, axis=axis)

    train = booster.transform(train)
    train.add_function("argmax", argmax)
    train["prediction"] = train["predictions"].argmax()
    
    """
    Using the  way to get predictions on a new dataset.
    This is very helpful if we did many feature engineering transformations. 
    """
    pipeline = Pipeline.from_vaex(train) 
    accuracy = accuracy_score(pipeline.inference(test)["prediction"].values,
                              test[target].values)
    
    # Re-train on the entire dataset
    booster = CatBoostModel(features=features,
                        target=target,
                        prediction_name=prediction_name,
                        params=params)
    processed = pipeline.inference(df) # all feature engineering (including the model which we will overite)
    booster.fit(processed)
    df = booster.transform(processed)
    df.add_function("argmax", argmax)
    df["prediction"] = df[prediction_name].argmax()
    # The "label" is to help the Frontend app to understand what actually was the result
    df["label"] = df["prediction"].map({0: "setosa", 1: "versicolor", 2: "virginica"})
    df.variables["accuracy"] = accuracy
    return df

df = load_iris()
pipeline = Pipeline.from_vaex(df, fit=fit).fit(df)
pipeline.validate()

True

### Persistance

In [32]:
from tempfile import TemporaryDirectory

path = str(TemporaryDirectory().name) + "/model.pkl"
pipeline.save(path)
pipeline = Pipeline.from_file(path)

pipeline.inference(pipeline.raw)

#,sepal_length,sepal_width,petal_length,petal_width,class_,petal_ratio,predictions,prediction,label
0,5.9,3,4.2,1.5,1,2.8,"array([0.00121612, 0.99772101, 0.00106287])",1,versicolor


### Re-train

In [36]:
from vaex.ml.datasets import load_iris_1e5

df = load_iris_1e5() # iris 670 times
pipeline.fit(df)
assert pipeline.validate()
print(f"Accuracy: {pipeline.variables['accuracy']} - very good as we duplicated the original data (:")

Accuracy: 1.0 - very good as we duplicated the original data (:


### Serve

* Note that when we train in this way, the "*raw*" example has the target variable "class_" which we will not expect in production.  This is no issue, we can either "pop" it out from the pipeline.raw, or just ignore it, predictions still work!

In [7]:
pipeline.raw.pop("class_", None) # we can also leave it unpoped
print(f"Saved to: {pipeline.save("../tests/models/server.pkl")}")
print(f"Check out the docs: http://127.0.0.1:5000\n")

!gl serve ../tests/models/server.pkl

Saved to: ../tests/models/server.pkl
Check out the docs: http://127.0.0.1:5000

[2021-11-16 19:01:25 +0100] [75207] [INFO] Starting gunicorn 20.1.0
[2021-11-16 19:01:25 +0100] [75207] [INFO] Listening at: http://127.0.0.1:5000 (75207)
[2021-11-16 19:01:25 +0100] [75207] [INFO] Using worker: uvicorn.workers.UvicornH11Worker
[2021-11-16 19:01:25 +0100] [75213] [INFO] Booting worker with pid: 75213
[2021-11-16 19:01:26 +0100] [75213] [INFO] Started server process [75213]
[2021-11-16 19:01:26 +0100] [75213] [INFO] Waiting for application startup.
[2021-11-16 19:01:26 +0100] [75213] [INFO] Application startup complete.
^C
[2021-11-16 19:01:45 +0100] [75207] [INFO] Handling signal: int
[2021-11-16 19:01:45 +0100] [75207] [INFO] Shutting down: Master
