Pipeline proposal #1332

xdssio · 2021-04-29T09:08:00Z

This is an implementation of a new Pipeline which wraps a few standard solutions needs and the vaex state.

General idea:
Any transformation you do on the dataframe as long as you start at the "raw" data you will use in production, is saved so that you can use the same infrastructure to solve all the problems.

Keep an example of the data.
fit, transform, fit_transform for the sklearn API.
inference function that output what you would need in a server.
- figure out and handle missing values, missing columns, and extra columns.
- Never filter the data.
read many inputs like bytes, JSON, dict, list and so on.
save/load

An Imputer transformer which used in default on the pipeline but can be used in many ways by providing a strategy.

Added to the Dataframe:

to_records implementation for easy testing (the use of records is very common in the industry)
- [{key:value,key:value,...},...]
countna: which counts all missing value in dataframe

Missing

predict to complete the sklearn API.
partial_fit if possible.
show: a way to view all the steps.
An API to manipulate steps.
A way to "restart" a state in case your "raw" data in production is different from the raw data you start with.

Examples:

train, test = vaex.ml.datasets.load_iris().split_random(0.8)
train['average_length'] = (train['sepal_length'] + train['petal_length']) / 2
booster = LightGBMModel(features=['average_length', 'petal_width'], target='class_'})
booster.fit(train)
train = booster.transform(train)

pipeline = Pipeline.from_dataframe(train)
assert "lightgbm_prediction" in pipeline.inference(test)

With fit:

def fit (df):
  df['average_length'] = (df['sepal_length'] + df['petal_length']) / 2
  booster = LightGBMModel(features=['average_length', 'petal_width'], target='class_')
  booster.fit(df)
  df = booster.transform(df)
  return df

train, test = vaex.ml.datasets.load_iris().split_random(0.8)

pipeline = Pipeline.from_dataframe(train, fit=fit)
pipeline.fit(train)
pipeline.inference(test) # predictions

JovanVeljanoski

Hi,

It is a bit difficult to review such a massive PR that contains multiple key points. So here I am focusing on the Imputer class.

In general I like it and I think we should have this. But there are various things that need to be improved.

Here are some general comments to start with:

uuid4 is not defined
TransformerBase is note defined
In the methods ar typically stands for "array". For vaex, it is probably better and more consistent to use expr that stands for "expression" instead. Will help with maintaining this going forward.
What you call "mode" is in fact the median. What you call "common" is in fact the mode. This makes me quite confused reviewing the code further.. so better to address it immediately.
Do we really need to have separate methods for things like mode, median etc.. since they are standard things in vaex and 1-liners? Maybe for the mode/common would make sense.. if this is reuseable.
We have a dedicated docstring for the "prefix", at the top of the transformations.py file. Se how the other classes use it.
All the methods are public, but none have docstrings. I think any public method (especially those that are newly introduces) should have at least a basic docstring, explaing the intent of the method and params/outputs.

I'd say let's address this 1st before continuing this forward. I am also fine with splitting this PR input 2-3 PRs, but that is up to you.

JovanVeljanoski · 2021-07-29T14:38:05Z

Ok I've taken over the developement here (temporarily?) since I like this and want to push this forward!
It is a fairly big PR so i'm doing this in steps.

So I have refactored the Imputer and improved the test a bit. It was already in good shape. The (rough) changelog:

Added docstrings to all methods, and a class docstring with an example
Fixed missing imports
Changed some variable and function/method names to make it a bit more understandable and easier to support going forward
Removed some of the methods that were too low level, making it easier (i hope) to follow the code along.
Added an additional test testing the state transfer - currently failing (see explanation below).
Improved the tests -> testing against fix values (just in case).

There is one problem as follows:
When doing .transform(df_test) if df_test does not have a missing column that needs to be imputed by the Imputer, the current behaviour is that that column will be initialized with a constant values (the values being the fill-values for that column).

This is currently not possible when doing state_transfer. I wonder if a fix here is possible, it would be quite nice to have this feature (somehow specifically tied to the Imputer). I wonder if @maartenbreddels can think of something :)

Also, inspired by @xdssio, I moved the __repr__ method to the base Transformer class, and now it is showing the name of the Transformer class and the arguments with their values.

* pipeline - to wrap state handling * imputer - fillna automatically * countna - small feature to count missing value on the entire dataset * to_records - used for more standard json input/output

…pytest runs

JovanVeljanoski requested review from maartenbreddels and JovanVeljanoski April 29, 2021 09:23

JovanVeljanoski mentioned this pull request May 15, 2021

[FEATURE-REQUEST] better way to transform single data point than using state transfer #1343

Open

maartenbreddels force-pushed the master branch from c144d6e to 6f60dd0 Compare June 4, 2021 14:31

JovanVeljanoski requested changes Jun 14, 2021

View reviewed changes

JovanVeljanoski mentioned this pull request Jul 28, 2021

[Feat]: Pipeline #980

Closed

JovanVeljanoski assigned maartenbreddels, JovanVeljanoski and xdssio Jul 28, 2021

JovanVeljanoski added the new-feature label Jul 28, 2021

JovanVeljanoski force-pushed the state_pipeline branch from 9d04565 to 3826627 Compare July 28, 2021 12:14

xdssio and others added 5 commits August 4, 2021 15:32

pipeline-commit

fa93f41

* pipeline - to wrap state handling * imputer - fillna automatically * countna - small feature to count missing value on the entire dataset * to_records - used for more standard json input/output

feat(ml): refactor the Imputer

84b6338

feat(ml): add repr to the Transformers

7637127

chore(tests): temporary rename duplicate test names such that the CI …

8d521bd

…pytest runs

chore(core): linting

ef7695f

JovanVeljanoski force-pushed the state_pipeline branch from f32bac4 to ef7695f Compare August 4, 2021 13:32

maartenbreddels force-pushed the master branch from 369423b to 6927b35 Compare November 25, 2021 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline proposal #1332

Pipeline proposal #1332

xdssio commented Apr 29, 2021

JovanVeljanoski left a comment

JovanVeljanoski commented Jul 29, 2021

Pipeline proposal #1332

Are you sure you want to change the base?

Pipeline proposal #1332

Conversation

xdssio commented Apr 29, 2021

JovanVeljanoski left a comment

Choose a reason for hiding this comment

JovanVeljanoski commented Jul 29, 2021