Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline proposal #1332

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Pipeline proposal #1332

wants to merge 5 commits into from

Conversation

xdssio
Copy link
Collaborator

@xdssio xdssio commented Apr 29, 2021

This is an implementation of a new Pipeline which wraps a few standard solutions needs and the vaex state.

General idea:
Any transformation you do on the dataframe as long as you start at the "raw" data you will use in production, is saved so that you can use the same infrastructure to solve all the problems.

  • Keep an example of the data.
  • fit, transform, fit_transform for the sklearn API.
  • inference function that output what you would need in a server.
    • figure out and handle missing values, missing columns, and extra columns.
    • Never filter the data.
  • read many inputs like bytes, JSON, dict, list and so on.
  • save/load

An Imputer transformer which used in default on the pipeline but can be used in many ways by providing a strategy.

Added to the Dataframe:

  • to_records implementation for easy testing (the use of records is very common in the industry)
    • [{key:value,key:value,...},...]
  • countna: which counts all missing value in dataframe

Missing

  • predict to complete the sklearn API.
  • partial_fit if possible.
  • show: a way to view all the steps.
  • An API to manipulate steps.
  • A way to "restart" a state in case your "raw" data in production is different from the raw data you start with.

Examples:

train, test = vaex.ml.datasets.load_iris().split_random(0.8)
train['average_length'] = (train['sepal_length'] + train['petal_length']) / 2
booster = LightGBMModel(features=['average_length', 'petal_width'], target='class_'})
booster.fit(train)
train = booster.transform(train)

pipeline = Pipeline.from_dataframe(train)
assert "lightgbm_prediction" in pipeline.inference(test)

With fit:

def fit (df):
  df['average_length'] = (df['sepal_length'] + df['petal_length']) / 2
  booster = LightGBMModel(features=['average_length', 'petal_width'], target='class_')
  booster.fit(df)
  df = booster.transform(df)
  return df

train, test = vaex.ml.datasets.load_iris().split_random(0.8)

pipeline = Pipeline.from_dataframe(train, fit=fit)
pipeline.fit(train)
pipeline.inference(test) # predictions

Copy link
Member

@JovanVeljanoski JovanVeljanoski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

It is a bit difficult to review such a massive PR that contains multiple key points. So here I am focusing on the Imputer class.

In general I like it and I think we should have this. But there are various things that need to be improved.

Here are some general comments to start with:

  • uuid4 is not defined
  • TransformerBase is note defined
  • In the methods ar typically stands for "array". For vaex, it is probably better and more consistent to use expr that stands for "expression" instead. Will help with maintaining this going forward.
  • What you call "mode" is in fact the median. What you call "common" is in fact the mode. This makes me quite confused reviewing the code further.. so better to address it immediately.
  • Do we really need to have separate methods for things like mode, median etc.. since they are standard things in vaex and 1-liners? Maybe for the mode/common would make sense.. if this is reuseable.
  • We have a dedicated docstring for the "prefix", at the top of the transformations.py file. Se how the other classes use it.
  • All the methods are public, but none have docstrings. I think any public method (especially those that are newly introduces) should have at least a basic docstring, explaing the intent of the method and params/outputs.

I'd say let's address this 1st before continuing this forward. I am also fine with splitting this PR input 2-3 PRs, but that is up to you.

@JovanVeljanoski
Copy link
Member

Ok I've taken over the developement here (temporarily?) since I like this and want to push this forward!
It is a fairly big PR so i'm doing this in steps.

So I have refactored the Imputer and improved the test a bit. It was already in good shape. The (rough) changelog:

  • Added docstrings to all methods, and a class docstring with an example
  • Fixed missing imports
  • Changed some variable and function/method names to make it a bit more understandable and easier to support going forward
  • Removed some of the methods that were too low level, making it easier (i hope) to follow the code along.
  • Added an additional test testing the state transfer - currently failing (see explanation below).
  • Improved the tests -> testing against fix values (just in case).

There is one problem as follows:
When doing .transform(df_test) if df_test does not have a missing column that needs to be imputed by the Imputer, the current behaviour is that that column will be initialized with a constant values (the values being the fill-values for that column).

This is currently not possible when doing state_transfer. I wonder if a fix here is possible, it would be quite nice to have this feature (somehow specifically tied to the Imputer). I wonder if @maartenbreddels can think of something :)

Also, inspired by @xdssio, I moved the __repr__ method to the base Transformer class, and now it is showing the name of the Transformer class and the arguments with their values.

xdssio and others added 5 commits August 4, 2021 15:32
* pipeline - to wrap state handling
* imputer - fillna automatically
* countna - small feature to count missing value on the entire dataset
* to_records - used for more standard json input/output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants