Open
Description
Thanks for the amazing initiative!
I am a bit taken aback by the redundancy of abstractions between ibis-ml and native ibis. I would expect ibis-ml to be a lightweight extension of ibis as much as possible, but that doesn't seem to be the case. Ibis-ml does its own stuff which is not compatible with the core ibis.
Here are a few examples which I came accross.
Selectors
Ibis-ml has its own abstraction for selectors. For example the following cast
ml.Cast(ml.has_type("boolean"), "int8"),
could have been:
ml.Cast(s.of_type("boolean"), "int8"),
Casing
Ibis-ml uses CamelCase. Ibs uses snake_case.
Ibis Pipelines
Most importantly, Ibis pipelines are already lazy and backend independent. So why not reuse those as ML recipes directly?
Ibis-ml could simply either
- enrich the existing backend transforms with the ML functionality, or
- provide its own proxy backend which would be dispatched to backends depending on the input data to the fit method
For example:
## 1. Ibis Table is already a deferred recipe, so use it as such:
rcp = (
df
.drop(["approved", "day"])
.mutate(day=_.cast("string"))
.mutate(s.across(s.endswith("_id"), _.cast("string")))
.fill_na(s.of_type("string"))
.mutate(s.of_type("boolean"), _.cast("int8"))
.ordinal_encode(s.of_type("string"), min_frequency=0.01)
)
tr = rcp.fit() # or rcp.fit(df), or rcp.fit(df_from_other_backend)
## Option 2:
# Start with a ml.recipe pseudo backend
rcp = (
ml.reicipe
.drop(["approved", "day"])
.mutate(day=_.cast("string"))
.mutate(s.across(s.endswith("_id"), _.cast("string")))
.fill_na(s.of_type("string"))
.mutate(s.of_type("boolean"), _.cast("int8"))
.ordinal_encode(s.of_type("string"), min_frequency=0.01)
)
tr = rcp.fit(df)
Does this make sense?
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
backlog