Open
Description
Currently (as per v0.1.3) categorical encoders don't handle the other categories when min_frequency is supplied. For example:
import pandas as pd
import ibis
import ibis_ml as ml
con = ibis.duckdb.connect()
df = pd.DataFrame({
'cat1': ['AA', 'BBB', 'AA', 'BBB', 'CCC'],
'cat2': ['X', 'Y', 'Y', 'X', 'Z'],
'value': [10, 20, 30, 40, 50]
})
tbl = con.create_table("tmp", df, overwrite=True)
tr_ohe = ml.Recipe(
ml.OneHotEncode(ml.string(), min_frequency=2),
).fit(tbl.drop("value"), tbl.value)
tr_ohe.to_ibis(tbl).to_pandas()
# value cat1_AA cat1_BBB cat2_X cat2_Y
# 0 10 1 0 1 0
# 1 20 0 1 0 1
# 2 30 1 0 0 1
# 3 40 0 1 1 0
# 4 50 0 0 0 0
tr_oe = ml.Recipe(
ml.OrdinalEncode(ml.string(), min_frequency=2),
ml.FillNA(ml.integer(), "OTHER") # <-- this does not work!!!
).fit(tbl.drop("value"), tbl.value)
tr_oe.to_ibis(tbl).to_pandas()
# value cat1 cat2
# 0 10 0.0 0.0
# 1 20 1.0 1.0
# 2 30 0.0 1.0
# 3 40 1.0 0.0
# 4 50 NaN NaN
(Note that there is an additional issue that filling NaN does not work on int vectors)
Would it be possible to extend the encoders to be able to specify the "other" value? More concretely:
- For OHE add the ability to generate the column which is 1 for all the swallowed categories.
- For OHE and OE add "others_value" argument to indicate which value should the swallowed categories assume
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
backlog