Skip to content

feat: Add "OTHER" category to the encoders when min_frequency is supplied #178

Open
@vspinu

Description

@vspinu

Currently (as per v0.1.3) categorical encoders don't handle the other categories when min_frequency is supplied. For example:

    import pandas as pd
    import ibis
    import ibis_ml as ml

    con = ibis.duckdb.connect()

    df = pd.DataFrame({
        'cat1': ['AA', 'BBB', 'AA', 'BBB', 'CCC'],
        'cat2': ['X', 'Y', 'Y', 'X', 'Z'],
        'value': [10, 20, 30, 40, 50]
    })

    tbl = con.create_table("tmp", df, overwrite=True)
 
    tr_ohe = ml.Recipe(
        ml.OneHotEncode(ml.string(), min_frequency=2),
    ).fit(tbl.drop("value"), tbl.value)
    
    tr_ohe.to_ibis(tbl).to_pandas()
     #    value  cat1_AA  cat1_BBB  cat2_X  cat2_Y
    # 0     10        1         0       1       0
    # 1     20        0         1       0       1
    # 2     30        1         0       0       1
    # 3     40        0         1       1       0
    # 4     50        0         0       0       0


    tr_oe = ml.Recipe(
        ml.OrdinalEncode(ml.string(), min_frequency=2),
        ml.FillNA(ml.integer(), "OTHER")       # <-- this does not work!!! 
    ).fit(tbl.drop("value"), tbl.value)
    
    tr_oe.to_ibis(tbl).to_pandas()
   #    value  cat1  cat2
    # 0     10   0.0   0.0
    # 1     20   1.0   1.0
    # 2     30   0.0   1.0
    # 3     40   1.0   0.0
    # 4     50   NaN   NaN

(Note that there is an additional issue that filling NaN does not work on int vectors)

Would it be possible to extend the encoders to be able to specify the "other" value? More concretely:

  • For OHE add the ability to generate the column which is 1 for all the swallowed categories.
  • For OHE and OE add "others_value" argument to indicate which value should the swallowed categories assume

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions