# Weight Of Evidence encoding

>The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production.

It is recommended to use [jupyter](https://jupyter.org/) to run this tutorial.

Binning create buckets of independent variables based on ranking methods. Binning helps us converting continuous variables into categorical ones.

WOE binning Implements a binning of numeric variables and factors with respect to a dichotomous target variable.

```
bin_total = bin_positives + bin_negatives
total_labels = total_positives + total_negatives
bin_WOE = log((bin_positives / total_positives) / (bin_negatives / total_negatives))
bin_iv = ((bin_positives / total_positives) - (bin_negatives / total_negatives)) * bin_woe
```


Currently we provide WOE encoding for vertically partitioned datasets.

Let's first load a sample dataset.

In [None]:
import secretflow as sf
from secretflow.component.core import CompVDataFrame
from secretflow.utils.simulation.datasets import load_linear

import os

os.environ["JAX_PLATFORMS"] = "cpu"

[2025-05-27 07:18:30.068] [info] [bigint_spi.cc:79] The default library used for BigInt operations is openssl


In [None]:
try:
    sf.shutdown()
except:
    pass
sf.init(['alice', 'bob'], address='local')
alice, bob = sf.PYU('alice'), sf.PYU('bob')
spu = sf.SPU(
    sf.utils.testing.cluster_def(
        ['alice', 'bob'],
        {"protocol": "REF2K", "field": "FM128", "fxp_fraction_bits": 40},
    )
)

  from .autonotebook import tqdm as notebook_tqdm
2025-05-27 07:18:30,976	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2025-05-27 07:18:31,173	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
  self.pid = _posixsubprocess.fork_exec(
  self.pid = _posixsubprocess.fork_exec(
2025-05-27 07:18:32,073	INFO worker.py:1841 -- Started a local Ray instance.


[36m(pid=666411)[0m [2025-05-27 07:18:33.340] [info] [bigint_spi.cc:79] The default library used for BigInt operations is openssl
[36m(pid=666418)[0m [2025-05-27 07:18:34.945] [info] [bigint_spi.cc:79] The default library used for BigInt operations is openssl




[36m(pyu_fn pid=666418)[0m [2025-05-27 07:18:37.105] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63


In [None]:
parts = {
    bob: (1, 11),
    alice: (11, 22),
}
vdf = load_linear(parts=parts)

In [None]:
label_data = vdf['y']
y = sf.reveal(label_data.partitions[alice].data).values

Now, we are ready to perform WOE binning and substitution.

In [None]:
from secretflow.preprocessing.binning.vert_woe_binning import VertWoeBinning
from secretflow.preprocessing.binning.vert_bin_substitution import VertBinSubstitution
from secretflow.component.core.dist_data.vtable_utils import build_schema

binning = VertWoeBinning(spu)
# note that woe currently only works on CompVDataFrame
vcomp_table = CompVDataFrame.from_pandas(vdf, build_schema(vdf, labels=set(["y"])))
bin_rules = binning.binning(
    vcomp_table,
    binning_method="chimerge",
    bin_num=4,
    bin_names={alice: [], bob: ["x5", "x7"]},
    label_name="y",
)

woe_sub = VertBinSubstitution()
vcomp_table = woe_sub.substitution(vcomp_table, bin_rules)

vdf = vcomp_table.to_pandas()
# this is for demo only, be careful with reveal
print(sf.reveal(vdf.partitions[alice].data))
print(sf.reveal(vdf.partitions[bob].data))



           x11       x12       x13       x14       x15       x16       x17  \
0     0.241531 -0.705729 -0.020094 -0.486932  0.851992  0.035219 -0.796096   
1    -0.402727  0.115744  0.468149 -0.697152  0.386395  0.712798  0.239583   
2     0.872675 -0.559321  0.390246  0.000472  0.225594 -0.639674  0.279511   
3    -0.644718 -0.409382  0.141747 -0.797517  0.314084 -0.802476  0.348878   
4    -0.949669 -0.940787 -0.951708  0.187475  0.272346  0.124419  0.853226   
...        ...       ...       ...       ...       ...       ...       ...   
9995 -0.031331 -0.078700 -0.020636 -0.575713  0.210120 -0.288943 -0.262945   
9996  0.047039  0.965614 -0.921435 -0.092970  0.205778  0.155392  0.922683   
9997  0.269438 -0.115586  0.928880  0.430016  0.269042 -0.331772  0.520971   
9998  0.999325  0.433372 -0.805999  0.311548  0.072405  0.973399 -0.123470   
9999 -0.203443  0.772931 -0.146181 -0.195646  0.274590  0.803816 -0.312047   

           x18       x19       x20  y  
0     0.810261  0.04830

Sometimes we may need the iv values. Releasing bin ivs will potentially leak label information according to issue https://github.com/secretflow/secretflow/issues/565.
Currently, we choose to save bin iv values in label holders device. It is up to label holder's choice to

1. share no iv information
2. share some chosen iv information

We will demonstrate how to share the feature ivs.

Recall that the bin_rules is a dictionary `{PYU: PYUObject}`, where each `PYUObject` itself is a dictionary of the following type:
```
{
    "variables":[
        {
            "name": str, # feature name
            "type": str, # "string" or "numeric", if feature is discrete or continuous
            "categories": list[str], # categories for discrete feature
            "split_points": list[float], # left-open right-close split points
            "total_counts": list[int], # total samples count in each bins.
            "else_counts": int, # np.nan samples count
            "filling_values": list[float], # woe values for each bins.
            "else_filling_value": float, # woe value for np.nan samples.
        },
        # ... others feature
    ],
    # label holder's PYUObject only
    # warning: giving bin_ivs to other party will leak positive samples in each bin.
    # it is up to label holder's will to give feature iv or bin ivs or all info to workers.
    # for more information, look at: https://github.com/secretflow/secretflow/issues/565

    # in the following comment, by safe we mean label distribution info is not leaked.
    "feature_iv_info" :[
        {
            "name": str, #feature name
            "ivs": list[float], #iv values for each bins, not safe to share with workers in any case.
            "else_iv": float, #iv for nan values, may share to with workers
            "feature_iv": float, #sum of bin_ivs, safe to share with workers when bin num > 2.
        }
    ]
}
```

In [None]:
# alice is label holder
dict_pyu_object = bin_rules[alice]


def extract_name_and_feature_iv(list_of_feature_iv_info):
    return [(d["name"], d["feature_iv"]) for d in list_of_feature_iv_info]


feature_ivs = alice(
    lambda dict_pyu_object: extract_name_and_feature_iv(
        dict_pyu_object["feature_iv_info"]
    )
)(dict_pyu_object)

In [None]:
# we can give the feature_ivs to bob
feature_ivs.to(bob)
# and/or we can reveal it to see it
sf.reveal(feature_ivs)

[('x5', 0.37848298069087766), ('x7', 0)]

Congradulations!
In this tutorial we have learnt how to

1. do WOE encoding
2. share some iv information to other parties
