# Weight Of Evidence encoding

>The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production.

It is recommended to use [jupyter](https://jupyter.org/) to run this tutorial.

Binning create buckets of independent variables based on ranking methods. Binning helps us converting continuous variables into categorical ones.

WOE binning Implements a binning of numeric variables and factors with respect to a dichotomous target variable.

```
bin_total = bin_positives + bin_negatives
total_labels = total_positives + total_negatives
bin_WOE = log((bin_positives / total_positives) / (bin_negatives / total_negatives))
bin_iv = ((bin_positives / total_positives) - (bin_negatives / total_negatives)) * bin_woe
```


Currently we provide WOE encoding for vertically partitioned datasets.

Let's first load a sample dataset.

In [1]:
import secretflow as sf
from secretflow.data.vertical import VDataFrame
from secretflow.utils.simulation.datasets import load_linear

In [2]:
sf.shutdown()
sf.init(['alice', 'bob'], address='local')
alice, bob = sf.PYU('alice'), sf.PYU('bob')
spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))

2023-10-07 18:03:23,909	INFO worker.py:1538 -- Started a local Ray instance.


In [3]:
parts = {
    bob: (1, 11),
    alice: (11, 22),
}
vdf = load_linear(parts=parts)

In [4]:
label_data = vdf['y']
y = sf.reveal(label_data.partitions[alice].data).values

Now, we are ready to perform WOE binning and substitution.

In [5]:
from secretflow.preprocessing.binning.vert_woe_binning import VertWoeBinning
from secretflow.preprocessing.binning.vert_bin_substitution import VertBinSubstitution

binning = VertWoeBinning(spu)
bin_rules = binning.binning(
    vdf,
    binning_method="chimerge",
    bin_num=4,
    bin_names={alice: [], bob: ["x5", "x7"]},
    label_name="y",
)

woe_sub = VertBinSubstitution()
vdf = woe_sub.substitution(vdf, bin_rules)

# this is for demo only, be careful with reveal
print(sf.reveal(vdf.partitions[alice].data))
print(sf.reveal(vdf.partitions[bob].data))

INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_woe_binning_pyu.VertWoeBinningPyuWorker'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_woe_binning_pyu.VertWoeBinningPyuWorker'> with party bob.


[2m[36m(SPURuntime pid=3800442)[0m 2023-10-07 18:03:26.686 [info] [default_brpc_retry_policy.cc:DoRetry:52] socket error, sleep=1000000us and retry


[2m[36m(_run pid=3793086)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2m[36m(_run pid=3793086)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2m[36m(_run pid=3793086)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[2m[36m(_run pid=3793086)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.


[2m[36m(_run pid=3793086)[0m [2023-10-07 18:03:27.436] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
[2m[36m(SPURuntime pid=3800442)[0m 2023-10-07 18:03:27.686 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=127.0.0.1:34547} (0x0x32463c0): Connection refused [R1][E112]Not connected to 127.0.0.1:34547 yet, server_id=0'
[2m[36m(SPURuntime pid=3800442)[0m 2023-10-07 18:03:27.686 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry
[2m[36m(SPURuntime pid=3800442)[0m 2023-10-07 18:03:28.686 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=127.0.0.1:34547} (0x0x32463c0): Connection refused [R1][E112]Not connected to 127.0.0.1:34547 yet, server_id=0 [R2][E112]Not connected to 127.

INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_bin_substitution.VertBinSubstitutionPyuWorker'> with party bob.
INFO:root:Create proxy actor <class 'secretflow.preprocessing.binning.vert_bin_substitution.VertBinSubstitutionPyuWorker'> with party alice.


[2m[36m(SPURuntime(device_id=None, party=bob) pid=3800442)[0m 2023-10-07 18:03:30.776 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63
[2m[36m(SPURuntime(device_id=None, party=alice) pid=3800441)[0m 2023-10-07 18:03:30.794 [info] [thread_pool.cc:ThreadPool:30] Create a fixed thread pool with size 63
           x11       x12       x13       x14       x15       x16       x17  \
0     0.241531 -0.705729 -0.020094 -0.486932  0.851992  0.035219 -0.796096   
1    -0.402727  0.115744  0.468149 -0.697152  0.386395  0.712798  0.239583   
2     0.872675 -0.559321  0.390246  0.000472  0.225594 -0.639674  0.279511   
3    -0.644718 -0.409382  0.141747 -0.797517  0.314084 -0.802476  0.348878   
4    -0.949669 -0.940787 -0.951708  0.187475  0.272346  0.124419  0.853226   
...        ...       ...       ...       ...       ...       ...       ...   
9995 -0.031331 -0.078700 -0.020636 -0.575713  0.210120 -0.288943 -0.262945   
9996  0.047039  0.965614 -0.921435 -0.09

Sometimes we may need the iv values. Releasing bin ivs will potentially leak label information according to issue https://github.com/secretflow/secretflow/issues/565.
Currently, we choose to save bin iv values in label holders device. It is up to label holder's choice to

1. share no iv information
2. share some chosen iv information

We will demonstrate how to share the feature ivs.

Recall that the bin_rules is a dictionary `{PYU: PYUObject}`, where each `PYUObject` itself is a dictionary of the following type:
```
{
    "variables":[
        {
            "name": str, # feature name
            "type": str, # "string" or "numeric", if feature is discrete or continuous
            "categories": list[str], # categories for discrete feature
            "split_points": list[float], # left-open right-close split points
            "total_counts": list[int], # total samples count in each bins.
            "else_counts": int, # np.nan samples count
            "filling_values": list[float], # woe values for each bins.
            "else_filling_value": float, # woe value for np.nan samples.
        },
        # ... others feature
    ],
    # label holder's PYUObject only
    # warning: giving bin_ivs to other party will leak positive samples in each bin.
    # it is up to label holder's will to give feature iv or bin ivs or all info to workers.
    # for more information, look at: https://github.com/secretflow/secretflow/issues/565

    # in the following comment, by safe we mean label distribution info is not leaked.
    "feature_iv_info" :[
        {
            "name": str, #feature name
            "ivs": list[float], #iv values for each bins, not safe to share with workers in any case.
            "else_iv": float, #iv for nan values, may share to with workers
            "feature_iv": float, #sum of bin_ivs, safe to share with workers when bin num > 2.
        }
    ]
}
```

In [6]:
# alice is label holder
dict_pyu_object = bin_rules[alice]


def extract_name_and_feature_iv(list_of_feature_iv_info):
    return [(d["name"], d["feature_iv"]) for d in list_of_feature_iv_info]


feature_ivs = alice(
    lambda dict_pyu_object: extract_name_and_feature_iv(
        dict_pyu_object["feature_iv_info"]
    )
)(dict_pyu_object)

In [7]:
# we can give the feature_ivs to bob
feature_ivs.to(bob)
# and/or we can reveal it to see it
sf.reveal(feature_ivs)

[('x5', 0.37848298069087766), ('x7', 0)]

Congradulations!
In this tutorial we have learnt how to

1. do WOE encoding
2. share some iv information to other parties
