In [1]:
import sys
sys.path.append('..')

import numpy as np
from pctl_scale import PercentileScaler
from sklearn.datasets import load_boston

## Boston Demo data
We will pick `average number of rooms per dwelling` or the `RM` variable.
It is a **ratio-scale** variable

In [2]:
tmp = load_boston()
x = tmp.data[:,5]

We will introduce some missing values and shuffle the data


In [3]:
x[:100] = np.nan
np.random.shuffle(x)

## Keep Missing Values
By default `PercentileScaler` would ignore missing values while fitting (You can disable it with `naignore=None`) and it's recommended not to change this behavior.

For demonstration purposes we impute `-123`.

In [4]:
naimpute = -123

scaler = PercentileScaler(upper=.95, lower=.05, naimpute=naimpute)
y = scaler.fit_transform(x)

The percentiles will obviously differ if we remove data (i.e. set to NaN)

In [5]:
print("{0:3.0f}% percentile value: {1:8.4f}".format(scaler.lower*100, scaler.pctl_lower))
print("{0:3.0f}% percentile value: {1:8.4f}".format(scaler.upper*100, scaler.pctl_upper))

  5% percentile value:   5.1627
 95% percentile value:   7.6898


Let's check what happend to the missing values

In [6]:
y[np.isnan(x)]

array([-123., -123., -123., -123., -123., -123., -123., -123., -123.,
       -123., -123., -123., -123., -123., -123., -123., -123., -123.,
       -123., -123., -123., -123., -123., -123., -123., -123., -123.,
       -123., -123., -123., -123., -123., -123., -123., -123., -123.,
       -123., -123., -123., -123., -123., -123., -123., -123., -123.,
       -123., -123., -123., -123., -123., -123., -123., -123., -123.,
       -123., -123., -123., -123., -123., -123., -123., -123., -123.,
       -123., -123., -123., -123., -123., -123., -123., -123., -123.,
       -123., -123., -123., -123., -123., -123., -123., -123., -123.,
       -123., -123., -123., -123., -123., -123., -123., -123., -123.,
       -123., -123., -123., -123., -123., -123., -123., -123., -123.,
       -123.])

## What should we impute?

* If `PercentileScaler` is used to scale a variable into an $(0,1)$ interval then `naimpute=0` would ok. 
* A model/algorithm might ignore the missing value when multiplying a model parameter with `0`. 
* Such behavior can be interpreted as implied dummy variable for missing values.
