Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add "median" to TargetEncoder #4722

Merged
merged 31 commits into from
Sep 7, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
cf87af4
Merge pull request #15 from rapidsai/branch-0.15
daxiongshu Jul 26, 2020
e3b7848
Merge pull request #18 from rapidsai/branch-0.17
daxiongshu Nov 1, 2020
e6d8ec3
Merge pull request #19 from rapidsai/branch-0.17
daxiongshu Nov 17, 2020
8b1b7c3
Merge pull request #20 from rapidsai/branch-0.18
daxiongshu Dec 28, 2020
7a51c5a
Merge pull request #22 from rapidsai/branch-0.19
daxiongshu Feb 18, 2021
d7a2f60
Merge pull request #23 from rapidsai/branch-0.19
daxiongshu Feb 27, 2021
6366e9e
Merge pull request #27 from rapidsai/branch-21.06
daxiongshu May 19, 2021
7757342
Merge branch 'rapidsai:branch-21.10' into branch-21.10
daxiongshu Aug 31, 2021
6349067
Merge pull request #32 from rapidsai/branch-22.02
daxiongshu Dec 16, 2021
6c408e5
Merge branch 'rapidsai:branch-22.02' into branch-22.02
daxiongshu Jan 11, 2022
3f4b89d
Merge branch 'rapidsai:branch-22.04' into branch-22.04
daxiongshu Feb 14, 2022
e305ae3
Merge branch 'rapidsai:branch-22.04' into branch-22.04
daxiongshu Feb 16, 2022
d3ee54a
Merge branch 'rapidsai:branch-22.04' into branch-22.04
daxiongshu Feb 18, 2022
39311f9
Merge branch 'rapidsai:branch-22.04' into branch-22.04
daxiongshu Feb 23, 2022
591ad28
Merge branch 'rapidsai:branch-22.04' into branch-22.04
daxiongshu Mar 5, 2022
c6ae54d
Merge branch 'rapidsai:branch-22.06' into branch-22.06
daxiongshu Apr 8, 2022
7f340a6
Merge branch 'rapidsai:branch-22.06' into branch-22.06
daxiongshu Apr 9, 2022
c821d78
Merge branch 'rapidsai:branch-22.06' into branch-22.06
daxiongshu Apr 14, 2022
0e29796
first commit
daxiongshu Apr 16, 2022
d6b031d
docstring
daxiongshu Apr 17, 2022
c281b4d
start for loop
daxiongshu Apr 18, 2022
24864fa
TODO: change impute_and_sort
daxiongshu Apr 19, 2022
02c5ddf
basic works
daxiongshu May 4, 2022
e2a534d
fit_transform test passed
daxiongshu May 4, 2022
2aa8cca
transform test passed
daxiongshu May 4, 2022
b68853b
Merge branch 'branch-22.06' of https://github.com/rapidsai/cuml into …
daxiongshu May 4, 2022
86a3397
Merge branch 'rapidsai-branch-22.0t push origin fea_TE_median6' into …
daxiongshu May 4, 2022
c835151
Merge branch 'rapidsai:branch-22.06' into fea_TE_median
daxiongshu Jun 1, 2022
f9015bb
Merge pull request #44 from daxiongshu/branch-22.10
daxiongshu Sep 1, 2022
dd38fd0
get_stat_func
daxiongshu Sep 2, 2022
1dbe878
fix style
daxiongshu Sep 2, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
47 changes: 37 additions & 10 deletions python/cuml/preprocessing/TargetEncoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,16 @@
import warnings


def get_stat_func(stat):
def func(ds):
if hasattr(ds, stat):
return getattr(ds, stat)()
else:
# implement stat
raise ValueError(f'{stat} function is not implemented.')
return func


class TargetEncoder:
"""
A cudf based implementation of target encoding [1]_, which converts
Expand Down Expand Up @@ -52,8 +62,9 @@ class TargetEncoder:
in `fit()` or `fit_transform()` functions.
output_type : {'cupy', 'numpy', 'auto'}, default = 'auto'
The data type of output. If 'auto', it matches input data.
stat : {'mean','var'}, default = 'mean'
The statistic used in encoding, mean or variance of the target.
stat : {'mean','var','median'}, default = 'mean'
The statistic used in encoding, mean, variance or median of the
target.
References
----------
Expand Down Expand Up @@ -93,8 +104,8 @@ def __init__(self, n_folds=4, smooth=0, seed=42,
" or 'numpy' or 'auto', "
"got {0}.".format(output_type))
raise ValueError(msg)
if stat not in {'mean', 'var'}:
msg = ("stat should be either 'mean' or 'var'."
if stat not in {'mean', 'var', 'median'}:
msg = ("stat should be 'mean', 'var' or 'median'."
f"got {stat}.")
raise ValueError(msg)

Expand Down Expand Up @@ -232,15 +243,15 @@ def _fit_transform(self, x, y, fold_ids):
self.n_folds = min(self.n_folds, len(train))
train[self.fold_col] = self._make_fold_column(len(train), fold_ids)

self.mean = train[self.y_col].mean()
self.y_stat_val = get_stat_func(self.stat)(train[self.y_col])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dantegd what do you think of the change here? Thank you.

if self.stat in ['median']:
return self._fit_transform_for_loop(train, x_cols)
Comment on lines +247 to +248
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better design pattern would be to have the codes of logic to calculate the statistics in separate functions, not only the median/the ones based in for loops, that are called from the init, which can also help to improve testing of the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment. Do you mean something like the following to separate the logic?

def get_stat_func(stat):
    def func(ds):
        if hasattr(ds, stat):
            return getattr(ds, stat)()
        else:
            # implement stat
    return func

func = get_stat_func(self.stat)
self.y_stat_val = func(train[self.y_col])


self.mean = train[self.y_col].mean()
if self.stat == 'var':
y_cols = [self.y_col, self.y_col2]
train[self.y_col2] = self._make_y_column(y*y)
self.mean2 = train[self.y_col2].mean()
var = self.mean2 - self.mean**2
n = train.shape[0]
self.var = var * n / (n-1)
else:
y_cols = [self.y_col]

Expand Down Expand Up @@ -277,6 +288,23 @@ def _fit_transform(self, x, y, fold_ids):
del encode_each_fold
return self._impute_and_sort(train), train

def _fit_transform_for_loop(self, train, x_cols):

def _rename_col(df, col):
df.columns = [col]
return df.reset_index()

res = []
for f in train[self.fold_col].unique().values_host:
mask = train[self.fold_col] == f
dg = train.loc[~mask].groupby(x_cols).agg({self.y_col: self.stat})
dg = _rename_col(dg, self.out_col)
res.append(train.loc[mask].merge(dg, on=x_cols, how='left'))
res = cudf.concat(res, axis=0)
self.encode_all = train.groupby(x_cols).agg({self.y_col: self.stat})
self.encode_all = _rename_col(self.encode_all, self.out_col)
return self._impute_and_sort(res), train

def _make_y_column(self, y):
"""
Create a target column given y
Expand Down Expand Up @@ -387,9 +415,8 @@ def _impute_and_sort(self, df):
"""
Impute and sort the result encoding in the same row order as input
"""
impute_val = self.var if self.stat == 'var' else self.mean
df[self.out_col] = df[self.out_col].nans_to_nulls()
df[self.out_col] = df[self.out_col].fillna(impute_val)
df[self.out_col] = df[self.out_col].fillna(self.y_stat_val)
df = df.sort_values(self.id_col)
res = df[self.out_col].values.copy()
if self.output_type == 'numpy':
Expand Down
18 changes: 17 additions & 1 deletion python/cuml/tests/test_target_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ def test_targetencoder_transform():

@pytest.mark.parametrize('n_samples', [5000, 500000])
@pytest.mark.parametrize('dtype', [np.int32, np.int64, np.float32, np.float64])
@pytest.mark.parametrize('stat', ['mean', 'var'])
@pytest.mark.parametrize('stat', ['mean', 'var', 'median'])
def test_targetencoder_random(n_samples, dtype, stat):

x = cp.random.randint(0, 1000, n_samples).astype(dtype)
Expand Down Expand Up @@ -277,3 +277,19 @@ def test_get_params():
p2 = encoder.get_params()
for k, v in params.items():
assert v == p2[k]


def test_targetencoder_median():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test is fine, though building from my comment about separating the logic of calculating things from the init, we could have lower level stat tests, what do you think?

train = cudf.DataFrame({'category': ['a', 'a', 'a', 'a',
'b', 'b', 'b', 'b'],
'label': [1, 22, 15, 17, 70, 9, 99, 56]})
encoder = TargetEncoder(stat='median')
train_encoded = encoder.fit_transform(train.category, train.label)
answer = np.array([17., 15., 17., 15., 56., 70., 56., 70.])
assert array_equal(train_encoded, answer)

encoder = TargetEncoder(stat='median')
encoder.fit(train.category, train.label)
train_encoded = encoder.transform(train.category)

assert array_equal(train_encoded, answer)