Quantile encoder #303

cmougan · 2021-05-31T06:10:53Z

This PR (#302), implements two methods from a recently published paper at a conference (MDAI 2021).

Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems (Carlos Mougan, David Masip, Jordi Nin, Oriol Pujol)

Encoding methods, full technical development can be followed in the paper:

Quantile Encoder

Tests are implemented and passed.
Scikit learn API semantics
Docs is extended

If I missed something or any comments, please let me know :)

cmougan · 2021-06-14T10:33:10Z

@janmotl @wdm0006

PaulWestenthanner · 2021-10-08T16:08:52Z

Hi @cmougan,
Sorry for the late reply. I've read trough the paper today. Great piece of work!
I've noticed you've pushed the code to sktools in the meantime (which also uses this repo). Is this PR still relevant? Would you prefer to have the code merged into here and remove it from sktools?

cmougan · 2021-10-09T13:37:22Z

Hi @PaulWestenthanner

Many thanks for your answer.

We will be very happy to contribute to your package. It makes all the sense to have it as another method in category_encoders.

So, yes please, let's merge.

The only reason to keep it alive in stools it's because the congress proceedings paper points there. Still we will like to merge.

Let us know what we can do.

category_encoders/quantile_encoder.py

docs/source/index.rst

PaulWestenthanner · 2021-10-09T22:24:47Z

Hi @cmougan
I just reviewed your code and remarked some minor changes - many of them rather cosmetic. Overall the code is quite nice and will be a good improvement to the library.
Apart from the comments I'd like to point out that I'm not quite happy with to copy-paste of some functionality (e.g. dropping low-variance columns). However I think it's okay here to just copy-paste since most other encoders did the same and we'll need a separate issue to clean it up.
Regarding the SummaryEncoder: if you want to you can move it over here as well.

cmougan · 2021-10-10T18:06:40Z

@david26694

cmougan · 2021-10-10T18:21:17Z

Many thanks @PaulWestenthanner for the detailed review.

I just pushed the Summary Encoder. If I have not missed anything, everything should be there.

category_encoders/quantile_encoder.py

PaulWestenthanner · 2021-10-11T19:36:29Z

category_encoders/quantile_encoder.py

+        for quantile in self.quantiles:
+            for col in self.cols:
+                percentile = round(quantile * 100)
+                X[f"{col}_{percentile}"] = X[col]


if you just add this in the for loop below (either before or after the col-names.append(,,,) you don't need the two loops here and don't need to calculate the percentile.
Note that the comment on coinciding quantiles below also applies here.

docs/source/summary.rst

tests/test_quantile_encoder.py

PaulWestenthanner · 2021-10-11T20:06:46Z

Thanks for the update. I've added a second round of comments.
Note that formatting the __init__.py introduced some merge conflicts. Please resolve them as well

category_encoders/quantile_encoder.py

Throw error in case of two quantiles with same percentile

cmougan · 2021-10-12T19:41:40Z

Hi @PaulWestenthanner I believe we just covered your comments. We also did a review of possible things that we might have left out.

Do you see anything else that might be worth to improve? Or that we missed out?

Many thanks for your suggestions, we really appreciate them :)

PaulWestenthanner · 2021-10-12T22:40:03Z

The code looks fine to me now. I'm happy to merge as soon as we get the pipeline to run. The issue at the moment is that we support python 3.5 which does not support f-strings yet. Since f-strings are awesome I'd recommend using the fstring package from future. Check this SO post as reference https://stackoverflow.com/questions/55182209/f-string-invalid-syntax-in-python-3-5
You'll probably need to add this package to the requirements(-dev).txt

PaulWestenthanner · 2021-10-14T16:05:00Z

Maybe we should drop support for python 3.5 since it is no longer officially supported anyway. But that would be out of scope for this PR

Refactor summary encoder

cmougan · 2021-10-15T22:46:00Z

@PaulWestenthanner thanks again!

cmougan · 2021-10-16T09:26:43Z

Hi @PaulWestenthanner,

After some discussion and testing with @david26694 we notice that the summary encoder does not pass most of the test.

1. Fix summary encoder to pass all the tests, this seems like a lot of work
1. Use the same strategy as with QE and copy-paste a lot of code.
1. Don't add Summary Encoder in this PR
1. Merge with out passing the tests

PaulWestenthanner · 2021-10-17T21:21:05Z

Hi @cmougan @david26694

Thanks for adding the tests. This was a good catch!
I took the time to fix the tests. This was mainly because the iterative structure did not comply with other encoders.
Please find my code here: https://github.com/PaulWestenthanner/categorical-encoding/tree/quantileEncoder
It's a little mix of option 1 and 2 since a little code is copied. However, the part of the code that pre-processes X and y and checks for nulls and so on is copied for more or less any encoder in the repo. This issue should be tackled at some point but not now.

I'd suggest we merge my branch into yours and then into master?

fixed tests for summary encoder

cmougan · 2021-10-17T21:30:21Z

Hi @PaulWestenthanner @david26694

Waw, impresive coding.

Just merged your branch.

PaulWestenthanner · 2021-10-19T18:34:04Z

The tests still fail for python 3.5 since f-strings are used. I fixed that in the code but forgot the tests. Would you mind fixing this @cmougan ?

cmougan · 2021-10-19T20:13:22Z

@PaulWestenthanner and now?

category_encoders/quantile_encoder.py

PaulWestenthanner · 2021-10-19T22:12:01Z

Sorry I missed one of the tests. I've added a comment on the appropriate position

cmougan · 2021-10-20T06:28:50Z

@PaulWestenthanner I was trying to debug that very unsuccessful. Many thanks

PaulWestenthanner · 2021-10-20T06:43:58Z

Great! Thanks for your effort. LGTM & Merge

cmougan added 5 commits May 31, 2021 08:01

#302 quantileEncoder and SummaryEncoder

284f378

#302 test for QE and SE - passing

55e00f4

Quantile Encoder and Summary Encoder

591257d

Quantile Encoder and Summary Encoder update docs

21bbb24

#302 Quantile Encoder and Summary Encoder update docs

da6de04

cmougan added 2 commits June 17, 2021 08:39

doc QE

56ca905

remove summary encoder

e3ea3e7

PaulWestenthanner mentioned this pull request Oct 8, 2021

Seeking maintainers #248

Open

PaulWestenthanner self-requested a review October 9, 2021 20:37

PaulWestenthanner self-assigned this Oct 9, 2021

PaulWestenthanner reviewed Oct 9, 2021

View reviewed changes

cmougan and others added 6 commits October 10, 2021 10:55

Update quantile_encoder.py

40a8a1c

remove unnecesary imports

b06f108

qe cosmetic issues

c72e73f

m bio

828e518

formatting

4df3bf1

summary encoder

64d1d5c

cmougan added 2 commits October 10, 2021 20:08

summary encoder

2032815

e

7a6da5b

PaulWestenthanner reviewed Oct 11, 2021

View reviewed changes

category_encoders/quantile_encoder.py Outdated Show resolved Hide resolved

PaulWestenthanner reviewed Oct 11, 2021

View reviewed changes

category_encoders/quantile_encoder.py Outdated Show resolved Hide resolved

change name Summary Encoder

ae7478a

cmougan and others added 4 commits October 12, 2021 17:43

Merge branch 'master' into quantileEncoder

d9ff993

test_summary_quantile

70d46e5

Throw error in case of two quantiles with same percentile

3d11c8e

Merge pull request #1 from david26694/quantileEncoder

3d5c91c

Throw error in case of two quantiles with same percentile

david26694 and others added 6 commits October 15, 2021 16:18

Refactor summary encoder

bbe1a15

Fix failing tests QE

86173c6

Add default arguments to SE

60ddb4f

Parametrise summary encoder

979c774

Add summary encoder in all QE tests

29f7c0d

Merge pull request #2 from david26694/quantileEncoder

741a21e

Refactor summary encoder

fixed tests for summary encoder

6618f26

Merge pull request #3 from PaulWestenthanner/quantileEncoder

b4b814f

fixed tests for summary encoder

cmougan added 2 commits October 19, 2021 21:09

add future string to support python3.5 for summary encoder test

b9bf00f

remove fstring from QE tests

3292dd1

PaulWestenthanner reviewed Oct 19, 2021

View reviewed changes

category_encoders/quantile_encoder.py Show resolved Hide resolved

handling coinciding quantiles for SE

c88e53e

PaulWestenthanner merged commit 66d89c2 into scikit-learn-contrib:master Oct 20, 2021

PaulWestenthanner mentioned this pull request Oct 20, 2021

[new features]: Quantile Encoder #302

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantile encoder #303

Quantile encoder #303

cmougan commented May 31, 2021 •

edited

cmougan commented Jun 14, 2021

PaulWestenthanner commented Oct 8, 2021

cmougan commented Oct 9, 2021

PaulWestenthanner commented Oct 9, 2021

cmougan commented Oct 10, 2021

cmougan commented Oct 10, 2021

PaulWestenthanner Oct 11, 2021

PaulWestenthanner commented Oct 11, 2021

cmougan commented Oct 12, 2021

PaulWestenthanner commented Oct 12, 2021

PaulWestenthanner commented Oct 14, 2021

cmougan commented Oct 15, 2021

cmougan commented Oct 16, 2021

PaulWestenthanner commented Oct 17, 2021

cmougan commented Oct 17, 2021

PaulWestenthanner commented Oct 19, 2021

cmougan commented Oct 19, 2021

PaulWestenthanner commented Oct 19, 2021

cmougan commented Oct 20, 2021

PaulWestenthanner commented Oct 20, 2021

Quantile encoder #303

Quantile encoder #303

Conversation

cmougan commented May 31, 2021 • edited

cmougan commented Jun 14, 2021

PaulWestenthanner commented Oct 8, 2021

cmougan commented Oct 9, 2021

PaulWestenthanner commented Oct 9, 2021

cmougan commented Oct 10, 2021

cmougan commented Oct 10, 2021

PaulWestenthanner Oct 11, 2021

Choose a reason for hiding this comment

PaulWestenthanner commented Oct 11, 2021

cmougan commented Oct 12, 2021

PaulWestenthanner commented Oct 12, 2021

PaulWestenthanner commented Oct 14, 2021

cmougan commented Oct 15, 2021

cmougan commented Oct 16, 2021

PaulWestenthanner commented Oct 17, 2021

cmougan commented Oct 17, 2021

PaulWestenthanner commented Oct 19, 2021

cmougan commented Oct 19, 2021

PaulWestenthanner commented Oct 19, 2021

cmougan commented Oct 20, 2021

PaulWestenthanner commented Oct 20, 2021

cmougan commented May 31, 2021 •

edited