Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantile encoder #303

Merged
merged 32 commits into from Oct 20, 2021
Merged

Quantile encoder #303

merged 32 commits into from Oct 20, 2021

Conversation

cmougan
Copy link
Contributor

@cmougan cmougan commented May 31, 2021

This PR (#302), implements two methods from a recently published paper at a conference (MDAI 2021).

Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems (Carlos Mougan, David Masip, Jordi Nin, Oriol Pujol)

Encoding methods, full technical development can be followed in the paper:

  • Quantile Encoder

Tests are implemented and passed.
Scikit learn API semantics
Docs is extended

If I missed something or any comments, please let me know :)

@cmougan
Copy link
Contributor Author

cmougan commented Jun 14, 2021

@janmotl @wdm0006

@PaulWestenthanner
Copy link
Collaborator

Hi @cmougan,
Sorry for the late reply. I've read trough the paper today. Great piece of work!
I've noticed you've pushed the code to sktools in the meantime (which also uses this repo). Is this PR still relevant? Would you prefer to have the code merged into here and remove it from sktools?

@cmougan
Copy link
Contributor Author

cmougan commented Oct 9, 2021

Hi @PaulWestenthanner

Many thanks for your answer.

We will be very happy to contribute to your package. It makes all the sense to have it as another method in category_encoders.

So, yes please, let's merge.

The only reason to keep it alive in stools it's because the congress proceedings paper points there. Still we will like to merge.

Let us know what we can do.

@PaulWestenthanner PaulWestenthanner self-assigned this Oct 9, 2021
category_encoders/quantile_encoder.py Outdated Show resolved Hide resolved
category_encoders/quantile_encoder.py Outdated Show resolved Hide resolved
category_encoders/quantile_encoder.py Show resolved Hide resolved
category_encoders/quantile_encoder.py Outdated Show resolved Hide resolved
category_encoders/quantile_encoder.py Outdated Show resolved Hide resolved
category_encoders/quantile_encoder.py Outdated Show resolved Hide resolved
category_encoders/quantile_encoder.py Outdated Show resolved Hide resolved
category_encoders/quantile_encoder.py Outdated Show resolved Hide resolved
docs/source/index.rst Show resolved Hide resolved
@PaulWestenthanner
Copy link
Collaborator

Hi @cmougan
I just reviewed your code and remarked some minor changes - many of them rather cosmetic. Overall the code is quite nice and will be a good improvement to the library.
Apart from the comments I'd like to point out that I'm not quite happy with to copy-paste of some functionality (e.g. dropping low-variance columns). However I think it's okay here to just copy-paste since most other encoders did the same and we'll need a separate issue to clean it up.
Regarding the SummaryEncoder: if you want to you can move it over here as well.

@cmougan
Copy link
Contributor Author

cmougan commented Oct 10, 2021

@david26694

@cmougan
Copy link
Contributor Author

cmougan commented Oct 10, 2021

Many thanks @PaulWestenthanner for the detailed review.

I just pushed the Summary Encoder. If I have not missed anything, everything should be there.

category_encoders/quantile_encoder.py Outdated Show resolved Hide resolved
category_encoders/quantile_encoder.py Outdated Show resolved Hide resolved
category_encoders/quantile_encoder.py Outdated Show resolved Hide resolved
for quantile in self.quantiles:
for col in self.cols:
percentile = round(quantile * 100)
X[f"{col}_{percentile}"] = X[col]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you just add this in the for loop below (either before or after the col-names.append(,,,) you don't need the two loops here and don't need to calculate the percentile.
Note that the comment on coinciding quantiles below also applies here.

docs/source/summary.rst Outdated Show resolved Hide resolved
tests/test_quantile_encoder.py Outdated Show resolved Hide resolved
@PaulWestenthanner
Copy link
Collaborator

Thanks for the update. I've added a second round of comments.
Note that formatting the __init__.py introduced some merge conflicts. Please resolve them as well

@cmougan
Copy link
Contributor Author

cmougan commented Oct 12, 2021

Hi @PaulWestenthanner I believe we just covered your comments. We also did a review of possible things that we might have left out.

Do you see anything else that might be worth to improve? Or that we missed out?

Many thanks for your suggestions, we really appreciate them :)

@PaulWestenthanner
Copy link
Collaborator

The code looks fine to me now. I'm happy to merge as soon as we get the pipeline to run. The issue at the moment is that we support python 3.5 which does not support f-strings yet. Since f-strings are awesome I'd recommend using the fstring package from future. Check this SO post as reference https://stackoverflow.com/questions/55182209/f-string-invalid-syntax-in-python-3-5
You'll probably need to add this package to the requirements(-dev).txt

@PaulWestenthanner
Copy link
Collaborator

Maybe we should drop support for python 3.5 since it is no longer officially supported anyway. But that would be out of scope for this PR

@cmougan
Copy link
Contributor Author

cmougan commented Oct 15, 2021

@PaulWestenthanner thanks again!

@cmougan
Copy link
Contributor Author

cmougan commented Oct 16, 2021

Hi @PaulWestenthanner,

After some discussion and testing with @david26694 we notice that the summary encoder does not pass most of the test.

    1. Fix summary encoder to pass all the tests, this seems like a lot of work
    1. Use the same strategy as with QE and copy-paste a lot of code.
    1. Don't add Summary Encoder in this PR
    1. Merge with out passing the tests

@PaulWestenthanner
Copy link
Collaborator

Hi @cmougan @david26694

Thanks for adding the tests. This was a good catch!
I took the time to fix the tests. This was mainly because the iterative structure did not comply with other encoders.
Please find my code here: https://github.com/PaulWestenthanner/categorical-encoding/tree/quantileEncoder
It's a little mix of option 1 and 2 since a little code is copied. However, the part of the code that pre-processes X and y and checks for nulls and so on is copied for more or less any encoder in the repo. This issue should be tackled at some point but not now.

I'd suggest we merge my branch into yours and then into master?

@cmougan
Copy link
Contributor Author

cmougan commented Oct 17, 2021

Hi @PaulWestenthanner @david26694

Waw, impresive coding.

Just merged your branch.

@PaulWestenthanner
Copy link
Collaborator

The tests still fail for python 3.5 since f-strings are used. I fixed that in the code but forgot the tests. Would you mind fixing this @cmougan ?

@cmougan
Copy link
Contributor Author

cmougan commented Oct 19, 2021

@PaulWestenthanner and now?

@PaulWestenthanner
Copy link
Collaborator

Sorry I missed one of the tests. I've added a comment on the appropriate position

@cmougan
Copy link
Contributor Author

cmougan commented Oct 20, 2021

@PaulWestenthanner I was trying to debug that very unsuccessful. Many thanks

@PaulWestenthanner
Copy link
Collaborator

Great! Thanks for your effort. LGTM & Merge

@PaulWestenthanner PaulWestenthanner merged commit 66d89c2 into scikit-learn-contrib:master Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants