Skip to content

ENH dataframe in/out for all samplers#644

Merged
glemaitre merged 5 commits intoscikit-learn-contrib:masterfrom
glemaitre:is/639
Nov 17, 2019
Merged

ENH dataframe in/out for all samplers#644
glemaitre merged 5 commits intoscikit-learn-contrib:masterfrom
glemaitre:is/639

Conversation

@glemaitre
Copy link
Copy Markdown
Member

@glemaitre glemaitre commented Nov 16, 2019

Reference Issue

closes #639

What does this implement/fix? Explain your changes.

Let dataframe out if they got in.

TODO:

  • Add some documentation.

Any other comments?

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Nov 16, 2019

This pull request introduces 1 alert when merging e936bdd into 45b538c - view on LGTM.com

new alerts:

  • 1 for Mismatch in multiple assignment

@codecov
Copy link
Copy Markdown

codecov Bot commented Nov 17, 2019

Codecov Report

Merging #644 into master will increase coverage by 0.54%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #644      +/-   ##
=========================================
+ Coverage   97.96%   98.5%   +0.54%     
=========================================
  Files          83      82       -1     
  Lines        4867    4888      +21     
=========================================
+ Hits         4768    4815      +47     
+ Misses         99      73      -26
Impacted Files Coverage Δ
imblearn/over_sampling/_smote.py 97.23% <100%> (ø) ⬆️
imblearn/utils/estimator_checks.py 96.11% <100%> (+7.22%) ⬆️
imblearn/over_sampling/tests/test_smote_nc.py 100% <100%> (ø) ⬆️
...ling/_prototype_selection/_random_under_sampler.py 100% <100%> (ø) ⬆️
imblearn/over_sampling/_random_over_sampler.py 100% <100%> (ø) ⬆️
...rototype_selection/_instance_hardness_threshold.py 100% <0%> (ø) ⬆️
...election/tests/test_instance_hardness_threshold.py 100% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 45b538c...2f93b40. Read the comment docs.

@chkoar
Copy link
Copy Markdown
Member

chkoar commented Nov 17, 2019

I thought that you were against on using pandas. With your implementation you assume that the user has pandas installed. Of course it is passing pandas pandas df. Shouldn't we try/catch in case that pandas in not installed and is passing something that has a loc property?

@glemaitre
Copy link
Copy Markdown
Member Author

glemaitre commented Nov 17, 2019 via email

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Nov 17, 2019

This pull request introduces 1 alert when merging b1d2d56 into 45b538c - view on LGTM.com

new alerts:

  • 1 for Mismatch in multiple assignment

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Nov 17, 2019

This pull request introduces 1 alert when merging 9431bb8 into 45b538c - view on LGTM.com

new alerts:

  • 1 for Mismatch in multiple assignment

@glemaitre
Copy link
Copy Markdown
Member Author

@chkoar Would you have a bit of time too look at it.

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Nov 17, 2019

This pull request introduces 1 alert when merging 2f93b40 into 153f6e0 - view on LGTM.com

new alerts:

  • 1 for Mismatch in multiple assignment

@chkoar
Copy link
Copy Markdown
Member

chkoar commented Nov 17, 2019

I will give it a go soon

Comment thread imblearn/base.py
Comment on lines +83 to +88
if self._columns is not None:
import pandas as pd
X_ = pd.DataFrame(output[0], columns=self._columns)
else:
X_ = output[0]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not doing this in _check_X_y?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block required the resampled data so _check_X_y is called before that the resampling happened.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It is like we need to extracted it in an internal transformation method. But I am ok as is.

@chkoar
Copy link
Copy Markdown
Member

chkoar commented Nov 17, 2019

Is there a case where something that will pass, it will have .columns, .loc and will not be a DataFrame? Probably is a corner case. In that case we will have an ugly error. Is there a need to test/catch this? If no then I think I am ok.

@glemaitre
Copy link
Copy Markdown
Member Author

Is there a case where something that will pass, it will have .columns, .loc and will not be a DataFrame? Probably is a corner case. In that case we will have an ugly error. Is there a need to test/catch this? If no then I think I am ok.

We always ducktype like this in scikit-learn for dataframe. So if it passes in imblearn it will break in scikit-learn just with the _safe_indexing.

@chkoar
Copy link
Copy Markdown
Member

chkoar commented Nov 17, 2019

Go ahead!

@glemaitre glemaitre merged commit 158258e into scikit-learn-contrib:master Nov 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for dataframe in/out

2 participants