Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH dataframe in/out for all samplers #644

Merged
merged 5 commits into from
Nov 17, 2019

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Nov 16, 2019

Reference Issue

closes #639

What does this implement/fix? Explain your changes.

Let dataframe out if they got in.

TODO:

  • Add some documentation.

Any other comments?

@lgtm-com
Copy link

lgtm-com bot commented Nov 16, 2019

This pull request introduces 1 alert when merging e936bdd into 45b538c - view on LGTM.com

new alerts:

  • 1 for Mismatch in multiple assignment

@codecov
Copy link

codecov bot commented Nov 17, 2019

Codecov Report

Merging #644 into master will increase coverage by 0.54%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #644      +/-   ##
=========================================
+ Coverage   97.96%   98.5%   +0.54%     
=========================================
  Files          83      82       -1     
  Lines        4867    4888      +21     
=========================================
+ Hits         4768    4815      +47     
+ Misses         99      73      -26
Impacted Files Coverage Δ
imblearn/over_sampling/_smote.py 97.23% <100%> (ø) ⬆️
imblearn/utils/estimator_checks.py 96.11% <100%> (+7.22%) ⬆️
imblearn/over_sampling/tests/test_smote_nc.py 100% <100%> (ø) ⬆️
...ling/_prototype_selection/_random_under_sampler.py 100% <100%> (ø) ⬆️
imblearn/over_sampling/_random_over_sampler.py 100% <100%> (ø) ⬆️
...rototype_selection/_instance_hardness_threshold.py 100% <0%> (ø) ⬆️
...election/tests/test_instance_hardness_threshold.py 100% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 45b538c...2f93b40. Read the comment docs.

@chkoar
Copy link
Member

chkoar commented Nov 17, 2019

I thought that you were against on using pandas. With your implementation you assume that the user has pandas installed. Of course it is passing pandas pandas df. Shouldn't we try/catch in case that pandas in not installed and is passing something that has a loc property?

@glemaitre
Copy link
Member Author

glemaitre commented Nov 17, 2019 via email

@lgtm-com
Copy link

lgtm-com bot commented Nov 17, 2019

This pull request introduces 1 alert when merging b1d2d56 into 45b538c - view on LGTM.com

new alerts:

  • 1 for Mismatch in multiple assignment

@lgtm-com
Copy link

lgtm-com bot commented Nov 17, 2019

This pull request introduces 1 alert when merging 9431bb8 into 45b538c - view on LGTM.com

new alerts:

  • 1 for Mismatch in multiple assignment

@glemaitre
Copy link
Member Author

@chkoar Would you have a bit of time too look at it.

@lgtm-com
Copy link

lgtm-com bot commented Nov 17, 2019

This pull request introduces 1 alert when merging 2f93b40 into 153f6e0 - view on LGTM.com

new alerts:

  • 1 for Mismatch in multiple assignment

@chkoar
Copy link
Member

chkoar commented Nov 17, 2019

I will give it a go soon

Comment on lines +83 to +88
if self._columns is not None:
import pandas as pd
X_ = pd.DataFrame(output[0], columns=self._columns)
else:
X_ = output[0]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not doing this in _check_X_y?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block required the resampled data so _check_X_y is called before that the resampling happened.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It is like we need to extracted it in an internal transformation method. But I am ok as is.

@chkoar
Copy link
Member

chkoar commented Nov 17, 2019

Is there a case where something that will pass, it will have .columns, .loc and will not be a DataFrame? Probably is a corner case. In that case we will have an ugly error. Is there a need to test/catch this? If no then I think I am ok.

@glemaitre
Copy link
Member Author

Is there a case where something that will pass, it will have .columns, .loc and will not be a DataFrame? Probably is a corner case. In that case we will have an ugly error. Is there a need to test/catch this? If no then I think I am ok.

We always ducktype like this in scikit-learn for dataframe. So if it passes in imblearn it will break in scikit-learn just with the _safe_indexing.

@chkoar
Copy link
Member

chkoar commented Nov 17, 2019

Go ahead!

@glemaitre glemaitre merged commit 158258e into scikit-learn-contrib:master Nov 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for dataframe in/out
2 participants