Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA add ValueDifferenceMetric as a pairwise metric #796

Merged
merged 23 commits into from Feb 14, 2021

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Feb 13, 2021

  • Implementation
  • Tests
  • Documentation
  • Reference
  • If public, user guide and more explanation regarding this distance.
  • Decide if it should be public or private

@pep8speaks
Copy link

pep8speaks commented Feb 13, 2021

Hello @glemaitre! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 23:1: E402 module level import not at top of file
Line 87:1: E402 module level import not at top of file

Line 764:17: W503 line break before binary operator

Comment last updated at 2021-02-14 18:28:41 UTC

@chkoar
Copy link
Member

chkoar commented Feb 13, 2021

Could you add in this branch some bench code so we could test vdm performance in order to improve it?

@glemaitre
Copy link
Member Author

Could you add in this branch some bench code so we could test vdm performance in order to improve it?

Yep. We could do that. I made some profiling and I am actually not sure that we can speed-up the computation.
Indeed, the expensive part was the encoding (that I put outside and a requirement as data structure).

@codecov
Copy link

codecov bot commented Feb 13, 2021

Codecov Report

Merging #796 (5526c8a) into master (b00e31b) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #796   +/-   ##
=======================================
  Coverage   98.62%   98.62%           
=======================================
  Files          89       89           
  Lines        5881     5881           
  Branches      494      494           
=======================================
  Hits         5800     5800           
  Misses         80       80           
  Partials        1        1           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a66fb7d...f545386. Read the comment docs.

@glemaitre glemaitre changed the title FEA add SMOTEN for nominal categorical features only FEA add ValueDifferenceMetric as a pairwise metric Feb 13, 2021
@glemaitre
Copy link
Member Author

@chkoar I am wondering if this metric should be public or private. Indeed, it required some OrdinalEncoder and proper dtype that we will manage internally but I don't know if we want a user to use it outside.

@chkoar
Copy link
Member

chkoar commented Feb 14, 2021

@chkoar I am wondering if this metric should be public or private.

We could add it without a leading _ but without a reference to the docs.
I played with the DistanceMetric API but it seems slow.

@glemaitre
Copy link
Member Author

I played with the DistanceMetric API but it seems slow.

It seems x20 slower than the current implementation. I think that we are fine to go indeed.
The only issue is that we will have a matrix of (n_samples, n_samples) leaving in memory.
But it might be best than not having SMOTEN :)

@glemaitre
Copy link
Member Author

We could add it without a leading _ but without a reference to the docs.

I am starting to think that it could leave in the documentation as well.
It is not too tricky to use it publicly. The only requirement is to use an OrdinalEncoder. I think it is quite explicit in the doc.

Copy link
Member

@chkoar chkoar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that we will use fit, probably we could inherit from the base estimator since it estimates from data.

class ValueDifferenceMetric(BaseEstimator):
    def __init__(self, k=1, r=2):...

    def fit(self, X, y):
        # learning unique classes here

    def pairwise(self, X, Y=None):...

On the other hand, another option would be to require the data in the init.

class ValueDifferenceMetric:
    def __init__(self, X, y, k=1, r=2):...

    def pairwise(self, X, Y=None):...

Additionally we could implement the callable API

vdm = ValueDifferenceMetric(...)
distance = vdm(x1, x2)

Design wise, I would be in favor for the following in order to use the DistanceMetric API but is way to slow.

class ValueDifferenceMetric:
    def __init__(self, X, y):...

    def __call__(self, x1, x2, k=1, r=2):... 

vdm = ValueDifferenceMetric(X,y)

metric = DistanceMetric.get_metric(vdm)
metric.pairwise(X)

# or

knn = KNearestNeighbors(metric=metric)

All these assuming that X is ordinal encoded with ints.

imblearn/metrics/pairwise.py Outdated Show resolved Hide resolved
imblearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved
imblearn/metrics/pairwise.py Outdated Show resolved Hide resolved
imblearn/metrics/pairwise.py Outdated Show resolved Hide resolved
@chkoar
Copy link
Member

chkoar commented Feb 14, 2021

The only issue is that we will have a matrix of (n_samples, n_samples) leaving in memory.

Can we get rid of that after the sampling?

@glemaitre
Copy link
Member Author

Can we get rid of that after the sampling?

yes, it is just temporary for the sampling for the NN search.

@glemaitre
Copy link
Member Author

@chkoar I think that I would like to see this PR merge as is and open another one for SMOTEN.
I think that we have a nice coverage regarding the basic usage case in the test.

Do you see anything else to add?

doc/metrics.rst Outdated Show resolved Hide resolved
imblearn/metrics/_classification.py Outdated Show resolved Hide resolved
@chkoar
Copy link
Member

chkoar commented Feb 14, 2021

@chkoar I think that I would like to see this PR merge as is and open another one for SMOTEN.

Agreed.

glemaitre and others added 4 commits February 14, 2021 18:46
Co-authored-by: Christos Aridas <chkoar@users.noreply.github.com>
@glemaitre
Copy link
Member Author

OK I think this is good to be merged. I fixed the issue with what's new. @chkoar Feel free to merge.

@glemaitre glemaitre merged commit ce4e1f7 into scikit-learn-contrib:master Feb 14, 2021
@chkoar
Copy link
Member

chkoar commented Feb 15, 2021

OK I think this is good to be merged. I fixed the issue with what's new. @chkoar Feel free to merge.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants