New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Add a new class RegressorChain similar to ClassifierChain #9257

Merged
merged 73 commits into from Dec 11, 2017

Conversation

Projects
None yet
6 participants
@thechargedneutron
Contributor

thechargedneutron commented Jun 30, 2017

Reference Issue

#9246

What does this implement/fix? Explain your changes.

The changes made are as follows

  1. Removed local variable class_
  2. Removed decision_function
  3. Removed predict_proba

Any other comments?

Suggestions are welcomed.

thechargedneutron added some commits Jun 30, 2017

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Jun 30, 2017

Member

Would it be possible to avoid redundant code between ClassifierChain and RegressorChain?
Something like:

class BaseChain(BaseEstimator)
    def __init__(self, base_estimator, order=None, cv=None, random_state=None):
        ...
    def fit(self, X, Y):
        ...
    def predict(self, X):
        ...

class ClassifierChain(BaseChain):
    """docstring"""
    @if_delegate_has_method('base_estimator')
    def predict_proba(self, X):
        ...
    @if_delegate_has_method('base_estimator')
    def decision_function(self, X):
        ...

class RegressorChain(BaseChain):
    """docstring"""
    ...

You will also need to fix the tests, maybe add new tests, add a docstring in RegressorChain, add an entry in the documentation, and add RegressorChain in doc/modules/classes.rst.

Basically, look for every appearance of ClassifierChain (with for example git grep ClassifierChain), and add something similar with RegressorChain if it's relevant.

Your code is also not PEP8 compliant.

Member

TomDLT commented Jun 30, 2017

Would it be possible to avoid redundant code between ClassifierChain and RegressorChain?
Something like:

class BaseChain(BaseEstimator)
    def __init__(self, base_estimator, order=None, cv=None, random_state=None):
        ...
    def fit(self, X, Y):
        ...
    def predict(self, X):
        ...

class ClassifierChain(BaseChain):
    """docstring"""
    @if_delegate_has_method('base_estimator')
    def predict_proba(self, X):
        ...
    @if_delegate_has_method('base_estimator')
    def decision_function(self, X):
        ...

class RegressorChain(BaseChain):
    """docstring"""
    ...

You will also need to fix the tests, maybe add new tests, add a docstring in RegressorChain, add an entry in the documentation, and add RegressorChain in doc/modules/classes.rst.

Basically, look for every appearance of ClassifierChain (with for example git grep ClassifierChain), and add something similar with RegressorChain if it's relevant.

Your code is also not PEP8 compliant.

@TomDLT TomDLT changed the title from [MRG] Solves #9246 by making necessary changes in ClassifierChain to [MRG] Add a new class RegressorChain similar to ClassifierChain Jun 30, 2017

@TomDLT TomDLT changed the title from [MRG] Add a new class RegressorChain similar to ClassifierChain to [WIP] Add a new class RegressorChain similar to ClassifierChain Jun 30, 2017

Progress Updated
* Update classes.rst

* Update multiclass.rst

* Update multioutput.py

* Update testing.py
@thechargedneutron

This comment has been minimized.

Show comment
Hide comment
@thechargedneutron

thechargedneutron Jun 30, 2017

Contributor

Work Progress:

  1. Edited multioutput.py to make it compatible with PEP8. (Number of characters > 79 ignored)
  2. Added the function in docs. (Except in whats_new. Do I have to add there also??)
  3. Added "docstring" in RegressorChain.

Need help in adding more tests.
Could "generate_multilabel_dataset_with_correlations()" be used to generate tests for Regression, I doubt.

Contributor

thechargedneutron commented Jun 30, 2017

Work Progress:

  1. Edited multioutput.py to make it compatible with PEP8. (Number of characters > 79 ignored)
  2. Added the function in docs. (Except in whats_new. Do I have to add there also??)
  3. Added "docstring" in RegressorChain.

Need help in adding more tests.
Could "generate_multilabel_dataset_with_correlations()" be used to generate tests for Regression, I doubt.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 1, 2017

Member
Member

jnothman commented Jul 1, 2017

@jmschrei

This comment has been minimized.

Show comment
Hide comment
@jmschrei

jmschrei Jul 3, 2017

Member

I would also support create a BaseChain class that is common to both regressors and classifiers. It seems to fit with the other models we have.

Member

jmschrei commented Jul 3, 2017

I would also support create a BaseChain class that is common to both regressors and classifiers. It seems to fit with the other models we have.

@TomDLT TomDLT added the New Feature label Jul 11, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 31, 2017

Member

Are you still working on this @thechargedneutron?

Member

jnothman commented Jul 31, 2017

Are you still working on this @thechargedneutron?

@thechargedneutron

This comment has been minimized.

Show comment
Hide comment
@thechargedneutron

thechargedneutron Jul 31, 2017

Contributor

Yes sir, since you were focussing on new release, I did not commit. I wish to continue the work. I'll update the PR in a day.

Contributor

thechargedneutron commented Jul 31, 2017

Yes sir, since you were focussing on new release, I did not commit. I wish to continue the work. I'll update the PR in a day.

@thechargedneutron

This comment has been minimized.

Show comment
Hide comment
@thechargedneutron

thechargedneutron Jul 31, 2017

Contributor

The common BaseChain class has been added. I tried resolving the conflicting files but somehow did not work. Kindly review the changes and remove the conflicts.
Also I need suggestions on how to add tests.
Thanks

Contributor

thechargedneutron commented Jul 31, 2017

The common BaseChain class has been added. I tried resolving the conflicting files but somehow did not work. Kindly review the changes and remove the conflicts.
Also I need suggestions on how to add tests.
Thanks

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 31, 2017

Member
Member

jnothman commented Jul 31, 2017

@thechargedneutron

This comment has been minimized.

Show comment
Hide comment
@thechargedneutron

thechargedneutron Aug 1, 2017

Contributor

The tests have been added. SVR was not added in the RegressorChain() tests as the function decision_function() is deprecated and will be removed in 0.19. If there's a mistake or something more is to be added, kindly let me know.
Thanks.
[EDIT 1]: Please let me know the reason why the Continuous Integration tests are failing every time I commit. Advice given in the irc channel is not working either.

Contributor

thechargedneutron commented Aug 1, 2017

The tests have been added. SVR was not added in the RegressorChain() tests as the function decision_function() is deprecated and will be removed in 0.19. If there's a mistake or something more is to be added, kindly let me know.
Thanks.
[EDIT 1]: Please let me know the reason why the Continuous Integration tests are failing every time I commit. Advice given in the irc channel is not working either.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 1, 2017

Member
Member

jnothman commented Aug 1, 2017

@thechargedneutron

This comment has been minimized.

Show comment
Hide comment
@thechargedneutron

thechargedneutron Aug 2, 2017

Contributor

Is there anything else to be worked upon? And why are the CI build failing?
Thanks.

Contributor

thechargedneutron commented Aug 2, 2017

Is there anything else to be worked upon? And why are the CI build failing?
Thanks.

@jnothman

This was the start of a review...

Show outdated Hide outdated sklearn/multioutput.py
Show outdated Hide outdated sklearn/multioutput.py
@jnothman

Tests are failing, in the first instance, because you have not imported RegressorChain into test_multioutput.py

Another issue is that classes_ should not be set in RegressorChain but should be in ClassifierChain

Show outdated Hide outdated sklearn/tests/test_multioutput.py
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 2, 2017

Member
Member

jnothman commented Aug 2, 2017

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 2, 2017

Member
Member

jnothman commented Aug 2, 2017

@thechargedneutron

This comment has been minimized.

Show comment
Hide comment
@thechargedneutron

thechargedneutron Aug 2, 2017

Contributor

Yes, I have done all the changes suggested by you, also jaccard_similarity_score has been replaced by mean_squared_error. Any more suggestions before I commit the changes?

Contributor

thechargedneutron commented Aug 2, 2017

Yes, I have done all the changes suggested by you, also jaccard_similarity_score has been replaced by mean_squared_error. Any more suggestions before I commit the changes?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 2, 2017

Member
Member

jnothman commented Aug 2, 2017

@thechargedneutron

This comment has been minimized.

Show comment
Hide comment
@thechargedneutron

thechargedneutron Aug 2, 2017

Contributor

No, I did not try it at my local machine.
And I was not aware of the fact that jaccard_similarity_score if better if greater. So we can replace assert_greater by assert_less, and I think assert_equal and assert_not_equal should remain as it is. Am I right?

Contributor

thechargedneutron commented Aug 2, 2017

No, I did not try it at my local machine.
And I was not aware of the fact that jaccard_similarity_score if better if greater. So we can replace assert_greater by assert_less, and I think assert_equal and assert_not_equal should remain as it is. Am I right?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 2, 2017

Member
Member

jnothman commented Aug 2, 2017

@thechargedneutron

This comment has been minimized.

Show comment
Hide comment
@thechargedneutron

thechargedneutron Aug 2, 2017

Contributor

Would a value of 0.1 be appropriate ? I tried finding if mean_squared_error is used anywhere else too, but it seems, its not used with a boundary.

Contributor

thechargedneutron commented Aug 2, 2017

Would a value of 0.1 be appropriate ? I tried finding if mean_squared_error is used anywhere else too, but it seems, its not used with a boundary.

@jnothman

Otherwise LGTM

Each model makes a prediction in the order specified by the chain using
all of the available features provided to the model plus the predictions
of models that are earlier in the chain.

This comment has been minimized.

@jnothman

jnothman Dec 7, 2017

Member

This should say "Read more in the User Guide"

@jnothman

jnothman Dec 7, 2017

Member

This should say "Read more in the User Guide"

Each model makes a prediction in the order specified by the chain using
all of the available features provided to the model plus the predictions
of models that are earlier in the chain.

This comment has been minimized.

@jnothman

jnothman Dec 7, 2017

Member

This should say "Read more in the User Guide"

@jnothman

jnothman Dec 7, 2017

Member

This should say "Read more in the User Guide"

This comment has been minimized.

@thechargedneutron

thechargedneutron Dec 7, 2017

Contributor

Followed the same procedure as already there in the codebase.
Read more in the :ref:User Guide <regressorchain>. I am not sure whether this is what you intended as the hyper-link currently may not exist.

@thechargedneutron

thechargedneutron Dec 7, 2017

Contributor

Followed the same procedure as already there in the codebase.
Read more in the :ref:User Guide <regressorchain>. I am not sure whether this is what you intended as the hyper-link currently may not exist.

This comment has been minimized.

@TomDLT

TomDLT Dec 7, 2017

Member

You need to add .. regressorchain: before

Regressor chain
===============

in doc/modules/multiclass.rst

Same for classifier chain

@TomDLT

TomDLT Dec 7, 2017

Member

You need to add .. regressorchain: before

Regressor chain
===============

in doc/modules/multiclass.rst

Same for classifier chain

@scikit-learn scikit-learn deleted a comment from codecov bot Dec 7, 2017

@thechargedneutron

This comment has been minimized.

Show comment
Hide comment
@thechargedneutron

thechargedneutron Dec 7, 2017

Contributor

Only doc changes and test fails!! The same problem of one test failing. Let me try with an empty commit.

Contributor

thechargedneutron commented Dec 7, 2017

Only doc changes and test fails!! The same problem of one test failing. Let me try with an empty commit.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 7, 2017

Member

I agree that test failure is strange:


=================================== FAILURES ===================================
_________________________ test_base_chain_random_order _________________________
    def test_base_chain_random_order():
        # Fit base chain with random order
        X, Y = generate_multilabel_dataset_with_correlations()
        for chain in [ClassifierChain(LogisticRegression()),
                      RegressorChain(LinearRegression())]:
            chain_random = clone(chain).set_params(order='random', random_state=42)
            chain_random.fit(X, Y)
            chain_fixed = clone(chain).set_params(order=chain_random.order_)
            chain_fixed.fit(X, Y)
            assert_not_equal(list(chain_random.order), list(range(4)))
            assert_equal(len(chain_random.order_), 4)
            assert_equal(len(set(chain_random.order_)), 4)
            # Randomly ordered chain should behave identically to a fixed order
            # chain with the same order.
            for est1, est2 in zip(chain_random.estimators_,
                                  chain_fixed.estimators_):
>               assert_array_almost_equal(est1.coef_, est2.coef_)
E               AssertionError: 
E               Arrays are not almost equal to 6 decimals
E               
E               (mismatch 97.0%)
E                x: array([  7.186580e-03,  -9.998066e-03,   3.580353e-03,  -1.627024e-03,
E                       -3.538778e-03,  -2.532519e-02,   1.074851e-03,   3.214291e-02,
E                        1.339362e+11,  -1.486778e-03,   1.718521e-03,  -4.677074e+11,...
E                y: array([  7.186547e-03,  -1.018109e-02,   3.780135e-03,  -2.089836e-03,
E                       -3.220238e-03,  -2.520867e-02,   1.489337e-03,   3.183672e-02,
E                       -8.141693e+10,  -1.886845e-03,   1.585007e-03,  -5.611828e+11,...
X          = array([[ 0.83122917, -1.19111403,  1.94512639, ..., -1.04616065,
         0.19652082, -0.71217646],
       [-1.7883219...266, -1.00619016],
       [-1.08628488, -1.20595377, -1.19448098, ...,  0.88972995,
        -1.08459736, -1.89365993]])
Y          = array([[0, 0, 1, 1],
       [0, 1, 1, 0],
       [1, 0, 1, 1],
       ..., 
       [1, 1, 1, 0],
       [0, 0, 1, 0],
       [0, 1, 0, 0]])
chain      = RegressorChain(base_estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
        cv=None, order=None, random_state=None)
chain_fixed = RegressorChain(base_estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
        cv=None, order=array([1, 3, 0, 2]), random_state=None)
chain_random = RegressorChain(base_estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
        cv=None, order='random', random_state=42)
est1       = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
est2       = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
/home/travis/build/scikit-learn/scikit-learn/sklearn/tests/test_multioutput.py:471: AssertionError
====== 1 failed, 8351 passed, 26 skipped, 4860 warnings in 544.24 seconds ======

It only occurs on one platform, only sometimes, and only in the regression case. Which would suggest it is an indeterminacy in LinearRegression which is quite unbelievable. @TomDLT any ideas?

I doubt it will change anything, but there should be a check there that chain_fixed.order_ and chain_random.order_ are identical. Could you add that?

Member

jnothman commented Dec 7, 2017

I agree that test failure is strange:


=================================== FAILURES ===================================
_________________________ test_base_chain_random_order _________________________
    def test_base_chain_random_order():
        # Fit base chain with random order
        X, Y = generate_multilabel_dataset_with_correlations()
        for chain in [ClassifierChain(LogisticRegression()),
                      RegressorChain(LinearRegression())]:
            chain_random = clone(chain).set_params(order='random', random_state=42)
            chain_random.fit(X, Y)
            chain_fixed = clone(chain).set_params(order=chain_random.order_)
            chain_fixed.fit(X, Y)
            assert_not_equal(list(chain_random.order), list(range(4)))
            assert_equal(len(chain_random.order_), 4)
            assert_equal(len(set(chain_random.order_)), 4)
            # Randomly ordered chain should behave identically to a fixed order
            # chain with the same order.
            for est1, est2 in zip(chain_random.estimators_,
                                  chain_fixed.estimators_):
>               assert_array_almost_equal(est1.coef_, est2.coef_)
E               AssertionError: 
E               Arrays are not almost equal to 6 decimals
E               
E               (mismatch 97.0%)
E                x: array([  7.186580e-03,  -9.998066e-03,   3.580353e-03,  -1.627024e-03,
E                       -3.538778e-03,  -2.532519e-02,   1.074851e-03,   3.214291e-02,
E                        1.339362e+11,  -1.486778e-03,   1.718521e-03,  -4.677074e+11,...
E                y: array([  7.186547e-03,  -1.018109e-02,   3.780135e-03,  -2.089836e-03,
E                       -3.220238e-03,  -2.520867e-02,   1.489337e-03,   3.183672e-02,
E                       -8.141693e+10,  -1.886845e-03,   1.585007e-03,  -5.611828e+11,...
X          = array([[ 0.83122917, -1.19111403,  1.94512639, ..., -1.04616065,
         0.19652082, -0.71217646],
       [-1.7883219...266, -1.00619016],
       [-1.08628488, -1.20595377, -1.19448098, ...,  0.88972995,
        -1.08459736, -1.89365993]])
Y          = array([[0, 0, 1, 1],
       [0, 1, 1, 0],
       [1, 0, 1, 1],
       ..., 
       [1, 1, 1, 0],
       [0, 0, 1, 0],
       [0, 1, 0, 0]])
chain      = RegressorChain(base_estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
        cv=None, order=None, random_state=None)
chain_fixed = RegressorChain(base_estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
        cv=None, order=array([1, 3, 0, 2]), random_state=None)
chain_random = RegressorChain(base_estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
        cv=None, order='random', random_state=42)
est1       = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
est2       = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
/home/travis/build/scikit-learn/scikit-learn/sklearn/tests/test_multioutput.py:471: AssertionError
====== 1 failed, 8351 passed, 26 skipped, 4860 warnings in 544.24 seconds ======

It only occurs on one platform, only sometimes, and only in the regression case. Which would suggest it is an indeterminacy in LinearRegression which is quite unbelievable. @TomDLT any ideas?

I doubt it will change anything, but there should be a check there that chain_fixed.order_ and chain_random.order_ are identical. Could you add that?

@thechargedneutron

This comment has been minimized.

Show comment
Hide comment
@thechargedneutron

thechargedneutron Dec 7, 2017

Contributor

@jnothman Yeah, added that. Works fine locally.

Contributor

thechargedneutron commented Dec 7, 2017

@jnothman Yeah, added that. Works fine locally.

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Dec 8, 2017

Member

This error is weird.
I set up a conda environment with the same versions as Travis' 3rd job (python 3.6.2) but could not reproduce locally (Linux-4.4.0-98-generic-x86_64-with-debian-stretch-sid).

Which would suggest it is an indeterminacy in LinearRegression which is quite unbelievable.

Indeed !

Member

TomDLT commented Dec 8, 2017

This error is weird.
I set up a conda environment with the same versions as Travis' 3rd job (python 3.6.2) but could not reproduce locally (Linux-4.4.0-98-generic-x86_64-with-debian-stretch-sid).

Which would suggest it is an indeterminacy in LinearRegression which is quite unbelievable.

Indeed !

@thechargedneutron

This comment has been minimized.

Show comment
Hide comment
@thechargedneutron

thechargedneutron Dec 8, 2017

Contributor

@TomDLT Tests failing. I guess some more change needs to be done apart from adding ABCMeta. What needs to be done in addition to it. I don't have much experience in this.

Contributor

thechargedneutron commented Dec 8, 2017

@TomDLT Tests failing. I guess some more change needs to be done apart from adding ABCMeta. What needs to be done in addition to it. I don't have much experience in this.

TomDLT added some commits Dec 11, 2017

@TomDLT

This comment has been minimized.

Show comment
Hide comment
@TomDLT

TomDLT Dec 11, 2017

Member

I removed _BaseChain from all_estimators by adding an abstract method, which fixed the estimator_check error.


About Travis' failure on python 3.6.2, it seems that LinearRegression is not deterministic, as shown in #10283.

To be able to merge this PR, I changed LinearRegression into Ridge.

Member

TomDLT commented Dec 11, 2017

I removed _BaseChain from all_estimators by adding an abstract method, which fixed the estimator_check error.


About Travis' failure on python 3.6.2, it seems that LinearRegression is not deterministic, as shown in #10283.

To be able to merge this PR, I changed LinearRegression into Ridge.

TomDLT added some commits Dec 11, 2017

@TomDLT

TomDLT approved these changes Dec 11, 2017

@jnothman jnothman merged commit 87759c1 into scikit-learn:master Dec 11, 2017

6 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.13%)
Details
codecov/project 96.14% (+0.01%) compared to 84fa352
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman
Member

jnothman commented Dec 11, 2017

@@ -425,3 +427,13 @@ averaged together.
Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank,
"Classifier Chains for Multi-label Classification", 2009.
.. regressorchain:

This comment has been minimized.

@amueller

amueller Dec 11, 2017

Member

needs _ in the front

@amueller

amueller Dec 11, 2017

Member

needs _ in the front

This comment has been minimized.

@thechargedneutron

thechargedneutron Dec 11, 2017

Contributor

Should I submit a new PR with these changes?

@thechargedneutron

thechargedneutron Dec 11, 2017

Contributor

Should I submit a new PR with these changes?

This comment has been minimized.

@amueller

amueller Dec 11, 2017

Member

yeah, sounds good. And mention that in #10258 and check the circle CI build whether the links work. Maybe also add a "see also" from the regressor to the classifier and vice versa.

@amueller

amueller Dec 11, 2017

Member

yeah, sounds good. And mention that in #10258 and check the circle CI build whether the links work. Maybe also add a "see also" from the regressor to the classifier and vice versa.

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment