Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Fix estimators to work if sample_weight parameter is pandas Series type #7825

Merged
merged 18 commits into from Dec 3, 2016

Conversation

@kathyxchen
Copy link
Contributor

@kathyxchen kathyxchen commented Nov 4, 2016

Reference Issue

Finishes work in #5642

What does this implement/fix? Explain your changes.

This addresses the comments made on @pradyu1993's commit

Any other comments?

May still warrant a PEP8 linting. (Having trouble with my usual development machine so once I get back on that I'll run the linter & all). Let me know if there are any suggestions on tests or improvements to how the check is made. Thanks!

@jnothman
Copy link
Member

@jnothman jnothman commented Nov 5, 2016

Travis is indeed reporting linter failures: https://travis-ci.org/scikit-learn/scikit-learn/jobs/173378047

@kathyxchen
Copy link
Contributor Author

@kathyxchen kathyxchen commented Nov 6, 2016

@jnothman done!
Also, the sample_weight parameter is used in many places throughout the linear_model/ridge.py file. Should a modification to accept pandas.Series be applied to all of these, or is that outside the scope of this issue? @amueller

@jnothman
Copy link
Member

@jnothman jnothman commented Nov 7, 2016

Also, the sample_weight parameter is used in many places throughout the linear_model/ridge.py file.

Do you mean in functions like sklearn.linear_model.ridge_regression? Or do you mean checking that it works for all solvers? At the moment we only have common tests for estimator classes. I think that checking support in functions is a good idea, but perhaps in a separate PR.

estimator = Estimator()
if has_fit_parameter(estimator, "sample_weight"):
try:
# default solver liblinear doesn't support this parameter

This comment has been minimized.

@jnothman

jnothman Nov 7, 2016
Member

solver='liblinear' does support sample_weight . Where did you get this idea from?

This comment has been minimized.

@kathyxchen

kathyxchen Nov 7, 2016
Author Contributor

This was part of the code taken from the previous PR. I should have double checked it first!

This comment has been minimized.

@jnothman

jnothman Nov 7, 2016
Member

Ah and it may have been true... over a year ago. :)

This comment has been minimized.

@amueller

amueller Nov 9, 2016
Member

It was, I think ;)

X = pd.DataFrame([[1, 1], [1, 2], [1, 3], [2, 1], [2, 2], [2, 3]])
y = pd.Series([1, 1, 1, 2, 2, 2])
weights = pd.Series([1] * 6)
for estimator_name, Estimator in all_estimators():

This comment has been minimized.

@jnothman

jnothman Nov 7, 2016
Member

This should be implemented like the other checks in sklearn.utils.estimator_checks, not as a separate test.

@kathyxchen
Copy link
Contributor Author

@kathyxchen kathyxchen commented Nov 14, 2016

@jnothman @amueller: I keep running into failed test cases that I don't encounter when I run nosetests utils/tests/test_estimator_checks.py on my local machine. Do you have an idea as to why this may be the case, or any tips for how I ought to set up my environment to avoid this issue?

@jnothman
Copy link
Member

@jnothman jnothman commented Nov 14, 2016

I generally find Travis logs much easier to check than Appveyor. At a glance, the problems on Travis may be related to Python 2, though I see the same failure in Python 3 AppVeyor.

@jnothman
Copy link
Member

@jnothman jnothman commented Nov 16, 2016

Please fix the PEP8 issues Travis is complaining about

Copy link
Member

@jnothman jnothman left a comment

Could you please also rename this to MRG if you think the work is complete enough for a full review.

weights = pd.Series([1] * 6)
try:
estimator.fit(X, y, sample_weight=weights)
except:

This comment has been minimized.

@jnothman

jnothman Nov 16, 2016
Member

You should (almost) never use except:. Use except Exception:

@@ -380,6 +381,27 @@ def check_estimator_sparse_data(name, Estimator):


@ignore_warnings(category=(DeprecationWarning, UserWarning))
def check_pandas_series(name, Estimator):

This comment has been minimized.

@jnothman

jnothman Nov 16, 2016
Member

please mention sample_weight in the name

accept_sparse=("csr", "csc"),
multi_output=True,
y_numeric=True)
# Loosely based on _solve_cholesky_kernel (called in KernelRidge.fit)

This comment has been minimized.

@jnothman

jnothman Nov 16, 2016
Member

This is far too confusing. Just raise an exception directly if sample_weight is a pandas Series.

@@ -72,6 +91,15 @@ def test_check_estimator():
# check that fit does input validation
msg = "TypeError not raised by fit"
assert_raises_regex(AssertionError, msg, check_estimator, BaseBadClassifier)
# check that sample_weights in fit accepts pandas.Series type
try:
from pandas import Series

This comment has been minimized.

@kathyxchen

kathyxchen Nov 16, 2016
Author Contributor

Here, I check if I am able to import pandas before the new estimator check (skip if not). Is there a better way to handle this?
pep8 complained about an unused import in my previous commit because I would just do a try-catch for an ImportError but would do nothing with the import itself. Now, I am using it in msg to avoid that lint error. Should I be handling this in a different way?

This comment has been minimized.

@amueller

amueller Nov 16, 2016
Member

This is fine. Maybe again raise a skip if there is no pandas. I feel we should do that whenever there is a pandas import in the tests but that goes beyond this issue.

This comment has been minimized.

@jnothman

jnothman Nov 16, 2016
Member

Raising a skip if there is no pandas requires splitting test_check_estimator into more functions. Which may be a good idea, but isn't really pertinent to this issue.

There should be a way to disable the linter catching a line... Look it up?

This comment has been minimized.

@amueller

amueller Nov 16, 2016
Member

right...

@kathyxchen kathyxchen changed the title [WIP] Fix estimators to work if sample_weight parameter is pandas Series type [MRG] Fix estimators to work if sample_weight parameter is pandas Series type Nov 16, 2016
Copy link
Member

@amueller amueller left a comment

Looks good apart from minor comments.

@@ -957,6 +957,8 @@ def fit(self, X, y, sample_weight=None):
"""
X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=np.float64,
multi_output=True, y_numeric=True)
if sample_weight is not None and not isinstance(sample_weight, float):

This comment has been minimized.

@amueller

amueller Nov 16, 2016
Member

Why could sample_weight be a float?

This comment has been minimized.

@kathyxchen

kathyxchen Nov 22, 2016
Author Contributor

This was based on the docstring (line 951) which said that sample_weight could accept a float

This comment has been minimized.

@amueller

amueller Dec 1, 2016
Member

I'm not sure why we support this but fair enough.

@@ -380,6 +381,27 @@ def check_estimator_sparse_data(name, Estimator):


@ignore_warnings(category=(DeprecationWarning, UserWarning))
def check_sample_weights_pandas_series(name, Estimator):

This comment has been minimized.

@amueller

amueller Nov 16, 2016
Member

Why UserWarning?

"'sample_weight' parameter is type "
"{1}".format(name, pd.Series))
except ImportError:
pass

This comment has been minimized.

@amueller

amueller Nov 16, 2016
Member

Maybe we should raise a SkipTest("Pandas not installed, not testing pandas series as class weight")? That would be more explicit.

@@ -72,6 +91,15 @@ def test_check_estimator():
# check that fit does input validation
msg = "TypeError not raised by fit"
assert_raises_regex(AssertionError, msg, check_estimator, BaseBadClassifier)
# check that sample_weights in fit accepts pandas.Series type
try:
from pandas import Series

This comment has been minimized.

@amueller

amueller Nov 16, 2016
Member

This is fine. Maybe again raise a skip if there is no pandas. I feel we should do that whenever there is a pandas import in the tests but that goes beyond this issue.

@kathyxchen
Copy link
Contributor Author

@kathyxchen kathyxchen commented Nov 22, 2016

@jnothman @amueller: Adding the SkipTest causes a failure in the doctest run for contributing.rst (here). Should I remove the SkipTest again (in favor of a 'pass') or change the contributing.rst example?

@jnothman
Copy link
Member

@jnothman jnothman commented Nov 22, 2016

Right, so this is due to a discrepancy between how we suggest others run our common tests (with check_estimator) and how we do internally. It would be good to modify check_estimator to catch and warn upon SkipTest, whether in this PR or separately.

Copy link
Member

@jnothman jnothman left a comment

LGTM

except SkipTest as message:
# the only SkipTest thrown currently results from not
# being able to import pandas.
warnings.warn(message, ImportWarning)

This comment has been minimized.

@jnothman

jnothman Nov 23, 2016
Member

I'm not certain whether ImportWarning is correct. Should we have a SkipTestWarning or some such?

This comment has been minimized.

@kathyxchen

kathyxchen Nov 23, 2016
Author Contributor

Should I add a SkipTestWarning to the exceptions.py file that inherits from the base Warning class? (Would it be in the scope of this PR?)

This comment has been minimized.

@amueller

amueller Dec 1, 2016
Member

ImportWarning is ignored by default. I'd either use UserWarning or add a SkipTestWarning. Feel free to add this to the PR, or change to UserWarning. otherwise LGTM!

@jnothman jnothman changed the title [MRG] Fix estimators to work if sample_weight parameter is pandas Series type [MRG+1] Fix estimators to work if sample_weight parameter is pandas Series type Nov 23, 2016
Copy link
Member

@amueller amueller left a comment

I don't think ImportWarning is good as it's ignored by default.

@@ -957,6 +957,8 @@ def fit(self, X, y, sample_weight=None):
"""
X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=np.float64,
multi_output=True, y_numeric=True)
if sample_weight is not None and not isinstance(sample_weight, float):

This comment has been minimized.

@amueller

amueller Dec 1, 2016
Member

I'm not sure why we support this but fair enough.

@@ -379,6 +385,28 @@ def check_estimator_sparse_data(name, Estimator):
raise


@ignore_warnings(category=(DeprecationWarning))

This comment has been minimized.

@amueller

amueller Dec 1, 2016
Member

You don't need the parentheses around DeprecationWarning, they don't do anything.

except SkipTest as message:
# the only SkipTest thrown currently results from not
# being able to import pandas.
warnings.warn(message, ImportWarning)

This comment has been minimized.

@amueller

amueller Dec 1, 2016
Member

ImportWarning is ignored by default. I'd either use UserWarning or add a SkipTestWarning. Feel free to add this to the PR, or change to UserWarning. otherwise LGTM!

Copy link
Member

@jnothman jnothman left a comment

LGTM. Please add a changelog entry to doc/whats_new.rst, under your choice of bug fix or enhancement (I can't decide!).

@kathyxchen kathyxchen force-pushed the kathyxchen:estimators-accept-series branch from 8a54e04 to 5c7b166 Dec 3, 2016
@amueller
Copy link
Member

@amueller amueller commented Dec 3, 2016

thanks :)

@amueller amueller merged commit 04b67e2 into scikit-learn:master Dec 3, 2016
3 checks passed
3 checks passed
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
…eries type (scikit-learn#7825)

* addressed comments in the PR about parameters in check_array

* update the test case for the evaluation of estimators with pandas series

* bug fix, need to check for *not* None explicitly

* updated with isinstance check if the documentation says there is acceptance of floats

* ran pep8 linter on modified files

* moving the test case to estimators_check

* add a predict function into the testing pandas.Series class

* avoid running anything beyond the newly added meta checks

* check if pandas is installed before running the specific test

* changed the order of the try-catch to check for sample_weight param beforehand

* pass on import error rather than printing something to std out

* improve test case naming and pd.Series check in the bad estimator class

* address a pep8 linter error with unused import

* pep8 warning disabled for potential unused import

* throw a warning when SkipTest is raised

* add a SkipTestWarning

* updated the whats_new.rst with this issue

* rebase and fix a spacing issue
Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…eries type (scikit-learn#7825)

* addressed comments in the PR about parameters in check_array

* update the test case for the evaluation of estimators with pandas series

* bug fix, need to check for *not* None explicitly

* updated with isinstance check if the documentation says there is acceptance of floats

* ran pep8 linter on modified files

* moving the test case to estimators_check

* add a predict function into the testing pandas.Series class

* avoid running anything beyond the newly added meta checks

* check if pandas is installed before running the specific test

* changed the order of the try-catch to check for sample_weight param beforehand

* pass on import error rather than printing something to std out

* improve test case naming and pd.Series check in the bad estimator class

* address a pep8 linter error with unused import

* pep8 warning disabled for potential unused import

* throw a warning when SkipTest is raised

* add a SkipTestWarning

* updated the whats_new.rst with this issue

* rebase and fix a spacing issue
NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
…eries type (scikit-learn#7825)

* addressed comments in the PR about parameters in check_array

* update the test case for the evaluation of estimators with pandas series

* bug fix, need to check for *not* None explicitly

* updated with isinstance check if the documentation says there is acceptance of floats

* ran pep8 linter on modified files

* moving the test case to estimators_check

* add a predict function into the testing pandas.Series class

* avoid running anything beyond the newly added meta checks

* check if pandas is installed before running the specific test

* changed the order of the try-catch to check for sample_weight param beforehand

* pass on import error rather than printing something to std out

* improve test case naming and pd.Series check in the bad estimator class

* address a pep8 linter error with unused import

* pep8 warning disabled for potential unused import

* throw a warning when SkipTest is raised

* add a SkipTestWarning

* updated the whats_new.rst with this issue

* rebase and fix a spacing issue
paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…eries type (scikit-learn#7825)

* addressed comments in the PR about parameters in check_array

* update the test case for the evaluation of estimators with pandas series

* bug fix, need to check for *not* None explicitly

* updated with isinstance check if the documentation says there is acceptance of floats

* ran pep8 linter on modified files

* moving the test case to estimators_check

* add a predict function into the testing pandas.Series class

* avoid running anything beyond the newly added meta checks

* check if pandas is installed before running the specific test

* changed the order of the try-catch to check for sample_weight param beforehand

* pass on import error rather than printing something to std out

* improve test case naming and pd.Series check in the bad estimator class

* address a pep8 linter error with unused import

* pep8 warning disabled for potential unused import

* throw a warning when SkipTest is raised

* add a SkipTestWarning

* updated the whats_new.rst with this issue

* rebase and fix a spacing issue
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…eries type (scikit-learn#7825)

* addressed comments in the PR about parameters in check_array

* update the test case for the evaluation of estimators with pandas series

* bug fix, need to check for *not* None explicitly

* updated with isinstance check if the documentation says there is acceptance of floats

* ran pep8 linter on modified files

* moving the test case to estimators_check

* add a predict function into the testing pandas.Series class

* avoid running anything beyond the newly added meta checks

* check if pandas is installed before running the specific test

* changed the order of the try-catch to check for sample_weight param beforehand

* pass on import error rather than printing something to std out

* improve test case naming and pd.Series check in the bad estimator class

* address a pep8 linter error with unused import

* pep8 warning disabled for potential unused import

* throw a warning when SkipTest is raised

* add a SkipTestWarning

* updated the whats_new.rst with this issue

* rebase and fix a spacing issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.