Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sklearn.utils._param_validation.InvalidParameterError: The 'zero_division' parameter of precision_score must be a float among {0.0, 1.0, nan} or a str among {'warn'}. Got nan instead #27563

Closed
jolespin opened this issue Oct 10, 2023 · 4 comments · Fixed by #27573
Labels

Comments

@jolespin
Copy link

jolespin commented Oct 10, 2023

Describe the bug

I'm trying to use precision_score with np.nan for the zero_division. It's not working with cross_val_score but working when I do manual cross-validation with the same pairs.

Steps/Code to Reproduce

Here's the data files to reproduce:
sklearn_data.pkl.zip

# Load in data
with open("sklearn_data.pkl", "rb") as f:
    objects = pickle.load(f)


# > objects.keys()
# dict_keys(['estimator', 'X', 'y', 'scoring', 'cv', 'n_jobs'])

estimator = objects["estimator"]
X = objects["X"]
y = objects["y"]
scoring = objects["scoring"]
cv = objects["cv"]
n_jobs = objects["n_jobs"]

# > scoring
# make_scorer(precision_score, pos_label=Case_0, zero_division=nan)

# > y.unique()
# ['Control', 'Case_0']
# Categories (2, object): ['Case_0', 'Control']

# First I checked to make sure that there are both classes in all the training and validation pairs
pos_label = "Case_0"
control_label = "Control"
for index_training, index_validation in cv:
    assert y.iloc[index_training].nunique() == 2
    assert y.iloc[index_validation].nunique() == 2
    assert pos_label in y.values
    assert control_label in y.values

# If I run manually:
scores = list()
for index_training, index_validation in cv:
    estimator.fit(X.iloc[index_training], y.iloc[index_training])
    y_hat = estimator.predict(X.iloc[index_validation])
    score = precision_score(y_true = y.iloc[index_validation], y_pred=y_hat, pos_label=pos_label)
    scores.append(score)
# > print(np.mean(scores))
# 0.501156937317928

```python
# If I use cross_val_score:
cross_val_score(estimator=estimator, X=X, y=y, cv=cv, scoring=scoring, n_jobs=n_jobs
/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:839: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 136, in __call__
    score = scorer._score(
  File "/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 355, in _score
    return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)
  File "/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 201, in wrapper
    validate_parameter_constraints(
  File "/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'zero_division' parameter of precision_score must be a float among {0.0, 1.0, nan} or a str among {'warn'}. Got nan instead.

Expected Results

0.501156937317928

Actual Results

/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:839: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 136, in __call__
    score = scorer._score(
  File "/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 355, in _score
    return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)
  File "/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 201, in wrapper
    validate_parameter_constraints(
  File "/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'zero_division' parameter of precision_score must be a float among {0.0, 1.0, nan} or a str among {'warn'}. Got nan instead.

Versions

System:
    python: 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)  [Clang 14.0.6 ]
executable: /Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/bin/python
   machine: macOS-13.4.1-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.3.1
          pip: 22.0.3
   setuptools: 60.7.1
        numpy: 1.24.4
        scipy: 1.8.0
       Cython: 0.29.27
       pandas: 1.4.0
   matplotlib: 3.7.1
       joblib: 1.3.2
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/libopenblasp-r0.3.18.dylib
        version: 0.3.18
threading_layer: openmp
   architecture: Haswell
    num_threads: 16

       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: /Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/libomp.dylib
        version: None
    num_threads: 16
@jolespin jolespin added Bug Needs Triage Issue requires triage labels Oct 10, 2023
@glemaitre
Copy link
Member

Your manual cross-validation does not use scoring. If it was using scoring, it will raise. For instance:

In [1]: from sklearn.metrics import precision_score, make_scorer
In [2]: from sklearn.ensemble import RandomForestClassifier
In [3]: import numpy as np
In [4]: from sklearn.datasets import make_classification
In [5]: X, y = make_classification(random_state=0)
In [6]: classifier = RandomForestClassifier(random_state=0).fit(X, y)
In [7]: scoring = make_scorer(precision_score, zero_division=np.nan)
In [8]: scoring(classifier, X, y)
Out[8]: 1.0
In [9]: scoring = make_scorer(precision_score, zero_division="nan")
In [9]: scoring(classifier, X, y)
---------------------------------------------------------------------------
InvalidParameterError                     Traceback (most recent call last)
Cell In[12], line 1
----> 1 scoring(classifier, X, y)

File ~/Documents/packages/scikit-learn/sklearn/metrics/_scorer.py:265, in _BaseScorer.__call__(self, estimator, X, y_true, sample_weight, **kwargs)
    262 if sample_weight is not None:
    263     _kwargs["sample_weight"] = sample_weight
--> 265 return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)

File ~/Documents/packages/scikit-learn/sklearn/metrics/_scorer.py:361, in _PredictScorer._score(self, method_caller, estimator, X, y_true, **kwargs)
    359 y_pred = method_caller(estimator, "predict", X)
    360 scoring_kwargs = {**self._kwargs, **kwargs}
--> 361 return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)

File ~/Documents/packages/scikit-learn/sklearn/utils/_param_validation.py:201, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    198 to_ignore += ["self", "cls"]
    199 params = {k: v for k, v in params.arguments.items() if k not in to_ignore}
--> 201 validate_parameter_constraints(
    202     parameter_constraints, params, caller_name=func.__qualname__
    203 )
    205 try:
    206     with config_context(
    207         skip_parameter_validation=(
    208             prefer_skip_nested_validation or global_skip_validation
    209         )
    210     ):

File ~/Documents/packages/scikit-learn/sklearn/utils/_param_validation.py:95, in validate_parameter_constraints(parameter_constraints, params, caller_name)
     89 else:
     90     constraints_str = (
     91         f"{', '.join([str(c) for c in constraints[:-1]])} or"
     92         f" {constraints[-1]}"
     93     )
---> 95 raise InvalidParameterError(
     96     f"The {param_name!r} parameter of {caller_name} must be"
     97     f" {constraints_str}. Got {param_val!r} instead."
     98 )

InvalidParameterError: The 'zero_division' parameter of precision_score must be a float among {0.0, 1.0, nan} or a str among {'warn'}. Got 'nan' instead.

So the problematic line is:

make_scorer(precision_score, pos_label=Case_0, zero_division=nan)

Could you give some information regarding what is the nan values used there.

@glemaitre glemaitre added Needs Info and removed Needs Triage Issue requires triage labels Oct 11, 2023
@glemaitre
Copy link
Member

@jeremiedbb I was checking how do we check for np.nan:

Options(Real, {0.0, 1.0, np.nan}),

I assume that it should be fine because we should have something like np.nan in {0.0, 1.0, np.nan}. It should not be a problem.

@jolespin
Copy link
Author

jolespin commented Oct 11, 2023

There's some strange behavior going on.

Could you give some information regarding what is the nan values used there.

I actually used np.nan but it just shows as nan when viewing the object

This one is a little more explicit:

import pickle
from sklearn.metrics import precision_score
from sklearn.model_selection import cross_val_score

# Load in data
with open("sklearn_data.pkl", "rb") as f:
    objects = pickle.load(f)


# > objects.keys()
# dict_keys(['estimator', 'X', 'y', 'scoring', 'cv', 'n_jobs'])

estimator = objects["estimator"]
X = objects["X"]
y = objects["y"]
scoring = objects["scoring"]
cv = objects["cv"]
n_jobs = objects["n_jobs"]

# > scoring
scoring = make_scorer(precision_score, pos_label="Case_0", zero_division=np.nan)

# > y.unique()
# ['Control', 'Case_0']
# Categories (2, object): ['Case_0', 'Control']

# First I checked to make sure that there are both classes in all the training and validation pairs
pos_label = "Case_0"
control_label = "Control"
for index_training, index_validation in cv:
    assert y.iloc[index_training].nunique() == 2
    assert y.iloc[index_validation].nunique() == 2
    assert pos_label in y.values
    assert control_label in y.values

# If I run manually using precision_score function
scores = list()
for index_training, index_validation in cv:
    estimator.fit(X.iloc[index_training], y.iloc[index_training])
    y_hat = estimator.predict(X.iloc[index_validation])
    score = precision_score(y_true = y.iloc[index_validation], y_pred=y_hat, pos_label=pos_label)
    scores.append(score)
# > print(np.mean(scores))
# 0.501156937317928

# Now using the scorer
scores = list()
for index_training, index_validation in cv:
    estimator.fit(X.iloc[index_training], y.iloc[index_training])
    y_hat = estimator.predict(X.iloc[index_validation])
    # score = precision_score(y_true = y.iloc[index_validation], y_pred=y_hat, pos_label=pos_label)
    score = scoring(estimator, X=X.iloc[index_validation], y_true=y.iloc[index_validation])
    scores.append(score)
print(np.mean(scores))
# 0.501156937317928

# If I use cross_val_score:
cross_val_score(estimator=estimator, X=X, y=y, cv=cv, scoring=scoring, n_jobs=n_jobs)
# Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:839: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
# Traceback (most recent call last):
#   File "/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 136, in __call__
#     score = scorer._score(
#   File "/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 355, in _score
#     return self._sign * self._score_func(y_true, y_pred, **scoring_kwargs)
#   File "/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 201, in wrapper
#     validate_parameter_constraints(
#   File "/Users/jespinoz/anaconda3/envs/soothsayer_py3.9_env2/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
#     raise InvalidParameterError(
# sklearn.utils._param_validation.InvalidParameterError: The 'zero_division' parameter of precision_score must be a float among {0.0, 1.0, nan} or a str among {'warn'}. Got nan instead.
...
[A bunch of these]

@glemaitre
Copy link
Member

glemaitre commented Oct 11, 2023

OK so here is a minimum reproducer:

# %%
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score, make_scorer

X, y = make_classification(weights=[0.3, 0.7], random_state=0)
X = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
y = pd.Series(y, name='target').apply(lambda x: 'class_1' if x == 1 else 'class_0')

classifier = DecisionTreeClassifier(max_depth=2, random_state=0).fit(X, y)
scoring = make_scorer(precision_score, pos_label='class_0', zero_division=np.nan)
print(scoring(classifier, X, y))

# %%
from sklearn.model_selection import cross_val_score

print(cross_val_score(classifier, X, y, scoring=scoring, n_jobs=2))

The culprit is n_jobs=2. With a single jobs then we don't have any issue. I assume that each subprocess has a different singleton representing np.nan therefore np.nan in {0.0, 1.0, np.nan} is not working.

We need to make the _NanConstraint public and use it instead of the Options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants