Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce positive sample_weight #15531

Open
rth opened this issue Nov 3, 2019 · 14 comments
Open

Enforce positive sample_weight #15531

rth opened this issue Nov 3, 2019 · 14 comments

Comments

@rth
Copy link
Member

rth commented Nov 3, 2019

As discussed in #12464 (comment) and #15358 (comment) in some specific use-cases negative sample_weight may be meaningful, however in most cases they should never happen.

So we may want to

  • add force_positive=None parameter to _check_sample_weights.
  • add a assume_positive_sample_weights=True config parameter to sklearn.set_config / get_config.

By default, force_positive=None would error on negative sample weight, but this check could be disabled globally with sklearn.set_config(assume_positive_sample_weights=False).

With _check_sample_weights(.., force_positive=True) the check would always be done irrespective of the config parameter.

If there are no objections to this, please tag this issue as "Help Wanted".

@rth rth changed the title Enforce positive sample weight Enforce positive sample_weight Nov 3, 2019
@rth rth added the Enhancement label Nov 4, 2019
@jnothman
Copy link
Member

jnothman commented Nov 4, 2019 via email

@rth rth added the help wanted label Nov 4, 2019
@lorentzenchr
Copy link
Member

I think, zero should always be allowed as a valid value for sample weights, but sum(sw) > 0 should hold.

For our dear hep friends, who want negative sw, often in linear models, they are implemented via scaling X and y by sqrt(sw). This makes supporting negative values difficult.

@rth rth changed the title Enforce positive sample_weight Enforce strictly positive sample_weight Nov 4, 2019
@rth rth changed the title Enforce strictly positive sample_weight Enforce positive sample_weight Nov 4, 2019
@NicolasHug
Copy link
Member

I like the config, but do we currently have a need for the first parameter? If we don't, maybe it's not worth implementing it (now).

Also, erroring by default on negative sw would technically be a backward incompatible change, though we may disguise it as a bug fix

@glemaitre
Copy link
Member

Also, erroring by default on negative sw would technically be a backward incompatible change, though we may disguise it as a bug fix

It would depend on the estimator. Some estimator should support negative sample weights and some others are currently erroring or should error (as SVM which was failing to validate the weights). We should probably have a common test and a specific tag that could indicate whether or not an estimator is supporting negative weights or not.

This said I am not against being able to activate/deactivate the support for negative sample weights. If negative weights are really uncommon, we should probably error most of the time and overwrite this behaviour with the config unless the estimator is not supported negative weights (maybe via the tag)

@adrinjalali
Copy link
Member

If an there are estimators which support negative sample weight, then it should be an estimator tag, and not a global config. @glemaitre which ones logically support a negative sample weight now?

@NicolasHug
Copy link
Member

If an there are estimators which support negative sample weight, then it should be an estimator tag, and not a global config

I think the config still makes sense since negative sample weights don't make sense in the vast majority of applications. You really want to know what you're doing when using negative SW.

Whether estimators can support them is another thing, independent of the application.

@adrinjalali
Copy link
Member

Another issue with the global config is that assume_positive_sample_weights=False kinda implies that negative sample weights are okay and we support them, but we don't.

@rth
Copy link
Member Author

rth commented Nov 4, 2019

assume_positive_sample_weights=False kinda implies that negative sample weights are okay and we support them, but we don't.

It's the same idea as the assume_finite global config parameter. That only removes one check, it does not add any expectation that models would work with infinite input there (or negative sample weights here).

@rth
Copy link
Member Author

rth commented Nov 4, 2019

which ones logically support a negative sample weight now?

See #3774 (comment) for more context.

This issue is really about adding checks that would benefit the vast majority of users, while avoiding breaking existing functionality for a small faction of users in HEP.

We definitely don't want support for negative sample weights in more models (unless someone has a very compelling argument for it) and I expect that most models simply won't work with negative sample weights irrespective of this config parameter.

@glemaitre
Copy link
Member

glemaitre commented Sep 2, 2021

#20880 will add the parameter only_non_negative to _check_sample_weight. For the moment we will update LinearRegression, AdaBoostClassifier, AdaBoostRegressor, and KDEDensity that are estimators that check for non-negative weights for the moment.

However, we should make a larger audit of our estimators with common tests. Some estimators should not accept negative weights and will break explicitly (for instance due to take the sqrt of negative weights) or maybe more silently with non-sense results.

@simonandras
Copy link
Contributor

simonandras commented Sep 11, 2021

There is a test of the function test_has_fit_parameter(), where assertions are made about the existance of the sample_weight paramerer of some estimators fit method ( test_has_fit_parameter() ). This list (or something similar elsewhere) could be completed with all the estimators to maintain where we have sample_weight parameter (and where we dont). In addition for all estimators there should be a decision about the negative weight case. If it is not supported an error vould be raised in all cases. This would be also tested. What do you think?

I created a list of all estimators (and its inherited classes), helper functions and metric functions which uses anywhere a sample_weight or sample_weights named parameter. Using this one can make decisions about each function or class in question.
sample_weight.txt

@glemaitre
Copy link
Member

This would be also tested. What do you think?

Indeed, we should create a common test that by default test all estimators. We should skip the test when sample_weight is not in the signature of fit. In addition, we also need a tag to specify if weights can be negative or not. In this manner, it will allow testing for consistency.

I created a list of all estimators (and its inherited classes)

Could you paste the list directly in the issue and hide it with the <details> HTML tag

@simonandras
Copy link
Contributor

The list is updated.

# List of all estimators (and its inherited classes), helper functions and metric
# functions which uses anywhere a sample_weight or sample_weights named parameter.
# If there is () after a name it indicates a function, if there is no parentheses
# then it's a class. Check in test files are excluded.
```
Estimators (the indented classes are inherited):
	BaseDecisionTree : sklearn/tree/_classes.py
	  - DecisionTreeClassifier : sklearn/tree/_classes.py
	    - ExtraTreeClassifier : sklearn/tree/_classes.py
	  - DecisionTreeRegressor : sklearn/tree/_classes.py
	    - ExtraTreeRegressor: sklearn/tree/_classes.py
	LinearSVC : sklearn/svm/_classes.py
	LinearSVR : sklearn/svm/_classes.py
	BaseLibSVM : sklearn/svm/_base.py
	  - BaseSVC : sklearn/svm/_base.py
	    - SVC : sklearn/svm/_base.py
	    - NuSVC : sklearn/svm/_base.py
	  - SVR : sklearn/svm/_classes.py
	  - NuSVR : sklearn/svm/_classes.py
	  - OneClassSVM : sklearn/svm/_classes.py
	SplineTransformer : sklearn/preprocessing/_polynomial.py
	StandardScaler : sklearn/preprocessing/_data.py
	KernelDensity : sklearn/neighbors/_kde.py
	GaussianNB : sklearn/naive_bayes.py
	_BaseDiscreteNB : sklearn/naive_bayes.py
	  - MultinomialNB : sklearn/naive_bayes.py
	  - ComplementNB : sklearn/naive_bayes.py
	  - BernoulliNB : sklearn/naive_bayes.py
	  - CategoricalNB : sklearn/naive_bayes.py
	_MultiOutputEstimator : sklearn/multioutput.py
	  - MultiOutputRegressor : sklearn/multioutput.py
	  - MultiOutputClassifier : sklearn/multioutput.py
	_BaseScorer : sklearn/metrics/_scorer.py
	  - _PredictScorer : sklearn/metrics/_scorer.py
	  - _ProbaScorer : sklearn/metrics/_scorer.py
	  - _ThresholdScorer : sklearn/metrics/_scorer.py
	BaseSGDClassifier : sklearn/linear_model/_stochastic_gradient.py
	  - SGDClassifier : sklearn/linear_model/_stochastic_gradient.py
	  - PassiveAggressiveClassifier : sklearn/linear_model/_passive_aggressive.py
	BaseSGDRegressor : sklearn/linear_model/_stochastic_gradient.py
	  - SGDRegressor : sklearn/linear_model/_stochastic_gradient.py
	  - PassiveAggressiveRegressor : sklearn/linear_model/_passive_aggressive.py
	SGDOneClassSVM : sklearn/linear_model/_stochastic_gradient.py
	_BaseRidge : sklearn/linear_model/_ridge.py
	  - Ridge : sklearn/linear_model/_ridge.py
	  - RidgeClassifier : sklearn/linear_model/_ridge.py
	_RidgeGCV : sklearn/linear_model/_ridge.py
	_BaseRidgeCV : sklearn/linear_model/_ridge.py
	  - RidgeCV : sklearn/linear_model/_ridge.py
	  - RidgeClassifierCV : sklearn/linear_model/_ridge.py
	RANSACRegressor : sklearn/linear_model/_ransac.py
	QuantileRegressor : sklearn/linear_model/_ransac.py
	LogisticRegression : sklearn/linear_model/_logistic.py
	  - LogisticRegressionCV : sklearn/linear_model/_logistic.py
	HuberRegressor : sklearn/linear_model/_huber.py
	GeneralizedLinearRegressor : sklearn/linear_model/_glm/glm.py
	  - PoissonRegressor : sklearn/linear_model/_glm/glm.py
	  - GammaRegressor : sklearn/linear_model/_glm/glm.py
	  - TweedieRegressor : sklearn/linear_model/_glm/glm.py
	ElasticNet : sklearn/linear_model/_coordinate_descent.py
	  - Lasso : sklearn/linear_model/_coordinate_descent.py
	    - MultiTaskElasticNet : sklearn/linear_model/_coordinate_descent.py
	      - MultiTaskLasso : sklearn/linear_model/_coordinate_descent.py
	LinearModelCV : sklearn/linear_model/_coordinate_descent.py
	  - LassoCV : sklearn/linear_model/_coordinate_descent.py
	  - ElasticNetCV : sklearn/linear_model/_coordinate_descent.py
	  - MultiTaskElasticNetCV : sklearn/linear_model/_coordinate_descent.py
	  - MultiTaskLassoCV : sklearn/linear_model/_coordinate_descent.py
	BayesianRidge : sklearn/linear_model/_bayes.py
	LinearRegression : sklearn/linear_model/_base.py
	KernelRidge : sklearn/kernel_ridge.py
	IsotonicRegression : sklearn/isotonic.py
	BaseWeightBoosting : sklearn/ensemble/_weight_boosting.py
	  - AdaBoostClassifier : sklearn/ensemble/_weight_boosting.py
	  - AdaBoostRegressor : sklearn/ensemble/_weight_boosting.py
	_BaseVoting : sklearn/ensemble/_voting.py
	  - VotingClassifier : sklearn/ensemble/_voting.py
	  - VotingRegressor : sklearn/ensemble/_voting.py
	_BaseStacking : sklearn/ensemble/_stacking.py
	  - StackingClassifier : sklearn/ensemble/_stacking.py
	  - StackingRegressor : sklearn/ensemble/_stacking.py
	BaseBagging : sklearn/ensemble/_bagging.py
	  - BaggingClassifier : sklearn/ensemble/_bagging.py
	  - BaggingRegressor : sklearn/ensemble/_bagging.py
	  - IsolationForest : sklearn/ensemble/_iforest.py
	BaseHistGradientBoosting : sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py
	  - HistGradientBoostingRegressor : sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py
	  - HistGradientBoostingClassifier : sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py
	BaseGradientBoosting : sklearn/ensemble/_gb.py
	  - GradientBoostingClassifier : sklearn/ensemble/_gb.py
	  - GradientBoostingRegressor : sklearn/ensemble/_gb.py
	BaseForest : sklearn/ensemble/_forest.py
	  - ForestClassifier : sklearn/ensemble/_forest.py
	    - RandomForestClassifier : sklearn/ensemble/_forest.py
	    - ExtraTreesClassifier : sklearn/ensemble/_forest.py
	  - ForestRegressor : sklearn/ensemble/_forest.py
	    - RandomForestRegressor : sklearn/ensemble/_forest.py
	    - ExtraTreesRegressor : sklearn/ensemble/_forest.py
	  - RandomTreesEmbedding : sklearn/ensemble/_forest.py
	DummyClassifier : sklearn/dummy.py
	DummyRegressor : sklearn/dummy.py
	EllipticEnvelope : sklearn/covariance/_elliptic_envelope.py
	KMeans : sklearn/cluster/_kmeans.py
	  - MiniBatchKMeans : sklearn/cluster/_kmeans.py
	DBSCAN : sklearn/cluster/_dbscan.py
	CalibratedClassifierCV : sklearn/calibration.py
	_SigmoidCalibration : sklearn/calibration.py
	ClassifierMixin : sklearn/base.py
	RegressorMixin : sklearn/base.py
```
Helper functions (in sklearn/utils):
	_check_sample_weight() : sklearn/utils/validation.py
	_weighted_percentile() : sklearn/utils/stats.py
	count_nonzero() : sklearn/utils/sparsefuncs.py
	class_distribution() : sklearn/utils/multiclass.py
	_incremental_mean_and_var() : sklearn/utils/extmath.py
```
Estimator helper functions and classes (the indented classes are inherited):
	make_dataset() : sklearn/linear_model/_base.py
	_preprocess_data() : sklearn/linear_model/_base.py
	_rescale_data() : sklearn/linear_model/_base.py
	_weights_scorer() : sklearn/inspection/_permutation_importance.py
	_calculate_permutation_scores() : sklearn/inspection/_permutation_importance.py
	permutation_importance() : sklearn/inspection/_permutation_importance.py
	fit_binary() : sklearn/linear_model/_stochastic_gradient.py
	sag_solver() : sklearn/linear_model/_sag.py
	_solve_cholesky_kernel() : sklearn/linear_model/_ridge.py
	ridge_regression() : sklearn/linear_model/_ridge.py
	_ridge_regression() : sklearn/linear_model/_ridge.py
	_average_binary_score() : sklearn/metrics/_base.py
	_logistic_loss_and_grad() : sklearn/linear_model/_logistic.py
	_logistic_loss() : sklearn/linear_model/_logistic.py
	_multinomial_loss() : sklearn/linear_model/_logistic.py
	_logistic_grad_hess() : sklearn/linear_model/_logistic.py
	_multinomial_loss_grad() : sklearn/linear_model/_logistic.py
	_multinomial_grad_hess() : sklearn/linear_model/_logistic.py
	_logistic_regression_path() : sklearn/linear_model/_logistic.py
	_log_reg_scoring_path() : sklearn/linear_model/_logistic.py
	_huber_loss_and_gradient() : sklearn/linear_model/_huber.py
	_pre_fit() : sklearn/linear_model/_base.py
	isotonic_regression() : sklearn/isotonic.py
	_fit_single_estimator() : sklearn/ensemble/_base.py
	_parallel_build_estimators() : sklearn/ensemble/_bagging.py
	BaseLoss : sklearn/ensemble/_hist_gradient_boosting/loss.py
	  - LeastSquares : sklearn/ensemble/_hist_gradient_boosting/loss.py
	  - LeastAbsoluteDeviation : sklearn/ensemble/_hist_gradient_boosting/loss.py
	  - Poisson : sklearn/ensemble/_hist_gradient_boosting/loss.py
	  - BinaryCrossEntropy : sklearn/ensemble/_hist_gradient_boosting/loss.py
	  - CategoricalCrossEntropy : sklearn/ensemble/_hist_gradient_boosting/loss.py
	LossFunction : sklearn/ensemble/_gb_losses.py
	  - RegressionLossFunction : sklearn/ensemble/_gb_losses.py
	    - LeastSquaresError : sklearn/ensemble/_gb_losses.py
	    - LeastAbsoluteError : sklearn/ensemble/_gb_losses.py
	    - HuberLossFunction : sklearn/ensemble/_gb_losses.py
	    - QuantileLossFunction : sklearn/ensemble/_gb_losses.py
	  - ClassificationLossFunction : sklearn/ensemble/_gb_losses.py
	    - BinomialDeviance : sklearn/ensemble/_gb_losses.py
	    - MultinomialDeviance : sklearn/ensemble/_gb_losses.py
	    - ExponentialLoss : sklearn/ensemble/_gb_losses.py
	_parallel_build_trees() : sklearn/ensemble/_forest.py
	k_means() : sklearn/cluster/_kmeans.py
	_kmeans_single_elkan() : sklearn/cluster/_kmeans.py
	_kmeans_single_lloyd() : sklearn/cluster/_kmeans.py
	_labels_inertia() : sklearn/cluster/_kmeans.py
	_labels_inertia_threadpool_limit() : sklearn/cluster/_kmeans.py
	_mini_batch_step() : sklearn/cluster/_kmeans.py
	dbscan() : sklearn/cluster/_dbscan.py
	_fit_classifier_calibrator_pair() : sklearn/calibration.py
	_fit_calibrator() : sklearn/calibration.py
	_sigmoid_calibration() : sklearn/calibration.py
	_fit_estimator() : sklearn/multioutput.py
	_partial_fit_estimator() : sklearn/multioutput.py
	_path_residuals() : sklearn/linear_model/_coordinate_descent.py
	_fit_liblinear() : sklearn/svm/_base.py
```
Metrics (in sklearn/metrics):
	mean_absolute_error() : sklearn/metrics/_regression.py
	mean_pinball_loss() : sklearn/metrics/_regression.py
	mean_absolute_percentage_error() : sklearn/metrics/_regression.py
	mean_squared_error() : sklearn/metrics/_regression.py
	mean_squared_log_error() : sklearn/metrics/_regression.py
	median_absolute_error() : sklearn/metrics/_regression.py
	explained_variance_score() : sklearn/metrics/_regression.py
	r2_score() : sklearn/metrics/_regression.py
	mean_tweedie_deviance() : sklearn/metrics/_regression.py
	mean_poisson_deviance() : sklearn/metrics/_regression.py
	mean_gamma_deviance() : sklearn/metrics/_regression.py
	d2_tweedie_score() : sklearn/metrics/_regression.py
	average_precision_score() : sklearn/metrics/_ranking.py
	_binary_uninterpolated_average_precision() : sklearn/metrics/_ranking.py
	det_curve() : sklearn/metrics/_ranking.py
	_binary_roc_auc_score() : sklearn/metrics/_ranking.py
	roc_auc_score() : sklearn/metrics/_ranking.py
	_multiclass_roc_auc_score() : sklearn/metrics/_ranking.py
	_binary_clf_curve() : sklearn/metrics/_ranking.py
	precision_recall_curve() : sklearn/metrics/_ranking.py
	roc_curve() : sklearn/metrics/_ranking.py
	label_ranking_average_precision_score() : sklearn/metrics/_ranking.py
	coverage_error() : sklearn/metrics/_ranking.py
	label_ranking_loss() : sklearn/metrics/_ranking.py
	dcg_score() : sklearn/metrics/_ranking.py
	ndcg_score() : sklearn/metrics/_ranking.py
	top_k_accuracy_score() : sklearn/metrics/_ranking.py
	accuracy_score() : sklearn/metrics/_classification.py
	confusion_matrix() : sklearn/metrics/_classification.py
	multilabel_confusion_matrix() : sklearn/metrics/_classification.py
	cohen_kappa_score() : sklearn/metrics/_classification.py
	jaccard_score() : sklearn/metrics/_classification.py
	matthews_corrcoef() : sklearn/metrics/_classification.py
	zero_one_loss() : sklearn/metrics/_classification.py
	f1_score() : sklearn/metrics/_classification.py
	fbeta_score() : sklearn/metrics/_classification.py
	precision_recall_fscore_support() : sklearn/metrics/_classification.py
	precision_score() : sklearn/metrics/_classification.py
	recall_score() : sklearn/metrics/_classification.py
	balanced_accuracy_score() : sklearn/metrics/_classification.py
	classification_report() : sklearn/metrics/_classification.py
	hamming_loss() : sklearn/metrics/_classification.py
	log_loss() : sklearn/metrics/_classification.py
	hinge_loss() : sklearn/metrics/_classification.py
	brier_score_loss() : sklearn/metrics/_classification.py
	_weighted_sum() : sklearn/metrics/_classification.py
```
Metric plot (in sklearn/metrics/_plot):
	RocCurveDisplay : sklearn/metrics/_plot/roc_curve.py
	plot_roc_curve() : sklearn/metrics/_plot/roc_curve.py
	PrecisionRecallDisplay : sklearn/metrics/_plot/precision_recall_curve.py
	plot_precision_recall_curve() : sklearn/metrics/_plot/precision_recall_curve.py
	DetCurveDisplay : sklearn/metrics/_plot/det_curve.py
	plot_det_curve() : sklearn/metrics/_plot/det_curve.py
	ConfusionMatrixDisplay : sklearn/metrics/_plot/confusion_matrix.py
	plot_confusion_matrix() : sklearn/metrics/_plot/confusion_matrix.py
```
Examples (in examples):
	plot_decision_function() : examples/svm/plot_weighted_samples.py
	_mean_frequency_by_risk_group() : examples/linear_model/plot_poisson_regression_non_normal_loss.py
```
Found "weights", "class_weights", etc. instead of "sample_weights" but with the use of
_check_sample_weight or something related (search of these is not complete):
	has_fit_parameter() : sklearn/utils/validation.py
	compute_sample_weight() : sklearn/utils/class_weight.py
	mean_variance_axis() : sklearn/utils/sparsefuncs.py
	incr_mean_variance_axis() : sklearn/utils/sparsefuncs.py
	_ValidationScoreCallback : sklearn/linear_model/_stochastic_gradient.py	

@simonandras
Copy link
Contributor

I think we could start with the estimators which has a sample_weight parameter in fit method. It is a narrower list and it can be obtained with a simpler search method.

The code for listing:

from sklearn.utils.validation import has_fit_parameter
import sklearn

all_estimators = sklearn.utils.all_estimators()

for i in range(len(all_estimators)):
    estimator_name = all_estimators[i][0]
    estimator = all_estimators[i][1]

    if has_fit_parameter(estimator, "sample_weight"):
        print(estimator_name)

The list:

AdaBoostClassifier
AdaBoostRegressor
BaggingClassifier
BaggingRegressor
BayesianRidge
BernoulliNB
CalibratedClassifierCV
CategoricalNB
ComplementNB
DBSCAN
DecisionTreeClassifier
DecisionTreeRegressor
DummyClassifier
DummyRegressor
ElasticNet
ElasticNetCV
ExtraTreeClassifier
ExtraTreeRegressor
ExtraTreesClassifier
ExtraTreesRegressor
GammaRegressor
GaussianNB
GradientBoostingClassifier
GradientBoostingRegressor
HistGradientBoostingClassifier
HistGradientBoostingRegressor
HuberRegressor
IsolationForest
IsotonicRegression
KMeans
KernelDensity
KernelRidge
Lasso
LassoCV
LinearRegression
LinearSVC
LinearSVR
LogisticRegression
LogisticRegressionCV
MiniBatchKMeans
MultiOutputClassifier
MultiOutputRegressor
MultinomialNB
NuSVC
NuSVR
OneClassSVM
Perceptron
PoissonRegressor
QuantileRegressor
RANSACRegressor
RandomForestClassifier
RandomForestRegressor
RandomTreesEmbedding
Ridge
RidgeCV
RidgeClassifier
RidgeClassifierCV
SGDClassifier
SGDOneClassSVM
SGDRegressor
SVC
SVR
SplineTransformer
StackingClassifier
StackingRegressor
StandardScaler
TweedieRegressor
VotingClassifier
VotingRegressor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants