Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Incorrect predictions from isotonic regression in the presence of repeated X values #9432
Isotonic regression returns an incorrect solution in the presence of repeated X values.
The problem seems to be that fit() begins by taking the weighted average of points with the same x value. However, rather than associating the resulting average point with the sum of the weights, it is using the average of the weights (in _make_unique(), in _isotonic.pyx, lines 103 and 116). When the fitting the isotonic regression, the combined points will then count for less than they should (according to the objective), and incorrect fitted values are returned.
It is conceivable that this is the intended behaviour (i.e. if the objective in the documentation refers to the output of _make_unique, rather than the original data), but it seems to conflict with the documentation, which explicitly refers to the possibility of equality among X values. The docs also mention that ties are broken using the "secondary method from Leeuw, 1977", but that too seems to sum the weights rather than averaging them (page 2, 6th line from the bottom).
If I have missed something, and this is the intended behaviour, I think we should clarify the documentation, and possibly add an option to not average in this way. However, I would advocate for summing rather than averaging as the default.
Steps/Code to Reproduce
This code presents a simple example that returns an incorrect result (according to the stated objective). It then uses the sample_weight option to artificially up-weight the points which get combined (at x=1). This, results in a lower squared error on the original data, demonstrating that the default behaviour does not achieve the minimum.
I believe the output of isotonic regression should predict [0, 0.33, 0.33, 0.33, 1] at the original input points, with squared error 0.667. _make_unique should produce x = (0, 1, 2, 3), y = (0, 0.5, 0, 1), sample_weight = (1, 2, 1, 1).
The actual predicted values at these points are [0, 0.25, 0.25, 0.25, 1], with squared error 0.688. _make_unique is producing x = (0, 1, 2, 3), y = (0, 0.5, 0, 1), sample_weight = (1, 1, 1, 1).