Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fixes gh-12218, TypeError converting int to float inside stats.ks_2samp #12280

Merged
merged 2 commits into from Jun 28, 2020

Conversation

pvanmulbregt
Copy link
Contributor

A multiplication of large integers fed as input to np.sqrt exceeded its ability
to convert to float, generating a TypeError inside stats.ks_2samp() for large
samples. If both sample sizes were larger than about 2^21 (~cube root of 2^64,
or about 2million), this condition would be triggered.

Reference issue

Closes gh-12218

What does this implement/fix?

Converts some large integers to floats before multiplying them. Python could handle the large result as an int, but np.sqrt() couldn't.

A multiplication of large integers fed as input to np.sqrt exceeded its ability
to convert to float, generating a TypeError inside stats.ks_2samp() for large
samples.  If both sample sizes were larger than about 2^21 (~cube root of 2^64,
or about 2million), this condition would be triggered.
Copy link
Member

@WarrenWeckesser WarrenWeckesser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @pvanmulbregt, this looks good. I made one small suggestion inline.

@@ -6731,13 +6731,16 @@ def ks_2samp(data1, data2, alternative='two-sided', mode='auto'):

if mode == 'asymp':
# The product n1*n2 is large. Use Smirnov's asymptoptic formula.
# Ensure float to avoid overflow in multiplication
# sorted because the one-sided formula is not symmetric in n1, n2
m, n = sorted(np.array([n1, n2], float), reverse=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to create a numpy array here. This is quite a bit faster:

Suggested change
m, n = sorted(np.array([n1, n2], float), reverse=True)
m, n = sorted([float(n1), float(n2)], reverse=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Replace the use of np.array([]) with [] as the np array doesn't provide
any additional value.
@WarrenWeckesser WarrenWeckesser merged commit f974e4b into scipy:master Jun 28, 2020
@WarrenWeckesser
Copy link
Member

Thanks @pvanmulbregt, merged.

@tylerjereddy tylerjereddy added this to the 1.6.0 milestone Jun 28, 2020
@tylerjereddy tylerjereddy added defect A clear bug or issue that prevents SciPy from being installed or used as expected backport-candidate This fix should be ported by a maintainer to previous SciPy versions. labels Jun 28, 2020
@tylerjereddy tylerjereddy modified the milestones: 1.6.0, 1.5.1 Jul 4, 2020
tylerjereddy pushed a commit to tylerjereddy/scipy that referenced this pull request Jul 4, 2020
…ats.ks_2samp (scipy#12280)

A multiplication of large integers fed as input to np.sqrt exceeded its ability
to convert to float, generating a TypeError inside stats.ks_2samp() for large
samples.  If both sample sizes were larger than about 2^21 (~cube root of 2^64,
or about 2million), this condition would be triggered.
@tylerjereddy tylerjereddy removed the backport-candidate This fix should be ported by a maintainer to previous SciPy versions. label Jul 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Type error in stats.ks_2samp when alternative != 'two-sided-
3 participants