Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor chi-squared distribution #77

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

mp4096
Copy link
Contributor

@mp4096 mp4096 commented Oct 19, 2017

Ok, this is a larger one:

Mode

Fix formula for the mode. Math.NET implementation has an error for freedom < 2 and faulty unit tests as well.

Pdf and log density

When investigating this unit test:

// TODO: figure out why this test fails:
test::check_continuous_distribution(&try_create(1.0), 0.0, 10.0);

I realised the problem two-fold: First, the old pdf implementation returned Inf at 0 instead of 0. Second, for freedom == 1.0 the pdf is very steep around 0 and cannot be integrated with the hard-coded step size. So I implemented a branching in the pdf and log density (also see Wikipedia) and changed freedom to 1.5.

Actually, Math.NET has a little bit more branching in the chi-squared pdf, but the bulk of it is covered in statrs's gamma.rs implementation. I'm not sure if we want to cover the case of freedom = Inf. @boxtown What's your opinion on this? If yay, I'll add this branching to pdf() and ln_pdf() and uncomment the unit tests. If nay, I'll just delete the commented unit tests.

Misc

And of course added more unit tests, but this is kind of self-explanatory.

@codecov
Copy link

codecov bot commented Oct 19, 2017

Codecov Report

Merging #77 into master will increase coverage by 0.42%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #77      +/-   ##
==========================================
+ Coverage   92.24%   92.67%   +0.42%     
==========================================
  Files          44       44              
  Lines        7187     7299     +112     
==========================================
+ Hits         6630     6764     +134     
+ Misses        557      535      -22
Impacted Files Coverage Δ
src/distribution/chi_squared.rs 93.78% <100%> (+35.75%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95fb051...2df72f2. Read the comment docs.

if x > 0.0 {
self.g.ln_pdf(x)
} else {
-f64::INFINITY
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conventionally we've been using f64::NEG_INFINITY, I'm not actually 100% sure -f64::INFINITY == f64::NEG_INFINITY

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I completely missed this one.

BTW, I didn't know that these constants are defined like this 😁 :

pub const INFINITY: f64 = 1.0f64 / 0.0f64
pub const NEG_INFINITY: f64 = -1.0f64 / 0.0f64

@@ -310,7 +310,11 @@ impl Continuous<f64, f64> for ChiSquared {
///
/// where `k` is the degrees of freedom and `Γ` is the gamma function
fn pdf(&self, x: f64) -> f64 {
self.g.pdf(x)
if x > 0.0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can be more specific, x == 0.0 is defined (at least according to Wikipedia) as long as freedom != 1.0, so we can probably do something like if x > 0.0 || (self.freedom != 1.0 && x == 0.0). May want to consider prec::almost_eq instead of == or != for floats but I'm just spitballing

Copy link
Contributor Author

@mp4096 mp4096 Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I screwed up here. Should've done better research.

The thing is a little bit more complicated. Wikipedia silently assumes only integer degrees of freedom. If we take freedom as a positive real, then we have 3 cases:

  1. 0 < freedom < 2: pdf(0) = +∞, hence 0 is excluded from the support
  2. freedom == 2: pdf(0) = 0.5
  3. 2 < freedom: pdf(0) = 0

This is obvious from the pdf formula (see Wikipedia), specifically the x^(k/2 - 1) term. All other terms are non-zero for any freedom.


Anyway, I've tried out scipy's implementation of chi-squared. Its behaviour is exactly as above (i.e. +∞, 0.5 and 0).


So the question is, what to do for the case when 0 < freedom < 2: Should we let the pdf function return +∞ (as scipy does)? Or should we handle it as a separate case?

My personal preference right now is to return +∞, but document this behaviour in the documentation.


Edit: wording.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fine by me as long as it's clear in the documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! Just to make sure: You prefer to return +∞ when 0 < freedom < 2 and pdf is evaluated at 0.

Then I'll rewrite the unit tests and update the docs.

test_almost(2.5, 0.045412171451573920401, 1e-15, |x| x.pdf(5.5));
test_almost(2.5, 1.8574923023527248767e-24, 1e-36, |x| x.pdf(110.1));
test_case(2.5, 0.0, |x| x.pdf(f64::INFINITY));
// test_case(f64::INFINITY, 0.0, |x| x.pdf(0.0));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does pdf return for freedom == f64::INFINITY normally? I'd like to have this case defined but am not necessarily sure we need an explicit branch in the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns a NaN. I guess as a result of 0 * Inf when computing the pdf.

TBH I don't like the idea of returning 0 instead of NaN because it doesn't make sense for me... Shouldn't a pdf integrate to 1 over support? How can it be upheld when pdf() returns 0 everywhere?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I think maybe being explicit about the NaN would be good, like adding a specific branch and note it in the docstring

@boxtown
Copy link
Collaborator

boxtown commented Oct 30, 2017 via email

@boxtown
Copy link
Collaborator

boxtown commented Jan 9, 2018

Hey @mp4096 what's the status on this PR? Do you need any help bringing it across the finish line?

@vks vks mentioned this pull request May 15, 2021
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants