Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple metalog implementation #1444

Draft
wants to merge 29 commits into
base: main
Choose a base branch
from
Draft

Add simple metalog implementation #1444

wants to merge 29 commits into from

Conversation

Hazelfire
Copy link
Collaborator

@Hazelfire Hazelfire commented Dec 7, 2022

Adds a simple metalog implementation. Doesn't include any fitting yet. Based on https://www.npmjs.com/package/@quri/metalog.

image

Todo:

  • Work out a way to make this much faster

@vercel
Copy link

vercel bot commented Dec 7, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
quri-api ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 19, 2023 5:55am
relative-values ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 19, 2023 5:55am
squiggle-components ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 19, 2023 5:55am
squiggle-website ✅ Ready (Inspect) Visit Preview Apr 19, 2023 5:55am

@Hazelfire
Copy link
Collaborator Author

This implementation works like a charm, except that it is extremely slow! Performance has been coming up a lot lately. I'll see what I can do on that end.

@Hazelfire Hazelfire marked this pull request as draft December 7, 2022 04:47
@codecov-commenter
Copy link

codecov-commenter commented Dec 22, 2022

Codecov Report

Merging #1444 (435ceab) into develop (04259c3) will decrease coverage by 0.31%.
The diff coverage is n/a.

❗ Current head 435ceab differs from pull request most recent head afffedf. Consider uploading reports for the commit afffedf to get more accurate results

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff             @@
##           develop    #1444      +/-   ##
===========================================
- Coverage    71.58%   71.27%   -0.31%     
===========================================
  Files           92       92              
  Lines         4656     4763     +107     
  Branches       853      883      +30     
===========================================
+ Hits          3333     3395      +62     
- Misses        1318     1363      +45     
  Partials         5        5              

see 13 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@Hazelfire
Copy link
Collaborator Author

I could probably add fitting functionality to this as well. But this as it stands is a simple implementation.

@OAGr
Copy link
Contributor

OAGr commented Dec 22, 2022

@NunoSempere can you review this, against other Metalog applications? I remember Eli/Misha had a different one they thought was good. We should make sure that it seems correct.

@OAGr
Copy link
Contributor

OAGr commented Dec 22, 2022

This returns errors for me, in some cases? (Like the example in the tests)
image

My guess is that this is invalid, but the error here seems wrong.

@Hazelfire
Copy link
Collaborator Author

Hazelfire commented Dec 22, 2022

Ok, here are my notes with this implementation:

  • The main source of error is the way that cdf is calculated. It just runs a mixture of newton's method and bisection to calculate the inverse of quantile. I originally based it on rmetalog, but rmetalog had two issues
    1. It was really slow
    2. It was sometimes not accurate enough. I could calculate the quantile and cdf then get numbers that were 0.5 off.
  • I therefore reimplemented cdf. This means that we almost but don't quite match rmetalog. Because it now passes tests for getting very close to the inverse of quantile, I think mine is more accurate.
  • However, there are a couple of issues with my (and rmetalog's) implementation
    1. Because cdf works by approximating an inverse of quantile, the cdf function is not always strictly monotonic. It is most of the time. I think this is what causes the "Xs is not sorted" error.
    2. My implementation of cdf does not conform to the PDF as accurately as I would like. It matches to 1 decimal place but no further.

@NunoSempere
Copy link
Contributor

NunoSempere commented Dec 22, 2022

@Hazelfire, can you:

A. Add a README.md to https://github.com/Hazelfire/metalog/ with some documentation.

In particular, I am curious about

  • What the array a represents. I'm thinking these might be equally spaced points in the cdf, but I'm not sure
  • I was under the impression that the metalog fit any points in a cdf? E.g., (1%: 10, 5%: 20, 90%: 30) (e.g., see here: http://www.metalogdistributions.com/fittodata.html). But it seems that this is not correct. Or at least, I don't see how you can represent this in an array, so something has to give.

B. Give me your thoughts on speed/accuracy tradeoff.

In particular, why wouldn't it be acceptable to expose the iterations/accuracy threshold/etc as a parameter, which would allow for faster initial iteration, but also for slower computing when finalizing.

Here: https://github.com/Hazelfire/metalog/blob/main/index.ts#L60 I'm seeing that you just bake the numbers in.

C. Give me your thoughts about why we get a non-monotonic cdf, and consequences

So an non-monotonic cdf is really bad, because it implies negative probability density for some regions.

At first I thought: That's fine, you can take a non-monotonic cdf and convert it to an almost-monotonic cdf by

xs = [0, 0.1, 0.2, 0.3, 0.2, 1]
for(let i = 1; i<xs.length; i++){
  if(xs[i] < x[i-1]){
    xs[i] = xs[i-1]
  }
}

But then we no longer get regions in which the probability density is negative. But we can get stil regions in which the probability density is 0. But this is a bit weird, since I don't see how you could quite get this when just specifying different points in a cdf.

Is this something we expect to get only when working with little accuracy? Or is this something that you expect would remain?

D. As a minor nitpick, I would have named things such that their apparent meaning is a bit more clear. E.g., have arrays be xs and ys when they represent the xs and ys of a cdf, and points also be x and y. I thought this was kinda worth flagging for this function https://github.com/Hazelfire/metalog/blob/main/index.ts#L30, because you use a, so then you have to use a_i. And your variable y is possibly an x coordinate?

@NunoSempere
Copy link
Contributor

@NunoSempere can you review this, against other Metalog applications?

Will do, though I've first left some comments above.

@Hazelfire
Copy link
Collaborator Author

Hazelfire commented Jan 9, 2023

Hey Nuno! Added a really basic README for you.

  • What the array a represents. I'm thinking these might be equally spaced points in the cdf, but I'm not sure

The array a represents the a parameter of the metalog distribution. See Metalog's Wikipedia page for details. It's shown as a_1, a_2, .... a_n

  • I was under the impression that the metalog fit any points in a cdf? E.g., (1%: 10, 5%: 20, 90%: 30) (e.g., see here: http://www.metalogdistributions.com/fittodata.html). But it seems that this is not correct. Or at least, I don't see how you can represent this in an array, so something has to give.

It does! You can calculate the a array from points in a CDF. I started that at the bottom of here: https://github.com/Hazelfire/metalog/blob/main/index.ts#L97-L128. I think I originally wanted to leave it out of the first revision of this PR, but I can add it in if we think that's a good idea

B. Give me your thoughts on speed/accuracy tradeoff.

In particular, why wouldn't it be acceptable to expose the iterations/accuracy threshold/etc as a parameter, which would allow for faster initial iteration, but also for slower computing when finalizing.

As far as I can tell rn, there's no tradeoff, my implementation is both much more accurate and much faster. I can expose the accuracy parameters if you want, but I don't think we would have any need for this to be much faster or much more accurate. I got the speed by making the method much faster. There is a little bit of detail on that in the new README.

C. Give me your thoughts about why we get a non-monotonic cdf, and consequences

It's annoying, but the reason we get this is because we can only approximate the CDF of a metalog distribution by newton's method. Because of this, the approximations are going to be off slightly, in a way that would be very slightly inconsistent. I think this becomes a problem with Squiggle because it makes the x coordinates out of order (I think it assumes the function is monotonic to order the coordinates. I can't remember exactly how it works)

So an non-monotonic cdf is really bad, because it implies negative probability density for some regions.

The pdf function is not calculated from just subtracting two cdfs, so the pdf, at least as implemented in the package, has no negative probability density anywhere (This hasn't been tested, but it should be correct)

Is this something we expect to get only when working with little accuracy? Or is this something that you expect would remain?

I expect it to remain unless I work out a way to do some crazy floating point manipulation code.

D. As a minor nitpick, I would have named things such that their apparent meaning is a bit more clear. E.g., have arrays be xs and ys when they represent the xs and ys of a cdf, and points also be x and y. I thought this was kinda worth flagging for this function https://github.com/Hazelfire/metalog/blob/main/index.ts#L30, because you use a, so then you have to use a_i. And your variable y is possibly an x coordinate?

I use a and a_i because I think that follows the standard notation better. My variable y is a probability. I should probably name it q or something...

@Hazelfire
Copy link
Collaborator Author

Interestingly, the pdf is not always strictly positive. I don't like that. Let me look into that.

@Hazelfire Hazelfire marked this pull request as draft January 9, 2023 03:41
@Hazelfire
Copy link
Collaborator Author

Hazelfire commented Jan 9, 2023

I implemented fitting to cdf in @quri/metalog!

@Hazelfire
Copy link
Collaborator Author

Putting some ideas here for metalog fitting syntax:

One possible consistent (but feels kind of off) is:

metalog({p5: -1, p20: 4, p90: 6}, 2)

Where the last number is the terms. I'm, however, leaning more towards:

metalog([{x: -1, q: 0.05}, {x: 4, q: 0.2}, {x: 6, q: 0.9}], 2)

Which can be more easily programmatically constructed

@Hazelfire
Copy link
Collaborator Author

Hazelfire commented Jan 16, 2023

Currently, I've opted towards this syntax:

metalog([{x: -1, q: 0.05}, {x: 4, q: 0.2}, {x: 6, q: 0.9}], 2)

I'm not trying to make the implementation more robust and give better error messages (particularly fixing Xs is not sorted, which shows up a lot)

@NunoSempere
Copy link
Contributor

NunoSempere commented Mar 24, 2023

Hey @Hazelfire, coming back the previous points I raised:

  • What's the status of the pdf not being strictly positive in the domain? I assume that this is now solved?
  • Nice that you now have a syntax like metalog([{x: -2, q: 0.1}, {x: -1, q: 0.3}, {x: 0, q: 0.9}]); much better than just specifying equally spaced points in the cdf. Could you add some documentation to the website? Not sure if I missed this.

Also, could you update https://github.com/Hazelfire/metalog/ so that it is up to date, if it isn't already?

Also, why is a "terms" parameter necessary? Can't you just look at the length of the points array?

Thanks for doing this, really appreciate it.

@OAGr
Copy link
Contributor

OAGr commented Apr 6, 2023

Any updates here?

@OAGr OAGr marked this pull request as draft April 6, 2023 21:49
@OAGr
Copy link
Contributor

OAGr commented Apr 6, 2023

Changing to draft - my impression is that there are remaining issues. (If you think it's done, Sam, please respond to Nuno's comments, and also say so)

@Hazelfire
Copy link
Collaborator Author

Hey! I've been busy doing other things. Just to chat:

Hey @Hazelfire, coming back the previous points I raised:

What's the status of the pdf not being strictly positive in the domain? I assume that this is now solved?
Yes it is. I fixed it by graphing metalog in a different way that's way faster and doesn't require approximation.
Nice that you now have a syntax like metalog([{x: -2, q: 0.1}, {x: -1, q: 0.3}, {x: 0, q: 0.9}]); much better than just specifying equally spaced points in the cdf. Could you add some documentation to the website? Not sure if I missed this.
Also, could you update https://github.com/Hazelfire/metalog/ so that it is up to date, if it isn't already?

Only thing that's missing here is documentation... I think? I can add documentation if you'd like.

Thanks for doing this, really appreciate it.

@OAGr
Copy link
Contributor

OAGr commented Apr 11, 2023

I'm still getting a lot of trouble getting this to work on a bunch of test cases.

image

a = metalog([{x: 0, q: 0.17}, {x: 1.4, q: 0.2}, {x: 2, q: 0.3}, {x: 4, q: 0.5}, {x: 7, q: 0.7}], {terms:8})

f(t) = cdf(a, t)

For example.

Would be curious to get a better understanding here - do we expect this to fail in ~30% of cases, or any idea? Is this something that's fixable with a better underlying algorithm?

@vercel vercel bot temporarily deployed to Preview – relative-values April 19, 2023 05:28 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Language Regarding Squiggle language semantics, distributions and function registry
Projects
Status: 🔖 Later
Development

Successfully merging this pull request may close these issues.

None yet

5 participants