Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Reduce GC pressure by representing domains more memory efficient #123

Closed
pvdz opened this issue Jul 22, 2016 · 6 comments
Closed

Idea: Reduce GC pressure by representing domains more memory efficient #123

pvdz opened this issue Jul 22, 2016 · 6 comments

Comments

@pvdz
Copy link
Contributor

pvdz commented Jul 22, 2016

Right now, about 40~50% of the time spent is GC. We can't really control that directly so we must try to alleviate the pressure on it. I think by far the biggest pressure comes from generating lots of objects, especially the arrays for the domains. So let's change that.

image

One part has already been done here; domains only containing values lower than 31 will become bitfields. This already brought some huge savings.

The other part is the larger domains, which we can't efficiently express in terms of bitfields. Let's try to turn them into strings.

Currently the SUP is 100000000, which is well above 16bit but well below 32bit. This means every (range) value of a domain can't be encoded in one 16bit char, but we can have two of them. Or maybe even better, use the UTF way of using a high-bit for multi-byte overflow.

I am worried that this will create a lot of string concat overhead on its own. Perhaps this turns out to be superduper fast though so we'll just have to give this a spin. Not a trivial change to flush through the system but it has high potentials.

@jonnor
Copy link
Contributor

jonnor commented Jul 22, 2016

Is there a fixed number of domain values (not variable) per var? If so, maybe can keep an Unit32Array addressed by the index of the trie that corresponds to the var?

@pvdz
Copy link
Contributor Author

pvdz commented Jul 22, 2016

Not entirely sure what you're asking. A var can have a domain which can contain any number of values between SUB and SUP, inclusive. Those constants are 0 and 100000000 respectfully (they are arbitrary and have been since FD.js). As an optimization, the array only stores range pairs, so instead of an array with 100000000 elements you get two elements [0, 100000000]. If the range is not continuous you get multiple such ranges.

I can think of some implementations with Tries, but they all feel useless (or worse) to me.

@pvdz
Copy link
Contributor Author

pvdz commented Jul 22, 2016

To illustrate, this ticket proposes to encode those domains as strings. A domain [99, 103] would encode as the string "\0c\0g". Obviously most domains don't encode in a legible string, but that's completely irrelevant.

@pvdz
Copy link
Contributor Author

pvdz commented Jul 22, 2016

The utf encoding is interesting. So there are two ways of encoding the string;

  • encode them as lo-hi 16bit values pairs (per number, so two pairs per range in the domain).

This has the advantage of being able to search efficiently in the string since you know how many bytes each range will take up. You can get the middle of the domain (in terms of ranges, at least) without having to walk through the domain.

The downside is, of course, that it takes more space than is required. Many values will be below 16bit which means our encoded strings would have a lot of wasted space.

  • encode them in a UTF fashion, which uses the high bit to determine whether the next byte is also part of the current value. In our case that means a number can be encoded as one or two 16bit "words".

The dis/advantages are opposite of the fixed encoding. String length is unpredictable since every number may be encoded as either one or two "words". Far less wasted space.

Note that this wastes no space in our case since our highest number is still well below 32bit so numbers cannot end up encoded as three "words".

From the top of my head we currently don't really use predictability to get to the middle of a domain. There are strategies that would use it, though, which we currently don't actually use to the best of my knowledge.

I think it's something that we could switch relatively easily as long as all domains are de/constructed with central methods.

@pvdz
Copy link
Contributor Author

pvdz commented Jul 22, 2016

I'm going to pursue fixed sized domains because I don't think the overhead is super relevant at the moment while I think it does make a lot of the related code easier to work with.

@pvdz
Copy link
Contributor Author

pvdz commented Aug 1, 2016

Implemented the string domains but they didn't make much of a difference.

@pvdz pvdz closed this as completed Aug 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants