-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea: Reduce GC pressure by representing domains more memory efficient #123
Comments
Is there a fixed number of domain values (not variable) per var? If so, maybe can keep an Unit32Array addressed by the |
Not entirely sure what you're asking. A var can have a domain which can contain any number of values between I can think of some implementations with Tries, but they all feel useless (or worse) to me. |
To illustrate, this ticket proposes to encode those domains as strings. A domain |
The utf encoding is interesting. So there are two ways of encoding the string;
This has the advantage of being able to search efficiently in the string since you know how many bytes each range will take up. You can get the middle of the domain (in terms of ranges, at least) without having to walk through the domain. The downside is, of course, that it takes more space than is required. Many values will be below 16bit which means our encoded strings would have a lot of wasted space.
The dis/advantages are opposite of the fixed encoding. String length is unpredictable since every number may be encoded as either one or two "words". Far less wasted space. Note that this wastes no space in our case since our highest number is still well below 32bit so numbers cannot end up encoded as three "words". From the top of my head we currently don't really use predictability to get to the middle of a domain. There are strategies that would use it, though, which we currently don't actually use to the best of my knowledge. I think it's something that we could switch relatively easily as long as all domains are de/constructed with central methods. |
I'm going to pursue fixed sized domains because I don't think the overhead is super relevant at the moment while I think it does make a lot of the related code easier to work with. |
Implemented the string domains but they didn't make much of a difference. |
Right now, about 40~50% of the time spent is GC. We can't really control that directly so we must try to alleviate the pressure on it. I think by far the biggest pressure comes from generating lots of objects, especially the arrays for the domains. So let's change that.
One part has already been done here; domains only containing values lower than 31 will become bitfields. This already brought some huge savings.
The other part is the larger domains, which we can't efficiently express in terms of bitfields. Let's try to turn them into strings.
Currently the
SUP
is 100000000, which is well above 16bit but well below 32bit. This means every (range) value of a domain can't be encoded in one 16bit char, but we can have two of them. Or maybe even better, use the UTF way of using a high-bit for multi-byte overflow.I am worried that this will create a lot of string concat overhead on its own. Perhaps this turns out to be superduper fast though so we'll just have to give this a spin. Not a trivial change to flush through the system but it has high potentials.
The text was updated successfully, but these errors were encountered: