Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Append()ing normal ShortStrings (length ≦ 255) to Ropes, switch DOM implementation to using that, and use that for generating reference links #157

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sideshowbarker
Copy link
Contributor

@sideshowbarker sideshowbarker commented Aug 9, 2023

Relates to #153. This change reverts 0ad4342 and increases to 255 the length of strings that can be passed directly to Rope≫Append() without getting truncated. Otherwise, without this change, any string passed directly to Rope≫Append() whose length exceeds a particular system-dependent limit (which in practice seems to be 15) will get silently truncated.

This change allows us to append strings to Ropes in the way we’d intuitively expect to — rather than instead needing to forever remember that we otherwise have to:

  1. Create an additional CutRope (e.g., with ExtractedData)
  2. Append our strings to that CutRope.
  3. Use Rope≫AppendDestructively() to append that CutRope to whatever Rope we’re working with.

Further, the change switches the DOM implementation to append strings directly when setting attribute values and when appending text nodes and comments. That in turn allows passing non-constant strings to Element≫AppendChild() and Document≫AppendChild() and Element≫SetAttribute().

Otherwise, without that change, the strings we use for attribute values and text nodes and comments either must be constants, or else we have to create CutRopes, append strings to them, and then use TText.CreateDestructively() (to create text nodes and comments) and Element≫SetAttributeDestructively() (to set attribute values).

The change also includes a commit that switches the code for generating reference links to using a normal Append() and simple string concatenation — rather than needing to append a Scratch Rope built using an ExtractedData CutRope.

This change reverts 0ad4342 and increases to 255 the length of strings
that can be passed directly to Rope≫Append() without getting truncated.
Otherwise, without this change, any string passed directly to
Rope≫Append() whose length exceeds a particular system-dependent limit
(which in practice seems to be 15) will get silently truncated.

Relates to #153.
This change makes the DOM implementation append strings directly when
setting attribute values and when appending text nodes and comments.
That in turn allows passing non-constant strings to Element≫AppendChild()
and Document≫AppendChild() and Element≫SetAttribute().

Otherwise, without this change, the strings we use for attribute values and
text nodes and comments either must be constants, or else we have to create
CutRopes, append strings to them, and then use TText.CreateDestructively()
(to create text nodes and comments) and Element≫SetAttributeDestructively()
(to set attribute values).
@sideshowbarker sideshowbarker changed the title Allow Append()ing normal ShortStrings (length ≦ 255) to Ropes, and use it for generating reference links Allow Append()ing normal ShortStrings (length ≦ 255) to Ropes, and use it in the DOM implementation, and use all that for generating reference links Aug 10, 2023
@sideshowbarker sideshowbarker changed the title Allow Append()ing normal ShortStrings (length ≦ 255) to Ropes, and use it in the DOM implementation, and use all that for generating reference links Allow Append()ing normal ShortStrings (length ≦ 255) to Ropes, switch the DOM implementation to using that, and then se all that for generating reference links Aug 10, 2023
@sideshowbarker sideshowbarker changed the title Allow Append()ing normal ShortStrings (length ≦ 255) to Ropes, switch the DOM implementation to using that, and then se all that for generating reference links Allow Append()ing normal ShortStrings (length ≦ 255) to Ropes, switch DOM implementation to using that, and then se all that for generating reference links Aug 10, 2023
@sideshowbarker sideshowbarker changed the title Allow Append()ing normal ShortStrings (length ≦ 255) to Ropes, switch DOM implementation to using that, and then se all that for generating reference links Allow Append()ing normal ShortStrings (length ≦ 255) to Ropes, switch DOM implementation to using that, and use that for generating reference links Aug 10, 2023
@sideshowbarker sideshowbarker force-pushed the sideshowbarker/rope-append-any-string branch from 9226f89 to c6fabf1 Compare August 10, 2023 08:32
@domenic
Copy link
Member

domenic commented Aug 10, 2023

Haven't gotten a chance to look at this in detail, but I appreciate all the work here!

I do wonder, what will happen if we pass a >255 character string? Silent or loud failure?

…vely

Currently the longest string we’re appending directly to Ropes is the
40-character SourceGitSHA string for the snapshots, and the second-
longest string is the 30-character “This section is non-normative.”.

This change switches that SourceGitSHA-handling code to instead using
Scratch.Append(@SourceGitSHA) and TText.CreateDestructively(Scratch) —
so that we can reduce down to 30 the size of ShortStrings we allow to be
appended directly to Ropes.
The larger the size of ShortStrings we allow to be appended directly to
Ropes, the more memory Wattsi consumes at runtime.

We don’t need the size to be anywhere near as big as 255 — the existing
code never actually needs to directly append strings with lengths any
longer than 30. If we ever do run into need to append string lengths
longer than 30, we could at that time just bump UTF8InlineSize up to
whatever new size we actually need.

Upping the size to 30 from the old size of 15 seems to increase the
memory consumption by about 15–20%, or around 80–100MB. (In comparison,
upping it to 255 seems to roughly double the memory consumption.)
@sideshowbarker
Copy link
Contributor Author

I do wonder, what will happen if we pass a >255 character string? Silent or loud failure?

I’ve pushed two more commits that update things a bit.

The main relevant commit reduces the string-length limit all the way down to 30, and adds some code that, if we try to directly pass a string whose length is bigger than 30, causes Wattsi to fail with an error message. For example, if we try to pass the string “This string has more than 30 characters”, it would fail with this error message:

Error: Operation attempted with string length > UTF8InlineSize: "This string has more than 30 c…"

The other commit rewrites part of the SourceGitSHA-handling code (for snapshots) to use the Scratch.Append(@SourceGitSHA) and TText.CreateDestructively(Scratch) for appending — because those SourceGitSHA values are 40 characters, so we’d otherwise need to set the limit at 40.

And the point of choosing 30 as the limit is, that’s the longest length of any strings the current code is appending directly. So we can safely set it to that, and we could bump it up later if we ever run into need to do anything with longer strings.

And the point of setting it to 30 instead of 255 is that the larger we make it, the more memory Wattsi consumes at runtime. I haven’t done any serious profiling, but what I have found from some limited testing is that at 255, it seems to roughly double the amount of memory that Wattsi consumes at runtime. But if we set it at 30, it seems to only increases the memory consumption by about 15–20%, or around 80–100MB (on my system).

I’ve tested this and found that after all the changes in this PR, Wattsi produces output identical to what the current code in the main branch outputs. I also found that it doesn’t significantly increase the build time — though it does seem to increase the build time slightly, but just by maybe half a second or less.

Anyway, I meant to say this earlier, but I think the point of making this change is not only to help those of us who’ve had to waste time learning the hard way about this unwelcome quirk in the code — and don’t want to have to re-learn it again the next time we touch the code — but to also help any future contributors who come along later to not have to waste their time unnecessarily suffering through what we’ve had to suffer through.

The thing in the existing code seems to just be a performance optimization for solving a performance problem that we don’t actually have in practice. So we can safely dispense with that optimization, and let ourselves and future contributors be able to get strings into attribute values and text nodes in the same simple way we can with DOM implementations in any other tools/runtimes — rather than making everybody continue to do the oddball thing the current code forces.

Also for the record here, a table showing string lengths used in the current code
String length Used how many times in code
1 845208
2 95595
3 63117
4 135
5 36835
6 653
7 662
8 539
9 5821
10 1208
11 484
12 1454
13 232
14 175
15 150
16 1498
17 20
18 47
19 91
20 33
21 17
22 63
23 2096
24 8
25 19
30 222

@domenic
Copy link
Member

domenic commented Aug 14, 2023

Hmm. Well, a 15-20% memory increase and 0.5 second time spent increase is not trivial. Would it be possible instead to just have an error when you try to Append() incorrectly?

@sideshowbarker
Copy link
Contributor Author

sideshowbarker commented Aug 16, 2023

Would it be possible instead to just have an error when you try to Append() incorrectly?

I opened #158 with a change that adds a Contributing section to the README.md, and that takes the existing error message we emit when the string is longer than the system-set limit (15), and updates that message to instead be:

Use Ropes instead; see https://github.com/whatwg/wattsi/#how-to-work-with-strings

@domenic
Copy link
Member

domenic commented Aug 17, 2023

Ah, sorry for not being clear. What I'm wondering is if it's possible to catch whatever went wrong in 3b11b2d .

If I recall, the issue was that I used Append() with a pointer to a string, but that string was... a bad string? I don't think I ever understood what was bad about it, and why sometimes (e.g. FData.Append(@NewData)) it's OK but sometimes not. (Maybe it's related to the {$IFOPT C+} AssertStringIsConstant(NewData); {$ENDIF} lines? Are those just not enabled in our current config?)

To recap, I think we've discovered two classes of failures:

  1. Append() with a UTF8String that is too long. (This is what was wrong with 4cabebb.)
  2. Append() with a pointer to a UTF8String that is... not constant? (This is what was wrong with bd1cd44.)

We have nice errors for (1). We don't have nice errors for (2). Can we add them? Is there a big performance cost to doing so? (As is hinted by the fact that there seems to be some related conditionally-compiled out asserts.)

It's totally fine if there's no way to detect it, or detect it cheaply! I just wanted to check, since you've done so much great work here already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants