refactor(rome_rowan): rework `SyntaxTokenText` #4721

Conaclos · 2023-07-22T10:17:09Z

Summary

This PR addresses some concerns of @ematipico.

All inner_text functions now return a SyntaxTokenText.
I thus removed Quoted and StaticStringValue.

My main issue, with SyntaxTokenText was the creativeness of the returned range. We lose the absolute location of the text in a source.
To address this issue, I added a (private) absolute offset, which is the starting position of the (private) token in the source.
This allows computing the absolute range from the (private) relative range.

I also modified slice to accept a range relative to the current slice of text. This seems more consistent from a user point of view and this mimics what we are used to for string slicing.

I take the opportunity to remove SyntaxTokenTextSlice in the formatter since SyntaxTokenText is now sufficient.

Test Plan

In progress...

netlify · 2023-07-22T10:17:14Z

✅ Deploy Preview for docs-rometools canceled.

Built without sensitive environment variables

Name	Link
🔨 Latest commit	`3da0087`
🔍 Latest deploy log	https://app.netlify.com/sites/docs-rometools/deploys/64bd15ff75565e0008c776b4

Conaclos · 2023-07-23T12:01:53Z

crates/rome_formatter/src/format_element.rs

@@ -392,4 +386,4 @@ static_assert!(std::mem::size_of::<crate::format_element::Tag>() == 16usize);

 #[cfg(not(debug_assertions))]
 #[cfg(target_pointer_width = "64")]
-static_assert!(std::mem::size_of::<crate::FormatElement>() == 24usize);
+static_assert!(std::mem::size_of::<crate::FormatElement>() == 32usize);


I don't understand why the size changed.
Moreover, I cannot reproduce this test locally.
Before and after this PR, the size of FormatElement is evaluated to 40 bytes on my machine.

Did you measure the size with a release build? Some elements contain debug only data

Regressing the size has negative consequences on overall performance and memory consumption

I accept with your point. However, I just merged SyntaxTokenTextSlice into SyntaxTokenText. This should take the same space.

Did you measure the size with a release build? Some elements contain debug only data

You are right!

Old layout of format_element::FormatElement::SyntaxTokenTextSlice (24 bytes):

#[repr(C)] // alignment=8 size=24 struct FormatElement_SyntaxTokenTextSlice { variant: i8, // which element of FormatElement is this? padding_0: [u8; 3], // for text_size alignment text_size: u32, // rome_text_size::TextSize::raw // Begin rome_rowan::SyntaxTokenText (alignment=8 size=16) green_token_ptr: u64, // rome_rowan::GreenToken::ptr (rome_rowan::ThinArc::ptr (pointer)) range_start: u32, // rome_text_size::TextRange::start (rome_text_size::TextSize::raw) range_end: u32, // rome_text_size::TextRange::start (rome_text_size::TextSize::raw) // End rome_rowan::SyntaxTokenText (alignment=8 size=16) }

New layout of format_element::FormatElement::SyntaxTokenTextSlice (32 bytes):

#[repr(C)] // alignment=8 size=32 struct FormatElement_SyntaxTokenTextSlice { variant: i8, // which element of FormatElement is this? padding_0: [u8; 7], // for rome_rowan::SyntaxTokenText alignment // Begin rome_rowan::SyntaxTokenText (alignment=8 size=24) text_size: u32, // rome_text_size::TextSize::raw padding_1: [u8; 4], // for green_token_ptr alignment green_token_ptr: u64, // rome_rowan::GreenToken::ptr (rome_rowan::ThinArc::ptr (pointer)) range_start: u32, // rome_text_size::TextRange::start (rome_text_size::TextSize::raw) range_end: u32, // rome_text_size::TextRange::start (rome_text_size::TextSize::raw) // End rome_rowan::SyntaxTokenText (alignment=8 size=24) }

Before, the space between format_element::FormatElement's discriminator and the rome_rowan::SyntaxTokenText was used to fit text_size. After, this space cannot be used. +4 bytes

After, text_size forces padding inside rome_rowan::SyntaxTokenText. The size of the members is 20 bytes, but because the struct needs to be 8-byte-aligned (because rome_rowan::GreenToken is 8-byte-aligned because pointers are 8-byte-aligned), 4 padding bytes are added. +4 bytes

+4 bytes + +4 bytes = +8 bytes

This patch helped my investigations (building with cargo +nightly build --release):

diff --git a/crates/rome_formatter/src/format_element.rs b/crates/rome_formatter/src/format_element.rs index fcb0e92d33..3d6d3e3250 100644 --- a/crates/rome_formatter/src/format_element.rs +++ b/crates/rome_formatter/src/format_element.rs @@ -16,6 +16,7 @@ use std::rc::Rc; /// /// Use the helper functions like [crate::builders::space], [crate::builders::soft_line_break] etc. defined in this file to create elements. #[derive(Clone, Eq, PartialEq)] +#[rustc_layout(debug)] pub enum FormatElement { /// A space token, see [crate::builders::space] for documentation. Space, diff --git a/crates/rome_formatter/src/lib.rs b/crates/rome_formatter/src/lib.rs index 21a5d90997..c2322681cd 100644 --- a/crates/rome_formatter/src/lib.rs +++ b/crates/rome_formatter/src/lib.rs @@ -1,3 +1,5 @@ +#![feature(rustc_attrs)] + //! Infrastructure for code formatting //! //! This module defines [FormatElement], an IR to format code documents and provides a mean to print diff --git a/crates/rome_rowan/src/lib.rs b/crates/rome_rowan/src/lib.rs index b4f5e659fb..7fc598c5fb 100644 --- a/crates/rome_rowan/src/lib.rs +++ b/crates/rome_rowan/src/lib.rs @@ -1,3 +1,5 @@ +#![feature(rustc_attrs)] + //! A generic library for lossless syntax trees. //! See `examples/s_expressions.rs` for a tutorial. #![forbid( diff --git a/crates/rome_rowan/src/syntax_token_text.rs b/crates/rome_rowan/src/syntax_token_text.rs index d5eb1ddbf1..7977189bb3 100644 --- a/crates/rome_rowan/src/syntax_token_text.rs +++ b/crates/rome_rowan/src/syntax_token_text.rs @@ -5,6 +5,7 @@ use std::{borrow::Borrow, fmt::Formatter}; /// Reference to the text of a SyntaxToken without having to worry about the lifetime of `&str`. #[derive(Eq, Clone)] +#[rustc_layout(debug)] pub struct SyntaxTokenText { // Using a green token to ensure this type is Send + Sync. token: GreenToken,

Thanks for the write-up!

I see two possibilities:

Replacing TextRange with a SmallTextRange struct in SyntaxTokenText to shrink SyntaxTokenText to 16 bytes.
This is based on the observation that a relative range does not need u32 positions. u16 seems enough.

Introducing GreenTokenText and reverting to the original layout of FormatElement (using GreenTokenText).

Any opinions?

ematipico · 2023-07-23T22:13:01Z

!bench_formatter

ematipico · 2023-07-23T22:23:46Z

crates/rome_rowan/src/syntax_token_text.rs

 /// Reference to the text of a SyntaxToken without having to worry about the lifetime of `&str`.
 #[derive(Eq, Clone)]
 pub struct SyntaxTokenText {
+    // Absolute start location of `token`
+    token_start: TextSize,


@Conaclos this is the reason why the FormatElement increased in size.

FormatElement is an enum, and its size is equal to the biggest type among all variants. Which means that SyntaxTokenText is now 32kb

I just moved source_position (renamed token_start) from SyntaxTokenTextSlice to SyntaxTokenText:

struct SyntaxTokenText { + token_start: TextSize token: GreenToken range: TextRange } enum FormatElement { - SyntaxTokenTextSlice { - source_position: TextSize, - slice: SyntaxTokenText, - }, + SyntaxTokenText(SyntaxTokenText), }

FormatElement::SyntaxTokenTextSlice and FormatElement::SyntaxTokenText should have the same size?

#4721 (comment)

github-actions · 2023-07-23T22:44:03Z

Formatter Benchmark Results

group                                    main                                   pr
-----                                    ----                                   --
formatter/big5-added.json                1.00    483.0±0.44µs    35.0 MB/sec    1.02    491.4±0.85µs    34.4 MB/sec
formatter/canada.json                    1.00    210.1±5.65ms    10.2 MB/sec    1.05    220.7±1.45ms     9.7 MB/sec
formatter/checker.ts                     1.00    398.6±6.21ms     6.5 MB/sec    1.00    396.9±2.61ms     6.5 MB/sec
formatter/compiler.js                    1.00    218.9±2.00ms     4.8 MB/sec    1.03    225.4±1.01ms     4.6 MB/sec
formatter/d3.min.js                      1.00    170.7±1.68ms  1572.6 KB/sec    1.03    175.8±3.13ms  1527.0 KB/sec
formatter/db.json                        1.00     14.3±0.08ms    12.8 MB/sec    1.02     14.5±0.09ms    12.6 MB/sec
formatter/dojo.js                        1.00     11.8±0.08ms     5.8 MB/sec    1.00     11.9±0.07ms     5.8 MB/sec
formatter/eucjp.json                     1.00    813.9±1.80µs    48.1 MB/sec    1.01    825.5±0.34µs    47.4 MB/sec
formatter/ios.d.ts                       1.00    259.1±1.77ms     7.2 MB/sec    1.01    260.4±3.70ms     7.2 MB/sec
formatter/jquery.min.js                  1.00     48.1±0.42ms  1760.7 KB/sec    1.02     48.9±0.43ms  1729.8 KB/sec
formatter/math.js                        1.00    331.6±3.19ms  1999.8 KB/sec    1.03    341.9±1.67ms  1939.7 KB/sec
formatter/package-lock.json              1.00      6.1±0.02ms    22.8 MB/sec    1.01      6.1±0.05ms    22.5 MB/sec
formatter/parser.ts                      1.00      8.1±0.04ms     6.0 MB/sec    1.01      8.2±0.03ms     5.9 MB/sec
formatter/pixi.min.js                    1.00    184.4±1.35ms     2.4 MB/sec    1.02    188.5±1.26ms     2.3 MB/sec
formatter/react-dom.production.min.js    1.00     55.7±0.60ms     2.1 MB/sec    1.03     57.3±0.67ms     2.0 MB/sec
formatter/react.production.min.js        1.01      2.8±0.01ms     2.2 MB/sec    1.00      2.7±0.01ms     2.2 MB/sec
formatter/router.ts                      1.00      3.3±0.01ms     9.8 MB/sec    1.00      3.3±0.03ms     9.7 MB/sec
formatter/tex-chtml-full.js              1.00    424.7±2.17ms     2.1 MB/sec    1.02    432.7±1.78ms     2.1 MB/sec
formatter/three.min.js                   1.00    221.2±2.10ms     2.7 MB/sec    1.02    225.3±1.91ms     2.6 MB/sec
formatter/typescript.js                  1.00   1428.5±7.95ms     6.7 MB/sec    1.01  1444.9±10.05ms     6.6 MB/sec
formatter/vue.global.prod.js             1.00     73.7±0.79ms  1673.3 KB/sec    1.03     76.3±1.66ms  1617.4 KB/sec

ematipico · 2023-07-24T09:42:57Z

There are a couple of things that are not clear in this PR, and I would like to clarify them:

what's the issue with inner_string_text? (please share an example so we can understand)
what's the issue with SyntaxTokenText not returning the range you need? (please share an example so we can understand)

Also, can you explain the relation between the two issues?

If they are unrelated, we could open two different PRs.

ematipico

We should avoid increasing memory, if the value we get in return is not enough

ematipico · 2023-07-24T09:38:06Z

crates/rome_json_syntax/src/lib.rs

+    let mut text = token.token_text_trimmed();
+    if token_kind == JsonSyntaxKind::JSON_STRING_LITERAL {
+        // remove string delimiters
+        let len = text.len() - TextSize::from(2);


This is unsafe because text.len() can be less than 2. You should use .saturing_sub

This should not be the case because it is a string literal?

It's still an unsafe operation, which is done without the proper checks. If you think it's a safe operation, you should add a // SAFETY comment to explain why it's safe.

Conaclos · 2023-07-24T16:14:03Z

* what's the issue with `inner_string_text`? (please share an example so we can understand)

There is no issue except that code is duplicated in several places. I just renamed it to inner_text. However, we can keep the previous name...

* what's the issue with `SyntaxTokenText` not returning the range you need? (please share an example so we can understand)

Let's me introduce some context to clarify the situation.

In some places, we need the text of a token, however, the borrow-checker requires allocating a new string. To avoid the allocation, SyntaxTokenText was introduced.

SyntaxTokenText is used in high-level API of the JSON, JS and JSX linter and formatter. This is also used to return the inner text of a JSON string literal.

I expanded its uses to JS and JSX string literals to be more consistent with the JSON code base. This makes possible the deletion of Quoted and StaticStringValue utility types.

SyntaxTokenText offers a range() method. The prefix Syntax suggests that this range is absolute, similar to SyntaxToken, SyntaxNode, SyntaxNodeText, ...

However, it is not.
This is really misleading.
For instance, let's use the following example:

let x;

The range of the token "x" is (4,5).
The range of its token text is (0,1).

I think the method must either be removed (and SYntaxTokenText renamed to TokenText or GreenTokenText) or reworked to return an absolute range. I chose the second path because it is very convenient in high-level code
to access to the absolute range of a text.
Otherwise, the user should fall back to SyntaxToken that is error-prone in the case where the token can be a string literal.

Moreover, there is the need to track the absolute range in other places like the formatter where a struct SyntaxTokenTextSlice and an enum variant FormatElement::SyntaxTokenTextSlice consisted of a source_location and a SyntaxTokenText. This looked like a good opportunity to merge SyntaxTokenTextSlice into SyntaxTokenText.

ematipico · 2023-07-24T19:37:25Z

There is no issue except that code is duplicated in several places. I just renamed it to inner_text. However, we can keep the previous name...

We can solve the repetition with a trait. Still, I think the changes can be moved into a separate PR. Doing so will allow to focus on what you're trying to solve.

Otherwise, the user should fall back to SyntaxToken that is error-prone in the case where the token can be a string literal.

When you say "user", do you mean us developers?

I'd like to propose another solution instead, where we remove the API that exposes the range of SyntaxTokenText. This is misleading, a developer should never rely on a range of substring in the first place. When dealing with string literals, the quotes should always be part of the range.

If there's a particular case where we need to extract a sub-range of a string literal, in that case, it's up to the developer to compute it.

github-actions bot added A-Linter Area: linter A-Core Area: core A-Formatter Area: formatter L-JavaScript Langauge: JavaScript A-Parser Area: parser A-Project Area: project configuration and loading L-JSON Language: JSON labels Jul 22, 2023

Conaclos marked this pull request as draft July 22, 2023 10:19

Conaclos added 2 commits July 22, 2023 12:24

refactor(rome_rowan): rework SyntaxTokenText abstraction

5537b88

refactor: merge GreenTokenText and SyntaxTokenText

3da0087

Conaclos commented Jul 23, 2023

View reviewed changes

Conaclos marked this pull request as ready for review July 23, 2023 12:02

Conaclos requested review from ematipico and MichaReiser July 23, 2023 12:02

ematipico reviewed Jul 23, 2023

View reviewed changes

ematipico suggested changes Jul 24, 2023

View reviewed changes

Conaclos mentioned this pull request Jul 25, 2023

refactor(rome_rowan): rework SyntaxTokenText into TokenText #4726

Merged

Conaclos closed this Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(rome_rowan): rework `SyntaxTokenText` #4721

refactor(rome_rowan): rework `SyntaxTokenText` #4721

Conaclos commented Jul 22, 2023 •

edited

Loading

netlify bot commented Jul 22, 2023 •

edited

Loading

Conaclos Jul 23, 2023

MichaReiser Jul 23, 2023

Conaclos Jul 23, 2023

Conaclos Jul 23, 2023

strager Jul 24, 2023 •

edited

Loading

Conaclos Jul 24, 2023 •

edited

Loading

ematipico commented Jul 23, 2023

ematipico Jul 23, 2023 •

edited

Loading

Conaclos Jul 23, 2023 •

edited

Loading

ematipico Jul 24, 2023

github-actions bot commented Jul 23, 2023

ematipico commented Jul 24, 2023

ematipico left a comment

ematipico Jul 24, 2023

Conaclos Jul 24, 2023

ematipico Jul 24, 2023

Conaclos commented Jul 24, 2023 •

edited

Loading

ematipico commented Jul 24, 2023

refactor(rome_rowan): rework SyntaxTokenText #4721

refactor(rome_rowan): rework SyntaxTokenText #4721

Conversation

Conaclos commented Jul 22, 2023 • edited Loading

Summary

Test Plan

netlify bot commented Jul 22, 2023 • edited Loading

✅ Deploy Preview for docs-rometools canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

strager Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

Conaclos Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

ematipico commented Jul 23, 2023

ematipico Jul 23, 2023 • edited Loading

Choose a reason for hiding this comment

Conaclos Jul 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jul 23, 2023

Formatter Benchmark Results

ematipico commented Jul 24, 2023

ematipico left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Conaclos commented Jul 24, 2023 • edited Loading

ematipico commented Jul 24, 2023

refactor(rome_rowan): rework `SyntaxTokenText` #4721

refactor(rome_rowan): rework `SyntaxTokenText` #4721

Conaclos commented Jul 22, 2023 •

edited

Loading

netlify bot commented Jul 22, 2023 •

edited

Loading

strager Jul 24, 2023 •

edited

Loading

Conaclos Jul 24, 2023 •

edited

Loading

ematipico Jul 23, 2023 •

edited

Loading

Conaclos Jul 23, 2023 •

edited

Loading

Conaclos commented Jul 24, 2023 •

edited

Loading