-
Notifications
You must be signed in to change notification settings - Fork 660
refactor(rome_rowan): rework SyntaxTokenText
#4721
Conversation
✅ Deploy Preview for docs-rometools canceled.Built without sensitive environment variables
|
@@ -392,4 +386,4 @@ static_assert!(std::mem::size_of::<crate::format_element::Tag>() == 16usize); | |||
|
|||
#[cfg(not(debug_assertions))] | |||
#[cfg(target_pointer_width = "64")] | |||
static_assert!(std::mem::size_of::<crate::FormatElement>() == 24usize); | |||
static_assert!(std::mem::size_of::<crate::FormatElement>() == 32usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the size changed.
Moreover, I cannot reproduce this test locally.
Before and after this PR, the size of FormatElement
is evaluated to 40 bytes on my machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you measure the size with a release build? Some elements contain debug only data
Regressing the size has negative consequences on overall performance and memory consumption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I accept with your point. However, I just merged SyntaxTokenTextSlice
into SyntaxTokenText
. This should take the same space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you measure the size with a release build? Some elements contain debug only data
You are right!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old layout of format_element::FormatElement::SyntaxTokenTextSlice
(24 bytes):
#[repr(C)]
// alignment=8 size=24
struct FormatElement_SyntaxTokenTextSlice {
variant: i8, // which element of FormatElement is this?
padding_0: [u8; 3], // for text_size alignment
text_size: u32, // rome_text_size::TextSize::raw
// Begin rome_rowan::SyntaxTokenText (alignment=8 size=16)
green_token_ptr: u64, // rome_rowan::GreenToken::ptr (rome_rowan::ThinArc::ptr (pointer))
range_start: u32, // rome_text_size::TextRange::start (rome_text_size::TextSize::raw)
range_end: u32, // rome_text_size::TextRange::start (rome_text_size::TextSize::raw)
// End rome_rowan::SyntaxTokenText (alignment=8 size=16)
}
New layout of format_element::FormatElement::SyntaxTokenTextSlice
(32 bytes):
#[repr(C)]
// alignment=8 size=32
struct FormatElement_SyntaxTokenTextSlice {
variant: i8, // which element of FormatElement is this?
padding_0: [u8; 7], // for rome_rowan::SyntaxTokenText alignment
// Begin rome_rowan::SyntaxTokenText (alignment=8 size=24)
text_size: u32, // rome_text_size::TextSize::raw
padding_1: [u8; 4], // for green_token_ptr alignment
green_token_ptr: u64, // rome_rowan::GreenToken::ptr (rome_rowan::ThinArc::ptr (pointer))
range_start: u32, // rome_text_size::TextRange::start (rome_text_size::TextSize::raw)
range_end: u32, // rome_text_size::TextRange::start (rome_text_size::TextSize::raw)
// End rome_rowan::SyntaxTokenText (alignment=8 size=24)
}
Before, the space between format_element::FormatElement
's discriminator and the rome_rowan::SyntaxTokenText
was used to fit text_size
. After, this space cannot be used. +4 bytes
After, text_size
forces padding inside rome_rowan::SyntaxTokenText
. The size of the members is 20 bytes, but because the struct needs to be 8-byte-aligned (because rome_rowan::GreenToken
is 8-byte-aligned because pointers are 8-byte-aligned), 4 padding bytes are added. +4 bytes
+4 bytes + +4 bytes = +8 bytes
This patch helped my investigations (building with cargo +nightly build --release
):
diff --git a/crates/rome_formatter/src/format_element.rs b/crates/rome_formatter/src/format_element.rs
index fcb0e92d33..3d6d3e3250 100644
--- a/crates/rome_formatter/src/format_element.rs
+++ b/crates/rome_formatter/src/format_element.rs
@@ -16,6 +16,7 @@ use std::rc::Rc;
///
/// Use the helper functions like [crate::builders::space], [crate::builders::soft_line_break] etc. defined in this file to create elements.
#[derive(Clone, Eq, PartialEq)]
+#[rustc_layout(debug)]
pub enum FormatElement {
/// A space token, see [crate::builders::space] for documentation.
Space,
diff --git a/crates/rome_formatter/src/lib.rs b/crates/rome_formatter/src/lib.rs
index 21a5d90997..c2322681cd 100644
--- a/crates/rome_formatter/src/lib.rs
+++ b/crates/rome_formatter/src/lib.rs
@@ -1,3 +1,5 @@
+#![feature(rustc_attrs)]
+
//! Infrastructure for code formatting
//!
//! This module defines [FormatElement], an IR to format code documents and provides a mean to print
diff --git a/crates/rome_rowan/src/lib.rs b/crates/rome_rowan/src/lib.rs
index b4f5e659fb..7fc598c5fb 100644
--- a/crates/rome_rowan/src/lib.rs
+++ b/crates/rome_rowan/src/lib.rs
@@ -1,3 +1,5 @@
+#![feature(rustc_attrs)]
+
//! A generic library for lossless syntax trees.
//! See `examples/s_expressions.rs` for a tutorial.
#![forbid(
diff --git a/crates/rome_rowan/src/syntax_token_text.rs b/crates/rome_rowan/src/syntax_token_text.rs
index d5eb1ddbf1..7977189bb3 100644
--- a/crates/rome_rowan/src/syntax_token_text.rs
+++ b/crates/rome_rowan/src/syntax_token_text.rs
@@ -5,6 +5,7 @@ use std::{borrow::Borrow, fmt::Formatter};
/// Reference to the text of a SyntaxToken without having to worry about the lifetime of `&str`.
#[derive(Eq, Clone)]
+#[rustc_layout(debug)]
pub struct SyntaxTokenText {
// Using a green token to ensure this type is Send + Sync.
token: GreenToken,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the write-up!
I see two possibilities:
-
Replacing
TextRange
with aSmallTextRange
struct inSyntaxTokenText
to shrinkSyntaxTokenText
to 16 bytes.
This is based on the observation that a relative range does not needu32
positions.u16
seems enough. -
Introducing
GreenTokenText
and reverting to the original layout ofFormatElement
(usingGreenTokenText
).
Any opinions?
!bench_formatter |
/// Reference to the text of a SyntaxToken without having to worry about the lifetime of `&str`. | ||
#[derive(Eq, Clone)] | ||
pub struct SyntaxTokenText { | ||
// Absolute start location of `token` | ||
token_start: TextSize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Conaclos this is the reason why the FormatElement
increased in size.
FormatElement
is an enum, and its size is equal to the biggest type among all variants. Which means that SyntaxTokenText is now 32kb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just moved source_position
(renamed token_start
) from SyntaxTokenTextSlice
to SyntaxTokenText
:
struct SyntaxTokenText {
+ token_start: TextSize
token: GreenToken
range: TextRange
}
enum FormatElement {
- SyntaxTokenTextSlice {
- source_position: TextSize,
- slice: SyntaxTokenText,
- },
+ SyntaxTokenText(SyntaxTokenText),
}
FormatElement::SyntaxTokenTextSlice
and FormatElement::SyntaxTokenText
should have the same size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatter Benchmark Results
|
There are a couple of things that are not clear in this PR, and I would like to clarify them:
Also, can you explain the relation between the two issues? If they are unrelated, we could open two different PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid increasing memory, if the value we get in return is not enough
let mut text = token.token_text_trimmed(); | ||
if token_kind == JsonSyntaxKind::JSON_STRING_LITERAL { | ||
// remove string delimiters | ||
let len = text.len() - TextSize::from(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unsafe because text.len()
can be less than 2. You should use .saturing_sub
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be the case because it is a string literal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still an unsafe operation, which is done without the proper checks. If you think it's a safe operation, you should add a // SAFETY
comment to explain why it's safe.
There is no issue except that code is duplicated in several places. I just renamed it to
Let's me introduce some context to clarify the situation. In some places, we need the text of a token, however, the borrow-checker requires allocating a new string. To avoid the allocation,
I expanded its uses to JS and JSX string literals to be more consistent with the JSON code base. This makes possible the deletion of
However, it is not. let x; The range of the token "x" is I think the method must either be removed (and Moreover, there is the need to track the absolute range in other places like the formatter where a struct |
We can solve the repetition with a trait. Still, I think the changes can be moved into a separate PR. Doing so will allow to focus on what you're trying to solve.
When you say "user", do you mean us developers? I'd like to propose another solution instead, where we remove the API that exposes the range of If there's a particular case where we need to extract a sub-range of a string literal, in that case, it's up to the developer to compute it. |
Summary
This PR addresses some concerns of @ematipico.
All
inner_text
functions now return aSyntaxTokenText
.I thus removed
Quoted
andStaticStringValue
.My main issue, with
SyntaxTokenText
was the creativeness of the returned range. We lose the absolute location of the text in a source.To address this issue, I added a (private) absolute offset, which is the starting position of the (private) token in the source.
This allows computing the absolute range from the (private) relative range.
I also modified
slice
to accept a range relative to the current slice of text. This seems more consistent from a user point of view and this mimics what we are used to for string slicing.I take the opportunity to remove
SyntaxTokenTextSlice
in the formatter sinceSyntaxTokenText
is now sufficient.Test Plan
In progress...