Skip to content

refactor(editor): optimize code block highlighting #12817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: canary
Choose a base branch
from

Conversation

golok727
Copy link
Contributor

@golok727 golok727 commented Jun 14, 2025

Limitation:
If a new line is added the subsequent lines needs retokenization. ( Will fix it in next PR )

Summary by CodeRabbit

  • New Features

    • Introduced dynamic, on-demand syntax highlighting for code blocks, providing more accurate and responsive tokenization.
    • Added support for incremental tokenization with state preservation, improving performance when editing large code blocks.
    • Added new syntax highlighting integration using Shiki for enhanced theme and language support.
    • Added a new property to display line indices within code blocks for improved context.
  • Improvements

    • Enhanced syntax highlighting precision by aligning tokens with their exact positions within code lines.
    • Improved handling of language and theme changes with cancellation support for ongoing language loading.
    • Added line-based token retrieval and rendering enhancements for inline code units.
    • Introduced a new method to retrieve full line content for better text processing.
  • Bug Fixes

    • Resolved issues with outdated or incorrect token highlighting during code editing.

@golok727 golok727 requested a review from a team as a code owner June 14, 2025 06:03
Copy link

coderabbitai bot commented Jun 14, 2025

Walkthrough

This change introduces a new tokenizer architecture for code blocks, shifting from static, precomputed token arrays to a dynamic, stateful tokenization model. New classes and interfaces for tokenization and Shiki integration are added, and existing components are refactored to use the tokenizer for line-by-line syntax highlighting with improved state management and caching.

Changes

File(s) Change Summary
blocksuite/affine/blocks/code/src/code-block-inline.ts Modified the renderer function in CodeBlockUnitSpecExtension to accept and bind a lineIndex property for rendered components.
blocksuite/affine/blocks/code/src/code-block.ts Replaced static token arrays with a tokenizer$ signal and dynamic tokenization; introduced asynchronous language loading, caching, and stateful tokenization using CodeTokenizer and ShikiTokenProvider. Updated rendering and effects accordingly.
blocksuite/affine/blocks/code/src/highlight/affine-code-unit.ts Refactored to use the tokenizer for dynamic token retrieval per line, added a lineIndex property, and adjusted token offset logic for accurate highlighting within code units. Added a getter for the closest v-line element.
blocksuite/framework/std/src/inline/components/v-line.ts Added a lineContent getter to the VLine class for retrieving concatenated inserted strings from the line's elements.
blocksuite/affine/blocks/code/src/highlight/shiki.ts Introduced a new module integrating Shiki syntax highlighting with a custom tokenizer state and provider, enabling incremental and stateful tokenization compatible with the new architecture.
blocksuite/affine/blocks/code/src/tokenizer/index.ts New index module re-exporting all exports from the new tokenizer.ts and types.ts modules.
blocksuite/affine/blocks/code/src/tokenizer/tokenizer.ts Added a new CodeTokenizer class implementing stateful, cached, line-by-line tokenization with cache validation and state management.
blocksuite/affine/blocks/code/src/tokenizer/types.ts Introduced new TypeScript interfaces and types for tokenizer state, tokens, tokenization results, and token providers.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CodeBlockComponent
    participant CodeTokenizer
    participant ShikiTokenProvider

    User->>CodeBlockComponent: Edits code or changes language
    CodeBlockComponent->>ShikiTokenProvider: (If needed) Load language/theme
    CodeBlockComponent->>CodeTokenizer: Create or update tokenizer
    loop For each visible line
        CodeBlockComponent->>CodeTokenizer: tokenizeLine({lineContent, lineIndex})
        CodeTokenizer->>ShikiTokenProvider: tokenize(lineContent, state)
        ShikiTokenProvider-->>CodeTokenizer: TokenizationResult (tokens, endState)
        CodeTokenizer-->>CodeBlockComponent: Tokens for line
        CodeBlockComponent->>AffineCodeUnit: Render with tokens and lineIndex
    end
Loading

Suggested labels

app:core

Suggested reviewers

  • Flrande
  • L-Sun

Poem

In the warren where the code lines flow,
A tokenizer hops, with state in tow.
Shiki’s colors shimmer, bright and keen,
Each line is parsed, each token seen.
Caches burrow, deltas gleam—
Syntax highlighting, a coder’s dream!
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@golok727 golok727 marked this pull request as draft June 14, 2025 06:08
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
blocksuite/affine/blocks/code/src/code-block.ts (1)

108-115: ⚠️ Potential issue

Old tokenizer lingers when language is cleared

If modelLang is null we return without resetting this.tokenizer$; the previous tokenizer keeps highlighting with an out-of-date grammar.

-    if (modelLang === null) {
-      return;
-    }
+    if (modelLang === null) {
+      this.tokenizer$.value = null;
+      return;
+    }
🧹 Nitpick comments (3)
blocksuite/affine/blocks/code/src/code-block.ts (1)

167-172: Clear tokenizer cache on text edits

The second effect touches tokenizer$.value but never invalidates cached tokens when the document changes, leading to stale highlighting after large edits/line re-orders. Consider tokenizer?.clearCache() whenever text.deltas$ changes.

blocksuite/affine/blocks/code/src/highlight/affine-code-unit.ts (1)

57-60: Heavy structuredClone call

Deep-cloning every line’s tokens on each render can be costly. A shallow clone (lineTokens.map(t => ({ ...t }))) or slice when you only mutate .content is usually sufficient and avoids browser support quirks for structuredClone.

blocksuite/affine/blocks/code/src/highlight/tokenizer.ts (1)

27-34: Unbounded cache may grow indefinitely

_tokenizedLines is never pruned. Very large files or long editing sessions will leak memory. Add a simple LRU policy or reset on large size.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a1abb60 and 69861b7.

📒 Files selected for processing (5)
  • blocksuite/affine/blocks/code/src/code-block-inline.ts (1 hunks)
  • blocksuite/affine/blocks/code/src/code-block.ts (6 hunks)
  • blocksuite/affine/blocks/code/src/highlight/affine-code-unit.ts (4 hunks)
  • blocksuite/affine/blocks/code/src/highlight/tokenizer.ts (1 hunks)
  • blocksuite/framework/std/src/inline/components/v-line.ts (1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
blocksuite/affine/blocks/code/src/highlight/tokenizer.ts

[error] 86-86: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)

🔇 Additional comments (2)
blocksuite/affine/blocks/code/src/code-block-inline.ts (1)

25-29: Confirm renderer signature compatibility

renderer: ({ delta, lineIndex }) => … introduces lineIndex, but InlineSpecExtension’s renderer type may still expect only { delta }. Please re-check the extension declaration to avoid a TS compile error in CI.

blocksuite/affine/blocks/code/src/highlight/tokenizer.ts (1)

57-65: Verify getInternalStack/equals API availability

GrammarState#getInternalStack() and .equals() are not part of Shiki’s public typings as of v1. If they change, this will break at runtime. Double-check the API or guard with optional chaining.

@golok727 golok727 marked this pull request as ready for review June 14, 2025 07:28
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
blocksuite/affine/blocks/code/src/highlight/shiki.ts (1)

24-42: Be explicit about single-line token extraction & guard against Shiki API changes

codeToTokens returns an array-of-lines; we currently grab tokens[0] without checking length.
If Shiki ever changes the return shape (e.g. returns an empty array for blank lines) this will throw.

-    return {
-      lineTokens: res.tokens[0],
-      endState: new ShikiTokenizerState(res.grammarState),
-    };
+    const firstLineTokens = res.tokens[0] ?? [];
+    return {
+      lineTokens: firstLineTokens,
+      endState: new ShikiTokenizerState(res.grammarState),
+    };
blocksuite/affine/blocks/code/src/tokenizer/tokenizer.ts (2)

74-80: Tight loop could walk the entire buffer each call

_guessStateForLine walks backwards one line at a time until it hits a cached state.
For large files this is O(N) per miss and can be noticeable during fast scrolling.

Consider storing the last line index whose endState is known and jump straight to it, or maintain a prefix-map of the latest valid state to achieve amortised O(1).

🧰 Tools
🪛 Biome (1.9.4)

[error] 76-76: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)


75-77: Biome hint – optional chaining improves readability

The static analyser suggestion is minor but valid:

-      const prevLineCache = this._tokenizedLines.get(lineIndex - 1);
-      if (prevLineCache && prevLineCache.token.endState) {
-        return prevLineCache.token.endState as State;
+      const prevLineCache = this._tokenizedLines.get(lineIndex - 1);
+      const endState = prevLineCache?.token.endState as State | undefined;
+      if (endState) {
+        return endState;
       }
🧰 Tools
🪛 Biome (1.9.4)

[error] 76-76: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 69861b7 and 3545866.

📒 Files selected for processing (6)
  • blocksuite/affine/blocks/code/src/code-block.ts (6 hunks)
  • blocksuite/affine/blocks/code/src/highlight/affine-code-unit.ts (4 hunks)
  • blocksuite/affine/blocks/code/src/highlight/shiki.ts (1 hunks)
  • blocksuite/affine/blocks/code/src/tokenizer/index.ts (1 hunks)
  • blocksuite/affine/blocks/code/src/tokenizer/tokenizer.ts (1 hunks)
  • blocksuite/affine/blocks/code/src/tokenizer/types.ts (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • blocksuite/affine/blocks/code/src/tokenizer/index.ts
  • blocksuite/affine/blocks/code/src/tokenizer/types.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • blocksuite/affine/blocks/code/src/highlight/affine-code-unit.ts
  • blocksuite/affine/blocks/code/src/code-block.ts
🧰 Additional context used
🧬 Code Graph Analysis (2)
blocksuite/affine/blocks/code/src/highlight/shiki.ts (1)
blocksuite/affine/blocks/code/src/tokenizer/types.ts (3)
  • TokenizerState (1-3)
  • TokensProvider (22-24)
  • TokenizationResult (17-20)
blocksuite/affine/blocks/code/src/tokenizer/tokenizer.ts (1)
blocksuite/affine/blocks/code/src/tokenizer/types.ts (5)
  • TokenizerState (1-3)
  • TokenizationResult (17-20)
  • TokensProvider (22-24)
  • LineKey (5-9)
  • Token (11-15)
🪛 Biome (1.9.4)
blocksuite/affine/blocks/code/src/tokenizer/tokenizer.ts

[error] 76-76: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
blocksuite/affine/blocks/code/src/tokenizer/types.ts (1)

5-8: Consider marking LineKey fields as readonly.

LineKey is mainly used as a cache-key. Marking its properties immutable prevents accidental mutation that could silently break cache invariants.

-export type LineKey = {
-  lineContent: string;
-  lineIndex: number;
-};
+export type LineKey = {
+  readonly lineContent: string;
+  readonly lineIndex: number;
+};
blocksuite/affine/blocks/code/src/tokenizer/tokenizer.ts (3)

74-80: Tiny readability win: use optional chaining.

Biome’s hint is valid – the manual null-check can be collapsed:

-      const prevLineCache = this._tokenizedLines.get(lineIndex - 1);
-      if (prevLineCache && prevLineCache.token.endState) {
-        return prevLineCache.token.endState as State;
-      }
+      const maybeState = this._tokenizedLines.get(lineIndex - 1)?.token.endState;
+      if (maybeState) {
+        return maybeState as State;
+      }

This removes one level of nesting and clarifies intent.

🧰 Tools
🪛 Biome (1.9.4)

[error] 76-76: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)


90-95: Empty-array sentinel may hide errors.

getLineTokens returns [] when a line hasn’t been tokenized.
Call-sites cannot distinguish “not yet tokenized” from “line is empty”. Returning undefined (or throwing) forces explicit handling and avoids subtle rendering bugs.

-    return this._tokenizedLines.get(index)?.token.lineTokens ?? [];
+    const cached = this._tokenizedLines.get(index);
+    if (!cached) return undefined; // or throw
+    return cached.token.lineTokens;

58-71: Potential memory growth: expose a pruning strategy.

_tokenizedLines grows unbounded during long-lived editing sessions. Consider:

• Exposing a prune(afterLine: number) method that drops caches before a given index, or
• Switching to an LRU map with a reasonable cap.

Prevents unnecessary memory usage in large documents.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3545866 and 3d790fd.

📒 Files selected for processing (3)
  • blocksuite/affine/blocks/code/src/highlight/shiki.ts (1 hunks)
  • blocksuite/affine/blocks/code/src/tokenizer/tokenizer.ts (1 hunks)
  • blocksuite/affine/blocks/code/src/tokenizer/types.ts (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • blocksuite/affine/blocks/code/src/highlight/shiki.ts
🧰 Additional context used
🧬 Code Graph Analysis (1)
blocksuite/affine/blocks/code/src/tokenizer/tokenizer.ts (1)
blocksuite/affine/blocks/code/src/tokenizer/types.ts (5)
  • TokenizerState (1-3)
  • TokenizationResult (16-19)
  • TokensProvider (21-24)
  • LineKey (5-8)
  • Token (10-14)
🪛 Biome (1.9.4)
blocksuite/affine/blocks/code/src/tokenizer/tokenizer.ts

[error] 76-76: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)

🔇 Additional comments (2)
blocksuite/affine/blocks/code/src/tokenizer/types.ts (1)

10-14: Align naming with common token interfaces (value vs content).

Most syntax-highlighting libs (Shiki, Prism, etc.) use content or value consistently. If the rest of the pipeline (e.g. AffineCodeUnit) expects value, a mismatch here becomes a foot-gun. Double-check downstream usages; consider renaming before this type propagates broadly.

blocksuite/affine/blocks/code/src/tokenizer/tokenizer.ts (1)

9-16: Great – HighlightedLine now carries the generic State.

The earlier unsafe casts are gone, restoring proper type safety.
Nice catch!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
blocksuite/affine/blocks/code/src/tokenizer/types.ts (5)

1-3: Clarify the equals contract & document invariants

equals sits at the heart of cache-invalidation logic, yet the interface doesn’t specify what constitutes equality (structural, referential, locale-sensitive, etc.). A brief JSDoc will prevent divergent implementations and subtle bugs.

 export interface TokenizerState {
+  /**
+   * Determine whether two states are equivalent for the purpose of
+   * incremental tokenisation. Implementations **must** satisfy reflexive,
+   * symmetric and transitive properties because the result is used as a
+   * cache key.
+   */
   equals(other: TokenizerState): boolean;
 }

5-8: Make LineKey immutable to avoid accidental mutation

LineKey is presumably used as a Map/WeakMap key. Mutating properties after insertion would break lookup semantics. Marking it Readonly prevents this class of bugs at compile time.

-export type LineKey = {
-  lineContent: string;
-  lineIndex: number;
-};
+export type LineKey = Readonly<{
+  lineContent: string;
+  lineIndex: number;
+}>;

10-14: Consider a readonly Token interface for safer downstream use

Tokens are typically treated as value objects. Freezing their fields helps avoid accidental in-place edits that desynchronise caches or renderers.

-export type Token = {
-  content: string;
-  offset: number;
-  color?: string;
-};
+export interface Token {
+  readonly content: string;
+  readonly offset: number;
+  readonly color?: string;
+}

16-19: Rename lineTokenstokens for concision (optional)

Within the TokenizationResult context, all tokens belong to a single line; the prefix doesn’t add information and slightly lengthens property access.

-export type TokenizationResult = {
-  lineTokens: Token[];
-  endState: TokenizerState;
-};
+export interface TokenizationResult {
+  tokens: Token[];
+  endState: TokenizerState;
+}

21-24: Evaluate need for async tokenisation

Some providers (e.g., WASM-backed highlighters or remote services) are inherently asynchronous. If future extenders will need async, considering a Promise-based overload now may avoid breaking changes later.

tokenize(line: string, state: State): TokenizationResult | Promise<TokenizationResult>;

Not urgent, but worth weighing while the API is still greenfield.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d790fd and 23b57e0.

📒 Files selected for processing (3)
  • blocksuite/affine/blocks/code/src/code-block.ts (7 hunks)
  • blocksuite/affine/blocks/code/src/tokenizer/tokenizer.ts (1 hunks)
  • blocksuite/affine/blocks/code/src/tokenizer/types.ts (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • blocksuite/affine/blocks/code/src/tokenizer/tokenizer.ts
  • blocksuite/affine/blocks/code/src/code-block.ts

Copy link

codecov bot commented Jun 16, 2025

Codecov Report

Attention: Patch coverage is 3.33333% with 87 lines in your changes missing coverage. Please review.

Project coverage is 55.59%. Comparing base (a1abb60) to head (23b57e0).

Files with missing lines Patch % Lines
...uite/affine/blocks/code/src/tokenizer/tokenizer.ts 0.00% 35 Missing ⚠️
blocksuite/affine/blocks/code/src/code-block.ts 0.00% 22 Missing ⚠️
...fine/blocks/code/src/highlight/affine-code-unit.ts 6.25% 15 Missing ⚠️
...ocksuite/affine/blocks/code/src/highlight/shiki.ts 13.33% 13 Missing ⚠️
...ksuite/affine/blocks/code/src/code-block-inline.ts 0.00% 1 Missing ⚠️
...uite/framework/std/src/inline/components/v-line.ts 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           canary   #12817      +/-   ##
==========================================
- Coverage   55.91%   55.59%   -0.33%     
==========================================
  Files        2652     2654       +2     
  Lines      125440   125511      +71     
  Branches    19948    19904      -44     
==========================================
- Hits        70142    69777     -365     
+ Misses      53550    53426     -124     
- Partials     1748     2308     +560     
Flag Coverage Δ
server-test 78.99% <ø> (-0.69%) ⬇️
unittest 31.58% <3.33%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

1 participant