Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: tokenizer extension position #3594

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

UziTech
Copy link
Member

@UziTech UziTech commented Jan 18, 2025

Marked version: 15.0.6

Description

Add position property for tokenizer extension to it can pick where to be ran in the lexer.

TODO:

  • Check benchmark speed (initial check seems to be fine)
  • Write tests
  • Write docs

Contributor

  • Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
  • no tests required for this PR.
  • If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

Sorry, something went wrong.

Copy link

vercel bot commented Jan 18, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
marked-website ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 1, 2025 6:14pm

@calculuschild
Copy link
Contributor

I haven't run any benchmarks on this, but I'm curious how much of a slowdown this adds, assuming no extensions. Do you have any measurements?

As an alternative that might be more flexible, but more restructuring... I wonder what you think about extracting each of the parser/renderer steps into an array of "extension-like objects".

I.e., instead of:

      beforeCode();
      // code
      if (token = this.tokenizer.code(src)) {
       ...
       beforeFences();

       // fences
      if (token = this.tokenizer.fences(src)) {
      ... 

something like (pseudocode):

const tokenizers = {
  code : ()=>{ if (token = this.tokenizer.code(src)) ... },
  fences : ()=>{ if (token = this.tokenizer.fences(src)) ... },
  ...
}

for ([tokenizerName, tokenizerFunction] in tokenizers) {
  runExtensionBefore(tokenizerName);
  tokenizerFunction();
}

I'm fairly certain there is some speed impact calling the tokenizers from an array/map/object like this but it might make the whole project more flexible for this type of "extension position" customization.

@UziTech
Copy link
Member Author

UziTech commented Feb 26, 2025

I did try to move the tokenizers to some sort of array but there were two problems.

  1. Some tokenizers require a lot of extra logic that is not easy to move out of the lexer.
  2. I couldn't think of an easy way to provide the array to the user so they can pick where to put their tokenizers

I think just adding a position property like this is the easiest for the user, and even though it is a lot of boilerplate we are not likely going to be adding or removing tokenizers from the lexer.

Copy link
Contributor

@calculuschild calculuschild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation seems good to me assuming there's no alternative to the boilerplate.

Remaining comments:

  1. Should this include a documentation update?
  2. Does this need a test for an extension at one of the custom positions?
  3. Is there any significant performance impact? Not sure if the extra function calls between each step are going to be an issue.

if (ext.position && ![...tokenizerBlockPositions, ...tokenizerInlinePositions].includes(ext.position)) {
throw new Error(`extension position must be one of '${tokenizerBlockPositions.join("', '")}', '${tokenizerInlinePositions.join("', '")}'`);
}
if (!ext.level && !ext.position) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we enforce compatible level and position entries? I.e., a block level doesn't make sense with an inline position. Or, should these properties be mutually-exclusive to avoid unexpected behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are mutually exclusive. They can both be specified in case an extension wants to support many versions of marked, but versions of marked that allow position will ignore level if position is provided.

@calculuschild
Copy link
Contributor

calculuschild commented Feb 28, 2025

I did try to move the tokenizers to some sort of array but there were two problems.

Ok, fair enough. Looking back, this is what I tried to do back in 2021 and I think we concluded arrays just add too much slowdown anyway. #1872

Also looking back, we had discussed a "before" parameter 4 years ago. Hoping it wasn't also a lag issue that made us drop it #2043 (comment)

@UziTech
Copy link
Member Author

UziTech commented Feb 28, 2025

Remaining comments:

Ya, these 3 are in the TODOs section in this PRs description. I will work on them soon.

Also looking back, we had discussed a "before" parameter 4 years ago. Hoping it wasn't also a lag issue that made us drop it

Looks like we just figured we would get it done when it was actually needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Select Extension priority
2 participants