Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom extras #519

Merged
merged 33 commits into from
Feb 25, 2024
Merged

Custom extras #519

merged 33 commits into from
Feb 25, 2024

Conversation

Crozzers
Copy link
Contributor

@Crozzers Crozzers commented Jul 2, 2023

This PR closes #382 by adding support for custom extras.
@JeffAbrahamson, @Einenlum and @kjaymiller, you've all expressed interest in this feature, so I'd love to hear any thoughts from you all.

How it works

Extras

Extras must inherit from the new Extra class. See the Strike extra for a basic example of the implementation.
Each extra must define the following properties:

  • name
    • A string name the user will be able to pass in the existing extras=[] param, like fenced-code-blocks.
  • order
  • run()
    • The function that will be called with text for the extra to process
  • test()
    • Checks whether the extra should be run against a block of text. For example, the fenced-code-blocks extra might check '```' in text

Extra options

The new extras support passing in extra parameters, like with existing extras. These are passed to the initialiser of the extra via the extras dict in the Markdown class (see _setup_extras)

Stages

Converting markdown to HTML happens in distinct stages, such as preprocessing, hashing HTML, processing lists, etc. These distinct stages are given numeric IDs in the Stage class which denote the order they run in.

The Markdown class methods that correspond to each stage get decorated with @Stage.mark. For example:

class Markdown:
    ...
    @Stage.mark(Stage.PREPROCESS)
    def preprocess(self, text):
        ...

Now, when preprocess is called, a wrapper function will find all registered extras that are in the same band as this stage and execute them.

Order of execution

As mentioned previously, each stage has its own numeric ID, each 100 apart from the last. Calling Stage.before/after(Stage.WHATEVER) will return that stage's ID minus/plus 5. Conescutive calls increment an internal counter and will continue subtracting/adding 5. For example:

Stage.PREPROCESS  # 50
Stage.after(Stage.PREPROCESS)  # [55]
Stage.after(Stage.PREPROCESS)  # [60]

This means that multiple extras can be registered for the same stage and will be executed in the order in which they were registered.

Known flaws

  • Calling Stage.before 10 times on the same stage will result in extras overflowing into the next stage
    • So your Stage.after([some stage]) will actually mean Stage.before([next stage])
    • This can be mitigated by calling Stage.after([stage], step=1) to increase the number of calls that can be made before this happens
  • Registering extras is a bit of a mess at the moment
    • All subclasses of Extra are auto-detected and registered during _setup_extras. I intend to change so extras must explicitly register themselves
  • Not all extras have been converted to new format
    • Notably the footnotes extra and a few others are tightly integrated and I haven't been able to untangle them as of yet
  • Not all extras have been completely seperated from the Markdown class so there are still shims in place

The way this works is by creating subclasses of the `Extra` class. These subclasses will have an order and a name. The name is the same one
specified in the `extras` list/dict given to the Markdown init function.

The order will be at which point the function will be executed. This is done by attaching the extra to a "Stage", a distinct step in the
markdown process (eg: forming paragraphs, processing links... etc). You can set the extra to run before or after the stage.

At the moment, extras are automatically registered, activated and executed by the Markdown class.

TODO:
* More elegant way to register and init extras
* Optimise `Stage.mark`
* Convert more extras to new class format
Stay up to date with master
Also converted `mermaid` extra as part of this process. As a plus,
you no longer need fenced-code-blocks activated to use mermaid.

Also added the ability for extras to be triggered before or after another extra
All block extras have now been converted
Still TODO is things like `footnotes` extra.
Also need to think of a system for extras to replace parts of the standard syntax
Merge master to pull in trentm#514 for conversion to new Extra format
At time of commit, Python 3.5 is 2 years 9 months EOL.
@Crozzers
Copy link
Contributor Author

Crozzers commented Jul 2, 2023

  • Registering extras is a bit of a mess at the moment
    • All subclasses of Extra are auto-detected and registered during _setup_extras. I intend to change so extras must explicitly register themselves

Changed in 5478960 so that extras must manually registered by calling the Extra.register function.
During Markdown._setup_extras it will now iterate over all registered extras and instantiate any that are found in self.extras using the parameters from self.extras

I have also dropped Python 3.5 support since it was throwing errors in the test suite due to some type hints. Python 3.5 was EOL 2 years and 9 months ago so this shouldn't be too much of an issue

@Crozzers Crozzers marked this pull request as ready for review July 8, 2023 11:07
@nicholasserra
Copy link
Collaborator

nicholasserra commented Jul 23, 2023

Somehow this slipped through the cracks for me. I'll get it into my queue

…nto custom-extras

Resolve merge conflicts for PR
@Crozzers
Copy link
Contributor Author

Crozzers commented Jul 23, 2023

No worries! Might also be worth pulling in some downstream project maintainers for feedback. Off the top of my head, @mhils (pdoc) and @rouilj (roundup) come to mind. Since both of these projects use the library, I imagine there will be some ideas for what the feature should look like.

@rouilj
Copy link

rouilj commented Jul 23, 2023

Thanks for the ping.

Markdown2 is just one of three supported markdown engines. Each is configured a little differently with different capabilities. The things I would use this for (recognizing autolinked phrases like issue1) etc are done by changing the text seen by markdown2 so it includes the proper markdown syntax for the links.

So I try to do all customization via markdown syntax. I do enable options (e.g. fenced blocks, line breaks on newline) that are supported on all the engines.

That said, I do have custom code for markdown and mistune that handle setting the rel
attributes on links or try to make embedded html safer.

Not sure if the customization methods of other engines may be useful feedback.
But you can see what other contributers have done at:

https://github.com/roundup-tracker/roundup/blob/master/roundup/cgi/templating.py#L122-L235

@mhils
Copy link
Contributor

mhils commented Jul 24, 2023

Thanks for the heads-up! This looks like a sensible refactor, no complaints really. :)

From an API standpoint I'm not sure if having a dedicated test method is particularly useful (extras could just do this check as the first part of run if they want to improve performance?`), but overall this all looks good.

@Crozzers
Copy link
Contributor Author

Apologies for the slow reply, Thanks all for the feedback!


@rouilj I took a look at the examples you linked, and tried my hand at implementing a few of them. All worked out pretty well, although a bit more manual work was needed since this library lacks facilities like the TreeProcessor base class. Instead of iterating over a list of elements, I parsed the output text for <a href... and replaced the hashed links. Not nearly as elegant but definitely achievable.

The developer experience could be made even better by perhaps implementing a LinkPostProcessor base class that finds all <a> tags, parses the HREF and REL attrs and passes it to an internal sub method that can edit the text for final output.

class LinkPostProcessor(Extra, ABC):
  order = Stage.after(Stage.LINKS)
  def test(self, text):
    return True
  @abstractmethod
  def sub(self, match):
    ...
  def run(self, text):
    return re.sub(r'(<a.*?)(?P<href>href="md5-[a-z0-9]{32}")(.*>)', self.sub, text)

class MyExtra(LinkPostProcessor):
  name = 'my-extra'
  def sub(self, text):
    return f'[REMOVED A LINK WITH WITH HREF {match.group("href"}]'

I could see such convenience classes being quite useful for quickly creating extras. Would likely implement as and when needed, probably to simplify existing extras that function in similar ways.
It's not a direct replacement for TreeProcessor and the like but hopefully that sort of idea will do.


@mhils The separated test method was something that I saw in Markdown's extension API docs and replicated here. Personally I think separating this functionality from the run function is cleaner and the explicit method is a good reminder to think about performance.
That said, if one didn't care for performance, including a def text(*_): return True method for each extra would be tedious so perhaps having the base implementation always return True with the option for overriding it would be a good compromise?

Allows extras to check exactly when they've been triggered - before or after certain stages
Ended up implementing a `ItalicsAndBoldProcessor` ABC extra class for both
to piggy-back off of. Works by hashing text that we don't want processed and then doing nothing
with the other text. Hashed text is then replaced after I&B has run
Resolve merge conflicts for PR
Resolve merge conflicts for PR
@mhils
Copy link
Contributor

mhils commented Dec 1, 2023

Thanks @Crozzers! 🍰

Personally I think separating this functionality from the run function is cleaner and the explicit method is a good reminder to think about performance.

This sounds reasonable to me - I think an explicit reminder may be a good idea. :) In that case it should probably not have a default implementation, otherwise that defeats that purpose. 😄

We'd have our first real use case for custom extras over in mitmproxy/pdoc#639, so while no hurries please I'm looking forward to this landing at some point. 😃

@Crozzers
Copy link
Contributor Author

Crozzers commented Jan 3, 2024

@nicholasserra would it be alright to revisit this PR? With all the recent changes to different extras it's getting a bit cumbersome to keep porting them back to this branch. There are also a couple of issues that would benefit from this:

Many thanks

@nicholasserra
Copy link
Collaborator

Sure thing i'll get this on my priority list thank you

@nicholasserra nicholasserra added Priority High priority tickets Feature Feature request labels Jan 3, 2024
@nicholasserra nicholasserra self-assigned this Jan 3, 2024
Merge recent changes to `breaks` functionality
@nicholasserra
Copy link
Collaborator

This is looking pretty fancy to me. I like it. Anythings better than the free for all we have right now. Still wrapping my head around the numerical aspect of the stages though. What's that getting us that something like an ordered list or dict isn't? I need to read through the code a bit longer to properly grasp what's happening.

But overall I think this should be great and we'll land it as-is or with minor tweaks. Just give me a little more time to think on it.

@Crozzers
Copy link
Contributor Author

Still wrapping my head around the numerical aspect of the stages though. What's that getting us that something like an ordered list or dict isn't?

An ordered list would be simpler and easier and would dodge this flaw in the number system:

Calling Stage.before 10 times on the same stage will result in extras overflowing into the next stage

The only issue is: if an extra is tied to a particular stage and that stage gets skipped for some reason, do we still want to run that extra?

With the numbering system, we can say that anything where 100 <= order <= 200 is tied to HASH_HTML and we can skip them. With an ordered list, it would look like this:

[Stage.PREPROCESS, FencedCodeBlocks, Stage.HASH_HTML, FencedCodeBlocks, Stage.LINK_DEFS]

FencedCodeBlocks is registered as before LINK_DEFS and after PREPROCESS, but with an ordered list we can't tell and therefore can't skip it.

We could instead have ordered lists for each extra. For example:

# I've manually filled out the lists here but this would be filled by calling `Stage.before/after`
class Stage:
    # tuple containing the `before` and `after` lists respectively. Could be done as a namedtuple
    PREPROCESS = ([], [FencedCodeBlocks])
    HASH_HTML = ([], [])
    LINK_DEFS = ([FencedCodeBlocks], [])

This solves the issue of numbers overflowing whilst still allowing us to tie extras to particular stages. The only downside is if an extra sets its order as Stage.before(FencedCodeBlocks), it's a bit more tedious to insert it in all the right lists, but this only happens at startup so isn't a huge concern.

I'll play around with this and see what happens

… ordering

Lists let us have a theoreticaly infinite number of extras registered, with no overflowing into the next stage
@Crozzers
Copy link
Contributor Author

@nicholasserra, I've refactored the ordering system for extras. I've moved most of the ordering logic into the Extra class and made Stage into a proper enum. The big change is that the numbering system is now gone.

The type of the Extra.order attribute is now a tuple of two iterables. This represents "before" and "after". EG:

class MyExtra(Extra):
  # run before preprocessing, after HASH_HTML and after the FencedCodeBlocks extra is run
  order = [Stage.PREPROCESS], [Stage.HASH_HTML, FencedCodeBlocks]

When you call MyExtra.register, it will calculate when this extra should be triggered and store this info in Extra._exec_order. When all the extras are registered, it looks something like this:

{<Stage.PREPROCESS: 1>: (
                         [],  # before
                         [<class 'markdown2.Mermaid'>,  # after
                          <class 'markdown2.Wavedrom'>,
                          <class 'markdown2.FencedCodeBlocks'>]),
 <Stage.HASH_HTML: 2>: (
                         [],  # before
                         [<class 'markdown2.MarkdownInHTML'>]),  # after
[...]
 <Stage.ITALIC_AND_BOLD: 14>: (
                         [<class 'markdown2.Underline'>,  # before
                          <class 'markdown2.Strike'>,
                          <class 'markdown2.MiddleWordEm'>,
                          <class 'markdown2.CodeFriendly'>],
                         [<class 'markdown2.Breaks'>,  # after
                          <class 'markdown2.MiddleWordEm'>,
                          <class 'markdown2.CodeFriendly'>,
                          <class 'markdown2.TelegramSpoiler'>])}

As mentioned previously, the Stage class is now an IntEnum containing the different stages. This was done so that extras could check the current stage of execution, and whether they were being called before/after a particular stage. This is useful for extras that need to be called before AND after a stage.

Calling Extra.deregister now also removes it from Extra._exec_order, allowing users to potentially change the order of an extra on the fly.

Finally, in the changelog I've put these changes under the most recent header, which is 2.4.13, but I suspect this change may warrant a bump to 2.5.0

@nicholasserra
Copy link
Collaborator

@Crozzers This is all looking amazing. I think i'm ready to merge this in. Thank you for all the work on this!

I see you have some other open PRs. How would you like to proceed here? Merge this in now, and then fix up the others (if needed), or should I merge those others in and this PR can be rebased? Your call.

I think i'll do a release of 2.4.13 before this lands. Then we can move these changes into 2.5.0 as you recommended. I'll just edit the changelog manually when it comes time.

@Crozzers
Copy link
Contributor Author

None of the other PRs are particularly high priority and #568 requires some further investigation so I'd say merge this one first and I can update the others if needed. They're small patches anyway so shouldn't need much doing to them.

Thanks for taking the time to review this PR, it ended up as a bit of a monster

@nicholasserra
Copy link
Collaborator

Ok then i'm gonna do a release of the current changes, then merge this, then manually bump us to 2.5.0 after

@nicholasserra nicholasserra merged commit 44e0277 into trentm:master Feb 25, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Feature request Priority High priority tickets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request] Permit custom extras
4 participants