Skip to content
This repository

Add SmartyPants extension as part of Python-Markdown #12

Closed
david-a-wheeler opened this Issue · 24 comments

4 participants

David A. Wheeler Waylan Limberg Jonathan Eunice Dmitry Shachnev
David A. Wheeler

This is a feature request. It'd be nice if there was a built-in (batteries included) extension to implement SmartyPants quoting by turning on a simple extension.

I notice that someone is already using SmartyPants with Markdown for Python, though not as an extension:
http://byrneswoder.com/blog/one-secret-to-generating-clean-html-from-text/

Waylan Limberg
Owner
waylan commented

I'm not completely opposed to this, but what's the benefit of:

markdown.markdown(text, extensions=['smartypants'])

over:

smartypants(markdown.markdown(text))

The latter works today and requires less typing. Of course, I realize in some more complex situations (like using smartypants with the codehilite extension) things may not work as well. Therefore, I could see a smartypants extension which implemented all the various features as inline patterns inserted into the parser.

However, this is not high enough on my priority list to devote the time to. Of course, patches and/or merge requests are always welcome.

David A. Wheeler

That doesn't work so nicely on the command line, or with other extensions (as you noted).

Waylan Limberg
Owner
waylan commented
Waylan Limberg
Owner
waylan commented

Based on the reasons stated previously, I will not be implementing this. If and when someone else writes the extension, they can do a merge request and I'll reconsider. Until then, I'm closing this.

Waylan Limberg waylan closed this
Jonathan Eunice

Here is such an extension: https://bitbucket.org/jeunice/mdx_smartypants

Anyone can use the code directly. I would be happy to put it in whatever format or license would make it most palatable to the Python-Markdown team to include. Smartypants seems like such a natural fit with Markdown. That it's not part of the core is just odd.

Waylan Limberg
Owner
waylan commented

Aside from the named html entities addition, how is this better that smartypants(markdown(text))?

Either way, the smartypants lib is still needed. And I'm "Meh" to the named html entities addition.

In other words, does this improve the markdown lib enough to justify an additional maintenance load? As I stated previously, I might be interested if smartypants we re-implemented as an extension using the extension API (inlinepatterns etc). If all the extension is doing is providing a wrapper, then why hide the fact in an extension?

Oh and it is not part of the "core" because smartypants.pl (the original) is not part of markdown.pl (the original markdown implementation). Of course, it can be an extension (which is not the "core") which even ships with the "core" if there is real value in doing so. I just don't see that value yet.

In any event, I've added the extension to the wiki

Jonathan Eunice
Jonathan Eunice

I'd prefer smartypants functions to work right out-of-the-box with Python-Markdown, but in lieu of that, I've submitted the package to PyPI (http://pypi.python.org/pypi/mdx_smartypants/) so that users can auto-install it with pip install mdx_smartypants (or failing that, fall back to the older easy_install mdx_smartypants).

Dmitry Shachnev

Aside from the named html entities addition, how is this better that smartypants(markdown(text))?

I see some benefits in extension approach:

  • no need to "tokenize" HTML and then build it back;
  • no hacks to prevent smartypants from touching code/pre blocks needed;
  • an extension can use Python-Markdown's escaping mechanism (i.e. to make \-- produce --, not –).
Dmitry Shachnev
mitya57 commented

Ah, and another benefit will be ability to use it in Python-Markdown’s own docs. I’m willing to implement and maintain this extension as part of Python-Markdown if @waylan agrees.

Waylan Limberg
Owner
waylan commented

@mitya57 to be clear, the benefits you mention do not apply to the existing implementation @jonathaneunice linked to above. However, an extension that reimplemented smartypants using Python-Markdown's extension API (probably a few different inline patterns) would be acceptable to me. I realize that is what you were referring to, but I just wanted to make it clear that I was questioning the value of @jonathaneunice's implementation, not smartypants in general.

So, yeah, I'm all for a smartypants extension if it is done right. I just don't have the time to devote to it.

Dmitry Shachnev
mitya57 commented

WIP version available in my smarty-extension branch (suggestions for a better name?).
I didn't manage to make Markdown not convert &symbol; to &symbol;, so using unicode instead of HTML symbols (I don't like that much), so help welcomed here.

Dmitry Shachnev
mitya57 commented

@waylan can you please review my implementation (linked above)? Please don't merge it yet, as there is no documentation and branch is not clean, but let me know if you do/don't like the code :)

Things that need to be done:

  • documentation
  • add smartypants license
  • make it configurable (i.e. specify needed fixers)
Waylan Limberg
Owner
waylan commented

Took a quick glance and generally looks good. A few concerns though. I find is strange that you are monkeypatching the Pattern class rather than using a subclass. In fact, for the dashes and ellipses a single subclass would easily work. And IMO the code would be a little easier to read. Remember that any extensions included with the standard library should be a model of how extensions work.

I'd rather see:

emDashesPattern = SubstituteTextPattern(r'(?<!-)---(?!-)', mdash)

In fact, the SubstituteTextPattern (or whatever better name you come up with) could probably eventually go right in inlinpatterns.py for use by any extension.

Regarding the many quote patterns, I wonder if a subclass could handle them as well - rather than a factory function. I'm undecided about that. But I'd like to see what is possible.

Curious why you chose to use tables in your tests. I would think horizontal rules would be a more interesting test. Especially some of the more interesting patterns people use (try 3 dashes, space, 2 dashes, space, 3 dashes ...). And that is all supported int he standard library.

Waylan Limberg
Owner
waylan commented

Oh, one more thing, according to the documentation, smartypants is supposed to use HTML entities in the output. Shouldn't we be doing the same?

Dmitry Shachnev
mitya57 commented

I'd rather see:

emDashesPattern = SubstituteTextPattern(r'(?<!-)---(?!-)', mdash)

In fact, the SubstituteTextPattern (or whatever better name you come up with) could probably eventually go right in inlinepatterns.py for use by any extension.

Done now.

Regarding the many quote patterns, I wonder if a subclass could handle them as well - rather than a factory function. I'm undecided about that. But I'd like to see what is possible.

These are now also SubsituteTextPatterns.

Curious why you chose to use tables in your tests. I would think horizontal rules would be a more interesting test. Especially some of the more interesting patterns people use (try 3 dashes, space, 2 dashes, space, 3 dashes ...). And that is all supported int he standard library.

Done.

Oh, one more thing, according to the documentation, smartypants is supposed to use HTML entities in the output. Shouldn't we be doing the same?

Please see my previous comment. Actually, I don't see any advantages of not using unicode now (the original smartypants was written back in 2004, probably there were some advantages at that point).

Dmitry Shachnev
mitya57 commented

Updated the branch with configuration, docs and license headers.

This is not yet ready for merging because it seems that match.group() cannot be called when the regexp matched STX or ETX, otherwise we can get failures like this one (in test suite):

-<a href="http://example.com">Link</a>\u2019s test</p>
+\x03\u2019s test</p>

The only way to solve this I can come up with is cloning each regexp into two: one with ?=[\x03-\x04] and one with ?![\x03-\x04], and calling match.group() only for the latter. Any better suggestions?

Waylan Limberg
Owner
waylan commented

@mitya57 I have a few observations:

It appears that you forgot to commit the documentation. I see the links to the file, but the smarty.txt file is missing.

Also it appears that either your master branch or smarty-extension branch is not up-to-date. If it was, then Github's compare feature would be helpful in seeing a snapshot of all your changes. Not a big deal, but it would be helpful.

I reviewed the entities issue. The problem is that the serializer is escaping any html entities. In fact, ElementTree apparently has no support for them. We handle them by storing them in the htmlStash. If we use entities here, that is what we would need to do. The pattern class would probably need to be a subclass of HtmlPattern and should mark the stashed entities as safe so they still work in "safe_mode" (...store(someEntity, safe=True); see util.py for details). So much for a reusable SubstituteTextPattern.

I think we should stick with entities. Some people use weird encodings in their servers/browsers/file systems, and have little or no understanding of the issues, let alone how to change the settings. This is especially true with English users who only ever work with text using ASCII characters. As ASCII converts to unicode, they don't ever notice a problem. If we start adding in random unicode chars, weird things might happen. I didn't check, maybe all the smartypants chars are ASCII anyway, but HTMLentities serve this purpose just fine. Might as well stick with them.

Finally, regarding your failing test; I was going to suggest running your pattern earlier, but it needs to run after "escape" at least, and that uses STX and ETX. Can you give me a simple example of the problem outside of the markdown codebase. I'm not sure I understand how the problem is related to match.group(). Oh, and I believe you should be using the \u0002 format rather than \x02 or it may fail to match properly in Python 3.

Dmitry Shachnev
mitya57 commented

@waylan Thanks for the review!

It appears that you forgot to commit the documentation. I see the links to the file, but the smarty.txt file is missing.

Will commit tomorrow (it's on a different machine).

Also it appears that either your master branch or smarty-extension branch is not up-to-date. If it was, then Github's compare feature would be helpful in seeing a snapshot of all your changes. Not a big deal, but it would be helpful.

Master updated.

I reviewed the entities issue. [snip]

Thanks for the suggestions, switched to entities and it works!

Finally, regarding your failing test; I was going to suggest running your pattern earlier, but it needs to run after "escape" at least, and that uses STX and ETX. Can you give me a simple example of the problem outside of the markdown codebase.

This seems to not happen outside markdown codebase. It seems that return value of handleMatch() can contain neither STX nor ETX, otherwise they stay in the document and don't get cleaned in the process of HTML conversion. I don't know why this happens. Also, for me \x0* and \u000* are equal in both Python 2 & 3.

Waylan Limberg
Owner
waylan commented

@mitya57 I just made some inline comments on commit be77195.

Regarding \x0* and \u000* I've tested in multiple platforms in the past, and on at least one (I forget which) \x0* failed to match properly when used in a regex in Python 3. So let's use \u000* for everything.

And it just occured to me what your problem most likely is with STX and ETX. Your Pattern class needs to be a subclass of HtmlPattern. Then, before passing any text into the htmlStash, you need to pass the text to self.unescape() first (which is defined in and inherited from HtmlPattern). STX and ETX wrap any existing placeholders for already stashed raw html - they act as start and stop deliminators. If a placehold gets stashed, then it will never get swapped back out. Therefore, we need to swap it back out before stashing it.

Also, as STX and ETX act as start and stop deliminators, make sure your regex isn't matching only part of an existing placeholder. If you break an existing placeholder up, it will never be able to be swapped back out later. As your regexes appear to only make reference to the STX and ETX chars as non-matches, I don't think this is a problem, but I haven't taken the time to actually go over the regular expressions you are using - so I thought it might be worth mentioning.

Hope this helps.

Dmitry Shachnev
mitya57 commented

@waylan Thanks for the comments, fixed 3 of 4 and commented on the 4th.

Then, before passing any text into the htmlStash, you need to pass the text to self.unescape() first (which is defined in and inherited from HtmlPattern).

I was not passing output of match.group() to htmlStash, I was using it directly. Escaping it doesn't help anyway :(

Most SmartyPants regexps match a quote, a character before it and/or a character after it. It breaks when one of those characters is STX/ETX.

Dmitry Shachnev

Sorry that it took too long, I was busy with some university work.

Now I've finally managed to fix the STX/ETX issue by using lookbehind experssions. The remaining problem is that quotes before/after inline markup are not recognized properly, i.e. a single quote after a link or emphasized text becomes an opening one instead of a closing one. Any thoughts about that?

If @waylan is OK with having that "bug", I'll make a clean branch and propose a formal pull request.

Waylan Limberg
Owner
waylan commented

@mitya57 I suppose the reason for the quote-after-inline-markup bug is a result of the fact that inline patterns provide no information about where the text comes from (tag, text, tail, etc). With that info, you could have different behavior if the text was from the elm.text, or elm.tail etc. This shortcoming of the API has always annoyed me.

I'm open to suggestions that won't break existing extensions.

Dmitry Shachnev

Actually, the original SmartyPants library suffers from the same issue, so we don't regress here. I've now created pull request #231 for your review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.