Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Add SmartyPants extension as part of Python-Markdown #12

Closed
david-a-wheeler opened this Issue May 27, 2011 · 24 comments

Comments

Projects
None yet
4 participants

This is a feature request. It'd be nice if there was a built-in (batteries included) extension to implement SmartyPants quoting by turning on a simple extension.

I notice that someone is already using SmartyPants with Markdown for Python, though not as an extension:
http://byrneswoder.com/blog/one-secret-to-generating-clean-html-from-text/

Owner

waylan commented Jun 1, 2011

I'm not completely opposed to this, but what's the benefit of:

markdown.markdown(text, extensions=['smartypants'])

over:

smartypants(markdown.markdown(text))

The latter works today and requires less typing. Of course, I realize in some more complex situations (like using smartypants with the codehilite extension) things may not work as well. Therefore, I could see a smartypants extension which implemented all the various features as inline patterns inserted into the parser.

However, this is not high enough on my priority list to devote the time to. Of course, patches and/or merge requests are always welcome.

That doesn't work so nicely on the command line, or with other extensions (as you noted).

Owner

waylan commented Jun 3, 2011

Regarding the command line, markdown outputs to stdout, so any decent
commandline script should be able to take that in stdin with a pipe.
If smartypants doesn't do that, I'd consider that an issue for
smartypants not markdown.

Another issue with making smartypants an extension is how to make it
work on raw html. Markdown does nothing to alter raw html, but users
will expect smartypants to run on the raw html. Sure, an extension
could run as a postprocessor extension on the full text, but them we
have the same issues with other extensions as we have now (when
wrapping the markdown output with a call to smartypants). So what's
the point?

I'm not saying its not worth doing, just that it will be more work
than I have time for if it is done right. Of course, we'll accept
patches/merge requests.


\X/ /-\ `/ |_ /-\ ||
Waylan Limberg

Owner

waylan commented Jul 21, 2011

Based on the reasons stated previously, I will not be implementing this. If and when someone else writes the extension, they can do a merge request and I'll reconsider. Until then, I'm closing this.

@waylan waylan closed this Jul 21, 2011

Here is such an extension: https://bitbucket.org/jeunice/mdx_smartypants

Anyone can use the code directly. I would be happy to put it in whatever format or license would make it most palatable to the Python-Markdown team to include. Smartypants seems like such a natural fit with Markdown. That it's not part of the core is just odd.

Owner

waylan commented May 7, 2012

Aside from the named html entities addition, how is this better that smartypants(markdown(text))?

Either way, the smartypants lib is still needed. And I'm "Meh" to the named html entities addition.

In other words, does this improve the markdown lib enough to justify an additional maintenance load? As I stated previously, I might be interested if smartypants we re-implemented as an extension using the extension API (inlinepatterns etc). If all the extension is doing is providing a wrapper, then why hide the fact in an extension?

Oh and it is not part of the "core" because smartypants.pl (the original) is not part of markdown.pl (the original markdown implementation). Of course, it can be an extension (which is not the "core") which even ships with the "core" if there is real value in doing so. I just don't see that value yet.

In any event, I've added the extension to the wiki

It’s better IMO because it yields a simple, highly-useful function right
out of the box that otherwise requires unwieldy alternatives.

It’s certainly possible to import a couple of separate namedentities and
smartypants modules, then do named_entities(smartyPants(markdown(text)) or
smarter_smartypants(markdown(text)). Unfortunately, that means one must
configure some extension usage (footnotes and tables, e.g.) as part of
markdown configuration, and other extension usage (nice typography,
readable HTML entities, etc.) somewhere else. That non-parallel structure
is more verbose and complex.

I note that several extensions already bundled with markdown—Meta and HTML
Tidy for instance—just pre- or post-process. HTML Tidy, in fact, similarly
depends on an external module for its core operation. It’s unclear to me
why mdx_smartypants, which fits a very similar niche (producing “pretty”
and/or high-quality HTML output) is more challenging / less useful. Indeed,
I’d argue that it’s more valuable.

Publishers and web designers use enhanced-typography glyphs all the time.
Layout systems like Microsoft Word and Wordpress have “smart quotes” as a
matter of course. If markdown doesn’t, or makes it more difficult, that
makes it less capable of rendering attractive output than the competition.
I like markdown and would like to see it used in more places. Extending it
so that it’s more effective at producing great-looking results, with little
effort, seems part of that mission. Indeed, that seems like Gruber’s
original mission.

Trying to reinvent a mature, well-accepted wheel like smartypants or do
some kind of fancy ties to APIs found only in Python-Markup—that seems
beside the point, as well as counter-productive from both effort and
quality-delivered points of view. I just want to get beautiful, easily-read
output from my markup text, and make it easy for others to do the same.
Channeling the Zen of Python:

  1. Beautiful is better than ugly.
  2. Simple is better than complex.
  3. Readability counts.

Smartypants out-of-the-box improves markdown along those axes. If it makes
it more palatable / less threatening from a maintenance effort point of
view, I’ll help debug smartypants-related issues that might arise.

jse

On Mon, May 7, 2012 at 3:37 PM, Waylan Limberg <
reply@reply.github.com

wrote:

Aside from the named html entities addition, how is this better that
smartypants(markdown(text))?

Either way, the smartypants lib is still needed. And I'm "Meh" to the
named html entities addition.

In other words, does this improve the markdown lib enough to justify an
additional maintenance load? As I stated previously, I might be interested
if smartypants we re-implemented as an extension using the extension API
(inlinepatterns etc). If all the extension is doing is providing a wrapper,
then why hide the fact in an extension?

Oh and it is not part of the "core" because smartypants.pl (the original)
is not part of markdown.pl (the original markdown implementation). Of
course, it can be an extension (which is not the "core") which even ships
with the "core" if there is real value in doing so. I just don't see that
value yet.

In any event, I've added the extension to the wiki


Reply to this email directly or view it on GitHub:
#12 (comment)

I'd prefer smartypants functions to work right out-of-the-box with Python-Markdown, but in lieu of that, I've submitted the package to PyPI (http://pypi.python.org/pypi/mdx_smartypants/) so that users can auto-install it with pip install mdx_smartypants (or failing that, fall back to the older easy_install mdx_smartypants).

Collaborator

mitya57 commented Apr 6, 2013

Aside from the named html entities addition, how is this better that smartypants(markdown(text))?

I see some benefits in extension approach:

  • no need to "tokenize" HTML and then build it back;
  • no hacks to prevent smartypants from touching code/pre blocks needed;
  • an extension can use Python-Markdown's escaping mechanism (i.e. to make \-- produce --, not &ndash;).
Collaborator

mitya57 commented May 2, 2013

Ah, and another benefit will be ability to use it in Python-Markdown’s own docs. I’m willing to implement and maintain this extension as part of Python-Markdown if @waylan agrees.

Owner

waylan commented May 2, 2013

@mitya57 to be clear, the benefits you mention do not apply to the existing implementation @jonathaneunice linked to above. However, an extension that reimplemented smartypants using Python-Markdown's extension API (probably a few different inline patterns) would be acceptable to me. I realize that is what you were referring to, but I just wanted to make it clear that I was questioning the value of @jonathaneunice's implementation, not smartypants in general.

So, yeah, I'm all for a smartypants extension if it is done right. I just don't have the time to devote to it.

Collaborator

mitya57 commented May 6, 2013

WIP version available in my smarty-extension branch (suggestions for a better name?).
I didn't manage to make Markdown not convert &symbol; to &amp;symbol;, so using unicode instead of HTML symbols (I don't like that much), so help welcomed here.

Collaborator

mitya57 commented May 12, 2013

@waylan can you please review my implementation (linked above)? Please don't merge it yet, as there is no documentation and branch is not clean, but let me know if you do/don't like the code :)

Things that need to be done:

  • documentation
  • add smartypants license
  • make it configurable (i.e. specify needed fixers)
Owner

waylan commented May 12, 2013

Took a quick glance and generally looks good. A few concerns though. I find is strange that you are monkeypatching the Pattern class rather than using a subclass. In fact, for the dashes and ellipses a single subclass would easily work. And IMO the code would be a little easier to read. Remember that any extensions included with the standard library should be a model of how extensions work.

I'd rather see:

emDashesPattern = SubstituteTextPattern(r'(?<!-)---(?!-)', mdash)

In fact, the SubstituteTextPattern (or whatever better name you come up with) could probably eventually go right in inlinpatterns.py for use by any extension.

Regarding the many quote patterns, I wonder if a subclass could handle them as well - rather than a factory function. I'm undecided about that. But I'd like to see what is possible.

Curious why you chose to use tables in your tests. I would think horizontal rules would be a more interesting test. Especially some of the more interesting patterns people use (try 3 dashes, space, 2 dashes, space, 3 dashes ...). And that is all supported int he standard library.

Owner

waylan commented May 12, 2013

Oh, one more thing, according to the documentation, smartypants is supposed to use HTML entities in the output. Shouldn't we be doing the same?

Collaborator

mitya57 commented May 13, 2013

I'd rather see:

emDashesPattern = SubstituteTextPattern(r'(?<!-)---(?!-)', mdash)

In fact, the SubstituteTextPattern (or whatever better name you come up with) could probably eventually go right in inlinepatterns.py for use by any extension.

Done now.

Regarding the many quote patterns, I wonder if a subclass could handle them as well - rather than a factory function. I'm undecided about that. But I'd like to see what is possible.

These are now also SubsituteTextPatterns.

Curious why you chose to use tables in your tests. I would think horizontal rules would be a more interesting test. Especially some of the more interesting patterns people use (try 3 dashes, space, 2 dashes, space, 3 dashes ...). And that is all supported int he standard library.

Done.

Oh, one more thing, according to the documentation, smartypants is supposed to use HTML entities in the output. Shouldn't we be doing the same?

Please see my previous comment. Actually, I don't see any advantages of not using unicode now (the original smartypants was written back in 2004, probably there were some advantages at that point).

Collaborator

mitya57 commented May 13, 2013

Updated the branch with configuration, docs and license headers.

This is not yet ready for merging because it seems that match.group() cannot be called when the regexp matched STX or ETX, otherwise we can get failures like this one (in test suite):

-<a href="http://example.com">Link</a>\u2019s test</p>
+�\x03\u2019s test</p>

The only way to solve this I can come up with is cloning each regexp into two: one with ?=[\x03-\x04] and one with ?![\x03-\x04], and calling match.group() only for the latter. Any better suggestions?

Owner

waylan commented May 16, 2013

@mitya57 I have a few observations:

It appears that you forgot to commit the documentation. I see the links to the file, but the smarty.txt file is missing.

Also it appears that either your master branch or smarty-extension branch is not up-to-date. If it was, then Github's compare feature would be helpful in seeing a snapshot of all your changes. Not a big deal, but it would be helpful.

I reviewed the entities issue. The problem is that the serializer is escaping any html entities. In fact, ElementTree apparently has no support for them. We handle them by storing them in the htmlStash. If we use entities here, that is what we would need to do. The pattern class would probably need to be a subclass of HtmlPattern and should mark the stashed entities as safe so they still work in "safe_mode" (...store(someEntity, safe=True); see util.py for details). So much for a reusable SubstituteTextPattern.

I think we should stick with entities. Some people use weird encodings in their servers/browsers/file systems, and have little or no understanding of the issues, let alone how to change the settings. This is especially true with English users who only ever work with text using ASCII characters. As ASCII converts to unicode, they don't ever notice a problem. If we start adding in random unicode chars, weird things might happen. I didn't check, maybe all the smartypants chars are ASCII anyway, but HTMLentities serve this purpose just fine. Might as well stick with them.

Finally, regarding your failing test; I was going to suggest running your pattern earlier, but it needs to run after "escape" at least, and that uses STX and ETX. Can you give me a simple example of the problem outside of the markdown codebase. I'm not sure I understand how the problem is related to match.group(). Oh, and I believe you should be using the \u0002 format rather than \x02 or it may fail to match properly in Python 3.

Collaborator

mitya57 commented May 18, 2013

@waylan Thanks for the review!

It appears that you forgot to commit the documentation. I see the links to the file, but the smarty.txt file is missing.

Will commit tomorrow (it's on a different machine).

Also it appears that either your master branch or smarty-extension branch is not up-to-date. If it was, then Github's compare feature would be helpful in seeing a snapshot of all your changes. Not a big deal, but it would be helpful.

Master updated.

I reviewed the entities issue. [snip]

Thanks for the suggestions, switched to entities and it works!

Finally, regarding your failing test; I was going to suggest running your pattern earlier, but it needs to run after "escape" at least, and that uses STX and ETX. Can you give me a simple example of the problem outside of the markdown codebase.

This seems to not happen outside markdown codebase. It seems that return value of handleMatch() can contain neither STX nor ETX, otherwise they stay in the document and don't get cleaned in the process of HTML conversion. I don't know why this happens. Also, for me \x0* and \u000* are equal in both Python 2 & 3.

Owner

waylan commented May 19, 2013

@mitya57 I just made some inline comments on commit be77195.

Regarding \x0* and \u000* I've tested in multiple platforms in the past, and on at least one (I forget which) \x0* failed to match properly when used in a regex in Python 3. So let's use \u000* for everything.

And it just occured to me what your problem most likely is with STX and ETX. Your Pattern class needs to be a subclass of HtmlPattern. Then, before passing any text into the htmlStash, you need to pass the text to self.unescape() first (which is defined in and inherited from HtmlPattern). STX and ETX wrap any existing placeholders for already stashed raw html - they act as start and stop deliminators. If a placehold gets stashed, then it will never get swapped back out. Therefore, we need to swap it back out before stashing it.

Also, as STX and ETX act as start and stop deliminators, make sure your regex isn't matching only part of an existing placeholder. If you break an existing placeholder up, it will never be able to be swapped back out later. As your regexes appear to only make reference to the STX and ETX chars as non-matches, I don't think this is a problem, but I haven't taken the time to actually go over the regular expressions you are using - so I thought it might be worth mentioning.

Hope this helps.

Collaborator

mitya57 commented May 19, 2013

@waylan Thanks for the comments, fixed 3 of 4 and commented on the 4th.

Then, before passing any text into the htmlStash, you need to pass the text to self.unescape() first (which is defined in and inherited from HtmlPattern).

I was not passing output of match.group() to htmlStash, I was using it directly. Escaping it doesn't help anyway :(

Most SmartyPants regexps match a quote, a character before it and/or a character after it. It breaks when one of those characters is STX/ETX.

Collaborator

mitya57 commented Jul 24, 2013

Sorry that it took too long, I was busy with some university work.

Now I've finally managed to fix the STX/ETX issue by using lookbehind experssions. The remaining problem is that quotes before/after inline markup are not recognized properly, i.e. a single quote after a link or emphasized text becomes an opening one instead of a closing one. Any thoughts about that?

If @waylan is OK with having that "bug", I'll make a clean branch and propose a formal pull request.

Owner

waylan commented Jul 25, 2013

@mitya57 I suppose the reason for the quote-after-inline-markup bug is a result of the fact that inline patterns provide no information about where the text comes from (tag, text, tail, etc). With that info, you could have different behavior if the text was from the elm.text, or elm.tail etc. This shortcoming of the API has always annoyed me.

I'm open to suggestions that won't break existing extensions.

Collaborator

mitya57 commented Jul 26, 2013

Actually, the original SmartyPants library suffers from the same issue, so we don't regress here. I've now created pull request #231 for your review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment