Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdown #431

Closed
wants to merge 17 commits into from
Closed

Markdown #431

wants to merge 17 commits into from

Conversation

gaborcsardi
Copy link
Member

From #365. This is not ready yet for merging, but it is close. :)

This PR adds markdown parsing for fields: title, description, details, references, concept, note, seealso, keywords, return, author, section, format, source, param, slot, field, method.

It should leave Rd notation unaffected. One glitch I noticed is that the commonmark parser removes leading whitespace. This is mostly fine with Rd, whitespace is not significant, except within \preformatted{}, so I have a workaround for that. (FIXME: any other case where leading whitespace matters?)

It is possible that some Rd notation is picked up as markdown, but all the examples I could come up with were quite artificial. Nevertheless we should have a @nomarkdown tag that forbids Markdown parsing.

For the supported notation, see the test cases and the readme at maxygen: https://github.com/gaborcsardi/maxygen

TODO:

  • Write a vignette, once we decided about all the details.
  • Add a @nomarkdown tag. Or @noMd is better.

Feedback welcome!

@@ -7,6 +7,10 @@

* Fixed bug in `@noRd`, where usage would cause error (#418).

* Most fields can now be written using Markdown markup instead of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think it's more robust to always put new items at the top of the list

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gaborcsardi
Copy link
Member Author

What do you think about the @nomarkdown tag? Is that OK?

@hadley
Copy link
Member

hadley commented Nov 12, 2015

I think it should be @noMarkdown to be consistent with other tags, or maybe just @noMd to be consistent with @noRd

@gaborcsardi
Copy link
Member Author

OK, @noMd it is.

@gaborcsardi
Copy link
Member Author

Here is a problem I just found. The Md emphasis markers are picked up from different lines, and this is sometimes a problem. E.g.

    #' Title
    #'
    #' Description with some *keywords* included.
    #' So far so good. \\preformatted{ *these are not
    #'   emphasised*. Or are they?
    #' }

results

\description{
Description with some \emph{keywords} included. So far so
good. \preformatted{ \emph{these are not emphasised}. Or are
they? }
}

This is of course correct for Md, but it will screw up Rd. _ is similar, I guess.
Possible solutions:

  • escape (replace, really) * and _ within \preformatted and \eqn before
    the markdown parsing, and then put them back.
  • do not parse markdown within \preformatted and \eqn at all.

The second is more sensible, I think. I would remove them completely before the parsing,
and then put them back at the appropriate place after. You would not want to write Md within
\preformatted and \eqn, anyway, right?

Toughts? @hadley @jeroenooms ?

@gaborcsardi
Copy link
Member Author

I chatted with @jeroenooms about this. Here are three good options to make sure that Rd markup still works as expected after merging this PR.

  1. The first one is simple, we require a @md tag for docs that is in Markdown. If @md is present, everything is parsed as md. This could be specified at the package level or the R object level.
  2. The second one is simple conceptually, it is a bit harder to implement. We would protect all Rd commands from the Markdown parser, so they would be not parsed at all. This also means that if you have an \itemize{}, you cannot use Markdown in it. The implementation would just replace Rd tags with markers before calling the md parser, and then put them back afterwards. This is similar to how Markdown is not interpreted within HTML.
  3. This is similar to the previous one, but we would only apply it to Rd tags that are potentially dangerous: \code, \preformatted, \deqn and \eqn, etc. The complete list can be assembled from https://developer.r-project.org/parseRd.pdf We would protect tags which most probably do not contain markdown markup, but might contain text, code or markup that is accidentally interpreted by the markdown parser. The only annoying thing with this solution is that if your Rd docs is still picked up as md, there might be no errors or warnings, your docs output will just be incorrect. But I don't think this would happen too often.

I think I prefer number three. It is potentially (slightly) dangerous, but it would allow a seamless transition, and I think it would work just fine in almost all cases.

@hadley What do you think?

@hadley
Copy link
Member

hadley commented Dec 2, 2015

3 sounds good to me.

@gaborcsardi
Copy link
Member Author

This is actually quite tricky. Ideally you would want to parse the text with Rd first, so that you can be sure that you find all the Rd tags, without any errors. But of course we don't want to do that, and with the markdown markup included, the text might not even parse as Rd.

So the plan is to have a very simple parser, that only cares about \{}% characters. It would find the Rd macro arguments that we want to ignore, and replace them with markers, before the Md parsing. And then put them back in the end.

I am a bit afraid that writing this in R will be somewhat slow, but we'll see. I'll include the list of all Rd macros, and how we would handle them, in another comment.

@gaborcsardi
Copy link
Member Author

Summary

The tables below contain all Rd macros, according to https://developer.r-project.org/parseRd.pdf and Writing R Extensions.

+ in the MD column means that we will parse the contents of the argument(s) of the macro as Markdown. - means that we will not, and roxygen will treat them verbatim.

For some macros, I am not quite sure what to do, these are marked with a question mark, and there is a short discussion about them at the end.

Sectioning macros

Macro Text type Roxy tag MD
\arguments latex implicit +
\author latex @author +
\concept latex @concepts +?
\description latex @description +
\details latex @details +
\docType latex @docType -
\encoding latex @encoding -
\format latex @format +
\keyword latex @keywords -
\name latex @name -
\note latex @note +
\references latex @references +
\section latex @section +
\seealso latex @seealso +
\source latex @source +
\title latex @title +
\value latex @value +
\examples R @examples -
\usage R @usage -
\alias verbatim @aliases -
\Rdversion verbatim -
\synopsis verbatim -
\Sexpr R -
\RdOpts verbatim -

Markup macros within sections taking LaTeX-like text

Macro Text type Roxy tag MD
\acronym latex -?
\bold latex +
\cite latex +
\command latex +?
\dfn latex +
\dQuote latex +
\email latex -
\emph latex +
\file latex -
\item latex +
\linkS4class latex -
\pkg latex -
\sQuote latex +
\strong latex +
\var latex -
\describe latex +
\enumerate latex +
\itemize latex +
\enc latex +
\if latex +2
\ifelse latex +23
\method latex -
\S3method latex -
\S4method latex -
\tabular latex +
\subsection latex +
\link latex -
\href verb+latex +2

+2 means that MD is parsed in the second argument only. +23 means that MD is parsed in the second and third arguments, but not in the first, etc.

Markup macros within sections taking R-like text, or verbatim text.

Note, that some macros do not take arguments, these are not important for our purposes: \cr, \dots, \ldots, \R, \tab.

Macro Text type Roxy tag MD
\code R -?
\dontshow R -
\donttest R -
\testonly R -
\dontrun verbatim -
\env verbatim -
\kbd verbatim -
\option verbatim -
\out verbatim -
\preformatted verbatim -
\samp verbatim -
\special verbatim -
\url verbatim -
\verb verbatim -
\deqn verbatim -
\eqn verbatim -
\newcommand verbatim -?
\renewcommand verbatim -?

New functions, not in the original docs

Macro Text type Roxy tag MD
\figure special? -

User defined macros

There are in share/Rd/macros/system.Rd and can be used in Rd files in general. Here is the current R-devel version: https://github.com/wch/r-source/blob/trunk/share/Rd/macros/system.Rd

Macro MD
\CRANpkg -
\PR -
\sspace -
\packageTitle -
\packageDescription -
\packageAuthor -
\packageMaintainer -
\packageDESCRIPTION -
\packageIndices -
\doi -

Comments on some macros

  • \concept usually don't need MD markup, but maybe sometimes
    it does, so I would allow it for now.
  • I guess we don't need MD within \acronym.
  • Is it dangerous parse MD within \command?
  • \code is tricky, because it is usually R code, so we don't want to parse it, but \link and \var are still interpreted. Anyway, within code, people can write Rd syntax.
  • \newcommand and \renewcommand are tricky, too. For now we don't parse them.

@gaborcsardi
Copy link
Member Author

@hadley @jeroenooms

This is pretty close now I think.

  1. I implemented the escaping of the "fragile" Rd tags: \code, \preformatted, etc. Within these there is no markdown parsing.
  2. I also implemented the @noMd tag. It works at the block level, just like noRd, and uses a special marker in the tags environment, called markdown-support. Somewhat abusing the tags, but also seemed like a natural place to put it. I can certainly put it in another environment if you don't like this solution.

For the fragile tag parsing, I modified the rdComplete C code, because I needed almost the same code, to find the end of the arguments after an Rd tag.

Apart from writing the vignette, what would be the best form for documentation? Is a brief manual page, pointing to the vignette, enough? Just trying to avoid duplication, if possible....

Please test if you have time and you are brave. :) Just kidding, I think it will work mostly out of the box, and if not, you can always use @noMd.

Btw. do we (temporarily?) want to have an option to turn on markdown parsing and leave it off by default? Just to be on the safe side.

@gaborcsardi
Copy link
Member Author

I fixed some bugs, and found a new one. More precisely, is actually not really a bug, but it is still annoying. This docs:
https://github.com/hadley/ggplot2/blob/9b5e097e0aafdca19b5e8f9d2153177eeba809fb/R/data.R#L117
are interpreted as an ordered list:

❯ cat(markdown_xml("qqqq\n  1. eeee fff ggg\nrrrr xvxcvxcvxcv"))
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
  <paragraph>
    <text>qqqq</text>
  </paragraph>
  <list type="ordered" start="1" delim="period" tight="true">
    <item>
      <paragraph>
        <text>eeee fff ggg</text>
        <softbreak />
        <text>rrrr xvxcvxcvxcv</text>
      </paragraph>
    </item>
  </list>
</document>

But of course sometimes these are really meant to be ordered lists. So I am not sure what to do about this, if anything. (Other than making sure that there is not error at least, like for ggplot2.)

@gaborcsardi
Copy link
Member Author

This is the diff between original roxygen and roxygen with this PR, for ggplot2: https://gist.github.com/gaborcsardi/baed9ffd55c072425bb926595b144324

  • Leading whitespace is deleted (you don't see this is in the diff, because in the diff I ignored the whitespace).
  • $ _ and % are not escaped
  • A whole \section{}{} is removed
  • four leading spaces are picked up as \preformatted{} when they ideally shouldn't
  • the spurious ordered list, mentioned in Markdown #431 (comment)

The first three I can fix, those are bugs.

The last two are supposedly features in commonmark, which are better fixed in ggplot2 I think.

I will fix these and then run it on more packages to see other possible bugs.

@gaborcsardi
Copy link
Member Author

So, if we don't mind deviating from commonmark, we can preprocess the text, so that

  • Ordered lists are only picked up after an empty line
  • The same for block quotes

I.e. this is an ordered list:

blah blah blah

1. first
2. second

but this is not:

blah blah blah April 3,
2016. 

@jeroenooms?

@gaborcsardi
Copy link
Member Author

Btw. @hadley, thanks for the +1, please comment next time, because it seems that I don't get a notification about +1s. I am not sure if this is for the good or bad. Thanks.

@jeroen
Copy link
Member

jeroen commented Apr 4, 2016

I suggest we to stick with the commonmark standard and put the responsibility of using proper markdown with the package authors. I think these edge cases are quite rare, and defining a different markdown only makes things more confusing and error prone. People will have to deal with the same issue when using markdown in vignettes or elsewhere.

We double-escape them before running the markdown parser.
Ideally we should keep leading ws in general, because that's what
Roxygen currently does. But it is also tricky, because the
ws might be followed by an ordered or unordered list, or a >
for a block quote, etc.

The parsed XML does not have the ws any more, so we need to
handle it before the markdown run.
@hadley
Copy link
Member

hadley commented Apr 4, 2016

Yeah, I'd say stick with common as well.

@gaborcsardi
Copy link
Member Author

OK, I fixed all the bugs I noticed. Still needs some testing.

I think we should turn it off by default, because I cannot reliably fix the removal of the leading whitespace. Sometimes leading whitespace is meaningful for markdown, e.g.

* This is 
* A list
    - With an embedded
    - List inside

I cannot "escape" the leading whitespace in this case, because then commonmark does not parse the list properly. So I think it is better to turn it off by default for now.

@hadley Is there a standard way of specifying Roxygen options for the whole package? If not, then maybe we could have one using the RoxygenNote field?

In addition it would make sense to have an @Md tag as well, to turn it on for a single chunk.

@gaborcsardi
Copy link
Member Author

@hadley If there is no way currently to set options for the whole package, how about using the RoxygeNote DESCRIPTION field for that? We could have something like:

RoxygenNote: 5.0.1, markdown = TRUE

I would do this in another PR.

@hadley
Copy link
Member

hadley commented Apr 6, 2016

Maybe for this version we should just let people opt in with @md? And then in the next version we could think about global switches?

@gaborcsardi
Copy link
Member Author

@hadley I was thinking about that, too. It is kinda painful to opt in for every single chunk.

@gaborcsardi
Copy link
Member Author

But sure, for now, we can do that.

@gaborcsardi
Copy link
Member Author

OK, it is opt-in now. There is an @md tag instead of the @noMd.

This way I think it is pretty safe to merge, in the end I did not modify any existing test cases, and all of them pass.

There is a single item on the TODO list, the vignette about markdown mentioned in #431 (comment). Maybe I can just write that in another PR, and we can start testing this. :)

EDIT: I mean, start using this. :)

@gaborcsardi
Copy link
Member Author

@hadley Btw. when is the next Roxygen release planned?

Just because I am excited to start using this. :) And also, I need to write the vignette before that.....

Anyway, if you missed the previous comment because you are not watching this repo, this is now ready to be merged I think.

@jeroen
Copy link
Member

jeroen commented Apr 14, 2016

@gaborcsardi looking into supporting readthedocs for R documentation with karthik. They also support commonmark as an input format http://docs.readthedocs.org/en/latest/getting_started.html.

@gaborcsardi
Copy link
Member Author

@jeroenooms That would be amazing!

Somewhat different from this PR, though. Depending on what exactly you want. I guess there you want some Rd -> markdown translation, whereas this PR is the opposite way.

Anyway, getting something out from Rd and/or roxygen that you can put on readthedocs would be super nice.

@hadley
Copy link
Member

hadley commented Aug 29, 2016

@gaborcsardi I'm working on roxygen2 again, aiming for a release around September 16. I'm travelling quite a bit so I'm not sure I'll be tackling any big problems for this release, but I'm definitely available to give feedback and discuss options.

@gaborcsardi
Copy link
Member Author

OK. I'll rebase this, and also write the vignette.

Since this is opt-in, I think it is fairly safe to merge.

It would be still nice to turn it on for the whole package. :)

@gaborcsardi
Copy link
Member Author

@hadley A technical issue. I deleted my fork in the meanwhile, so I cannot change this PR. I'll close it down and open another one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants