Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for task list checkboxes outside p #81

Merged

Conversation

Mr0grog
Copy link
Contributor

@Mr0grog Mr0grog commented Sep 4, 2023

Initial checklist

  • I read the support docs
  • I read the contributing guide
  • I agree to follow the code of conduct
  • I searched issues and couldn’t find anything (or linked relevant results below)
  • If applicable, I’ve added docs and tests

Description of changes

Previously, list items with leading checkboxes would only be treated as task list items if the checkbox was wrapped in a <p> element, like:

<li><p><input type="checkbox"> Other content</p></li>

This adds support for detecting task lists even when the checkbox is not nested in a <p>:

<li><input type="checkbox"> Other content</li>

This does not add support for some other complex scenarios, such as the checkbox being inside other phrasing content, being more deeply nested, or having comments in front of it, since the implementation gets much more complex:

<li><strong><input type="checkbox"> Other content</strong></li>
<li><p><strong><input type="checkbox"> Other content</strong></p></li>
<li><!-- Comment --><input type="checkbox"> Other content</li>

See discussion on #80 for more about that. There is a prototype of support for deeper nesting, but not comments, at Mr0grog@9a386bc, which I am happy to add here (or clean up some) if you like it.

Fixes #80.

Side Questions

  1. I thought about writing some kind of warning messages for situations that are probably malformed, e.g. there is a leading checkbox, but followed by non-phrasing content (so you’d wind up with a Mdast tree that is in some sense invalid, or at least doesn’t stringify correctly). I was thinking of the error messages in remark-parse’s emitErrors option (obviously not exactly the same since this is not a Unified plugin). Anyway, I didn’t see tooling for anything like this, and wanted to make sure I wasn’t missing it.

  2. I couldn’t find an easy/obvious way to narrow down which tests were running, which would have been helpful while debugging some issues (instead, I wound up hacking around in the test/index.js file). Is there a way to do this that I’m missing? Otherwise it seems like the Node.js test runner has support, but the way tests are run and written here doesn’t work with it:

    • Need to run tests with node --test, not node test/index.js (i.e. don’t run an actual test file), then you can use --test-name-pattern to filter tests.
    • Need to define test suites with describe() and it() instead of test() with subtests. I think the issue with the latter is that it can’t know in advance whether you are defining subtests, so it can’t filter based on them.

    ^^ If those are useful changes, I’m happy to do a separate, quick PR for them.

Previously, list items with leading checkboxes would only be treated as task list items if the checkbox was wrapped in a `<p>` element, like:

    <li><p><input type="checkbox"> Other content</p></li>

This adds support for detecting task lists even when the checkbox is not nested in a `<p>`:

    <li><input type="checkbox"> Other content</li>

This does *not* add support for some other complex scenarios, such as the checkbox being inside other phrasing content, being more deeply nested, or having comments in front of it:

    <li><strong><input type="checkbox"> Other content</strong></li>
    <li><p><strong><input type="checkbox"> Other content</strong></p></li>
    <li><!-- Comment --><input type="checkbox"> Other content</li>

Fixes syntax-tree#80.
@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Sep 4, 2023
Comment on lines 34 to 41
if (
checkbox &&
checkbox.type === 'element' &&
checkbox.tagName === 'input' &&
checkbox.properties &&
(checkbox.properties.type === 'checkbox' ||
checkbox.properties.type === 'radio')
) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Node 16+ is required, this could probably be stremlined a bit with optional chaining (e.g. checkbox?.properties?.type === 'checkbox'). I didn't see that in use anywhere else, though, so I left the full check here.

Copy link
Member

@wooorm wooorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for such a good PR!

lib/handlers/li.js Outdated Show resolved Hide resolved
Comment on lines 44 to 52
clone = {...node, children: [...node.children]}
if (checkboxParent === node) {
clone.children.splice(0, 1)
} else {
clone.children.splice(0, 1, {
...checkboxParent,
children: checkboxParent.children.slice(1)
})
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) some comments are useful
b) operating on the clone reads a bit complex, to me. Maybe that’s just me. What do you think about this?

Suggested change
clone = {...node, children: [...node.children]}
if (checkboxParent === node) {
clone.children.splice(0, 1)
} else {
clone.children.splice(0, 1, {
...checkboxParent,
children: checkboxParent.children.slice(1)
})
}
// The checkbox is the head of `li`:
if (checkboxParent === node) {
clone = {...node, children: node.children.slice(1)}
}
// The checkbox is in a paragraph (which is `checkboxParent`):
else {
clone = {
...node,
children: [
{...checkboxParent, children: checkboxParent.children.slice(1)},
...checkboxParent.children.slice(1)
]
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Yes, I definitely should have added some comments here. I tried both of these styles and liked on the one I committed slightly more, but I don’t have strong feelings. Will change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: will wait for feedback on the below item (about the checkbox with no text after it) before making the changes here, since there might be more to adjust depending on what the right output should be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I gather the main concern here is readability, I wound up extracting the code for finding and removing the checkbox into a separate function. I also wound up rewriting this recursively, which I think makes the logic much clearer to read (if you don’t think so, I can just roll back to f0acebf, which is still a separate function but has your recommended implementation and some more comments and description).

<li><input type="checkbox"> India</li>
<li>
<input type="checkbox"> Juliet
</li>
</ol>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a test without anything after it! Particularly without whitespace, like the current new India one

Copy link
Contributor Author

@Mr0grog Mr0grog Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, that’s an interesting edge case I didn’t think about! What would you expect for output here? The current code creates an empty task list item, e.g:

{
  type: "listItem",
  checked: false,
  children: []
}

…which renders like a normal list item, without even a checkbox (adding the Juliet item before it just so it’s clear):

* [ ] Juliet
*

(The checkbox disappears because there’s no paragraph node at the start of the children array, and that’s required to render it in mdast-util-task-list-gfm-item.)

Is that OK? This is kind of a weird situation (maybe like the things I was thinking there could be warnings about). I’m not sure thare’s any really reasonable result here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the wait. Yes, I think that’s fine. I think no checkbox is rendered because a paragraph is required by GFM. And those have to be filled with something? As far as I remember?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, great, will leave this case functioning as it currently does and get the rest of this done this week. 👍

I think no checkbox is rendered because a paragraph is required by GFM. And those have to be filled with something? As far as I remember?

Yeah, that’s what’s happening. FWIW, what I was trying to get at was whether you’d want to output a different tree in this case because the above current behavior doesn’t render/outputs an invalid mdast tree. For example, putting in a non-breaking space:

* [ ] Juliet
* [ ]  

Or a comment:

* [ ] Juliet
* [ ] <!-- No text -->

Or just backing out and treating it like a checkbox in the middle of text gets treated, rather than as a task list item:

* [ ] Juliet
* \[ ]

The first two will successfully render as a checklist item when converting back to HTML; the last one does not, but matches behavior elsewhere in this module. There are lots of creative options along these lines.

BUT all of these are doing some kind of funny behavior to preserve the intent of the input. I definitely understand there are reasons you might consider the current garbage-in-garbage-out behavior better or more straightforward.

@wooorm
Copy link
Member

wooorm commented Sep 5, 2023

As for tests: if I am personally investigating something small, I copy/paste the example from the readme in an example.js, and debug there.

Typically tests are smaller, less fixture bases than this project. Then adding some console.logs somewhere in there is what I do.

For these big fixture projects, you can also pass UPDATE to regenerate fixtures: UPDATE=1 node test/index.js regenerates. Then you can use git to diff the changes that are happening!

@wooorm
Copy link
Member

wooorm commented Sep 5, 2023

malformed

The input here can be arbitrary HTML. It could be any unsafe page on the web. Perhaps we could start thinking about warnings where we can’t make a good choice. But I imagine that there are too many, and human interaction is needed if that goal, scraping the web and turning it into markdown, can’t be reached.

What’s the reason you are using this project?

Co-authored-by: Titus <tituswormer@gmail.com>
Signed-off-by: Rob Brackett <rob@robbrackett.com>
@Mr0grog
Copy link
Contributor Author

Mr0grog commented Sep 6, 2023

Typically tests are smaller, less fixture bases than this project. Then adding some console.logs somewhere in there is what I do.

This also applies to the smaller, non-fixture tests here! Personally, I sometimes use a debugger with tests and sometimes use console.log, but in both cases, being able to narrow down which tests run is super helpful (e.g. console.log in the library code will still spit out a lot of extra noise if more than one test leverages the code you are investigating, which is usually the case). Anyway, not a big deal. I just figured it might be a nice and simple change, since you get a lot of nice CLI support just by swapping out what function gets called to declare a test:

- test('core', async function (t) {
+ describe('core', async function () {
-   await t.test('should expose the public api', async function () {
+   it('should expose the public api', async function () {

For these big fixture projects, you can also pass UPDATE to regenerate fixtures:

I don’t think that would have helped me a whole lot here (I wrote the tests before the code), but good to know!

malformed

The input here can be arbitrary HTML. It could be any unsafe page on the web. …I imagine that there are too many [possible cases]

For sure! I guess I was thinking about warnings in very specific cases where it’s clear what might be intended, even though you won’t get it because of Markdown’s rules (or at least Mdast’s rules; “Markdown” is pretty flavor-based when you get to the details), or because the input produces an invalid/malformed/whatever output tree.

For example, it’s easy to detect a situation like the following because the Mdast tree we convert to cannot be correctly rendered (it produces a task list item node that does not start with a paragraph, and we expect this not to work per #80 (comment)). We don’t have to look at the input; just the output immediately before returning from the li handler:

<!-- Renders like a perfectly normal checklist item in HTML, but the checkbox will just disappear in Markdown.
     Where did it go? Why? I imagine it would be pretty opaque from an end-user perspective. -->
<li><input type="checkbox"> <!-- comment -->Kilo</li>

What’s the reason you are using this project?

For this, my main usage is https://github.com/Mr0grog/google-docs-to-markdown. (I’ve used other parts of Mdast for other things, but not the HTML → MD stuff here.)

Copy link
Member

@wooorm wooorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience! I think there’s just this one point?

<li><input type="checkbox"> India</li>
<li>
<input type="checkbox"> Juliet
</li>
</ol>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the wait. Yes, I think that’s fine. I think no checkbox is rendered because a paragraph is required by GFM. And those have to be filled with something? As far as I remember?

This adds some more explanatory comments and moves the checkbox-related code into its own function, so the complex stuff is better encapsulated.
The main goal here is to make the logic here easier to read and follow. It also makes it easier to update to handle more complex scenarios if desired in the future. This does make the code slightly more vulnerable to getting slowed down by weird trees (that aren't valid HTML) with `<p>` elements deeply nested inside each other.
@Mr0grog
Copy link
Contributor Author

Mr0grog commented Sep 27, 2023

@wooorm I think this should be good to go now! I added a list item with just a checkbox to both the ol and ul fixtures, and also did some work on making the code a bit more readable and documented.

I wound up extracting the code for finding and removing the checkbox into a separate function, since I think isolating it and giving it a name helps make things easier to follow. I also rewrote it in a recursive form that I think makes the logic much clearer to read (if you don’t think so, I can just roll back to f0acebf, which is still a separate function but has your recommended implementation and some more comments and description).

Benchmarking on my machine, it doesn't seem to make a notable performance difference, although it does technically make the code more vulnerable to slowdowns from weird (not valid in HTML, and will never be returned by hast-util-from-html unless there's a bug) trees like:

<li>
  <p>
    <p>
      <p>
         <!-- and so on many levels deep -->
         <input type="checkbox"> Text
      </p>
    </p>
  </p>
</li>

Please let me know if that's ok, whether there's anything else I need to do here, and if you need me to squash the commits.

@wooorm wooorm merged commit 183cbf9 into syntax-tree:main Sep 30, 2023
2 checks passed
@wooorm wooorm changed the title Support task list checkboxes outside p elements Add support for task list checkboxes outside p Sep 30, 2023
@wooorm wooorm added the 💪 phase/solved Post is done label Sep 30, 2023
@github-actions

This comment has been minimized.

@github-actions github-actions bot removed the 🤞 phase/open Post is being triaged manually label Sep 30, 2023
@wooorm
Copy link
Member

wooorm commented Sep 30, 2023

Thanks so much! 10.1.0!

@Mr0grog Mr0grog deleted the 80-some-checklists-dont-use-paragraphs branch October 4, 2023 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 phase/solved Post is done
Development

Successfully merging this pull request may close these issues.

Tasklists don't work if checkbox is a direct child of a list item
2 participants