Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New sibling nodes added by a transformer during traversal are never traversed #91

Closed
rgrove opened this issue Dec 6, 2013 · 9 comments
Labels

Comments

@rgrove
Copy link
Owner

rgrove commented Dec 6, 2013

Nokogiri doesn't update previously-created NodeSets with new siblings. See #90 for background.

@JasonBarnabe
Copy link

Did you ever file this against Nokogiri?

@rgrove
Copy link
Owner Author

rgrove commented Feb 10, 2014

No, I plan to work around this in Sanitize, but every time I have a few minutes and sit down to think about this I get all depressed about how much time I spend dealing with Nokogiri issues, and then I start fantasizing about switching to a better, HTML5-compliant parser, and then I check on the status of Gumbo and see that it's stalled, and then I feel tempted to start writing a custom parser, and then I realize what a terrible idea that is given how little time I already have to work on Sanitize, and then I'm out of free time and the bug hasn't gotten fixed.

Sorry. I'll try to get to it soon.

@rgrove
Copy link
Owner Author

rgrove commented Feb 11, 2014

I thought I had a straightforward solution for this:

  1. Keep track of each child we visit while traversing a node's children.
  2. After traversing a node's children, refresh the NodeSet of children. Traverse any children in the refreshed set that we haven't already visited.
  3. Repeat until all the node's children have been traversed.

While this guarantees that we visit any new nodes that were added during traversal, it has a couple of problems.

First, it's easy to get into an infinite loop: just write a transformer that adds a sibling to every node it encounters, and we'll keep traversing and adding and traversing and adding forever. But writing a transformer like this would be dumb, so I'm not too worried about this case.

The bigger problem is that when Sanitize cleans up a non-whitelisted node, it (by default) replaces the original node with its children. This means that during traversal we can potentially end up re-traversing nodes we've already visited many times simply because they keep getting re-parented shallower in the tree.

This is going to be trickier than I thought.

@JasonBarnabe
Copy link

Would it help if you restricted the solution to only traverse new siblings that are added after the current node? That's all I really need for what I describe in #90 - a situation where one node is split into many by a transformer - and seems reasonable to me as long as the behaviour is documented.

I'm not familiar with the internals of your or Nokogiri's code, but it sounds like all you would have to do is ensure that the next node to be processed actually is the previous node's next sibling.

@rgrove
Copy link
Owner Author

rgrove commented Feb 11, 2014

That was actually the first solution I tried, but it turns out not to be reliable since the current node may be deleted or moved during traversal. If it was deleted and we try to get the next sibling, then there is no next sibling. If it was moved, then the next sibling may be under a completely different parent so it's not what we want.

I played around with verifying that the current node's parent is still the same parent it was originally and, if not, getting the next sibling of the previous sibling that we traversed, but this falls apart with nodes that are children of document fragments, since their parent references don't actually point to the fragment (document fragments are weird).

Without the ability to trust the current node as a stable reference point, I'm not sure there's any reliable way to make forward-only traversal work.

@JasonBarnabe
Copy link

Without the ability to trust the current node as a stable reference point, I'm not sure there's any reliable way to make forward-only traversal work.

Well again, speaking selfishly about what I'm trying to do, can't you just make the assumption and document it? "New nodes created by a transformer will only be traversed if they are added after the original node and the original node retains its position in the document". That doesn't seem like a particularly onerous restriction, and I'm having trouble thinking up a case where that wouldn't be enough.

As another alternative, what if you allowed the transformer to indicate by way of the return value which additional nodes it wants traversed (i.e. the nodes it just created)?

@rgrove
Copy link
Owner Author

rgrove commented Feb 11, 2014

Well again, speaking selfishly about what I'm trying to do, can't you just make the assumption and document it? "New nodes created by a transformer will only be traversed if they are added after the original node and the original node retains its position in the document".

Yes, I could do that to address your particular use case (although with some degree of uncertainty when it comes to children of document fragments). But I'm not comfortable with this approach. Since people use Sanitize for security purposes, I think it needs to be consistent and reliable. This restriction could create a situation that silently results in behavior that the author of a transformer doesn't expect and might not notice.

Seems like we'd end up with the same problem we're trying to fix (silently failing to traverse certain nodes), but with less consistency about when it happens. At least right now you can be certain that any siblings you add during traversal will not be traversed.

As another alternative, what if you allowed the transformer to indicate by way of the return value which additional nodes it wants traversed (i.e. the nodes it just created)?

This would essentially be the same as calling Sanitize.clean_node! on the additional nodes you want to traverse, which is the workaround you're already using, right? In any case, yeah, I may end up implementing some kind of configuration-based solution if I don't have any other brilliant ideas.

rgrove added a commit that referenced this issue May 19, 2014
@rgrove rgrove closed this as completed in 6208727 Jun 21, 2014
@JasonBarnabe
Copy link

This seems to work fine in 3.0.0. Thanks!

@rgrove
Copy link
Owner Author

rgrove commented Jun 26, 2014

Glad to hear it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants