Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add implementations for node_replace_with/node_insert_before/node_insert_after, minor tweaks #25

Merged
merged 7 commits into from
Aug 3, 2020

Conversation

kkszysiu
Copy link
Contributor

@kkszysiu kkszysiu commented Jun 28, 2020

I need such implementation in the project I'm working at. Might be helpful as well for others.

Also please review if the implementation is correct.

Thanks!

@kkszysiu kkszysiu changed the title Add implementations for node_replace_with / node_insert_before / node_insert_after, minor tweaks Add implementations for node_replace_with/node_insert_before/node_insert_after, minor tweaks Jun 28, 2020
@kkszysiu
Copy link
Contributor Author

@lexborisov, can you also take a look if it is a proper way to do something like this with Modest?
Thanks!

@rushter
Copy link
Owner

rushter commented Jun 29, 2020

Looks good. Can you please update CHANGES.rst and add your recent changes to it?

@kkszysiu kkszysiu force-pushed the insert_after_before_replace_nodes branch from 4ead4c6 to 23724a3 Compare June 29, 2020 16:10
@kkszysiu
Copy link
Contributor Author

@rushter , done! I've rebased with master, created new version row and added new methods there.

@rushter
Copy link
Owner

rushter commented Jun 29, 2020

@kkszysiu Does it make sense to keep separate methods for nodes?

I think we can check the type of the argument. We can also convert strings to nodes and remove the plain-text limitation.

So, instead of two methods, we could use one with an extra argument for backward compatibility:

def replace_with(self, value, bool parse=False):

If value is a string and parse is True, treats the input as HTML and parses it.

What do you think?

@kkszysiu
Copy link
Contributor Author

@rushter that's a good question. I was thinking about it at first to just make them single methods but I was afraid I will degrade the performance. As far as I understand then, Cython would need to add another guard in such case as it wouldn't be able to assume we are passing string (I'm not an expert in Cython in any way, that was just my understanding).

I was trying to avoid performance degradation in this case, after all that's why I chosen selectolax over lxml at first place, so that's why I've implemented it as separate methods, but if you think it makes sense to keep them as single methods and the performance woudln't be affected much, then I can surely change it! :)

@rushter
Copy link
Owner

rushter commented Jun 29, 2020

I think fused types can take care of type dispatching.

Usually, type checking and conversion is not the most expensive operation. The lxml library is slow because of the underlying libxml2 library.

@kkszysiu
Copy link
Contributor Author

@rushter, yeah. That makes sense. Okay then, let me rewrite it as a single methods, it would be much cleaner.

@kkszysiu
Copy link
Contributor Author

@rushter , one more thing. Do you think it would be better to add parse param or just allow people to pass string or Node?
Something like that:
def replace_with(self, value):
Where value could be str or Node and inside the method we will check what was passed.

Otherwise we will kind of limit/assume that user wants to have root.body.child.html, right?
But what if user actually want to insert some other node?

For example in my case I have bunch of dummy tags in my HTML tree and I'm replacing them with actual, real elements from another parsed tree, so such method makes much more sense to me, but not sure about real world usage.

@rushter
Copy link
Owner

rushter commented Jun 29, 2020

@rushter , one more thing. Do you think it would be better to add parse param or just allow people to pass string or Node?
Something like that:
def replace_with(self, value):
Where value could be str or Node and inside the method we will check what was passed.

Otherwise we will kind of limit/assume that user wants to have root.body.child.html, right?
But what if user actually want to insert some other node?

For example in my case I have bunch of dummy tags in my HTML tree and I'm replacing them with actual, real elements from another parsed tree, so such method makes much more sense to me, but not sure about real world usage.

We should definitely allow passing Node objects. I was concerned about backward compatibility. What's why I proposed an extra parameter. But after your comment, I think it's safe to use def replace_with(self, value). Since old implementation allowed only plain-text, I think our new implementation won't break anything trying to convert it to HTML.

I think we just need to make sure that our new function accepts simple strings with no HTML tags (only text). Selectolax may not parse it correctly.

@kkszysiu
Copy link
Contributor Author

@rushter , we can check isinstance, in case it is a string, then we will use old logic, in case it is Node, then we will use node replacing logic, something like this I assume would work:

    def replace_with(self, value):
        if isinstance(value, (str, bytes, unicode)):
            cdef myhtml_tree_node_t* text_node
            if isinstance(value, (str, unicode)):
                bytes_val = value.encode(_ENCODING)
            elif isinstance(value, bytes):
                bytes_val = value

            text_node = myhtml_node_create(self.parser.html_tree, MyHTML_TAG__TEXT, MyHTML_NAMESPACE_HTML)
            myhtml_node_text_set(text_node, <char*> bytes_val, len(bytes_val), MyENCODING_UTF_8)
            myhtml_node_insert_before(self.node, text_node)
            myhtml_node_delete(self.node)
        elif isinstance(value, Node):
            cdef myhtml_tree_node_t *node = value.node
            myhtml_node_insert_before(self.node, node)
            myhtml_node_delete(self.node)
        else:
            raise TypeError("Expected a string or Node instance, but %s found" % type(value).__name__)

Right?

@rushter
Copy link
Owner

rushter commented Jun 29, 2020

@kkszysiu Yes, please document the difference between string and Node objects in the function description.

My idea was to allow passing HTML as a string, but your approach works for me too.

@kkszysiu
Copy link
Contributor Author

Yes yes, will document everything and fix tests. Thank you!

@ppwwyyxx
Copy link

ppwwyyxx commented Aug 1, 2020

Any updates?

@kkszysiu
Copy link
Contributor Author

kkszysiu commented Aug 2, 2020

Hey @rushter , @ppwwyyxx actually yes.
I've tried to implement it in a way @rushter described, but generally it requires to initialise parser inside node.pxi module, which is not a trivial task.

I've also tried to use my approach, where I will just recognise which data is passed to method, but cython is not supporting it.

Do you guys have any idea how we can implement it the other way? If not then maybe merging the current approach I've prepared would be enough for now?
I am already using this approach in my project without any issues.

Thanks!

@rushter
Copy link
Owner

rushter commented Aug 2, 2020

I've also tried to use my approach, where I will just recognise which data is passed to method, but cython is not supporting it.

What was the problem? It definitely can accept multiple types.

@kkszysiu
Copy link
Contributor Author

kkszysiu commented Aug 2, 2020

@rushter , maybe then I'm doing something wrong, here is example branch: https://github.com/rushter/selectolax/compare/master...kkszysiu:insert_after_before_replace_nodes2?expand=1

And then when I'm trying to compile it, I'm getting:
selectolax/node.pxi:597:24: Cannot convert Python object to 'myhtml_tree_node_t *'

When I directly pass Node as argument it works.

Any ideas?

@rushter
Copy link
Owner

rushter commented Aug 2, 2020

@kkszysiu Most likely, you just need to make it obvious for the compiler that you are working with the right object.

node = <myhtml_tree_node_t*> value.node

@kkszysiu
Copy link
Contributor Author

kkszysiu commented Aug 2, 2020

Oh, let me try @rushter !

Copy link
Owner

@rushter rushter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Please fix some minor issues and I will make a new release of the library.

Parameters
----------
value : str
The text to replace the Node with.
value : str|Node
Copy link
Owner

@rushter rushter Aug 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
value : str|Node
value : str, bytes or Node

https://numpydoc.readthedocs.io/en/latest/format.html#sections

value : str
The text to replace the Node with.
value : str|Node
The text or Node instance to replace the Node with.
Copy link
Owner

@rushter rushter Aug 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The text or Node instance to replace the Node with.
The text or Node instance to replace the Node with.
When a text string is passed, it's treated as text. All HTML tags will be escaped.
Convert and pass the ``Node`` object when you want to work with HTML.

It would also be good add a simple example of how to pass a node object.

Example of the docstring:

This is not an easy concept for new users.

Please update the docstrings for all the new functions.

@@ -115,11 +115,17 @@ cdef class _Attributes:
return "<%s attributes, %s items>" % (tag_name, len(self))


ctypedef fused str_or_Node:
str
Node
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your type is missing the bytes object. Our function can handle bytes as well, but Cython won't allow the bytes object.
Please add bytes and unicode (for Python 2) here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank yo @rushter !
Yes, you were right. Adding explicit type fixed the issue.
I've used basestring instead of str/unicode for backward compatibility with Py2 and correct support of Py3.

According to the Cython docs:

The basestring type represents both the types str and unicode, i.e. all Python text string types in Python 2 and Python 3.

So we should be good. Tests are passing.

@rushter rushter merged commit 71d1322 into rushter:master Aug 3, 2020
@rushter
Copy link
Owner

rushter commented Aug 3, 2020

Thank you for you contribution. I've published a new release.

@ppwwyyxx
Copy link

ppwwyyxx commented Aug 3, 2020

I noticed an issue regarding reference. The following code works:

from selectolax.parser import HTMLParser

html_parser = HTMLParser('<div>Get <span alt="Laptop"><img src="/jpg"> <div></div></span></div>')
img_node = html_parser.css_first('img')

html_parser2 = HTMLParser('<div>Test</div>')
img_node.replace_with(html_parser2.body.child)
assert html_parser.body.child.html == '<div>Get <span alt="Laptop"><div>Test</div> <div></div></span></div>'

but the following fails:

from selectolax.parser import HTMLParser

html_parser = HTMLParser('<div>Get <span alt="Laptop"><img src="/jpg"> <div></div></span></div>')
img_node = html_parser.css_first('img')

img_node.replace_with(HTMLParser('<div>Test</div>').body.child)
assert html_parser.body.child.html == '<div>Get <span alt="Laptop"><div>Test</div> <div></div></span></div>'

since this is python, I would expect the two to work the same and are memory safe

@ppwwyyxx
Copy link

ppwwyyxx commented Aug 3, 2020

It seems that a myhtml_tree_node_t cannot have shared ownership. So for a statement A.replace_with(B), it has to do one of the following:

  • Let B keep ownership -- cause strange failures like above
  • Let A steal the ownership -- needs to invalidate B in this call, and document the unintuitive behavior clearly
  • Let A copy the node -- more expensive

@rushter
Copy link
Owner

rushter commented Aug 3, 2020

Ok, I've opened a new issue

@kkszysiu
Copy link
Contributor Author

kkszysiu commented Aug 4, 2020

Oh, I haven't experienced this issue myself. I will try to came up with a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants