Add implementations for node_replace_with/node_insert_before/node_insert_after, minor tweaks #25

kkszysiu · 2020-06-28T12:54:18Z

I need such implementation in the project I'm working at. Might be helpful as well for others.

Also please review if the implementation is correct.

Thanks!

kkszysiu · 2020-06-29T05:22:05Z

@lexborisov, can you also take a look if it is a proper way to do something like this with Modest?
Thanks!

rushter · 2020-06-29T15:48:10Z

Looks good. Can you please update CHANGES.rst and add your recent changes to it?

…_insert_after, minor tweaks

kkszysiu · 2020-06-29T16:16:27Z

@rushter , done! I've rebased with master, created new version row and added new methods there.

rushter · 2020-06-29T16:39:29Z

@kkszysiu Does it make sense to keep separate methods for nodes?

I think we can check the type of the argument. We can also convert strings to nodes and remove the plain-text limitation.

So, instead of two methods, we could use one with an extra argument for backward compatibility:

def replace_with(self, value, bool parse=False):

If value is a string and parse is True, treats the input as HTML and parses it.

What do you think?

kkszysiu · 2020-06-29T17:49:49Z

@rushter that's a good question. I was thinking about it at first to just make them single methods but I was afraid I will degrade the performance. As far as I understand then, Cython would need to add another guard in such case as it wouldn't be able to assume we are passing string (I'm not an expert in Cython in any way, that was just my understanding).

I was trying to avoid performance degradation in this case, after all that's why I chosen selectolax over lxml at first place, so that's why I've implemented it as separate methods, but if you think it makes sense to keep them as single methods and the performance woudln't be affected much, then I can surely change it! :)

rushter · 2020-06-29T17:57:06Z

I think fused types can take care of type dispatching.

Usually, type checking and conversion is not the most expensive operation. The lxml library is slow because of the underlying libxml2 library.

kkszysiu · 2020-06-29T18:10:07Z

@rushter, yeah. That makes sense. Okay then, let me rewrite it as a single methods, it would be much cleaner.

kkszysiu · 2020-06-29T18:54:40Z

@rushter , one more thing. Do you think it would be better to add parse param or just allow people to pass string or Node?
Something like that:
def replace_with(self, value):
Where value could be str or Node and inside the method we will check what was passed.

Otherwise we will kind of limit/assume that user wants to have root.body.child.html, right?
But what if user actually want to insert some other node?

For example in my case I have bunch of dummy tags in my HTML tree and I'm replacing them with actual, real elements from another parsed tree, so such method makes much more sense to me, but not sure about real world usage.

rushter · 2020-06-29T19:05:06Z

@rushter , one more thing. Do you think it would be better to add parse param or just allow people to pass string or Node?
Something like that:
def replace_with(self, value):
Where value could be str or Node and inside the method we will check what was passed.

Otherwise we will kind of limit/assume that user wants to have root.body.child.html, right?
But what if user actually want to insert some other node?

For example in my case I have bunch of dummy tags in my HTML tree and I'm replacing them with actual, real elements from another parsed tree, so such method makes much more sense to me, but not sure about real world usage.

We should definitely allow passing Node objects. I was concerned about backward compatibility. What's why I proposed an extra parameter. But after your comment, I think it's safe to use def replace_with(self, value). Since old implementation allowed only plain-text, I think our new implementation won't break anything trying to convert it to HTML.

I think we just need to make sure that our new function accepts simple strings with no HTML tags (only text). Selectolax may not parse it correctly.

kkszysiu · 2020-06-29T19:21:21Z

@rushter , we can check isinstance, in case it is a string, then we will use old logic, in case it is Node, then we will use node replacing logic, something like this I assume would work:

    def replace_with(self, value):
        if isinstance(value, (str, bytes, unicode)):
            cdef myhtml_tree_node_t* text_node
            if isinstance(value, (str, unicode)):
                bytes_val = value.encode(_ENCODING)
            elif isinstance(value, bytes):
                bytes_val = value

            text_node = myhtml_node_create(self.parser.html_tree, MyHTML_TAG__TEXT, MyHTML_NAMESPACE_HTML)
            myhtml_node_text_set(text_node, <char*> bytes_val, len(bytes_val), MyENCODING_UTF_8)
            myhtml_node_insert_before(self.node, text_node)
            myhtml_node_delete(self.node)
        elif isinstance(value, Node):
            cdef myhtml_tree_node_t *node = value.node
            myhtml_node_insert_before(self.node, node)
            myhtml_node_delete(self.node)
        else:
            raise TypeError("Expected a string or Node instance, but %s found" % type(value).__name__)

Right?

rushter · 2020-06-29T19:25:02Z

@kkszysiu Yes, please document the difference between string and Node objects in the function description.

My idea was to allow passing HTML as a string, but your approach works for me too.

kkszysiu · 2020-06-29T19:30:06Z

Yes yes, will document everything and fix tests. Thank you!

ppwwyyxx · 2020-08-01T23:40:30Z

Any updates?

kkszysiu · 2020-08-02T02:49:12Z

Hey @rushter , @ppwwyyxx actually yes.
I've tried to implement it in a way @rushter described, but generally it requires to initialise parser inside node.pxi module, which is not a trivial task.

I've also tried to use my approach, where I will just recognise which data is passed to method, but cython is not supporting it.

Do you guys have any idea how we can implement it the other way? If not then maybe merging the current approach I've prepared would be enough for now?
I am already using this approach in my project without any issues.

Thanks!

rushter · 2020-08-02T09:40:37Z

I've also tried to use my approach, where I will just recognise which data is passed to method, but cython is not supporting it.

What was the problem? It definitely can accept multiple types.

kkszysiu · 2020-08-02T18:24:50Z

@rushter , maybe then I'm doing something wrong, here is example branch: https://github.com/rushter/selectolax/compare/master...kkszysiu:insert_after_before_replace_nodes2?expand=1

And then when I'm trying to compile it, I'm getting:
selectolax/node.pxi:597:24: Cannot convert Python object to 'myhtml_tree_node_t *'

When I directly pass Node as argument it works.

Any ideas?

rushter · 2020-08-02T18:38:02Z

@kkszysiu Most likely, you just need to make it obvious for the compiler that you are working with the right object.

node = <myhtml_tree_node_t*> value.node

kkszysiu · 2020-08-02T18:39:04Z

Oh, let me try @rushter !

… insert_after

rushter

Looks good. Please fix some minor issues and I will make a new release of the library.

rushter · 2020-08-02T21:46:55Z

selectolax/node.pxi

        Parameters
        ----------
-        value : str
-            The text to replace the Node with.
+        value : str|Node


Suggested change

value : str|Node

value : str, bytes or Node

https://numpydoc.readthedocs.io/en/latest/format.html#sections

rushter · 2020-08-02T21:56:44Z

selectolax/node.pxi

-        value : str
-            The text to replace the Node with.
+        value : str|Node
+            The text or Node instance to replace the Node with.


Suggested change

The text or Node instance to replace the Node with.

The text or Node instance to replace the Node with.

When a text string is passed, it's treated as text. All HTML tags will be escaped.

Convert and pass the ``Node`` object when you want to work with HTML.

It would also be good add a simple example of how to pass a node object.

Example of the docstring:

selectolax/selectolax/node.pxi

Line 471 in f4738f9

Examples

This is not an easy concept for new users.

Please update the docstrings for all the new functions.

rushter · 2020-08-02T21:58:59Z

selectolax/node.pxi

@@ -115,11 +115,17 @@ cdef class _Attributes:
        return "<%s attributes, %s items>" % (tag_name, len(self))


+ctypedef fused str_or_Node:
+    str
+    Node


Your type is missing the bytes object. Our function can handle bytes as well, but Cython won't allow the bytes object.
Please add bytes and unicode (for Python 2) here.

Thank yo @rushter !
Yes, you were right. Adding explicit type fixed the issue.
I've used basestring instead of str/unicode for backward compatibility with Py2 and correct support of Py3.

According to the Cython docs:

The basestring type represents both the types str and unicode, i.e. all Python text string types in Python 2 and Python 3.

So we should be good. Tests are passing.

rushter · 2020-08-03T09:00:52Z

Thank you for you contribution. I've published a new release.

ppwwyyxx · 2020-08-03T19:27:20Z

I noticed an issue regarding reference. The following code works:

from selectolax.parser import HTMLParser

html_parser = HTMLParser('<div>Get <span alt="Laptop"><img src="/jpg"> <div></div></span></div>')
img_node = html_parser.css_first('img')

html_parser2 = HTMLParser('<div>Test</div>')
img_node.replace_with(html_parser2.body.child)
assert html_parser.body.child.html == '<div>Get <span alt="Laptop"><div>Test</div> <div></div></span></div>'

but the following fails:

from selectolax.parser import HTMLParser

html_parser = HTMLParser('<div>Get <span alt="Laptop"><img src="/jpg"> <div></div></span></div>')
img_node = html_parser.css_first('img')

img_node.replace_with(HTMLParser('<div>Test</div>').body.child)
assert html_parser.body.child.html == '<div>Get <span alt="Laptop"><div>Test</div> <div></div></span></div>'

since this is python, I would expect the two to work the same and are memory safe

ppwwyyxx · 2020-08-03T19:41:43Z

It seems that a myhtml_tree_node_t cannot have shared ownership. So for a statement A.replace_with(B), it has to do one of the following:

Let B keep ownership -- cause strange failures like above
Let A steal the ownership -- needs to invalidate B in this call, and document the unintuitive behavior clearly
Let A copy the node -- more expensive

rushter · 2020-08-03T19:49:27Z

Ok, I've opened a new issue

kkszysiu · 2020-08-04T02:26:41Z

Oh, I haven't experienced this issue myself. I will try to came up with a solution.

kkszysiu changed the title ~~Add implementations for node_replace_with / node_insert_before / node_insert_after, minor tweaks~~ Add implementations for node_replace_with/node_insert_before/node_insert_after, minor tweaks Jun 28, 2020

Add implementations for node_replace_with / node_insert_before / node…

23724a3

…_insert_after, minor tweaks

kkszysiu force-pushed the insert_after_before_replace_nodes branch from 4ead4c6 to 23724a3 Compare June 29, 2020 16:10

Update CHANGES.rst

1e67147

rushter mentioned this pull request Jul 8, 2020

Node.Replace_with automatically encodes everything #26

Closed

kkszysiu added 4 commits August 3, 2020 03:47

Allow passing Node or str instance to replace_with, insert_before and…

edaf8db

… insert_after

Explicitely set type for node

660573f

Update CHANGES.rst

4c185e4

Fix tests

ec8d7ef

rushter requested changes Aug 2, 2020

View reviewed changes

Improved docstrings, add more examples, fixes after code review

d6e7208

rushter merged commit 71d1322 into rushter:master Aug 3, 2020

rushter mentioned this pull request Aug 3, 2020

New replace_with method does not work as expected #27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add implementations for node_replace_with/node_insert_before/node_insert_after, minor tweaks #25

Add implementations for node_replace_with/node_insert_before/node_insert_after, minor tweaks #25

kkszysiu commented Jun 28, 2020 •

edited

kkszysiu commented Jun 29, 2020

rushter commented Jun 29, 2020

kkszysiu commented Jun 29, 2020

rushter commented Jun 29, 2020

kkszysiu commented Jun 29, 2020

rushter commented Jun 29, 2020

kkszysiu commented Jun 29, 2020

kkszysiu commented Jun 29, 2020

rushter commented Jun 29, 2020

kkszysiu commented Jun 29, 2020

rushter commented Jun 29, 2020

kkszysiu commented Jun 29, 2020

ppwwyyxx commented Aug 1, 2020

kkszysiu commented Aug 2, 2020

rushter commented Aug 2, 2020

kkszysiu commented Aug 2, 2020

rushter commented Aug 2, 2020

kkszysiu commented Aug 2, 2020

rushter left a comment

rushter Aug 2, 2020 •

edited

rushter Aug 2, 2020 •

edited

rushter Aug 2, 2020

kkszysiu Aug 2, 2020

rushter commented Aug 3, 2020

ppwwyyxx commented Aug 3, 2020

ppwwyyxx commented Aug 3, 2020

rushter commented Aug 3, 2020

kkszysiu commented Aug 4, 2020

Add implementations for node_replace_with/node_insert_before/node_insert_after, minor tweaks #25

Add implementations for node_replace_with/node_insert_before/node_insert_after, minor tweaks #25

Conversation

kkszysiu commented Jun 28, 2020 • edited

kkszysiu commented Jun 29, 2020

rushter commented Jun 29, 2020

kkszysiu commented Jun 29, 2020

rushter commented Jun 29, 2020

kkszysiu commented Jun 29, 2020

rushter commented Jun 29, 2020

kkszysiu commented Jun 29, 2020

kkszysiu commented Jun 29, 2020

rushter commented Jun 29, 2020

kkszysiu commented Jun 29, 2020

rushter commented Jun 29, 2020

kkszysiu commented Jun 29, 2020

ppwwyyxx commented Aug 1, 2020

kkszysiu commented Aug 2, 2020

rushter commented Aug 2, 2020

kkszysiu commented Aug 2, 2020

rushter commented Aug 2, 2020

kkszysiu commented Aug 2, 2020

rushter left a comment

Choose a reason for hiding this comment

rushter Aug 2, 2020 • edited

Choose a reason for hiding this comment

rushter Aug 2, 2020 • edited

Choose a reason for hiding this comment

rushter Aug 2, 2020

Choose a reason for hiding this comment

kkszysiu Aug 2, 2020

Choose a reason for hiding this comment

rushter commented Aug 3, 2020

ppwwyyxx commented Aug 3, 2020

ppwwyyxx commented Aug 3, 2020

rushter commented Aug 3, 2020

kkszysiu commented Aug 4, 2020

kkszysiu commented Jun 28, 2020 •

edited

rushter Aug 2, 2020 •

edited

rushter Aug 2, 2020 •

edited