-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add implementations for node_replace_with/node_insert_before/node_insert_after, minor tweaks #25
Add implementations for node_replace_with/node_insert_before/node_insert_after, minor tweaks #25
Conversation
@lexborisov, can you also take a look if it is a proper way to do something like this with Modest? |
Looks good. Can you please update |
…_insert_after, minor tweaks
4ead4c6
to
23724a3
Compare
@rushter , done! I've rebased with master, created new version row and added new methods there. |
@kkszysiu Does it make sense to keep separate methods for nodes? I think we can check the type of the argument. We can also convert strings to nodes and remove the plain-text limitation. So, instead of two methods, we could use one with an extra argument for backward compatibility:
What do you think? |
@rushter that's a good question. I was thinking about it at first to just make them single methods but I was afraid I will degrade the performance. As far as I understand then, Cython would need to add another guard in such case as it wouldn't be able to assume we are passing string (I'm not an expert in Cython in any way, that was just my understanding). I was trying to avoid performance degradation in this case, after all that's why I chosen selectolax over lxml at first place, so that's why I've implemented it as separate methods, but if you think it makes sense to keep them as single methods and the performance woudln't be affected much, then I can surely change it! :) |
I think fused types can take care of type dispatching. Usually, type checking and conversion is not the most expensive operation. The |
@rushter, yeah. That makes sense. Okay then, let me rewrite it as a single methods, it would be much cleaner. |
@rushter , one more thing. Do you think it would be better to add parse param or just allow people to pass string or Node? Otherwise we will kind of limit/assume that user wants to have root.body.child.html, right? For example in my case I have bunch of dummy tags in my HTML tree and I'm replacing them with actual, real elements from another parsed tree, so such method makes much more sense to me, but not sure about real world usage. |
We should definitely allow passing I think we just need to make sure that our new function accepts simple strings with no HTML tags (only text). Selectolax may not parse it correctly. |
@rushter , we can check isinstance, in case it is a string, then we will use old logic, in case it is Node, then we will use node replacing logic, something like this I assume would work: def replace_with(self, value):
if isinstance(value, (str, bytes, unicode)):
cdef myhtml_tree_node_t* text_node
if isinstance(value, (str, unicode)):
bytes_val = value.encode(_ENCODING)
elif isinstance(value, bytes):
bytes_val = value
text_node = myhtml_node_create(self.parser.html_tree, MyHTML_TAG__TEXT, MyHTML_NAMESPACE_HTML)
myhtml_node_text_set(text_node, <char*> bytes_val, len(bytes_val), MyENCODING_UTF_8)
myhtml_node_insert_before(self.node, text_node)
myhtml_node_delete(self.node)
elif isinstance(value, Node):
cdef myhtml_tree_node_t *node = value.node
myhtml_node_insert_before(self.node, node)
myhtml_node_delete(self.node)
else:
raise TypeError("Expected a string or Node instance, but %s found" % type(value).__name__) Right? |
@kkszysiu Yes, please document the difference between string and Node objects in the function description. My idea was to allow passing HTML as a string, but your approach works for me too. |
Yes yes, will document everything and fix tests. Thank you! |
Any updates? |
Hey @rushter , @ppwwyyxx actually yes. I've also tried to use my approach, where I will just recognise which data is passed to method, but cython is not supporting it. Do you guys have any idea how we can implement it the other way? If not then maybe merging the current approach I've prepared would be enough for now? Thanks! |
What was the problem? It definitely can accept multiple types. |
@rushter , maybe then I'm doing something wrong, here is example branch: https://github.com/rushter/selectolax/compare/master...kkszysiu:insert_after_before_replace_nodes2?expand=1 And then when I'm trying to compile it, I'm getting: When I directly pass Node as argument it works. Any ideas? |
@kkszysiu Most likely, you just need to make it obvious for the compiler that you are working with the right object.
|
Oh, let me try @rushter ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Please fix some minor issues and I will make a new release of the library.
selectolax/node.pxi
Outdated
Parameters | ||
---------- | ||
value : str | ||
The text to replace the Node with. | ||
value : str|Node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value : str|Node | |
value : str, bytes or Node |
https://numpydoc.readthedocs.io/en/latest/format.html#sections
value : str | ||
The text to replace the Node with. | ||
value : str|Node | ||
The text or Node instance to replace the Node with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text or Node instance to replace the Node with. | |
The text or Node instance to replace the Node with. | |
When a text string is passed, it's treated as text. All HTML tags will be escaped. | |
Convert and pass the ``Node`` object when you want to work with HTML. |
It would also be good add a simple example of how to pass a node object.
Example of the docstring:
selectolax/selectolax/node.pxi
Line 471 in f4738f9
Examples |
This is not an easy concept for new users.
Please update the docstrings for all the new functions.
@@ -115,11 +115,17 @@ cdef class _Attributes: | |||
return "<%s attributes, %s items>" % (tag_name, len(self)) | |||
|
|||
|
|||
ctypedef fused str_or_Node: | |||
str | |||
Node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your type is missing the bytes
object. Our function can handle bytes as well, but Cython won't allow the bytes
object.
Please add bytes
and unicode
(for Python 2) here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank yo @rushter !
Yes, you were right. Adding explicit type fixed the issue.
I've used basestring instead of str/unicode for backward compatibility with Py2 and correct support of Py3.
According to the Cython docs:
The basestring type represents both the types str and unicode, i.e. all Python text string types in Python 2 and Python 3.
So we should be good. Tests are passing.
Thank you for you contribution. I've published a new release. |
I noticed an issue regarding reference. The following code works: from selectolax.parser import HTMLParser
html_parser = HTMLParser('<div>Get <span alt="Laptop"><img src="/jpg"> <div></div></span></div>')
img_node = html_parser.css_first('img')
html_parser2 = HTMLParser('<div>Test</div>')
img_node.replace_with(html_parser2.body.child)
assert html_parser.body.child.html == '<div>Get <span alt="Laptop"><div>Test</div> <div></div></span></div>' but the following fails: from selectolax.parser import HTMLParser
html_parser = HTMLParser('<div>Get <span alt="Laptop"><img src="/jpg"> <div></div></span></div>')
img_node = html_parser.css_first('img')
img_node.replace_with(HTMLParser('<div>Test</div>').body.child)
assert html_parser.body.child.html == '<div>Get <span alt="Laptop"><div>Test</div> <div></div></span></div>' since this is python, I would expect the two to work the same and are memory safe |
It seems that a
|
Ok, I've opened a new issue |
Oh, I haven't experienced this issue myself. I will try to came up with a solution. |
I need such implementation in the project I'm working at. Might be helpful as well for others.
Also please review if the implementation is correct.
Thanks!