-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bandit: allow-list lxml usages #6265
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #6265 +/- ##
=======================================
Coverage 88.90% 88.91%
=======================================
Files 161 161
Lines 11793 11796 +3
Branches 1914 1914
=======================================
+ Hits 10485 10488 +3
Misses 964 964
Partials 344 344
|
TextareaElement, | ||
) | ||
from parsel.selector import create_root_node | ||
from lxml.html import FormElement # nosec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: We need to change import style because is not possible to use # no sec otherwise? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isort did that automatically, not sure why, I imagine it is a style choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! Thanks!
@@ -120,7 +115,7 @@ def _get_form( | |||
formxpath: Optional[str], | |||
) -> FormElement: | |||
"""Find the wanted form element within the given response.""" | |||
root = create_root_node(response.text, HTMLParser, base_url=get_base_url(response)) | |||
root = response.selector.root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain this? Also does this work with all supported parsel versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the code later on relies on attributes from lxml.etree.Element
not exposed by Selector
(e.g. name
), so using the selector all the way was not possible.
Selector.root
was already available in the lowest parsel version Scrapy supports.
In most cases, the actual loading of the document is done by parsel, not Scrapy.
form.py
was 1 exception, but I have refactored it to use the response selector instead. The application ofget_base_url
inunified.py
was a bug fix detected in the process.Another exception is
iterparse_lxml
, which now usesresolve_entities=False
as parsel does.And then there was the sitemap code, which was already disabling entity resolution.