Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bandit: allow-list lxml usages #6265

Merged
merged 1 commit into from
Mar 1, 2024
Merged

Conversation

Gallaecio
Copy link
Member

In most cases, the actual loading of the document is done by parsel, not Scrapy.

form.py was 1 exception, but I have refactored it to use the response selector instead. The application of get_base_url in unified.py was a bug fix detected in the process.

Another exception is iterparse_lxml, which now uses resolve_entities=False as parsel does.

And then there was the sitemap code, which was already disabling entity resolution.

Copy link

codecov bot commented Mar 1, 2024

Codecov Report

Merging #6265 (6cbf33d) into master (aa1bf69) will increase coverage by 0.00%.
The diff coverage is 92.85%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6265   +/-   ##
=======================================
  Coverage   88.90%   88.91%           
=======================================
  Files         161      161           
  Lines       11793    11796    +3     
  Branches     1914     1914           
=======================================
+ Hits        10485    10488    +3     
  Misses        964      964           
  Partials      344      344           
Files Coverage Δ
scrapy/http/request/form.py 97.76% <100.00%> (+0.03%) ⬆️
scrapy/linkextractors/lxmlhtml.py 97.03% <100.00%> (ø)
scrapy/selector/unified.py 100.00% <100.00%> (ø)
scrapy/utils/_compression.py 91.89% <ø> (ø)
scrapy/utils/sitemap.py 96.15% <100.00%> (ø)
scrapy/utils/versions.py 100.00% <100.00%> (ø)
scrapy/utils/iterators.py 91.97% <50.00%> (ø)

TextareaElement,
)
from parsel.selector import create_root_node
from lxml.html import FormElement # nosec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: We need to change import style because is not possible to use # no sec otherwise? Thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isort did that automatically, not sure why, I imagine it is a style choice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Thanks!

@@ -120,7 +115,7 @@ def _get_form(
formxpath: Optional[str],
) -> FormElement:
"""Find the wanted form element within the given response."""
root = create_root_node(response.text, HTMLParser, base_url=get_base_url(response))
root = response.selector.root
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain this? Also does this work with all supported parsel versions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the code later on relies on attributes from lxml.etree.Element not exposed by Selector (e.g. name), so using the selector all the way was not possible.

Selector.root was already available in the lowest parsel version Scrapy supports.

@Gallaecio Gallaecio merged commit bf14935 into scrapy:master Mar 1, 2024
26 checks passed
Gallaecio added a commit that referenced this pull request May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants