Bandit: allow-list lxml usages #6265

Gallaecio · 2024-03-01T12:51:06Z

In most cases, the actual loading of the document is done by parsel, not Scrapy.

form.py was 1 exception, but I have refactored it to use the response selector instead. The application of get_base_url in unified.py was a bug fix detected in the process.

Another exception is iterparse_lxml, which now uses resolve_entities=False as parsel does.

And then there was the sitemap code, which was already disabling entity resolution.

codecov · 2024-03-01T12:52:59Z

Codecov Report

Merging #6265 (6cbf33d) into master (aa1bf69) will increase coverage by 0.00%.
The diff coverage is 92.85%.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6265   +/-   ##
=======================================
  Coverage   88.90%   88.91%           
=======================================
  Files         161      161           
  Lines       11793    11796    +3     
  Branches     1914     1914           
=======================================
+ Hits        10485    10488    +3     
  Misses        964      964           
  Partials      344      344

Files	Coverage Δ
scrapy/http/request/form.py	`97.76% <100.00%> (+0.03%)`	⬆️
scrapy/linkextractors/lxmlhtml.py	`97.03% <100.00%> (ø)`
scrapy/selector/unified.py	`100.00% <100.00%> (ø)`
scrapy/utils/_compression.py	`91.89% <ø> (ø)`
scrapy/utils/sitemap.py	`96.15% <100.00%> (ø)`
scrapy/utils/versions.py	`100.00% <100.00%> (ø)`
scrapy/utils/iterators.py	`91.97% <50.00%> (ø)`

Laerte · 2024-03-01T13:06:29Z

scrapy/http/request/form.py

-    TextareaElement,
-)
-from parsel.selector import create_root_node
+from lxml.html import FormElement  # nosec


Question: We need to change import style because is not possible to use # no sec otherwise? Thanks

isort did that automatically, not sure why, I imagine it is a style choice.

Got it! Thanks!

wRAR · 2024-03-01T13:49:34Z

scrapy/http/request/form.py

@@ -120,7 +115,7 @@ def _get_form(
    formxpath: Optional[str],
 ) -> FormElement:
    """Find the wanted form element within the given response."""
-    root = create_root_node(response.text, HTMLParser, base_url=get_base_url(response))
+    root = response.selector.root


Can you please explain this? Also does this work with all supported parsel versions?

Some of the code later on relies on attributes from lxml.etree.Element not exposed by Selector (e.g. name), so using the selector all the way was not possible.

Selector.root was already available in the lowest parsel version Scrapy supports.

Bandit: allow-list lxml usages

6cbf33d

Laerte reviewed Mar 1, 2024

View reviewed changes

wRAR reviewed Mar 1, 2024

View reviewed changes

wRAR approved these changes Mar 1, 2024

View reviewed changes

Gallaecio merged commit bf14935 into scrapy:master Mar 1, 2024
26 checks passed

Gallaecio added a commit that referenced this pull request May 13, 2024

Bandit: allow-list lxml usages (#6265)

b25f34d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bandit: allow-list lxml usages #6265

Bandit: allow-list lxml usages #6265

Gallaecio commented Mar 1, 2024

codecov bot commented Mar 1, 2024 •

edited

Laerte Mar 1, 2024

Gallaecio Mar 1, 2024

Laerte Mar 1, 2024

wRAR Mar 1, 2024

Gallaecio Mar 1, 2024

Bandit: allow-list lxml usages #6265

Bandit: allow-list lxml usages #6265

Conversation

Gallaecio commented Mar 1, 2024

codecov bot commented Mar 1, 2024 • edited

Codecov Report

Laerte Mar 1, 2024

Choose a reason for hiding this comment

Gallaecio Mar 1, 2024

Choose a reason for hiding this comment

Laerte Mar 1, 2024

Choose a reason for hiding this comment

wRAR Mar 1, 2024

Choose a reason for hiding this comment

Gallaecio Mar 1, 2024

Choose a reason for hiding this comment

codecov bot commented Mar 1, 2024 •

edited