What exactly is '--blockRules' blocking? Entire URLs where an element like an iframe matches a regex, or only the matching part of a page? #574

steph-nb · 2024-05-15T15:03:04Z

Hi,
to me it looks like '--blockRules' blocks entire pages, when a subelement like an iframe-content's URL is matching a passed regex.
Is that correct?
Or what is the exact mechanism?

And if my assumption was true, would it be nice to have an option to only exclude exactly the matching elements, but collect the rest of a page?

Many thanks

tw4l · 2024-05-16T16:41:39Z

Hi @steph-nb, the block rules target requests from specific URLs, so if you have a page at example.com with an iframe loading content from othersite.com and add a block rule matching othersite.com, the overall page at example.com should still be captured but the iframe content from othersite.com should be blocked.

If you're seeing behavior that deviates from this, I'm happy to look into it further!

steph-nb · 2024-05-17T06:14:42Z

Hi @tw4l , many thanks for your answer. I am not yet sure, if really the beaviour of browsertrix-crawler or my syntax of using crawler_extra_args in browsertrix is wrong.
How would you enter multiple regexes to blockRules in crawler_extra_args of the value.yaml in browsertrix, to block all matching contents on any page visited?

For example I want to use these regexes:

BR and thanks a lot!

steph-nb · 2024-06-07T13:27:21Z

Hi @tw4l ,
I retried several ways to configure this parameter via the values.yaml of browsertrix.
Here some examples:
a)
crawler_extra_args: '--rolloverSize 100000000 --blockRules [".youtube.",".facebook.",".stats\.i-web\.ch.",".stats4\.i-web\.ch.",".onLogin.",".start_date.",".matomo."]'

b)
crawler_extra_args: '--rolloverSize 100000000 --blockRules ["youtube"]'

It always resulted in blocking much more than the desired page-elements only.
See for instance:

Question 1:
How would you pass this parameter via values.yaml?

Question 2:
If my ways should already be fine, could you maybe rework the functionality to really only exclude matching elements?

Many thanks and BR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What exactly is '--blockRules' blocking? Entire URLs where an element like an iframe matches a regex, or only the matching part of a page? #574

What exactly is '--blockRules' blocking? Entire URLs where an element like an iframe matches a regex, or only the matching part of a page? #574

steph-nb commented May 15, 2024

tw4l commented May 16, 2024

steph-nb commented May 17, 2024 •

edited

steph-nb commented Jun 7, 2024

What exactly is '--blockRules' blocking? Entire URLs where an element like an iframe matches a regex, or only the matching part of a page? #574

What exactly is '--blockRules' blocking? Entire URLs where an element like an iframe matches a regex, or only the matching part of a page? #574

Comments

steph-nb commented May 15, 2024

tw4l commented May 16, 2024

steph-nb commented May 17, 2024 • edited

steph-nb commented Jun 7, 2024

steph-nb commented May 17, 2024 •

edited