Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What exactly is '--blockRules' blocking? Entire URLs where an element like an iframe matches a regex, or only the matching part of a page? #574

Open
steph-nb opened this issue May 15, 2024 · 3 comments

Comments

@steph-nb
Copy link

Hi,
to me it looks like '--blockRules' blocks entire pages, when a subelement like an iframe-content's URL is matching a passed regex.
Is that correct?
Or what is the exact mechanism?

And if my assumption was true, would it be nice to have an option to only exclude exactly the matching elements, but collect the rest of a page?

Many thanks

@tw4l
Copy link
Contributor

tw4l commented May 16, 2024

Hi @steph-nb, the block rules target requests from specific URLs, so if you have a page at example.com with an iframe loading content from othersite.com and add a block rule matching othersite.com, the overall page at example.com should still be captured but the iframe content from othersite.com should be blocked.

If you're seeing behavior that deviates from this, I'm happy to look into it further!

@steph-nb
Copy link
Author

steph-nb commented May 17, 2024

Hi @tw4l , many thanks for your answer. I am not yet sure, if really the beaviour of browsertrix-crawler or my syntax of using crawler_extra_args in browsertrix is wrong.
How would you enter multiple regexes to blockRules in crawler_extra_args of the value.yaml in browsertrix, to block all matching contents on any page visited?

For example I want to use these regexes:
image

BR and thanks a lot!

@steph-nb
Copy link
Author

steph-nb commented Jun 7, 2024

Hi @tw4l ,
I retried several ways to configure this parameter via the values.yaml of browsertrix.
Here some examples:
a)
crawler_extra_args: '--rolloverSize 100000000 --blockRules [".youtube.",".facebook.",".stats\.i-web\.ch.",".stats4\.i-web\.ch.",".onLogin.",".start_date.",".matomo."]'

b)
crawler_extra_args: '--rolloverSize 100000000 --blockRules ["youtube"]'

It always resulted in blocking much more than the desired page-elements only.
See for instance:
image

Question 1:
How would you pass this parameter via values.yaml?

Question 2:
If my ways should already be fine, could you maybe rework the functionality to really only exclude matching elements?

Many thanks and BR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

2 participants