playwrightUtils for blocking patterns #2810

tsrseerist · 2025-01-13T15:44:07Z

tsrseerist
Jan 13, 2025

I am currently using https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequests and the extraUrlPatterns to block certain urls. Some of these urls are ones like www.googletagmanager.com or images.taboola.com. However in my stats with my proxy provider it seems my scraper is still letting these through. Am I doing something wrong or does this not work the way I am thinking it does? I recall blocking stuff more effectively in puppeteer and have tried a few things in playwright but none seem to really work correctly. Any ideas?

Answered by janbuchar

Jan 14, 2025

Can you try calling blockRequests in a pre-navigation hook (https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#preNavigationHooks) instead of a request handler?

View full answer

janbuchar · 2025-01-14T09:30:56Z

janbuchar
Jan 14, 2025
Maintainer

Hi @tsrseerist and thanks for your interest in Crawlee! Could you please provide a minimal code snippet that reproduces the error so that we can troubleshoot this more efficiently?

0 replies

tsrseerist · 2025-01-14T13:38:52Z

tsrseerist
Jan 14, 2025
Author

Sure:

import { createPlaywrightRouter, playwrightUtils, log } from "crawlee";

const patterns = ["http://www.googletagmanager.com/", "images.taboola.com"]

// in a route
await playwrightUtils.blockRequests(page, {
    extraUrlPatterns: patterns,
  });

I use a route for doing my scraping but this is basically all I am doing.

4 replies

janbuchar Jan 14, 2025
Maintainer

Could you also provide a crawler definition, please? And the page that you're trying to scrape?

janbuchar Jan 14, 2025
Maintainer

Can you try calling blockRequests in a pre-navigation hook (https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#preNavigationHooks) instead of a request handler?

Answer selected by janbuchar

tsrseerist Jan 14, 2025
Author

yea I can try that, how does that look, something like?:

preNavigationHooks: [
    async (crawlingContext) => {
        const { page } = crawlingContext;
        await playwrightUtils.blockRequests(page, {
          extraUrlPatterns: patterns,
        });
    },
]

tsrseerist Jan 14, 2025
Author

Ok I tried this:

preNavigationHooks: [
        async ({ page }) => {
          await playwrightUtils.blockRequests(page, {
            extraUrlPatterns: patterns,
          });
        },
      ],

and it works exactly as expected so thank you for pointing me in the right direction.

tsrseerist · 2025-01-14T13:50:05Z

tsrseerist
Jan 14, 2025
Author

Sure:

import { PlaywrightCrawler, log } from 'crawlee';
import { router } from './routes.js';

const crawler = new PlaywrightCrawler({
    requestHandler: router,
});

await crawler.run(['https://cnn.com']);

There's really not much to it, just something basic like this should work. Also let me know if I am using extraUrlPatterns the correct way to.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

playwrightUtils for blocking patterns #2810

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

playwrightUtils for blocking patterns #2810

Uh oh!

tsrseerist Jan 13, 2025

Replies: 3 comments · 4 replies

Uh oh!

janbuchar Jan 14, 2025 Maintainer

Uh oh!

Uh oh!

tsrseerist Jan 14, 2025 Author

Uh oh!

janbuchar Jan 14, 2025 Maintainer

Uh oh!

janbuchar Jan 14, 2025 Maintainer

Uh oh!

Uh oh!

tsrseerist Jan 14, 2025 Author

Uh oh!

tsrseerist Jan 14, 2025 Author

Uh oh!

tsrseerist Jan 14, 2025 Author

tsrseerist
Jan 13, 2025

Replies: 3 comments 4 replies

janbuchar
Jan 14, 2025
Maintainer

tsrseerist
Jan 14, 2025
Author

janbuchar Jan 14, 2025
Maintainer

janbuchar Jan 14, 2025
Maintainer

tsrseerist Jan 14, 2025
Author

tsrseerist Jan 14, 2025
Author

tsrseerist
Jan 14, 2025
Author