How would I crawl a single site with multiple pages in parallel? #24

tinonetic · 2021-01-12T23:09:13Z

Hi,

Thanks for the product!

Apologies for the many questions.

How would I crawl a single site with multiple pages in parallel?
Do I need AbotX or Abot would do?
Do I need to loop through the list of sites if I can only do 3 at a time for the free version?
Is it ideal to have this in a job that keeps track of runs?
Also it doesn't say which part of the code I get the crawled data...is it in crawlEngine.SiteCrawlCompleted, after the lock(crawlCounts){...} statment?

Example

        private static async Task DemoParallelCrawlerEngine()
        {
            var siteToCrawlProvider = new SiteToCrawlProvider();
            siteToCrawlProvider.AddSitesToCrawl(new List<SiteToCrawl>
            {
                new SiteToCrawl{ Uri = new Uri("YOURSITE1") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE2") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE3") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE4") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE5") }
            });

            var config = GetSafeConfig();
            config.MaxConcurrentSiteCrawls = 3;
                
            var crawlEngine = new ParallelCrawlerEngine(
                config, 
                new ParallelImplementationOverride(config, 
                    new ParallelImplementationContainer()
                    {
                        SiteToCrawlProvider = siteToCrawlProvider,
                        WebCrawlerFactory = new WebCrawlerFactory(config)//Same config will be used for every crawler
                    })
                );                
            
            var crawlCounts = new Dictionary<Guid, int>();
            var siteStartingEvents = 0;
            var allSitesCompletedEvents = 0;
            crawlEngine.CrawlerInstanceCreated += (sender, eventArgs) =>
            {
                var crawlId = Guid.NewGuid();
                eventArgs.Crawler.CrawlBag.CrawlId = crawlId;
            };
            crawlEngine.SiteCrawlStarting += (sender, args) =>
            {
                Interlocked.Increment(ref siteStartingEvents);
            };
            crawlEngine.SiteCrawlCompleted += (sender, eventArgs) =>
            {
                lock (crawlCounts)
                {
                    crawlCounts.Add(eventArgs.CrawledSite.SiteToCrawl.Id, eventArgs.CrawledSite.CrawlResult.CrawlContext.CrawledCount);
                }
            };
            crawlEngine.AllCrawlsCompleted += (sender, eventArgs) =>
            {
                Interlocked.Increment(ref allSitesCompletedEvents);
            };

            await crawlEngine.StartAsync();
        }

The text was updated successfully, but these errors were encountered:

sjdirect · 2021-01-12T23:27:06Z

Abot's PoliteWebCrawler alone can crawl multiple PAGES of a single site concurrently. AbotX's ParallelCrawlerEngine is to manage multiple instances of Abot's PoliteWebCrawler instances, effectively allowing you to crawl multiple SITES concurrently.

Example shows how to get the content of a crawled page.
var crawler = new PoliteWebCrawler(config);
crawler.PageCrawlCompleted += PageCrawlCompleted;//Several events available...
var crawlResult = await crawler.CrawlAsync(new Uri("http://!!!!!!!!YOURSITEHERE!!!!!!!!!.com"));

    private static void PageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
    {
        var httpStatus = e.CrawledPage.HttpResponseMessage.StatusCode;
        var rawPageText = e.CrawledPage.Content.Text;
    }

tinonetic · 2021-01-13T20:32:26Z

Thank-you for the info. Very helpful

One last clarification. If I have a website and I want it to crawl specific pages, not the whole site, do I have to make it crawl the entire site?

How do I direct it to crawl, say paged content. For example:

www.mysite.com/puppies?page=1
www.mysite.com/puppies?page=2
www.mysite.com/puppies?page=3
www.mysite.com/puppies?page=4
www.mysite.com/puppies?page=5

...and I do not want it to crawl

www.mysite.com/contact
www.mysite.com/puppies/blog
www.mysite.com/services

Thank-you for your patience!

sjdirect closed this as completed Jan 12, 2021

tinonetic mentioned this issue Jan 13, 2021

How do I direct it to crawl, specifc paged content in a site directory? sjdirect/abot#227

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How would I crawl a single site with multiple pages in parallel? #24

How would I crawl a single site with multiple pages in parallel? #24

tinonetic commented Jan 12, 2021

sjdirect commented Jan 12, 2021 •

edited

Loading

tinonetic commented Jan 13, 2021 •

edited

Loading

How would I crawl a single site with multiple pages in parallel? #24

How would I crawl a single site with multiple pages in parallel? #24

Comments

tinonetic commented Jan 12, 2021

sjdirect commented Jan 12, 2021 • edited Loading

tinonetic commented Jan 13, 2021 • edited Loading

sjdirect commented Jan 12, 2021 •

edited

Loading

tinonetic commented Jan 13, 2021 •

edited

Loading