Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How would I crawl a single site with multiple pages in parallel? #24

Closed
tinonetic opened this issue Jan 12, 2021 · 2 comments
Closed

How would I crawl a single site with multiple pages in parallel? #24

tinonetic opened this issue Jan 12, 2021 · 2 comments

Comments

@tinonetic
Copy link

Hi,

Thanks for the product!

Apologies for the many questions.

How would I crawl a single site with multiple pages in parallel?
Do I need AbotX or Abot would do?
Do I need to loop through the list of sites if I can only do 3 at a time for the free version?
Is it ideal to have this in a job that keeps track of runs?
Also it doesn't say which part of the code I get the crawled data...is it in crawlEngine.SiteCrawlCompleted, after the lock(crawlCounts){...} statment?

Example

        private static async Task DemoParallelCrawlerEngine()
        {
            var siteToCrawlProvider = new SiteToCrawlProvider();
            siteToCrawlProvider.AddSitesToCrawl(new List<SiteToCrawl>
            {
                new SiteToCrawl{ Uri = new Uri("YOURSITE1") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE2") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE3") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE4") },
                new SiteToCrawl{ Uri = new Uri("YOURSITE5") }
            });

            var config = GetSafeConfig();
            config.MaxConcurrentSiteCrawls = 3;
                
            var crawlEngine = new ParallelCrawlerEngine(
                config, 
                new ParallelImplementationOverride(config, 
                    new ParallelImplementationContainer()
                    {
                        SiteToCrawlProvider = siteToCrawlProvider,
                        WebCrawlerFactory = new WebCrawlerFactory(config)//Same config will be used for every crawler
                    })
                );                
            
            var crawlCounts = new Dictionary<Guid, int>();
            var siteStartingEvents = 0;
            var allSitesCompletedEvents = 0;
            crawlEngine.CrawlerInstanceCreated += (sender, eventArgs) =>
            {
                var crawlId = Guid.NewGuid();
                eventArgs.Crawler.CrawlBag.CrawlId = crawlId;
            };
            crawlEngine.SiteCrawlStarting += (sender, args) =>
            {
                Interlocked.Increment(ref siteStartingEvents);
            };
            crawlEngine.SiteCrawlCompleted += (sender, eventArgs) =>
            {
                lock (crawlCounts)
                {
                    crawlCounts.Add(eventArgs.CrawledSite.SiteToCrawl.Id, eventArgs.CrawledSite.CrawlResult.CrawlContext.CrawledCount);
                }
            };
            crawlEngine.AllCrawlsCompleted += (sender, eventArgs) =>
            {
                Interlocked.Increment(ref allSitesCompletedEvents);
            };

            await crawlEngine.StartAsync();
        }
@sjdirect
Copy link
Owner

sjdirect commented Jan 12, 2021

Abot's PoliteWebCrawler alone can crawl multiple PAGES of a single site concurrently. AbotX's ParallelCrawlerEngine is to manage multiple instances of Abot's PoliteWebCrawler instances, effectively allowing you to crawl multiple SITES concurrently.

Example shows how to get the content of a crawled page.
var crawler = new PoliteWebCrawler(config);
crawler.PageCrawlCompleted += PageCrawlCompleted;//Several events available...
var crawlResult = await crawler.CrawlAsync(new Uri("http://!!!!!!!!YOURSITEHERE!!!!!!!!!.com"));

    private static void PageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
    {
        var httpStatus = e.CrawledPage.HttpResponseMessage.StatusCode;
        var rawPageText = e.CrawledPage.Content.Text;
    }

@tinonetic
Copy link
Author

tinonetic commented Jan 13, 2021

Thank-you for the info. Very helpful

One last clarification. If I have a website and I want it to crawl specific pages, not the whole site, do I have to make it crawl the entire site?

How do I direct it to crawl, say paged content. For example:

www.mysite.com/puppies?page=1
www.mysite.com/puppies?page=2
www.mysite.com/puppies?page=3
www.mysite.com/puppies?page=4
www.mysite.com/puppies?page=5

...and I do not want it to crawl

www.mysite.com/contact
www.mysite.com/puppies/blog
www.mysite.com/services

Thank-you for your patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants