Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Crawler #37

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Refactor Crawler #37

wants to merge 7 commits into from

Conversation

tmaier
Copy link
Contributor

@tmaier tmaier commented Jun 10, 2014

As a result from #33, I reconsidered the current structure or PolipusCrawler.

Especially PolipusCrawler#takeover is a very long method where lots is going on at the same time.
PolipusCrawler itself has lots of methods and is responsible for everything whats going on in Polipus.

I consider this pull request more a proof of concept and a starting point for a discussion.

I would like to move all methods of PolipusCrawler to its own classes or plugins so that every class has its own responsibility.
For now, I moved most of PolipusCrawler#takeover to Worker#run and split itself again in smaller methods.
This would allow a more thorough testing of only specific features of polipus without running the full stack.

The delegator used in Worker is more a temporary solution.

We could allow plugins to hook into should_be_visited? and add then a robots plugin, a follow_links_like plugin and a store_pages plugin.

A statistics plugin would replace incr_pages and incr_errors and hook into on_after_download and on_page_error

@coveralls
Copy link

Coverage Status

Coverage increased (+0.42%) when pulling 67dca2c on tmaier:crawler-refactor into 0106246 on taganaka:master.

@taganaka
Copy link
Owner

👍
I'm really supportive in refactoring and moving code around as long as it will be helpful for a better maintainability & testing. I'm in general not a huge fan of having everything incapsulated/abstracted/delegated (Matryoshka doll style) just because guides and books says that :)

So let's the topic open.

What about to start to write a WISH/TODO list about features and improvements we envision in the next releases? So that we have a better context on what should be next and how to move forward?

@tmaier
Copy link
Contributor Author

tmaier commented Jun 13, 2014

The WISHES and TODOs

Feel free to add your point in this comment or to cancel a line if you're against it.
Reorder the points freely

  1. Improve code style and refactor (@tmaier)
    • Get code climate to "A" everywhere
    • Get rid of long methods. I know the source is Anemone - which is influenced Polipus very much
    • I would prefer more small classes over one big class PolipusCrawler which does everything.
  2. Add a dry run (@tmaier)
    • We cache the pages and we can really do lots of nice and fancy stuff in our #on_page_downloaded blocks.
    • It would be great if we could re-run the blocks on the pages in the pages collection
  3. Allow better debugging of our blocks (@tmaier)
    Why did it fail now? allow to step into the on_page_downloaded block somehow and debug it
    For example with pry.
  4. Handle the case when one worker fails (@tmaier)
    • When one worker fails, the one worker just stops working. The whole crawl process is still going on. Just the one worker never picks up another job.
    • Either let the whole crawl session die (of course without loosing data) or let the one worker restart
  5. Consider to replace parts of Polipus::HTTP with excon (@tmaier)
    • Would we gain speed?
    • Would we be able to remove code which we would need to maintain from Polipus?

@taganaka
Copy link
Owner

Very good points! Thanks alot for your thoughts and your help

Going to open a separated issue/thread for each items so that it is easily to keep track of them.

On top of my mind:

  1. General storage performance improvements (@taganaka)
    • Bloom filtered storage: storage.exists?(page) is invoked very often and every time it turns into an hit on DB slowing down the process. A bloom filter added on top of the storage logic might help here
    • Buffered writes: storage.add(page) performs a single insert/write each time is invoked. We might get some advantages here by implementing the support for bulk writes/inserts whether the underlying driver allows such operations (https://github.com/mongodb/mongo-ruby-driver/wiki/Bulk-Write-Operations)
  2. Redesign of Queue Overflow Manager (@taganaka)
    • It is at the moment very inefficient for a large crawling session (> 20 ec2 instances, huge websites, tons of inbound links ). The current strategy is not fast enough to guarantee Redis to not be overloaded by tons of urls
  3. Better built in statistics tools
    • counters on 2XX 3XX 4XX 5XX http codes, avg for downloaded pages per H/m/s etc, ETA..etc
  4. Get rid of s3_storage

@tmaier
Copy link
Contributor Author

tmaier commented Jun 16, 2014

I also don't really know what to do with the current plugin implementation.
First of all, a plugin does not have access to the page he is currently processing.

Next, the existing plugins Cleaner and Sleeper: For me it is not really obvious why they are plugins and everything else are options of PolipusCrawler. See polipus.rb#L23-L80

I would propose to allow the plugins access to page and also to move every single configurable feature to the plugin architecture. This way, someone could replace single features with his own implementation or simply get a slimmer crawler when he does not need some of the features provided.

I imagine it like the Middleware of Rack or Sidekiq.

@taganaka
Copy link
Owner

I also don't really know what to do with the current plugin implementation.

Me either :) My initial concept was to create an architecture where user's code could run into polipus scope. But I didn't invest much time. I'm also fine to drop the current implementation and explore a Middleware-like implementation (that actually seems a very good idea!!!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants