Refactor Crawler #37

tmaier · 2014-06-10T07:25:58Z

As a result from #33, I reconsidered the current structure or PolipusCrawler.

Especially PolipusCrawler#takeover is a very long method where lots is going on at the same time.
PolipusCrawler itself has lots of methods and is responsible for everything whats going on in Polipus.

I consider this pull request more a proof of concept and a starting point for a discussion.

I would like to move all methods of PolipusCrawler to its own classes or plugins so that every class has its own responsibility.
For now, I moved most of PolipusCrawler#takeover to Worker#run and split itself again in smaller methods.
This would allow a more thorough testing of only specific features of polipus without running the full stack.

The delegator used in Worker is more a temporary solution.

We could allow plugins to hook into should_be_visited? and add then a robots plugin, a follow_links_like plugin and a store_pages plugin.

A statistics plugin would replace incr_pages and incr_errors and hook into on_after_download and on_page_error

coveralls · 2014-06-10T07:33:27Z

Coverage increased (+0.42%) when pulling 67dca2c on tmaier:crawler-refactor into 0106246 on taganaka:master.

taganaka · 2014-06-11T03:00:16Z

👍
I'm really supportive in refactoring and moving code around as long as it will be helpful for a better maintainability & testing. I'm in general not a huge fan of having everything incapsulated/abstracted/delegated (Matryoshka doll style) just because guides and books says that :)

So let's the topic open.

What about to start to write a WISH/TODO list about features and improvements we envision in the next releases? So that we have a better context on what should be next and how to move forward?

tmaier · 2014-06-13T15:54:56Z

The WISHES and TODOs

Feel free to add your point in this comment or to ~~cancel~~ a line if you're against it.
Reorder the points freely

Improve code style and refactor (@tmaier)
- Get code climate to "A" everywhere
- Get rid of long methods. I know the source is Anemone - which is influenced Polipus very much
- I would prefer more small classes over one big class PolipusCrawler which does everything.
Add a dry run (@tmaier)
- We cache the pages and we can really do lots of nice and fancy stuff in our #on_page_downloaded blocks.
- It would be great if we could re-run the blocks on the pages in the pages collection
Allow better debugging of our blocks (@tmaier)
Why did it fail now? allow to step into the on_page_downloaded block somehow and debug it
For example with pry.
Handle the case when one worker fails (@tmaier)
- When one worker fails, the one worker just stops working. The whole crawl process is still going on. Just the one worker never picks up another job.
- Either let the whole crawl session die (of course without loosing data) or let the one worker restart
Consider to replace parts of Polipus::HTTP with excon (@tmaier)
- Would we gain speed?
- Would we be able to remove code which we would need to maintain from Polipus?

taganaka · 2014-06-14T20:01:36Z

Very good points! Thanks alot for your thoughts and your help

Going to open a separated issue/thread for each items so that it is easily to keep track of them.

On top of my mind:

General storage performance improvements (@taganaka)
- Bloom filtered storage: storage.exists?(page) is invoked very often and every time it turns into an hit on DB slowing down the process. A bloom filter added on top of the storage logic might help here
- Buffered writes: storage.add(page) performs a single insert/write each time is invoked. We might get some advantages here by implementing the support for bulk writes/inserts whether the underlying driver allows such operations (https://github.com/mongodb/mongo-ruby-driver/wiki/Bulk-Write-Operations)
Redesign of Queue Overflow Manager (@taganaka)
- It is at the moment very inefficient for a large crawling session (> 20 ec2 instances, huge websites, tons of inbound links ). The current strategy is not fast enough to guarantee Redis to not be overloaded by tons of urls
Better built in statistics tools
- counters on 2XX 3XX 4XX 5XX http codes, avg for downloaded pages per H/m/s etc, ETA..etc
Get rid of s3_storage
- Polipus will be shipped with mongo store as built in storage mechanism. Storage alternatives will be provided trough separated gems (eg: https://github.com/taganaka/polipus-storage-mysql). Less code to maintains, less dependencies

tmaier · 2014-06-16T06:59:43Z

I also don't really know what to do with the current plugin implementation.
First of all, a plugin does not have access to the page he is currently processing.

Next, the existing plugins Cleaner and Sleeper: For me it is not really obvious why they are plugins and everything else are options of PolipusCrawler. See polipus.rb#L23-L80

I would propose to allow the plugins access to page and also to move every single configurable feature to the plugin architecture. This way, someone could replace single features with his own implementation or simply get a slimmer crawler when he does not need some of the features provided.

I imagine it like the Middleware of Rack or Sidekiq.

taganaka · 2014-06-16T19:34:48Z

I also don't really know what to do with the current plugin implementation.

Me either :) My initial concept was to create an architecture where user's code could run into polipus scope. But I didn't invest much time. I'm also fine to drop the current implementation and explore a Middleware-like implementation (that actually seems a very good idea!!!)

tmaier added 7 commits June 10, 2014 07:57

Move PolipusCrawler to own file

0591066

Remove private for now

ec3b519

replace @internal_queue with #queue

d9ec277

Add Worker

8190cb5

Add #process(page)

ade9a7d

Refactor #process(page)

fe79d7a

Replace @options with #options

67dca2c

tmaier mentioned this pull request Sep 30, 2014

Thread seems to hang in HTTP Call #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Crawler #37

Refactor Crawler #37

tmaier commented Jun 10, 2014

coveralls commented Jun 10, 2014

taganaka commented Jun 11, 2014

tmaier commented Jun 13, 2014

taganaka commented Jun 14, 2014

tmaier commented Jun 16, 2014

taganaka commented Jun 16, 2014

Refactor Crawler #37

Are you sure you want to change the base?

Refactor Crawler #37

Conversation

tmaier commented Jun 10, 2014

coveralls commented Jun 10, 2014

taganaka commented Jun 11, 2014

tmaier commented Jun 13, 2014

The WISHES and TODOs

taganaka commented Jun 14, 2014

tmaier commented Jun 16, 2014

taganaka commented Jun 16, 2014