Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Added Cassandra backend #225

Closed
wants to merge 14 commits into from

Conversation

voith
Copy link
Contributor

@voith voith commented Nov 11, 2016

This PR is a rebase of #128. Although I have completely changed the design and refactored the code, I have added @wpxgit commits(but squashed them) because this work was originally initiated by him.

I have tried to follow the DRY methodology as much as possible, so I had to refactor some existing code.

I have serialized dicts using Pickle, as a result this backend won't have problems discussed in #211.

The PR includes unit tests and some integration tests with the backends integration testing framework.

Its good that frontera has an integration test framework for testing backends in single threaded mode. However, having a similar framework for the distributed mode is very much needed.

I am open to all sorts of suggestions :)

@voith
Copy link
Contributor Author

voith commented Nov 11, 2016

I'm marking this PR as WIP as there is some of manual testing that I have to do for the distributed backend. However, I won't be changing much code, so this PR is ready for some early review.

@wpxgit since this is your code too, would you mind helping me test this?

@codecov-io
Copy link

codecov-io commented Nov 11, 2016

Current coverage is 71.89% (diff: 87.83%)

Merging #225 into master will increase coverage by 1.70%

@@             master       #225   diff @@
==========================================
  Files            68         72     +4   
  Lines          4690       5116   +426   
  Methods           0          0          
  Messages          0          0          
  Branches        636        679    +43   
==========================================
+ Hits           3292       3678   +386   
- Misses         1256       1285    +29   
- Partials        142        153    +11   

Powered by Codecov. Last update f4abced...9c316ea

@sibiryakov
Copy link
Member

Hi, Voith

Thank you very much for starting this! Frontera will benefit definitely from Cassandra support.

I expect major use case is distributed backends run mode. So main source of inspiration is HBaseBackend, not sqla.
We need to test it with with at least 4 spiders and 2 SWs.
Also storing data only in Python-readable format is more of harm than good for Cassandra-involved application, because they are likely will be multi-platform.

A.

11 нояб. 2016 г., в 17:25, Voith Mascarenhas notifications@github.com написал(а):

This PR is a rebase of #128. Although I have completely changed the design and refactored the code, I have added @wpxgit commits(but squashed them) because this work was originally initiated by him.

I have tried to follow the DRY methodology as much as possible, so I had to refactor some existing code.

I have serialized dicts using Pickle, as a result this backend won't have problems discussed in #211.

The PR includes unit tests and some integration tests with the backends integration testing framework.

Its good that frontera has an integration test framework for testing backends in single threaded mode. However, having a similar framework for the distributed mode is very much needed.

I am open to all sorts of suggestions :)

You can view, comment on, or merge this pull request online at:

#225

Commit Summary

Added Cassandra Backend
refactored some sqlalchemy code to reuse in cassandra backend
added unit tests for cassandra backend
added pickle field to serialize dicts in cassandra
refactored cassandra metadata and added tests for it
fix pickledict bugs in py2
refactored cassandra states and added tests for it
added cassandra queue and tests for it
Added Lifo, fifo, dfs, bfs backend and correspoding tests for it
fixed connection issue in tests
Added cassandra revisiting backend and tests for it
added unitests for utcnow_timestamp
updated cassandra docs
added tests for cassandra distributed backend
File Changes

M .travis.yml (2)
M docs/source/topics/frontera-settings.rst (91)
M docs/source/topics/frontier-backends.rst (49)
M frontera/contrib/backends/init.py (124)
A frontera/contrib/backends/cassandra/init.py (145)
A frontera/contrib/backends/cassandra/components.py (239)
A frontera/contrib/backends/cassandra/models.py (117)
A frontera/contrib/backends/cassandra/revisiting.py (101)
M frontera/contrib/backends/sqlalchemy/init.py (67)
M frontera/contrib/backends/sqlalchemy/components.py (35)
M frontera/contrib/backends/sqlalchemy/revisiting.py (33)
M frontera/settings/default_settings.py (15)
M frontera/utils/misc.py (9)
M requirements/tests.txt (1)
M setup.py (7)
A tests/contrib/backends/cassandra/test_backend_cassandra.py (302)
A tests/contrib/backends/cassandra/wait_for_cluster_up.py (30)
M tests/test_utils_misc.py (15)
Patch Links:

https://github.com/scrapinghub/frontera/pull/225.patch
https://github.com/scrapinghub/frontera/pull/225.diff

You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@voith
Copy link
Contributor Author

voith commented Nov 15, 2016

@sibiryakov Thanks for your input!
I just started testing it in distributed mode. Having some trouble to get in running after the new changes introduced in kafka in #223.

I expect major use case is distributed backends run mode. So main source of inspiration is HBaseBackend, not sqla.

I will add a separate distributed backend based on hbase. But I'll keep the other backends like LIFO, FIFO for a quick run through purpose.

Also storing data only in Python-readable format is more of harm than good for Cassandra-involved application, because they are likely will be multi-platform.

I did not think about the multi platform issues. I'll use the existing encoder in this case

@sibiryakov
Copy link
Member

how is it going @voith ?

@voith
Copy link
Contributor Author

voith commented Nov 18, 2016

@sibiryakov I did not get time to work on this. I will work on it over this weekend. I hope to have something by the end of this weekend.

@MichaelVIU
Copy link

@voith: its is great to see that you are implementing cassandra for frontera!
Have taken a short look on your code - it looks great and clean.
Only one hint i have - i've taken a big eye on performance when implementing cassandra as storage backend. After several test it looks like execute_concurrent_with_args was the fastest way to insert that amount of data in cassandra - especial when inserting links. But perhaps batch querys are fast enough - some time went by and i'm not shure if i have tested it...

@alex: sorry i was really bussy the last months - cassandra for frontera was on my todo list - but i've decided to go with another solution for me and haven't found time to complete this. But it's great to see that someone other has taken the baton...

@voith
Copy link
Contributor Author

voith commented Nov 24, 2016

@wpxgit thanks for your feedback. I'll look into execute_concurrent_with_args suggestion.

@sibiryakov I have been a little busy off late. I'm sorry for keeping this on hold! I'll see if I get some time this weekend.

@sibiryakov
Copy link
Member

@voith NP! Let me know if you need anything, you took pretty interesting initiative.

@sibiryakov
Copy link
Member

@voith any news on this?

@voith
Copy link
Contributor Author

voith commented Jan 19, 2017

@sibiryakov the last time worked on this I came across this post which states that using cassendra as an queue is a anti pattern. I was a little discouraged after reading it. I don't if its worth to add cassandra as a backend. I would still be willing to work on this if somehow you convince that this is not a major issue

@sibiryakov
Copy link
Member

@voith It's all about implementation. we never discussed it. I suggest to read the comments to this article where people are pointing out that using Cassandra for queues isn't impossible, you just need to take some details into account and design accordingly.

This PR implies designing the data model, and probably testing with at least tens of gigabytes volume.

Worth looking into http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra

@voith
Copy link
Contributor Author

voith commented Jan 23, 2017

@sibiryakov thank you for getting my hopes high back again. I will try to take a stab at this this weekend

@voith
Copy link
Contributor Author

voith commented Feb 21, 2017

I'm closing this as I no longer have the motivation to continue it

@voith voith closed this Feb 21, 2017
@voith voith deleted the PR-128-cassandra-backend branch February 21, 2017 17:27
@sibiryakov
Copy link
Member

Thanks for trying, anyway!

@dyangelo-grullon
Copy link

So, what's missing really? This feature is pretty desirable.

@sibiryakov
Copy link
Member

Well, it has to work. Someone has to implement the queue suitable for crawling from multiple domains and test it on crawling at least 10M domains, to make sure queue is operating fast enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants