Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elassandra benefits #47

Closed
vgkowski opened this issue Sep 27, 2016 · 4 comments
Closed

Elassandra benefits #47

vgkowski opened this issue Sep 27, 2016 · 4 comments

Comments

@vgkowski
Copy link

Congrats for this solution which seems to be very interesting. I am already using Cassandra and Elasticsearch to serve different storage needs, C* as source of truth and ES as serving layer for user analytics, and I can see the benefits of deep integration from the data pipeline point of view because there is no need to double ingest. But I would like to know if the following assumptions are right:

  • as data is stored in C* which manage primary keys, Elassandra natively provides idempotence meaning writing for example from a Spark job, if one worker crash the bulk insert will be replayed but duplicated entries will be overriden so resulting in no duplicates in ES ?
  • as you provide a bi directionnal relation between C* and ES, you can both use ES-hadoop and Spark-Cassandra connector to write to Elassandra ?
  • as the number of shard is linked to the number of partition in Cassandra, index shards will natively growth with the number of C* nodes or vnodes ? Does it mean that search performance would be better than a static nb of shard index in standard ES ?
  • I am always sceptical with forks because following the source product evolution is very hard and time consumming. I have heard in your C* summit video that few C* classes are modified but lots of ES classes are modified (more than 1000 ?). Aren't you afraid of being late on original products version ? How many resources are deeply implied on this project ? For example C* 2.2 is end of life next month but tic/toc release isn't very stable...
    Again I think the idea behind your project is very clever and interesting, the main risk will be a small adoption that will reduce the capabilities to maintain it from the base C* and ES.
@vroyer
Copy link
Collaborator

vroyer commented Sep 27, 2016

Hi,

Thanks for your comments. Here is some explanations related to your four points :
1-Yes, using the cassandra primary key avoid duplicates. Of course, if you indexe documents without any id, you could have duplicates due to automatic id generation.
2-Yes, both ES and C* spark connectors should work. ES-Spark connector with pushdown is better for read operations including filtering or aggregation, while C
-spark connector is probably better for write operations (no JSON generation and parsing).
3-Search/Aggregation performances cannot be better than elasticsearch (as it is elasticsearch search code), but elassandra is easier to scale by adding nodes. C
bootstraping automatically increase the number of shards, increasing your overall throughput.
4-Changing the code at the minimum was also big a challenge, and there is only 170 modified classes from ES over 3900, and less than 10 classes modified in C*. Unfortunately, the current bottleneck is mainly on my free-time right now ...

Regards,
Vincent.

@ddorian
Copy link

ddorian commented Sep 27, 2016

as the number of shard is linked to the number of partition in Cassandra

That's incorrect. Number of shards is linked to number of nodes. Each index will have 1 shard on each node. This is more efficient because you don't have to merge results from multiple shards inside 1 node.

@vgkowski
Copy link
Author

OK thanks. Do you have any large scale reference ?

@vroyer
Copy link
Collaborator

vroyer commented Sep 27, 2016

Not really, i have deployed on 6 nodes.
Any feed back from a larger deployment is welcome.
Thanks'.

@vroyer vroyer closed this as completed Mar 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants