Elassandra benefits #47

vgkowski · 2016-09-27T13:11:18Z

Congrats for this solution which seems to be very interesting. I am already using Cassandra and Elasticsearch to serve different storage needs, C* as source of truth and ES as serving layer for user analytics, and I can see the benefits of deep integration from the data pipeline point of view because there is no need to double ingest. But I would like to know if the following assumptions are right:

as data is stored in C* which manage primary keys, Elassandra natively provides idempotence meaning writing for example from a Spark job, if one worker crash the bulk insert will be replayed but duplicated entries will be overriden so resulting in no duplicates in ES ?
as you provide a bi directionnal relation between C* and ES, you can both use ES-hadoop and Spark-Cassandra connector to write to Elassandra ?
as the number of shard is linked to the number of partition in Cassandra, index shards will natively growth with the number of C* nodes or vnodes ? Does it mean that search performance would be better than a static nb of shard index in standard ES ?
I am always sceptical with forks because following the source product evolution is very hard and time consumming. I have heard in your C* summit video that few C* classes are modified but lots of ES classes are modified (more than 1000 ?). Aren't you afraid of being late on original products version ? How many resources are deeply implied on this project ? For example C* 2.2 is end of life next month but tic/toc release isn't very stable...
Again I think the idea behind your project is very clever and interesting, the main risk will be a small adoption that will reduce the capabilities to maintain it from the base C* and ES.

vroyer · 2016-09-27T17:09:18Z

Hi,

Thanks for your comments. Here is some explanations related to your four points :
1-Yes, using the cassandra primary key avoid duplicates. Of course, if you indexe documents without any id, you could have duplicates due to automatic id generation.
2-Yes, both ES and C* spark connectors should work. ES-Spark connector with pushdown is better for read operations including filtering or aggregation, while C-spark connector is probably better for write operations (no JSON generation and parsing).
3-Search/Aggregation performances cannot be better than elasticsearch (as it is elasticsearch search code), but elassandra is easier to scale by adding nodes. C bootstraping automatically increase the number of shards, increasing your overall throughput.
4-Changing the code at the minimum was also big a challenge, and there is only 170 modified classes from ES over 3900, and less than 10 classes modified in C*. Unfortunately, the current bottleneck is mainly on my free-time right now ...

Regards,
Vincent.

ddorian · 2016-09-27T18:53:18Z

as the number of shard is linked to the number of partition in Cassandra

That's incorrect. Number of shards is linked to number of nodes. Each index will have 1 shard on each node. This is more efficient because you don't have to merge results from multiple shards inside 1 node.

vgkowski · 2016-09-27T19:44:05Z

OK thanks. Do you have any large scale reference ?

vroyer · 2016-09-27T22:18:07Z

Not really, i have deployed on 6 nodes.
Any feed back from a larger deployment is welcome.
Thanks'.

vroyer closed this as completed Mar 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elassandra benefits #47

Elassandra benefits #47

vgkowski commented Sep 27, 2016

vroyer commented Sep 27, 2016 •

edited

ddorian commented Sep 27, 2016 •

edited

vgkowski commented Sep 27, 2016

vroyer commented Sep 27, 2016

Elassandra benefits #47

Elassandra benefits #47

Comments

vgkowski commented Sep 27, 2016

vroyer commented Sep 27, 2016 • edited

ddorian commented Sep 27, 2016 • edited

vgkowski commented Sep 27, 2016

vroyer commented Sep 27, 2016

vroyer commented Sep 27, 2016 •

edited

ddorian commented Sep 27, 2016 •

edited