Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facebook Presto #1139

Closed
ChrisFeldmeier opened this issue Mar 29, 2016 · 44 comments

Comments

@ChrisFeldmeier
Copy link

commented Mar 29, 2016

Is it possible to connect Facebook's prestodb.io to scylla? Has anyone tested this yet?

@avikivity

This comment has been minimized.

Copy link
Contributor

commented Mar 29, 2016

It's not clear if the connector requires thrift, or whether thrift is optional. If thrift is optional, it prestodb should work.

@ChrisFeldmeier

This comment has been minimized.

Copy link
Author

commented Mar 29, 2016

I don't know. Maybe the docu does say nothing about that. I have created an issue and waiting for an answer: prestodb/presto#4894

@dorlaor

This comment has been minimized.

Copy link
Contributor

commented May 2, 2016

@tzach can you please post your latest Presto experience here?

@tzach

This comment has been minimized.

Copy link
Contributor

commented May 2, 2016

Presto use CQL, and from my preliminary tests just works with Scylla.
I used the Cassandra connector, and followed the instructions
https://prestodb.io/docs/current/connector/cassandra.html

Here some basic commands I tested with

presto:default> SELECT * FROM cassandra.mykeyspace.users where user_id >= 2 and user_id <= 3;
 user_id | fname  | lname 
---------+--------+-------
       2 | dor    | laor  
       3 | shlomi | laor  
(2 rows)

presto:default> SELECT * FROM cassandra.mykeyspace.users where regexp_like(user_id, 'd.*')

presto:default> SELECT * FROM cassandra.mykeyspace.users where regexp_like(fname, 'd.*');
 user_id | fname | lname 
---------+-------+-------
       2 | dor   | laor  
(1 row)

presto:default> select * from (select * from cassandra.mykeyspace.users where lname='laor') where fname='shlomi';
 user_id | fname  | lname 
---------+--------+-------
       3 | shlomi | laor  
(1 row)

presto:default> select lname, count(*) from cassandra.mykeyspace.users group by lname;
  lname   | _col1 
----------+-------
 livyatan |     1 
 laor     |     2 
 livne    |     1 
 kivity   |     1 
(4 rows)

Dats used for the above, provisioned directly to Scylla

cqlsh> CREATE KEYSPACE mykeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
cqlsh> CREATE TABLE users (user_id int PRIMARY KEY, fname text, lname text);
cqlsh:mykeyspace> insert into users (user_id , fname, lname) values (1, 'tzach', 'livyatan');
cqlsh:mykeyspace> insert into users (user_id , fname, lname) values (2, 'dor', 'laor');                                                               
cqlsh:mykeyspace> insert into users (user_id , fname, lname) values (3, 'shlomi', 'laor');                                                            
cqlsh:mykeyspace> insert into users (user_id , fname, lname) values (4, 'shlomi', 'livne');                                                           
cqlsh:mykeyspace> insert into users (user_id , fname, lname) values (6, 'avi', 'kivity'); 
@tzach

This comment has been minimized.

Copy link
Contributor

commented May 29, 2016

@ChrisFeldmeier did you managed to run Presto with Scylla?

@avikivity

This comment has been minimized.

Copy link
Contributor

commented May 29, 2016

Closing, as @tzach showed presto+scylla works.

@avikivity avikivity closed this May 29, 2016
@ChrisFeldmeier

This comment has been minimized.

Copy link
Author

commented May 29, 2016

not testet yet

Am 29.05.2016 um 11:44 schrieb Avi Kivity notifications@github.com:

Closing, as @tzach showed presto+scylla works.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@gslin

This comment has been minimized.

Copy link

commented Jun 27, 2016

I build a cluster today and it seems Presto 0.149 will try to use Thrift:

ubuntu@ip-172-31-21-172:~/presto-server-0.149$ ../presto                                                      
presto> USE cassandra.kktest;
presto:kktest> SELECT * FROM api_login LIMIT 1;
Query 20160627_020144_00005_decmp failed: java.io.IOException: Unable to connect to server 172.31.15.98:9160
presto:kktest> 

As expect, port 9042 (Native) is open, and port 9160 (Thrift) is not:

ubuntu@ip-172-31-21-172:~/presto-server-0.149$ telnet 172.31.15.98 9042
Trying 172.31.15.98...
Connected to 172.31.15.98.
Escape character is '^]'.
^]
telnet> q
Connection closed.
ubuntu@ip-172-31-21-172:~/presto-server-0.149$ telnet 172.31.15.98 9160
Trying 172.31.15.98...
telnet: Unable to connect to remote host: Connection refused

May I ask @tzach about the settings?

@tzach

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2016

Hi @gslin thanks for testing

I restarted with presto 1.4.9 and Scylla 1.2 and its working for me.

Here is a gits with my trivial set up (one presto node)
https://gist.github.com/tzach/31e44b23926e92e3ff3d28630c2ba422

And here is a gits with some test I did, entering data via cqlsh, and query via presto
https://gist.github.com/tzach/7d3a4540264418fdb15aa9fa159e0188

I'm not sure why presto even try to access Thrift in your setup.
Maybe the different is in etc/catalog/cassandra.properties ?

@ustczen

This comment has been minimized.

Copy link

commented Jul 9, 2016

Hi @tzach many thanks for your set up tutorial. I can get a proper output in your sample test.
However, when I test presto + scylla with tons of data, I met the same problem as @gslin ,and presto throw these exception:
2016-07-09T23:41:36.532+0800 ERROR Query-20160709_154134_00001_3376r-164 com.facebook.presto.cassandra.CassandraThriftConnectionFactory Unable to connect to server 127.0.0.1:9160 java.io.IOException: Unable to connect to server 127.0.0.1:9160 at com.facebook.presto.cassandra.CassandraThriftConnectionFactory.createConnection(CassandraThriftConnectionFactory.java:88) at com.facebook.presto.cassandra.CassandraThriftConnectionFactory.getClientFromAddressList(CassandraThriftConnectionFactory.java:68) at com.facebook.presto.cassandra.CassandraThriftConnectionFactory.create(CassandraThriftConnectionFactory.java:53) at com.facebook.presto.cassandra.CassandraThriftClient.getRangeMap(CassandraThriftClient.java:38) at com.facebook.presto.cassandra.CassandraTokenSplitManager.getSplits(CassandraTokenSplitManager.java:65) at com.facebook.presto.cassandra.CassandraSplitManager.getSplitsByTokenRange(CassandraSplitManager.java:99) at com.facebook.presto.cassandra.CassandraSplitManager.getSplits(CassandraSplitManager.java:82) at com.facebook.presto.split.SplitManager.getSplits(SplitManager.java:45) at com.facebook.presto.sql.planner.DistributedExecutionPlanner$Visitor.visitTableScan(DistributedExecutionPlanner.java:112) at com.facebook.presto.sql.planner.DistributedExecutionPlanner$Visitor.visitTableScan(DistributedExecutionPlanner.java:92) at com.facebook.presto.sql.planner.plan.TableScanNode.accept(TableScanNode.java:135) at com.facebook.presto.sql.planner.DistributedExecutionPlanner$Visitor.visitFilter(DistributedExecutionPlanner.java:162) at com.facebook.presto.sql.planner.DistributedExecutionPlanner$Visitor.visitFilter(DistributedExecutionPlanner.java:92) at com.facebook.presto.sql.planner.plan.FilterNode.accept(FilterNode.java:71) at com.facebook.presto.sql.planner.DistributedExecutionPlanner$Visitor.visitLimit(DistributedExecutionPlanner.java:258) at com.facebook.presto.sql.planner.DistributedExecutionPlanner$Visitor.visitLimit(DistributedExecutionPlanner.java:92) at com.facebook.presto.sql.planner.plan.LimitNode.accept(LimitNode.java:86) at com.facebook.presto.sql.planner.DistributedExecutionPlanner.plan(DistributedExecutionPlanner.java:78) at com.facebook.presto.sql.planner.DistributedExecutionPlanner.plan(DistributedExecutionPlanner.java:83) at com.facebook.presto.execution.SqlQueryExecution.planDistribution(SqlQueryExecution.java:303) at com.facebook.presto.execution.SqlQueryExecution.start(SqlQueryExecution.java:226) at com.facebook.presto.execution.QueuedExecution.lambda$start$1(QueuedExecution.java:62) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at org.apache.thrift.transport.TSocket.open(TSocket.java:185) at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:41) at com.facebook.presto.cassandra.CassandraThriftConnectionFactory.createConnection(CassandraThriftConnectionFactory.java:84) ... 24 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589)

@tzach

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2016

Hi @ustczen Thanks for reproducing this issue.
If I'm reading the stack trace correctly, presto try to drop to Thrift, rather than use CQL. Scylla does not support Thrift yet, so the client fails. Scylla will support Thrift very soon (upcoming 1.3 release), but as long as Presto use CQL, it should work now.

Can you please elaborate on your setup, and how much data you loaded before encountering this issue? Also, can you paste your catalog/cassandra.properties file?

Thanks

@gslin

This comment has been minimized.

Copy link

commented Jul 10, 2016

I can reproduce in this way:

Create keyspace and table in cqlsh:

CREATE KEYSPACE mykeyspace WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};
USE mykeyspace;
CREATE TABLE access_log (id UUID PRIMARY KEY, ip TEXT, ident TEXT, user TEXT, timestamp INT, method TEXT, url TEXT, status INT, byte BIGINT, referer TEXT, useragent TEXT);

Use the following command to import:

cat access.log | ./combined-to-sql.sh | cqlsh

Then run presto-cli with:

SELECT COUNT(*) FROM cassandra.mykeyspace.access_log;

Then I will get Thrift error message.

@tzach

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2016

@gslin thanks
I was able to reproduce your use case. Look like Presto use both CQL and Thrift for some reason, and choose one base on the case. I will try to dig into Presto to understand if it can be update to use CQL only.

@duarten can you use this case to test the new Scylla Thrift feature?

@dorlaor

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2016

@tzach and @gslin, you can try to use the latest nightly binaries which
contain the thrift code

On Sun, Jul 10, 2016 at 1:34 PM, Tzach Livyatan notifications@github.com
wrote:

@gslin https://github.com/gslin thanks
I was able to reproduce your use case. Look like Presto use both CQL and
Thrift for some reason, and choose one base on the case. I will try to dig
into Presto to understand if it can be update to use CQL only.

@duarten https://github.com/duarten can you use this case to test the
new Scylla Thrift feature?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1139 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABp6RTXM5YFz-B6YBdeyteF9mUDu2uOhks5qUMrCgaJpZM4H6trj
.

@ustczen

This comment has been minimized.

Copy link

commented Jul 10, 2016

@tzach ,my etc/catalog/cassandra.properties follows your post exactly.
I imported about 100million lines into scylla before I met the exception above.I guess presto use thrift under special circumstances.
I would be appreciate it if you could estimate the release of Scylla 1.3.

@tzach

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2016

@ustczen thanks for reporting this issue.
Scylla 1.3 is planned for later this month.

@dorlaor

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2016

@ustczen https://github.com/ustczen , don't wait to 1.3, test it using
our nightly build.
This way in case something isn't in place we'll fix it for 1.3

On Sun, Jul 10, 2016 at 9:02 PM, Tzach Livyatan notifications@github.com
wrote:

@ustczen https://github.com/ustczen thanks for reporting this issue.
Scylla 1.3 is planned for later this month.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1139 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABp6RWE0Ks1Cc9ZeffVRvWPJWxeAoHfLks5qUTOkgaJpZM4H6trj
.

@duarten

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2016

@tzach Sounds like a good candidate, I'll give it a spin asap.

@tzach

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2016

@ustczen @gslin thanks for pointing us to this issue.
@duarten found that Presto uses Thrift describe_splits verb which was not part of the initial Scylla Thrift implementation. This means Scylla Presto support is limited, and we will fix it following 1.3 release. See #1445 for more.

@tzach tzach added this to the 1.4 milestone Jul 11, 2016
@tzach tzach added the bug label Jul 11, 2016
@cawallin

This comment has been minimized.

Copy link

commented Jul 21, 2016

Hi, I work on Presto -- the Cassandra connector for Presto uses CQL if you are selecting less than cassandra.limit-for-partition-key-select partition keys (by default 200), but if you are selecting more than that, it uses Thrift to figure out how data is partitioned on Cassandra.

@penberg penberg reopened this Jul 21, 2016
@penberg

This comment has been minimized.

Copy link
Contributor

commented Jul 21, 2016

I think @duarten is working on it.

@duarten

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2016

@cawallin Thanks for the explanation!

@penberg Yep, df346d8 should fix this.

I still have to make sure it's working well with Presto though.

@penberg

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2016

@duarten Can you please double-check that all the relevant commits are backported to 1.3? I think they are but it's very easy to make mistakes with that.

@duarten

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2016

@penberg They are, yes.

@tzach tzach removed this from the 1.3 milestone Aug 18, 2016
avikivity added a commit that referenced this issue Aug 23, 2016
Size estimates for a particular column family are recorded every 5
minutes. However, when a user calls the describe_splits(_ex) verbs,
they may want to see estimates for a recently created and updated
column family; this is legitimate and common in testing. However, a
client may also call describe_splits(_ex) very frequently and
recording the estimates on every call is wasteful and, worse, can
cause clients to give up. This patch fixes this by only recording
estimates if the first attempt to query them produces no results.

Refs #1139

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1471900595-4715-1-git-send-email-duarte@scylladb.com>
@duarten

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2016

@tzach I think I fixed the issue (it reliably works for me). Please retest with a build that includes commit 440c1b2 when possible.

@tzach

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2016

Thanks @duarten I will test with 1.3 AMI once it is ready

@penberg

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2016

@tzach That commit is not in 1.3...

@avikivity

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2016

@duarten how does that commit explain the problem?

@duarten

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2016

@avikivity WIthout it the client just seems to give up and now report the whole number of rows.

@avikivity

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2016

Ah, so it's just performance?

Maybe we should update in the background, so we already return immediately.

@duarten

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2016

Right. We update the size estimates in the background, every 5 minutes,
it's just that for some use cases, mostly tests, users might want to see
the stats immediately.

On Tue, Aug 23, 2016 at 4:29 PM Avi Kivity notifications@github.com wrote:

Ah, so it's just performance?

Maybe we should update in the background, so we already return immediately.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1139 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAen8jAVHXnai1LUOT-3lyyWD2bqy-Vqks5qiwPbgaJpZM4H6trj
.

@tzach

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2016

@duarten does it mean the presto count will give the right result after 5 min?

@tzach

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2016

Created a Docker image for quick test of Scylla+Presto
https://hub.docker.com/r/tzachl/scylla-and-presto-image/

penberg added a commit that referenced this issue Aug 29, 2016
Size estimates for a particular column family are recorded every 5
minutes. However, when a user calls the describe_splits(_ex) verbs,
they may want to see estimates for a recently created and updated
column family; this is legitimate and common in testing. However, a
client may also call describe_splits(_ex) very frequently and
recording the estimates on every call is wasteful and, worse, can
cause clients to give up. This patch fixes this by only recording
estimates if the first attempt to query them produces no results.

Refs #1139

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1471900595-4715-1-git-send-email-duarte@scylladb.com>
(cherry picked from commit 440c1b2)
@duarten

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2016

@tzach Sorry, had forgotten about your question. It wouldn't work, as the thrift verb was always causing the size estimates to be recorded, on the assumption describe_splits_ex wouldn't be called often (Titan calls it with a big range, while Presto calls it often with small ranges), and that just slowed things down enough to cause the Presto client to give up. The fix was to query first, and only trigger a recording of the estimates if the query returns an empty result set. On a related note, I opened #1616.

@slivne slivne modified the milestones: 1.3, 1.4 Aug 30, 2016
@slivne

This comment has been minimized.

Copy link
Contributor

commented Sep 20, 2016

@duarten is this resolved - if so can you reference the commit in head / associated issue that has this info

@duarten

This comment has been minimized.

Copy link
Contributor

commented Sep 20, 2016

The last commit I did as part of this issue was 9ec939f. For me Presto integration has been working reliably since it.

@penberg

This comment has been minimized.

Copy link
Contributor

commented Sep 20, 2016

And that commit is in the upcoming 1.3.1 release, so I think we can just close this issue.

@slivne

This comment has been minimized.

Copy link
Contributor

commented Sep 20, 2016

music to my ears ... closing

@slivne slivne closed this Sep 20, 2016
duarten added a commit to duarten/scylla that referenced this issue Sep 29, 2016
This patch fixes a bug where queries such as the following are not
handled properly:

"SELECT * FROM ks.cf WHERE token(id) >
9207857967443869328 AND token(id) <= -9223372036854775808"

Here -9223372036854775808 represents the minimum token, which we were
just translating into a token with kind::key, thus returning incorrect
results.

Ref scylladb#1139
Ref scylladb#693

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Sep 29, 2016
This patch re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.

Ref scylladb#1139
Ref scylladb#693

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Sep 29, 2016
This patch re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.

Ref scylladb#1139
Ref scylladb#693

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Sep 30, 2016
This patch fixes a bug where queries such as the following are not
handled properly:

"SELECT * FROM ks.cf WHERE token(id) >
9207857967443869328 AND token(id) <= -9223372036854775808"

Here -9223372036854775808 represents the minimum token, which we were
just translating into a token with kind::key, thus returning incorrect
results.

Ref scylladb#1139
Ref scylladb#693

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Sep 30, 2016
This patch fixes a bug where queries such as the following are not
handled properly:

"SELECT * FROM ks.cf WHERE token(id) >
9207857967443869328 AND token(id) <= -9223372036854775808"

Here -9223372036854775808 represents the minimum token, which we were
just translating into a token with kind::key, thus returning incorrect
results.

Ref scylladb#1139
Ref scylladb#693
Fixes scylladb#1717

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Sep 30, 2016
This patch fixes a bug where queries such as the following are not
handled properly:

"SELECT * FROM ks.cf WHERE token(id) >
9207857967443869328 AND token(id) <= -9223372036854775808"

Here -9223372036854775808 represents the minimum token, which we were
just translating into a token with kind::key, thus returning incorrect
results.

Ref scylladb#1139
Ref scylladb#693
Fixes scylladb#1717

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Sep 30, 2016
This patch fixes a bug where queries such as the following are not
handled properly:

"SELECT * FROM ks.cf WHERE token(id) >
9207857967443869328 AND token(id) <= -9223372036854775808"

Here -9223372036854775808 represents the minimum token, which we were
just translating into a token with kind::key, thus returning incorrect
results.

Ref scylladb#1139
Ref scylladb#693
Fixes scylladb#1717

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Oct 10, 2016
This patch re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.

Ref scylladb#1139
Ref scylladb#693

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
duarten added a commit to duarten/scylla that referenced this issue Oct 10, 2016
This patch re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.

Ref scylladb#1139
Ref scylladb#693

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
avikivity added a commit that referenced this issue Oct 11, 2016
"This patch-set re-implements the describe_splits_ex() verb to more closely
follow Cassandra's implementation, on which some clients rely.

Ref #1139
Ref #693"

* 'describe-splits/v2' of github.com:duarten/scylla:
  thrift: Implement describe_splits_ex based on Cassandra
  storage_service: Implement get_splits() function
  sstables: Add function to get key samples
  sstables/key: Add to_partition_key function
  size_estimates_recorder: Increase estimate accuracy
  sstables: Get estimates for a particular range
  sstables/key: Make key::kind public
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.