Performance issues #7

CharlesHD · 2018-07-03T15:35:43Z

Hi,
first of all, thanks for datahike. I really think this soft has its place in the clojure ecosystem.
So I start using it, and I created a small (110k eids) file database.
The query engine exhibits some strange performance behaviour. I'll use my db logic for example since i didn't try to reproduce it on other database.

(time (ds/q '[:find ?p ?v
                   :where
                   [2 ?p ?v]] @db ))
"Elapsed time: 5.704363 msecs"
#{[:similarity/name "w2v"] [:w2v/iterations 10] [:w2v/window-size 5] [:w2v/layer-size 20] [:w2v/seed 42] [:w2v/stop-words []] [:w2v/min-word-frequency 5]}
;; using reference directly is really fast

(time (ds/q '[:find ?sim :where
                    [?id :trajectory/dataset "huffpost"]
                    [?id :trajectory/query "moon"]
                    [?id :trajectory/sim ?sim]] @db))
"Elapsed time: 9.537365 msecs"
#{[2]}
;; Getting reference is really fast

(time (ds/q '[:find ?sim ?p ?v 
                   :where
                  [?id :trajectory/dataset "huffpost"]
                  [?id :trajectory/query "moon"]
                  [?id :trajectory/sim ?sim]
                  [?sim ?p ?v]] @db))
"Elapsed time: 3908.029339 msecs"
#{[2 :w2v/iterations 10] [2 :w2v/min-word-frequency 5] [2 :similarity/name "w2v"] [2 :w2v/stop-words []] [2 :w2v/layer-size 20] [2 :w2v/seed 42] [2 :w2v/window-size 5]}
;; Doing the union is really slow 

(time (ds/q '[:find ?sim ?p ?v :where
                                     [?id :trajectory/sim ?sim]
                                     [?sim ?p ?v]] @db))
"Elapsed time: 4246.928569 msecs" 
#{[25490 :w2v/min-word-frequency 5] [84456 :w2v/seed 42] [41442 :w2v/window-size 5] [98356 :w2v/layer-size 20]  ...}
;; not doing the union gives similar time (somewhat slower)

Any idea of what may cause that union tremendous overhead ?

whilo · 2018-07-03T22:31:29Z

That is odd. I will take a look into it. Could you try reordering the :where clauses as a quick check?

CharlesHD · 2018-07-04T07:56:49Z

I'll try some reordering this morning.

Update : Ok i've done some reordering. As expected in datalog, it's way worse to have the unrestricted query first. This is not on the same machine so the query that gave 4s yesterday give ~8s today.

(time (ds/q '[:find ?sim ?p ?v :where
                                     [?sim ?p ?v]
                                     [?id :trajectory/dataset "huffpost"]
                                     [?id :trajectory/query "moon"]
                                     [?id :trajectory/sim ?sim]] @db))
"Elapsed time: 30940.753258 msecs"
;; that's ~4 times worst

Other ordering with [?sim ?p ?v] in last position give the same time magnitude (~8s).

whilo · 2018-07-04T09:43:24Z

Ok, thanks! I will have to study the query engine then and maybe a profiler helps.

whilo · 2018-07-14T13:57:54Z

I would like to do this over the next week. Can you provide a dataset for reproduction? You can also send me your dataset confidentially via mail for reproduction only, if this is ok for you.

CharlesHD · 2018-07-15T20:29:36Z

Sure, I'll craft you a dataset tomorrow.
EDIT : I get some work overload. I'll give it to you in the next hours.
EDIT2 @whilo : Do you have a mail or something for me giving you the DB. A zip of the datahike directory is ok for you to import the db ?

whilo · 2018-07-25T14:13:47Z

Yes, a zip of the datahike directory is ok. Just send it at christian at topiq.es.

tonsky · 2018-12-23T19:34:42Z

Datascript engine does a very straightforward thing here:

[?id :trajectory/sim ?sim]
[?sim ?p ?v]

Between those two clauses it needs to make a join. First relation is already calculated, it a is single tuple (or might be two-three tuples, doesn’t matter). Second one needs to be calculated. It has no “clues” for query engine (none it can use anyways). ?p and ?v are unknown, ?sim is sort of known, but it doesn’t know how to use that knowledge. For query engine this clause is opaque, so it resolves to all datoms in DB. Then join happens, which unifies values of ?sim in both relations. Long execution times come from second relation being equal to the whole db in size.

It seems obvious what should be done if ?sim is known to have just one possible value. If it has multiple values, though, it’s not clear what the strategy should be here. We can query index per each value, but at some point this will become slower than actually walking whole DB. Identifying this threshold is non-trivial.

For now I can recommend that you use :where clauses for filtering and pull API for attr retrieval:

(ds/q '[:find [pull($, ?sim, [*]) ...] 
        :where
        [?id :trajectory/dataset "huffpost"]
        [?id :trajectory/query "moon"]
        [?id :trajectory/sim ?sim]]
      @db)

whilo · 2018-12-25T14:11:48Z

@tonsky Thanks for replying! That makes sense. I think in general better query planning would be helpful, but Datomic is also not doing a lot in this direction yet, I think. I guess this has to do with not having indices per attribute (tables). Even if there were a few ?sim values it would probably be better to seek them individually, but this requires at first some reasonable statistics with respect to the indices.

CharlesHD · 2019-01-08T09:46:00Z

@tonsky thanks for the explanation. It makes sense indeed, even if I would expect the query engine to be able to use the restriction given by clause 1 on clause 2.

Fortunately, the pull API can express this kind of restriction, even if it is less fun to use in my opinion.

Now that the reason is exposed, I let the issue open until the datahike team makes a decision about fixing this query behaviour or not in datahike.

whilo · 2019-01-21T00:43:20Z

It is arguable whether it is a bug. Datomic kind of takes the position that developers should be aware of the sizes of their relations. I agree that the query engine should automatically figure that out, but it is not obvious what the tradeoffs are exactly. It could be that tracking index statistics on these flat indices is too expensive, I have not thought that through yet.

kordano · 2019-01-22T12:49:58Z

How should we approach these kind of issues?

whilo · 2019-01-23T00:59:49Z

For this we need a query planner. I am kind of brainstorming about this as well for the probabilistic inference engine (anytime AI). For datahike it means that we need to quantify costs first and then take the query clauses and index statistics, build a graph of the query and then try to push constraints that minimize index operations (cost) as close to the index grounding as possible. We also need to track the statistics about the attributes and entities in the database accordingly. It is a non-trivial task, but probably can be approached in steps. It would help to read up on SQL query planner implementations a bit, if you have time, e.g. Postgres or MySQL should be fine. Feel free to post papers that you find and we can discuss related work to study. I think we first need to address the schema handling properly though.

whilo · 2019-02-12T22:09:48Z

We have this issue on our Agenda (including a potential query planner) and will work with the guys of 3DF on this, but are not committed yet to any solution. I will close the issue for now, because we will stick to the Datomic/datascript default for now and leave it to the programmer.

kordano self-assigned this Aug 3, 2018

whilo mentioned this issue Dec 18, 2018

Addressed issue where class import does not conform to Clojure 1.10 ns spec replikativ/hitchhiker-tree#1

Merged

kordano added the bug Something isn't working label Jan 16, 2019

whilo closed this as completed Feb 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues #7

Performance issues #7

CharlesHD commented Jul 3, 2018

whilo commented Jul 3, 2018

CharlesHD commented Jul 4, 2018 •

edited

Loading

whilo commented Jul 4, 2018

whilo commented Jul 14, 2018

CharlesHD commented Jul 15, 2018 •

edited

Loading

whilo commented Jul 25, 2018

tonsky commented Dec 23, 2018

whilo commented Dec 25, 2018

CharlesHD commented Jan 8, 2019

whilo commented Jan 21, 2019

kordano commented Jan 22, 2019

whilo commented Jan 23, 2019

whilo commented Feb 12, 2019

Performance issues #7

Performance issues #7

Comments

CharlesHD commented Jul 3, 2018

whilo commented Jul 3, 2018

CharlesHD commented Jul 4, 2018 • edited Loading

whilo commented Jul 4, 2018

whilo commented Jul 14, 2018

CharlesHD commented Jul 15, 2018 • edited Loading

whilo commented Jul 25, 2018

tonsky commented Dec 23, 2018

whilo commented Dec 25, 2018

CharlesHD commented Jan 8, 2019

whilo commented Jan 21, 2019

kordano commented Jan 22, 2019

whilo commented Jan 23, 2019

whilo commented Feb 12, 2019

CharlesHD commented Jul 4, 2018 •

edited

Loading

CharlesHD commented Jul 15, 2018 •

edited

Loading