Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues #7

Closed
CharlesHD opened this issue Jul 3, 2018 · 13 comments
Closed

Performance issues #7

CharlesHD opened this issue Jul 3, 2018 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@CharlesHD
Copy link

Hi,
first of all, thanks for datahike. I really think this soft has its place in the clojure ecosystem.
So I start using it, and I created a small (110k eids) file database.
The query engine exhibits some strange performance behaviour. I'll use my db logic for example since i didn't try to reproduce it on other database.

(time (ds/q '[:find ?p ?v
                   :where
                   [2 ?p ?v]] @db ))
"Elapsed time: 5.704363 msecs"
#{[:similarity/name "w2v"] [:w2v/iterations 10] [:w2v/window-size 5] [:w2v/layer-size 20] [:w2v/seed 42] [:w2v/stop-words []] [:w2v/min-word-frequency 5]}
;; using reference directly is really fast

(time (ds/q '[:find ?sim :where
                    [?id :trajectory/dataset "huffpost"]
                    [?id :trajectory/query "moon"]
                    [?id :trajectory/sim ?sim]] @db))
"Elapsed time: 9.537365 msecs"
#{[2]}
;; Getting reference is really fast

(time (ds/q '[:find ?sim ?p ?v 
                   :where
                  [?id :trajectory/dataset "huffpost"]
                  [?id :trajectory/query "moon"]
                  [?id :trajectory/sim ?sim]
                  [?sim ?p ?v]] @db))
"Elapsed time: 3908.029339 msecs"
#{[2 :w2v/iterations 10] [2 :w2v/min-word-frequency 5] [2 :similarity/name "w2v"] [2 :w2v/stop-words []] [2 :w2v/layer-size 20] [2 :w2v/seed 42] [2 :w2v/window-size 5]}
;; Doing the union is really slow 

(time (ds/q '[:find ?sim ?p ?v :where
                                     [?id :trajectory/sim ?sim]
                                     [?sim ?p ?v]] @db))
"Elapsed time: 4246.928569 msecs" 
#{[25490 :w2v/min-word-frequency 5] [84456 :w2v/seed 42] [41442 :w2v/window-size 5] [98356 :w2v/layer-size 20]  ...}
;; not doing the union gives similar time (somewhat slower)

Any idea of what may cause that union tremendous overhead ?

@whilo
Copy link
Member

whilo commented Jul 3, 2018

That is odd. I will take a look into it. Could you try reordering the :where clauses as a quick check?

@CharlesHD
Copy link
Author

CharlesHD commented Jul 4, 2018

I'll try some reordering this morning.

Update : Ok i've done some reordering. As expected in datalog, it's way worse to have the unrestricted query first. This is not on the same machine so the query that gave 4s yesterday give ~8s today.

(time (ds/q '[:find ?sim ?p ?v :where
                                     [?sim ?p ?v]
                                     [?id :trajectory/dataset "huffpost"]
                                     [?id :trajectory/query "moon"]
                                     [?id :trajectory/sim ?sim]] @db))
"Elapsed time: 30940.753258 msecs"
;; that's ~4 times worst

Other ordering with [?sim ?p ?v] in last position give the same time magnitude (~8s).

@whilo
Copy link
Member

whilo commented Jul 4, 2018

Ok, thanks! I will have to study the query engine then and maybe a profiler helps.

@whilo
Copy link
Member

whilo commented Jul 14, 2018

I would like to do this over the next week. Can you provide a dataset for reproduction? You can also send me your dataset confidentially via mail for reproduction only, if this is ok for you.

@CharlesHD
Copy link
Author

CharlesHD commented Jul 15, 2018

Sure, I'll craft you a dataset tomorrow.
EDIT : I get some work overload. I'll give it to you in the next hours.
EDIT2 @whilo : Do you have a mail or something for me giving you the DB. A zip of the datahike directory is ok for you to import the db ?

@whilo
Copy link
Member

whilo commented Jul 25, 2018

Yes, a zip of the datahike directory is ok. Just send it at christian at topiq.es.

@tonsky
Copy link
Contributor

tonsky commented Dec 23, 2018

Datascript engine does a very straightforward thing here:

[?id :trajectory/sim ?sim]
[?sim ?p ?v]

Between those two clauses it needs to make a join. First relation is already calculated, it a is single tuple (or might be two-three tuples, doesn’t matter). Second one needs to be calculated. It has no “clues” for query engine (none it can use anyways). ?p and ?v are unknown, ?sim is sort of known, but it doesn’t know how to use that knowledge. For query engine this clause is opaque, so it resolves to all datoms in DB. Then join happens, which unifies values of ?sim in both relations. Long execution times come from second relation being equal to the whole db in size.

It seems obvious what should be done if ?sim is known to have just one possible value. If it has multiple values, though, it’s not clear what the strategy should be here. We can query index per each value, but at some point this will become slower than actually walking whole DB. Identifying this threshold is non-trivial.

For now I can recommend that you use :where clauses for filtering and pull API for attr retrieval:

(ds/q '[:find [pull($, ?sim, [*]) ...] 
        :where
        [?id :trajectory/dataset "huffpost"]
        [?id :trajectory/query "moon"]
        [?id :trajectory/sim ?sim]]
      @db)

@whilo
Copy link
Member

whilo commented Dec 25, 2018

@tonsky Thanks for replying! That makes sense. I think in general better query planning would be helpful, but Datomic is also not doing a lot in this direction yet, I think. I guess this has to do with not having indices per attribute (tables). Even if there were a few ?sim values it would probably be better to seek them individually, but this requires at first some reasonable statistics with respect to the indices.

@CharlesHD
Copy link
Author

@tonsky thanks for the explanation. It makes sense indeed, even if I would expect the query engine to be able to use the restriction given by clause 1 on clause 2.

Fortunately, the pull API can express this kind of restriction, even if it is less fun to use in my opinion.

Now that the reason is exposed, I let the issue open until the datahike team makes a decision about fixing this query behaviour or not in datahike.

@kordano kordano added the bug Something isn't working label Jan 16, 2019
@whilo
Copy link
Member

whilo commented Jan 21, 2019

It is arguable whether it is a bug. Datomic kind of takes the position that developers should be aware of the sizes of their relations. I agree that the query engine should automatically figure that out, but it is not obvious what the tradeoffs are exactly. It could be that tracking index statistics on these flat indices is too expensive, I have not thought that through yet.

@kordano
Copy link
Member

kordano commented Jan 22, 2019

How should we approach these kind of issues?

@whilo
Copy link
Member

whilo commented Jan 23, 2019

For this we need a query planner. I am kind of brainstorming about this as well for the probabilistic inference engine (anytime AI). For datahike it means that we need to quantify costs first and then take the query clauses and index statistics, build a graph of the query and then try to push constraints that minimize index operations (cost) as close to the index grounding as possible. We also need to track the statistics about the attributes and entities in the database accordingly. It is a non-trivial task, but probably can be approached in steps. It would help to read up on SQL query planner implementations a bit, if you have time, e.g. Postgres or MySQL should be fine. Feel free to post papers that you find and we can discuss related work to study. I think we first need to address the schema handling properly though.

@whilo
Copy link
Member

whilo commented Feb 12, 2019

We have this issue on our Agenda (including a potential query planner) and will work with the guys of 3DF on this, but are not committed yet to any solution. I will close the issue for now, because we will stick to the Datomic/datascript default for now and leave it to the programmer.

@whilo whilo closed this as completed Feb 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants