create_node|edge_attr_index for SQLGraph#223
Conversation
Co-authored-by: Jordão Bragantini <jordao.bragantini@gmail.com>
Co-authored-by: Jordão Bragantini <jordao.bragantini@gmail.com>
Co-authored-by: Jordão Bragantini <jordao.bragantini@gmail.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #223 +/- ##
==========================================
+ Coverage 88.45% 88.50% +0.04%
==========================================
Files 55 55
Lines 3890 3993 +103
Branches 674 700 +26
==========================================
+ Hits 3441 3534 +93
- Misses 267 275 +8
- Partials 182 184 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Maybe we need drop functions |
| SQLGraph lets you create indexes on node or edge attributes to keep repeated | ||
| filters fast: |
There was a problem hiding this comment.
@yfukai this is awesome. Could you briefly mention what kind of speed-up we can expect with this? 2x, 10x?
There was a problem hiding this comment.
I benchmarked the performance and added the result to the doc!
Co-authored-by: Jordão Bragantini <jordao.bragantini@gmail.com>
…nto sql_indexing
yfukai
left a comment
There was a problem hiding this comment.
Benchmarked the performance improvement by indexing. Code:
import tracksdata as td
import tempfile
import time
if __name__ == "__main__":
for node_count in [1_000_000, 100_000_000]:
print(f"\nBenchmarking SQLGraph with {node_count} nodes")
graph_db_file = tempfile.NamedTemporaryFile(suffix=".db", delete=False).name
graph: td.graph.SQLGraph = td.graph.SQLGraph(
drivername="sqlite",
database=graph_db_file,
overwrite=True,
)
graph.add_node_attr_key("attr1", 0)
graph.bulk_add_nodes([{td.DEFAULT_ATTR_KEYS.T: i, "attr1": i % 100} for i in range(node_count)])
print("Finished adding nodes.")
# measure time to filter nodes by attr1
start_time = time.time()
filtered_graph = graph.filter(td.NodeAttr("attr1") == 0).subgraph()
end_time = time.time()
time_without_index = end_time - start_time
print(f"Time to filter nodes without index: {time_without_index:.2f} seconds")
graph.ensure_node_attr_index("attr1")
start_time = time.time()
filtered_graph = graph.filter(td.NodeAttr("attr1") == 0).subgraph()
end_time = time.time()
time_with_index = end_time - start_time
print(f"Time to filter nodes with index: {time_with_index:.2f} seconds")
print(f"Speedup factor: {time_without_index / time_with_index:.2f}x")|
@yfukai, that's an amazing speedup! |
|
Sure! Can we use "create_{node|edge}_attr_index" then? This agrees with actual SQL statement. |
ensure_node|edge_attr_index for SQLGraphcreate_node|edge_attr_index for SQLGraph
|
@yfukai, that's even better. |
This pull request introduces a new feature for the
SQLGraphbackend: the ability to create explicit database indexes on node and edge attributes to improve query performance, especially for frequently filtered attributes. The documentation and tests have been updated to reflect and validate this functionality.SQLGraph indexing improvements:
ensure_node_attr_indexandensure_edge_attr_indextoSQLGraphfor creating indexes on node and edge attribute columns, including support for composite and unique indexes. (src/tracksdata/graph/_sql_graph.py)docs/concepts.md)README.md)Testing and validation:
src/tracksdata/graph/_test/test_graph_backends.py)sqlalchemyimport to support index inspection in tests. (src/tracksdata/graph/_test/test_graph_backends.py)