Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neo4j connector to support cypher queries using Trino table functions #15587

Closed
wants to merge 6 commits into from

Conversation

sashi-gh
Copy link

@sashi-gh sashi-gh commented Jan 4, 2023

Description

Neo4j uses a custom query syntax called Cypher query language that is optimized for interacting with a graph database. This is not compatible with SQL.

Neo4j supports SQL queries on a graph db using the "Neo4j connector for BI" JDBC driver.
https://neo4j.com/bi-connector/
This driver translates SQL queries to graph optimized cypher queries. But, this approach has some limitations and not suitable for all use cases as the SQL queries could get verbose, do not natively support graph semantics and not very performant.

Neo4j also has another driver called 'neo4j-jdbc' that supports native cypher queries over JDBC.
https://github.com/neo4j-contrib/neo4j-jdbc
However, this driver couldn't be configured using Trino's generic JDBC driver.

So, I created a bespoke connector for Neo4j to be able to run Neo4j cypher queries using Trino table functions. This could be a valuable tool for customers who would like to query Neo4j as part of their data mesh strategy.

Additional context and related issues

This driver functionality is limited to supporting cypher queries using Table functions. It derives most of it's functionality from trino-base-jdbc module. It doesn't support any DDL or DML statements and doesn't support any pushdown features.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(X) Release notes are required, with the following suggested text:

# Section
* New Neo4j connector to support cypher queries using Trino table functions  ({issue}`issuenumber`)

@cla-bot
Copy link

cla-bot bot commented Jan 4, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add docs and more tests?

@sashi-gh
Copy link
Author

@ebyhr, thanks for your comment.
I looked at the BaseJdbcConnectorTest and here are my thoughts -

  • The neo4j connector I've created supports only pass through queries in Cypher language. It is not a full fledged connector like others where the incoming SQL would be mapped to Cypher. So, maybe not all the tests in BaseJdbcConnectorTest are relevant to this neo4j connector even after customizing the test behavior using TestingConnectorBehavior
  • I plan to write tests like this -

If the original test sql was -

SELECT shippriority, clerk, totalprice FROM orders

The neo4j test would be -

SELECT shippriority, clerk, totalprice FROM TABLE(neo4j.system.query(query => 'MATCH (n:Orders) return n.orderkey as orderkey,n.custkey as custkey,n.orderstatus as orderstatus,n.totalprice as totalprice,n.orderdate as orderdate,n.orderpriority as orderpriority,n.clerk as clerk,n.shippriority as shippriority,n.comment as comment'))

So, any references to simple table names in the original test query, would be replaced by a pass-through table query that returns the same data

  • The problem with the above approach is that, I'll have to rewrite all relevant tests from BaseConnectorTest in my Test class because the base queries are not directly executable with neo4j connector. This could pose problems if tests are added/updated/removed in the base class
  • Instead of rewriting the tests, I was thinking a better approach would be to modify the BaseConnectorTest (and BaseJdbcConnectorTest) to allow subclasses define the table format of the query. So, the default behavior is to use the table name as is, but a sub-class could overwrite the tableName with a pass-through table function query. This would also allow future versions of any pass-through only connectors to define their own test sql behavior

Thoughts?

@github-actions
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Feb 22, 2023
@bitsondatadev
Copy link
Member

@ebyhr can you give this a look?


import io.trino.plugin.jdbc.BaseJdbcConfig;

public class Neo4jJdbcConfig
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case can we use BaseJdbcConfig directly ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So can we remove this class ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Comment on lines 467 to 474
return query.transformQuery(sql -> {
if (sql.toLowerCase(Locale.getDefault()).startsWith("select")) {
int firstIndex = sql.toLowerCase(Locale.getDefault()).indexOf("from (");
int lastIndex = sql.lastIndexOf(")");
sql = sql.substring(firstIndex + 6, lastIndex);
}
log.debug("Neo4j Cypher sql query - " + sql);
return sql;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead can we capture them as a Neo4jQueryBuilder.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is probably a better approach

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created Neo4jQueryBuilder

private static final Logger log = Logger.get(Neo4jClient.class);
private static final int MAX_RESULT_SET_INFO_CACHE_ENTRIES = 10000;
private final Type jsonType;
private final Cache<PreparedQuery, Neo4jResultSetInfo> cachedResultSetInfo;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this cache ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache would help with query performance by not having to fetch resultset metadata from neo4j if we have already seen the same query in the past.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will CachingJdbcClient helps here ? Or can we add this cacahing layer as a follow-up PR ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could help depending on queries. It's not as helpful as full fledged connector as this connector only supports PTF queries. So, the underlying cypher query cannot be parameterized. Do you see any problem with the implementation?

@cla-bot cla-bot bot added the cla-signed label May 18, 2023
<parent>
<groupId>io.trino</groupId>
<artifactId>trino-root</artifactId>
<version>406-SNAPSHOT</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the version ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean sync the fork to the latest version (419-snapshot)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to latest version

<artifactId>neo4j-jdbc-bolt</artifactId>
<version>4.0.6</version>
<exclusions>
<exclusion>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could mention why we would need this exclusion ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the exclusion

<dependency>
<groupId>org.neo4j</groupId>
<artifactId>neo4j-jdbc-bolt</artifactId>
<version>4.0.6</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract the version as a maven property ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

<dependency>
<groupId>org.neo4j.driver</groupId>
<artifactId>neo4j-java-driver</artifactId>
<version>4.4.9</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be compatible with the bolt dependency ?

</dependency>

<!-- for testing -->

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line can be removed

{
super(jdbcClient, jdbcQueryEventListeners);
this.jdbcClient = jdbcClient;
this.jdbcQueryEventListeners = jdbcQueryEventListeners;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a requireNonNull check for all the arguments and for jdbcQueryEventListeners can we use ImmutableSet.copyOf - so that we have a immutable copy of jdbcQueryEventListener

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these checks are in the super class

import java.util.Optional;
import java.util.Set;

public class Neo4jMetadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an extend Neo4jMetadata - if the goal is to restrict the pushdown operation, the same can be achieved using Neo4jClient itself instead of handing here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Neo4jMetadata is called first and then for actual pushdown execution Neo4jClient is called. Since this connector doesn't do any pushdown, the sooner we short circuit the better.
Let me know if my assumption is wrong.

extends DefaultQueryBuilder
{
/**
* This connector only supports table functions, so wrap the user entered table function query into a Neo4j call subquery with projected columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this meant we don't normal SELECT * FROM kind of queries in this Neo4j connector ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this connector is only for running 'cypher queries' in a PTF

import static java.util.Objects.requireNonNull;

public class Neo4jLoader
extends AbstractTestingTrinoClient<Void>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this Neo4jLoader needs to extend AbstractTestingTrinoClient can we do something like DruidQueryRunner - we run SELECT * FROM tpch table and use the materialized result to load data into Neo4j.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that too. I followed the approach taken in Elastic connector. In either approach we'll still have to take the result and insert data into Neo4j as we did here in Neo4jLoadingSession. Anything wrong with this?

}

@Test
public void testCreateNode()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add these test as a part of BaseNeo4jTest

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created BaseNeo4jTest

@kokosing
Copy link
Member

kokosing commented Jun 7, 2023

@sashi-gh Are you willing to continue to work on it? We are very interested in this connector so we can take it over if you don't mind. We will address all the remaining comments and merge it. Your authorship of the code will remain.

@sashi-gh sashi-gh closed this Jun 7, 2023
@ragnard
Copy link
Member

ragnard commented Sep 19, 2023

I'm also interested in a connector for neo4j. I've had a look at the work done by @sashi-gh and have continued working on the branch here (diff: master...ragnard:trino:neo4j-connector)

So far I've mainly updated to 427-SNAPSHOT and done some testing/review.

@kokosing Would the project be interested in continued work on this?

@ragnard
Copy link
Member

ragnard commented Sep 20, 2023

FWIW I've started working on another approach for a Neo4j connector that is not based on JDBC.

In short, I think the mismatch between JDBC and Neo4j, and some implementation decisions made in the Neo4j JDBC driver brings more problems than it's worth.

My branch is here, will submit PR for discussion when it's in a decent shape:
master...ragnard:trino:neo4j-bolt-connector

As of now, I've got a POC of all the read operations implemented where node labels/properties are exposed as tables/columns, and I'm working on a system.query PTF for executing raw cypher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

6 participants