Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixing RDF and Neo4J #3

Open
pudo opened this issue Aug 18, 2014 · 8 comments
Open

Mixing RDF and Neo4J #3

pudo opened this issue Aug 18, 2014 · 8 comments

Comments

@pudo
Copy link

pudo commented Aug 18, 2014

@jmatsushita I think I need a bit of advice from you.

The SQL implementation of grano is meeting a few limitations when querying the graph in depth, so I'm trying to think of alternative approaches. At the moment, the best thing I was able to come up with is a wild, poly-backend mix of RDF, ElasticSearch and Neo4J.

RDF would be stored in hashed flat files and contain full provenance and alternative values. This would be the master data, but it wouldn't be easily queryable. Queries would therefore be handled by ElasticSearch for simple stuff (like prefix lookups) and Neo4J for more complex queries (I still want to wrap it in an MQL front-end because exposing CYPHER to the web seems insane).

I still want to have a bespoke web interface for entering, exploring and editing data (i.e. in news orgs), but that would basically interact with a simplified web API...

I'm wondering a) what do you think of this? b) would that bring our projects closer together, or are you thinking of ways to handle provenance inside of Neo4J? I just can't get myself to trust that unholy thing as a main data store :)

@jmatsushita
Copy link
Member

Hey! I don't think I can manage provenance inside neo4j in the way that I want to (at the statement level) so my plan for Open Oil is to move to a triple store next.

It would help if you gave me an example of a query that you'd like to run that is currently not working well.

The gist of my response is:
1/ Are you sure you can't design a relational database structure that would be able to produce provenance quads and have reasonable performance for most queries?
2/ If you start using a graph database then you should probably move your main storage to it because otherwise it feels like it's going to be a bit of a mess to maintain.
3/ I don't think neo4j can manage even "simple provenance" both on nodes AND relationships (See provenance v0.1b paragraph. It certainly can't when having to deal with statements on properties.
4/ Choosing a sane full stack based on a triple store which can do quads efficiently, has good abstractions and APIs and a good community is something that I'm interested in!!! Added "stacks" as a topic of interest

I think ideally whatever storage you choose should be able to generate an RDF representation (I think I should be able to generate RDF from Neo4j (iilab/openoil@1d2c213) for instance), and your RDF generation Gist seems should be able to do that. Whether you need a triple store or not depends on how granular you need the provenance data to be. The moment you're moving to store some aspect of your data in a triple store though, I think is the moment where you need to consider storing your main data in the triplestore.

From a data modelling standpoint, if you generate RDF it should probably be in quads or named graphs, but you could probably still implement the logic of that with your current backend (maybe by introducing a new table however, just to capture the meta-statements, instead of currently with added metadata field in your main graph table).

I think having both ElasticSearch and Neo4j probably is generally not necessary but I'd need to understand what you mean by "Prefix lookup". Neo4j embeds a lucene full text search engine that should cover most of you needs I think if you were to use it.

I like the idea of building an MQL query end point to Neo4j. But maybe for now its simpler to convert your MQL generation widgets into Cypher generating widgets? Do you have some examples of common structures of the MQL queries you end up generating? I don't claim being able to deal with the generalised problem, but maybe I could give it a shot that would work for Grano's use cases. Would be interesting then to look into MQL to Sparql. It feels that Gremlin is currently the closest to be able to pivot between graph querying languages (https://groups.google.com/forum/#!topic/gremlin-users/7QS7j9aA5NA) but it doesn't support MQL as far as I'm aware. But I suspect MQL is simple...

Seems like giving Cailey a kick in the tires might be instructive. But also give a shot to:

  • virtuoso - because it still seems to be the most trusted and active triple store out there,
  • orientdb - because graph + document sounds cool,
  • titan - because not being able to distribute the graph accross multiple machines seems very weird (although that probably won't solve anything regarding the logical distribution of graph data instances).

Hope that helps.

@pudo
Copy link
Author

pudo commented Aug 19, 2014

Thanks, Jun for that comprehensive reply and the valuable thoughts!

To begin, where I'm currently stuck: doing bidirectional queries on my SQL-based graph. Imagine when you're storing a graph in SQL, it'll always be directed because the edge/link table will have to have a source and target column. This isn't a huge problem if you know which way the links you're looking for are pointed, but if you don't know (or don't care), the only way to query this is to do a query that either includes a big UNION or generates an impossibly big join table (for my 30k-link dataset, the full bidirectional join is 19mn rows). This leads me to think I may have reached the limits of what I can do in SQL.

Now, to your questions:

  1. I'm pretty sure that can be done, but the more I think about it, the more it looks like a quad store (i.e. a thing that's basically one big statement table with four (plus some minor) columns. So essentially that would leave me implementing something that is basically a quad store.

  2. It's only a mess if I can't restrict all types of write on the graph store that don't come from the main store processing queue. If that can be guaranteed, it shouldn't be impossible. Main thing here is not actually exposing the CYPHER endpoint, ever.

  3. Agreed.

  4. Let's do that, then!

Since I wrote this issue, I've realised that simple indexes are not optional, they're needed in order to be able to do any type of meaningful updates on the dataset. So the flat-file option is pretty much dead unless I want to start implementing my own indexes (at that point I'm just building a graph database).

The next most conservative thing to do, IMO, would be to use a single-table relational DB as the quad store. That's not performant enough to do full-blown SPARQL; but it should be OK as a sort of master data storage from which more query-friendly forms (like Neo4J or even in-memory graphs) can be derived. I do understand your reluctance to do this type of double storage, but I just don't feel the one-size-fits-all thing exists.

On MQL: I just don't think that one can safely expose CYPHER, there are probably a dozen ways to circumvent the term filter that you've proposed previously. And once you're off to parse this thing in any meaningful way, why not go and do a nice query API like MQL. In any case, it's a bit of a gimmick - and won't determine any of the major architecture choices.

Finally, the databases: as a general thing, I would like to keep this reasonably simple to deploy on a reasonably-sized machine. Titan hits me very badly on this front.

I've worked with Virtuoso in the past, and sworn myself that I will never again touch it with a stick. It's a mad piece of software; needs to die. Not a rational argument here, so much as an irrational fear :)

That leaves Orient and Cayley: Cayley is really awesome from what I've seen, including native quad support. But it's just very, very new and it looks like some fairly basic bits are missing (e.g. I think it may only do bulk import at launch at this time?). Orient looks really cool, going to explore that more today. Haven't seen any mentions of quads yet, though.

@ahdinosaur
Copy link

hey. :)

i'm working on something similar to uf6 (open-app, openappjs), and for the data subsystem i was thinking of building on top of leveldb and levelgraph, as the level and levelgraph ecosystems allow you to "build your own database" using modules.

@pudo
Copy link
Author

pudo commented Aug 20, 2014

Hey Michael! Levelgraph looks pretty cool, it'd be very interesting to find out if there is a) quad support and b) non-JS bindings or a REST API!

@ahdinosaur
Copy link

a) you can add additional properties to triples like so. they can be accessed during query filters, but the additional properties are not by default indexed like the triples are. there's also an issue up about proper named graphs: levelgraph/levelgraph#43.
b) it's built in JS and either leveldb on the server or IndexedDB in the browser. i don't know of any REST APIs for it, but it wouldn't be hard to do.

in general, the "build your own database" approach has less "batteries included", so it's really dependent on what you want, just figured i might as well share.

@jmatsushita
Copy link
Member

Hey @pudo

Re. limits of SQL, makes a lot of sense. Seems as you say in 1/ that you're hitting the "I am building my own triple(quad) store" barrier indeed.

Re. 2/ that sounds messy. I didn't mean impossible, I mean increased complexity and failure modes.

Re. 3/ & 4/ awesome! Let's!

Re. Indices and rebuilding a graph database. Yup.

Re. Yes, you read my mind, this gets negative points for me because of the increased system complexity and managing multiple "conceptual views" (your graph systems which holds the data, and your meta-graph system which holds the provenance) on top of, what essentially is just, the graph (if it allows quads). I think at some point you'll also be dealing with some "relational" integrity issues when your main graph evolves and needs to be kept linked properly to your meta-graph. Also I think there's a paradoxical thing going on here. If you add an index to your triple in your current SQL main database you're in fact implementing a quad. If you store quads in your secondary database, then aren't you essentially duplicating the triple information in your main database and adding a "column"?

Re. one size fits all. The problem as I see it with regards to moving your main storage to a graph DB is mostly about changing a lot of the plumbing which you invested a lot of effort in, but I don't see why it wouldn't "fit". What are the things that would not work on a graphDB? (Sorry I think if I read more of the Grano code, then I'll be able to answer myself but I'm sure it'll be faster if you explain).

Re. building a quad store in SQL. Maybe provenance queries, which are our main use case for now I guess, could be predictible enough in their structure to optimise your quad store for this type of query, but I'm worried that you would also get into the same performance, giant join problems for anything that tries to see the quad graph really as a graph, rather than a provenance store. But maybe that's enough. We would need to be a bit more granular about the type of queries we think would be ran for the type of applications we're interested in, don't you think?

Moving the choosing a graph database discussion (and @ahdinosaur's contributions) to a new issue! #4

This was referenced Aug 20, 2014
@pudo
Copy link
Author

pudo commented Aug 21, 2014

You're convincing me, @jmatsushita - maybe it is worth finding another "one size fits all" solution. I'm beginning to be somewhat attracted by RDF as the standardized SPARQL Query/Update mechanism means that you can swap out backends easily (or so the theory goes).

For grano specifically, there might be a migration path where I move the core graph to a triplestore first and keep the rest (users, projects, ...) in a sqlite database before finally moving everything over.

@jmatsushita
Copy link
Member

Awesome. Yes keeping users and projects aside makes a lot of sense to me.

Sent from my phone

On 21 Aug 2014, at 11:35, Friedrich Lindenberg notifications@github.com wrote:

You're convincing me, @jmatsushita - maybe it is worth finding another "one size fits all" solution. I'm beginning to be somewhat attracted by RDF as the standardized SPARQL Query/Update mechanism means that you can swap out backends easily (or so the theory goes).

For grano specifically, there might be a migration path where I move the core graph to a triplestore first and keep the rest (users, projects, ...) in a sqlite database before finally moving everything over.


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants