New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Spark support #1045
Comments
Here's an old repo that might be tangentially useful on the input side. It's the simplest possible proof-of-concept that I could concoct that wrapped a Titan 0.5 inputformat in a Spark RDD: https://github.com/dalaro/titan-spark-test. Haven't touched it in a couple of months though. |
From what I gathered this wouldn't even be necessary since all we need is the Hadoop InputFormat, right @okram. |
I wrote an email to you guys, but here it is for issue record:
|
Reading is indeed pretty simple. I ran into a thorny classpath conflict around Netty that affects Spark, but I made some temporary hacks to get it working, though I need to return to this to see what else is affected by the conflict. Anyway, with Netty classpath conflicts out of the way, I could read from Titan-Cassandra after preloading GotG through OLTP:
I'm fuzzy on how the write side will work. I suspect the worst case is probably something like writing Gryo from Spark to disk/HDFS and then executing a separate BulkLoaderVertexProgram computation that reads the Gryo files. The BulkLoader Marko and I worked on is still in titan09, though I haven't touched it in a while. |
I also just noticed that vertices without relations do not appear in Spark count or valueMap output. This could be due to CassandraInputFormat (i.e. the Titan bits involved here). I'm not sure at the moment. |
@dalaro -- what do you mean "vertices without relations." You mean edge-less vertices? I don't have any test cases that test with edge-less vertices so perhaps its a Spark thing...... ?? eek. We may have to create a new |
@okram right, At its storage level, Titan doesn't really allow for a relationless vertex. There's a hidden system relation on every extant vertex, even if a vertex hase no user-visible relations. I suspect that I'm not translating that into a TP3 compatible analog in the inputformat (if one exists in TP3). |
An edge-less vertex is simply a vertex with |
Working with Marko, I did the following on this issue today:
This gives us two ways to write to Titan from GraphComputers:
This stuff is still pretty raw and experimental. Known problems:
But this is just the proof-of-concept stage. Here's what I can do now (b4e62ef). Feedback in general and in particular from @dkuppitz would be welcome.
|
BulkLoaderVertexProgram and the Hadoop Input/OutputFormats now use harmonized config key prefixes: titanmr.{ioformat,bulkload}.conf. Also added a ConfigElement.getPath overload that includes the root element name (the default behavior is still to exclude the root). This commit fixes some DEBUG logging statements needlessly emitted at ERROR. For #1045
The netty dependency issue was fixed in 0c36a4f. I tweaked the configuration keys and changed BLVP's ResultGraph default to Here are the changes from the config key tweaks:
Here's
Here's
|
@dalaro --- if there are aspects you think are wrong because of TP3, please file issues on our issue tracker. Thanks -- that was cool to see it work! |
as a first step toward #1021, it seems we can get Spark support in Titan by simply reusing the existing InputFormats for Hadoop in Spark.
While this may not be the most efficient way to go about this, it would provide us with an easy first integration opportunity to investigate Spark support and get some feedback in the Titan 0.9M2 release.
It is unclear as to what exactly is needed here, but @okram and @dkuppitz can help.
The text was updated successfully, but these errors were encountered: