Support Neo4jIO #4488

turb · 2022-08-05T21:22:14Z

Support the experimental org.apache.beam.sdk.io.neo4j.Neo4jIO introduced in Beam 2.38. This is inspired by JdbcIO.

Request for comment: Beam only provides readAll, hence the usage of sc.parallelize(Seq(())).applyTransform to simulate a read.

I don't know if it is the proper way to do this.
Maybe @mattcasters knows why a simple read is not provided in Beam.

codecov · 2022-08-05T21:39:29Z

Codecov Report

Merging #4488 (083129c) into main (36935c1) will decrease coverage by 0.30%.
The diff coverage is 40.00%.

❗ Current head 083129c differs from pull request most recent head 2054563. Consider uploading reports for the commit 2054563 to get more accurate results

@@            Coverage Diff             @@
##             main    #4488      +/-   ##
==========================================
- Coverage   60.48%   60.18%   -0.31%     
==========================================
  Files         275      279       +4     
  Lines        9882    10086     +204     
  Branches      438      840     +402     
==========================================
+ Hits         5977     6070      +93     
- Misses       3905     4016     +111

Impacted Files	Coverage Δ
...rc/main/scala/com/spotify/scio/neo4j/Neo4jIO.scala	`29.41% <29.41%> (ø)`
...in/scala/com/spotify/scio/neo4j/Neo4jOptions.scala	`100.00% <100.00%> (ø)`
.../spotify/scio/neo4j/syntax/SCollectionSyntax.scala	`100.00% <100.00%> (ø)`
.../spotify/scio/neo4j/syntax/ScioContextSyntax.scala	`100.00% <100.00%> (ø)`
...om/spotify/scio/elasticsearch/CoderInstances.scala	`44.11% <0.00%> (-5.89%)`	⬇️
...om/spotify/scio/elasticsearch/CoderInstances.scala	`42.42% <0.00%> (-5.86%)`	⬇️
...com/spotify/scio/bigquery/types/TypeProvider.scala	`47.22% <0.00%> (-2.78%)`	⬇️
...la/com/spotify/scio/bigquery/client/BigQuery.scala	`22.44% <0.00%> (-2.56%)`	⬇️
...rc/main/scala/com/spotify/scio/util/ScioUtil.scala	`59.25% <0.00%> (-2.28%)`	⬇️
...n/scala/com/spotify/scio/extra/annoy/package.scala	`80.00% <0.00%> (-2.06%)`	⬇️
... and 43 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

mattcasters · 2022-08-05T22:03:06Z

Maybe @mattcasters knows why a simple read is not provided in Beam.

Frankly I couldn't come up with a good way to parallelise a Cypher query in a generic way. If you would read just nodes or relationships you could split up the natural internal ID address space. For a generic cypher query it's not that trivial.
The "solution" then is to provide sets of parameters to the query which is then executed multiple times in parallel. With a bit of luck you can get to the same embarrassingly parallel solution.

turb · 2022-08-05T22:26:08Z

Frankly I couldn't come up with a good way to parallelise a Cypher query in a generic way.

Thanks for your answer!

I may be mistaken, but that indeed prevents using Splitable DoFn to get parallel IO, however it is still possible to have a naïve read that simply runs on a single node. I think that's what JdbcIO does — it only includes an outputParallelization parameter to shuffle the output.

With readAll, if we have only one cypher query to execute, we need to provide a one-discarded-element PCollection, which is kind of ugly.

To say things differently, this is not about the parallelisation feature, but the availability of read in the API, for convenience.

RustedBones

Thanks a lot for the contribution!
This looks quite good.

Maybe we can also create a very basic use-case in scio-example if you know one

scio-neo4j/src/main/scala/com/spotify/scio/neo4j/Neo4jIoOptions.scala

scio-neo4j/src/main/scala/com/spotify/scio/neo4j/syntax/AllSyntax.scala

RustedBones · 2022-08-11T09:34:15Z

@turb for parallel reading in jdbc, we have the jdbcShardedSelect:
It takes a shardColumn and a numShards with

shardColumn being uniform and linear, where we can know the min and max values
numShards to split the shardColum space above into x buckets

we can potentially do smth like this

sc.paralellize(Range(min, shard1), Range(shard1, shard2), Range(shard2, max)) // to be computed ?
   .applyTransform(
      beam.Neo4jIO
        .readAll[Unit, T]()
        ...
        .withCypher(shardedCypher) // ensures we have $min $max filter
        .withParametersFunction((r: Range) => Map("min" -> r.min, "max" -> r.max))

turb · 2022-08-11T10:04:48Z

Maybe we can also create a very basic use-case in scio-example if you know one

Added one in fixup 7669f0a

turb · 2022-08-11T10:08:27Z

for parallel reading in jdbc, we have the jdbcShardedSelect

I suppose @mattcasters has an opinion on it.

However, since the Beam impl is still experimental, is it a good idea to build this kind of pattern on top of it yet?

RustedBones · 2022-08-11T10:11:54Z

IMHO it's more for further direction.
I'd prefer incremental development, where we can start with the propose read and write and continue discussion in other issue/PR

mattcasters · 2022-08-11T12:02:41Z

I'd be happy to consider any and all changes to Neo4jIO. It was really hard to get it across the finish line because of all the new rules and regulations that are in effect in the Beam project which do not apply to existing code like JdbcIO. Because of this, both IOs are not the same. Neo4jIO was a bit of a challenge since there was no other clean code for an IO available. Months was spent on it.
However, now that the code exists we can expand functionality. A Neo4j shard selector would indeed be something cool to consider. Feel free to create an issue in the beam project and I'd be happy to investigate.

RustedBones

I took some liberty and pushed on your branch some integration tests for neo4j.
Playing with the API raised few questions.

scio-neo4j/src/it/scala/com/spotify/scio/neo4j/Neo4jIOIT.scala

RustedBones · 2022-08-18T17:04:32Z

scio-neo4j/src/it/scala/com/spotify/scio/neo4j/Neo4jIOIT.scala

+      """UNWIND $origin AS origin
+        |MATCH
+        |  (m:Movie {name: origin.movie}),
+        |  (c:Country {name: origin.country})
+        |CREATE (m)-[:ORIGIN]->(c)
+        |""".stripMargin,
+      "origin",
+      { case (movie, country) => Map("movie" -> movie, "country" -> country) }


Same thing here

.saveAsNeo4j(neo4jOptions, cypher)(x => Map(...))

As a neo4j newbie, I'm also confused by the requirement of the UNWIND and the obligation to declare the unwindMapName
As I understand, this is to enable batching.

Can this be hidden to the user ?

MATCH (m:Movie {name: $movie}), (c:Country {name: $country}) CREATE (m)-[:ORIGIN]->(c)

could be automatically transformed with a carefully picked unwinded name, replacing the $

As a neo4j newbie, I'm also confused by the requirement of the UNWIND and the obligation to declare the unwindMapName
As I understand, this is to enable batching.

Can this be hidden to the user ?

Maybe so. however, I hereby notify — bother — @mattcasters again, since he is the best expert we can find here, I believe!

Just to clarify the UNWIND situation: I should mention that without UNWIND you really can't get decent performance with Neo4j. It is also painful to note that the dialect for UNWIND is a bit different from regular Cypher statements. However, I'm told this is something my colleagues in engineering are working on for the upcoming version 5 of Neo4j. I propose to wait until that situation becomes a bit clearer, before making additional changes. Fingers crossed.

turb · 2022-08-19T18:50:37Z

I took some liberty and pushed on your branch some integration tests for neo4j.

That's great! Many thanks.

I have locally updated the Neo4j docker image pulled by testContainers to 4.4.10 from 3.5.0, as the later has no Apple Silicon version, hence runs forever and timeouts on my laptop. Should I share this change?

RustedBones

As many of our API actually uses typeclasses, I think we can leverage this in also for the neo4jIO

RowMapper would be the read typeclass to transform Record to T
ParametersBuilder would extract the Map[String, AnyRef] parameter map from T

I'm less sure of the 2nd, if we should actually leave if as a function instead of a typeclass

@clairemcginty @farzad-sedghi

scio-neo4j/src/it/scala/com/spotify/scio/neo4j/Neo4jIOIT.scala

RustedBones · 2022-08-22T16:46:31Z

scio-neo4j/src/main/scala/com/spotify/scio/neo4j/Neo4jIO.scala

+  object WriteParam {
+    private[neo4j] val BeamDefaultBatchSize = 5000L
+  }
+  final case class WriteParam(batchSize: Long = WriteParam.BeamDefaultBatchSize)


If we use batchSize as a write param, we can have common Neo4jOptions

RustedBones · 2022-08-22T16:47:55Z

scio-neo4j/src/main/scala/com/spotify/scio/neo4j/Neo4jIO.scala

+  private[neo4j] val unwindMapName: String = cypher match {
+    case Neo4jIO.UnwindParameterRegex(name) => name
+    case _ =>
+      throw new IllegalArgumentException(
+        s"""Expected unwind cypher with parameter but got:
+         |$cypher
+         |See: https://neo4j.com/docs/cypher-manual/current/clauses/unwind/#unwind-creating-nodes-from-a-list-parameter
+         |""".stripMargin
+      )
+  }


Trying to be smart here: we can skip passing unwindMapName manually and extract if from the query.
Since we only support writeUnwind I think this is safe

turb · 2022-08-22T17:24:36Z

scio-examples/src/main/scala/com/spotify/scio/examples/extra/Neo4JExample.scala


    val modifiedEntities = entities.map(e => e.copy(property = e.property.map(_ + " modified")))

-    modifiedEntities.write(Neo4jWrite(Neo4jWriteOptions(neo4jConf, writeCypher, "rows", setter)))
+    modifiedEntities.saveAsNeo4j(neo4jConf, writeCypher)


Adapted Neo4JExample with the new interface.

Thanks I missed that part. Also noticed that there already a exists a typeclass out there as you mentioned in the comment.

RustedBones · 2022-08-23T08:08:42Z

Thanks a lot for fixing the issues @turb
I think the PR is good to go now. I'll open an issue in magnolify to support the RowMapper automatic derivation as neotype core package comes with a lot of extras.

turb · 2022-08-23T08:12:32Z

Thanks a lot for fixing the issues @turb I think the PR is good to go now. I'll open an issue in magnolify to support the RowMapper automatic derivation as neotype core package comes with a lot of extras.

Great! Should we squash the fixups, or will this be managed by the merger?

RustedBones · 2022-08-23T08:19:34Z

we'll do a squash merge for the whole PR

kellen · 2022-08-23T11:52:51Z

scio-neo4j/src/main/scala/com/spotify/scio/neo4j/syntax/ScioContextSyntax.scala

+   * @param cypher
+   *   Neo4J cypher query
+   */
+  def neo4jCypher[T: RowMapper: Coder](neo4jOptions: Neo4jOptions, cypher: String): SCollection[T] =


Instead of an implicit RowMapper, it'd be more idiomatic to provide a function Record => T and just pass that directly to the read method. See e.g. https://github.com/spotify/scio/blob/main/scio-jdbc/src/main/scala/com/spotify/scio/jdbc/JdbcOptions.scala#L79

If a mapper could be easily derived via magnolify we should introduce a type there and potentially also support that as a shorthand.

See spotify/magnolify#588

kellen · 2022-08-23T12:00:10Z

scio-neo4j/src/main/scala/com/spotify/scio/neo4j/Neo4jIO.scala

+    EmptyTap
+  }
+
+  override def tap(params: ReadP): Tap[Nothing] =


Is it for some reason not possible to provide a tap to read these results?

It is too difficult to get the query that represent the written data I think

kellen · 2022-08-23T12:12:18Z

scio-neo4j/src/main/scala/com/spotify/scio/neo4j/syntax/SCollectionSyntax.scala

+    neo4jOptions: Neo4jOptions,
+    unwindCypher: String,
+    batchSize: Long = WriteParam.BeamDefaultBatchSize
+  )(implicit params: ParametersBuilder[T]): ClosedTap[Nothing] =


Similar to RowMapper, it'd be more idiomatic to provide this as a function rather than an implicit parameter.

RustedBones

We now have a single Neo4jIO supporting both read and write.
Conversion is handled by magnolify typeclass for Value.

Record result is a 'neighbor' of Value, and only allows indexed based access.
Here we loose this feature. Single value or tuples aren't supported on the read side

build.sbt

RustedBones · 2022-08-24T15:32:18Z

scio-neo4j/src/main/scala/com/spotify/scio/neo4j/Neo4jIO.scala

+  implicit private def recordConverter(record: Record): Value =
+    Values.value(record.asMap(identity))


By doing this, we loose indexed based access capability

RustedBones · 2022-08-24T15:32:48Z

scio-neo4j/src/main/scala/com/spotify/scio/neo4j/Neo4jIO.scala

+          .withTransactionConfig(neo4jOptions.transactionConfig)
+          .withCypher(cypher)
+          .withRowMapper(neo4jType.from(_))
+          .withCoder(CoderMaterializer.beam(sc, coder))


coder was missing

Support Neo4jIO

47de496

turb force-pushed the main branch from 4256175 to 47de496 Compare August 5, 2022 22:25

RustedBones reviewed Aug 11, 2022

View reviewed changes

turb added 2 commits August 11, 2022 11:25

fixup! Support Neo4jIO

d4cfa0a

fixup! Support Neo4jIO

d23faf5

fixup! Support Neo4jIO

7669f0a

fixup! Support Neo4jIO

373064f

fixup! Support Neo4jIO

e65ed92

Add integration-test for scio-neo4j

91cb9a5

RustedBones reviewed Aug 18, 2022

View reviewed changes

more idiomatic scio API

1aa14b4

RustedBones reviewed Aug 22, 2022

View reviewed changes

fixup! more idiomatic scio API

5f60da0

turb commented Aug 22, 2022

View reviewed changes

turb and others added 2 commits August 22, 2022 19:54

fixup! more idiomatic scio API

49362e2

Update comments and scaladoc

083129c

RustedBones approved these changes Aug 23, 2022

View reviewed changes

kellen reviewed Aug 23, 2022

View reviewed changes

Use magnolify type class

37a470b

RustedBones reviewed Aug 24, 2022

View reviewed changes

jjengo assigned RustedBones Sep 6, 2022

RustedBones added 2 commits September 15, 2022 18:11

Update magnolify to released version

0fe01fe

Make neo4j compile in 2.12

2054563

RustedBones added this to the 0.12.0 milestone Sep 19, 2022

RustedBones merged commit 6a11cf4 into spotify:main Sep 23, 2022

sumitsu mentioned this pull request Feb 23, 2023

Neo4j parametrized from existing SCollection #4719

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Neo4jIO #4488

Support Neo4jIO #4488

turb commented Aug 5, 2022

codecov bot commented Aug 5, 2022 •

edited

mattcasters commented Aug 5, 2022

turb commented Aug 5, 2022

RustedBones left a comment

RustedBones commented Aug 11, 2022

turb commented Aug 11, 2022

turb commented Aug 11, 2022 •

edited

RustedBones commented Aug 11, 2022

mattcasters commented Aug 11, 2022

RustedBones left a comment

RustedBones Aug 18, 2022

turb Aug 18, 2022

mattcasters Aug 19, 2022

turb commented Aug 19, 2022

RustedBones left a comment

RustedBones Aug 22, 2022

RustedBones Aug 22, 2022

turb Aug 22, 2022

RustedBones Aug 22, 2022

RustedBones commented Aug 23, 2022

turb commented Aug 23, 2022

RustedBones commented Aug 23, 2022

kellen Aug 23, 2022

RustedBones Aug 24, 2022

kellen Aug 23, 2022

RustedBones Aug 24, 2022

kellen Aug 23, 2022

RustedBones left a comment •

edited

RustedBones Aug 24, 2022

RustedBones Aug 24, 2022

		implicit private def recordConverter(record: Record): Value =
		Values.value(record.asMap(identity))

Support Neo4jIO #4488

Support Neo4jIO #4488

Conversation

turb commented Aug 5, 2022

codecov bot commented Aug 5, 2022 • edited

Codecov Report

mattcasters commented Aug 5, 2022

turb commented Aug 5, 2022

RustedBones left a comment

Choose a reason for hiding this comment

RustedBones commented Aug 11, 2022

turb commented Aug 11, 2022

turb commented Aug 11, 2022 • edited

RustedBones commented Aug 11, 2022

mattcasters commented Aug 11, 2022

RustedBones left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

turb commented Aug 19, 2022

RustedBones left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RustedBones commented Aug 23, 2022

turb commented Aug 23, 2022

RustedBones commented Aug 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RustedBones left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 5, 2022 •

edited

turb commented Aug 11, 2022 •

edited

RustedBones left a comment •

edited