Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark_read_jdbc returns columns with quotes instead of backticks #3196

Open
crogers923 opened this issue Nov 1, 2021 · 2 comments
Open

spark_read_jdbc returns columns with quotes instead of backticks #3196

crogers923 opened this issue Nov 1, 2021 · 2 comments

Comments

@crogers923
Copy link

spark_read_jdbc returns columns with quotes instead of backticks for query. This causes the results to return as literals instead of the data.

sc <- spark_connect(master = "yarn")
spark_read_jdbc(sc, "tbl_nm", options = list(
  url = "jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;..",
  driver = "com.simba.googlebigquery.jdbc.Driver",
  dbtable="project.dataset.tbl_nm",
  partitionColumn="COLUMN1",
  lowerBound = 1,
  upperBound = 100,
  numPartitions = 100
)

Generates (as an example):
SELECT "COLUMN1", "COLUMN2", "COLUMN3" FROM project.dataset.tbl_nm where "COLUMN1" >= 1;

This results in the error that: "No matching signature for operator >= for argument types: STRING, INT64"

This should instead be:
SELECT `COLUMN1`, `COLUMN2`, `COLUMN3` FROM project.dataset.tbl_nm where `COLUMN1` >= 1;

@gboyega1
Copy link

gboyega1 commented Jul 13, 2022

hello. How were you able to implement this connectivity between sparklyr and bigquery. I keep getting the following error:
Error: java.lang.NoSuchMethodError: 'java.util.List com.google.common.base.Splitter.splitToList(java.lang.CharSequence)'
Do you have any idea what I might be doing wrong?

@gboyega1
Copy link

Hello again, I found that guava jar versions used by spark and simba bigquery connector are different (14.01 and 31.1 respectively) so I had to replace that of spark. I've also since had to replace or add a few other jars to the spark jars folder. Now I have the following error message:

Failed to fetch data: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 24.0 failed 4 times, most recent failure: Lost task 0.3 in stage 24.0 (TID 39) (cluster-xxx.location-yyy.project-zzz.internal executor 2): java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to long. at com.simba.googlebigquery.exceptions.ExceptionConverter.toSQLException(Unknown Source) at com.simba.googlebigquery.utilities.conversion.TypeConverter.toLong(Unknown Source) at com.simba.googlebigquery.jdbc.common.SForwardResultSet.getLong(Unknown Source) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$9(JdbcUtils.scala:446) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$9$adapted(JdbcUtils.scala:445) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:367) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:349) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)

I've read a few blogs where they've suggested "implementing a custom JDBC dialect" to solve this, unfortunately, I have no idea how that is to be done, not to mention the fact that my coding ability is limited to R, some Python and SQL. I wonder if you have a more straight forward work around to this or how I can go about implementing the custom JDBC dialect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants