Import Error with Databricks #48

mayankshah891 · 2017-11-14T21:50:17Z

I am attempting to read a table from BigQuery and write it to another database.

On my first few tries, I was able to read the table from BigQuery, but when I attempted to write said table to the database, I got the following error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 13.0 failed 4 times, most recent failure: Lost task 1.3 in stage 13.0 (TID 41, 10.102.241.112, executor 0): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0!

Then, I checked our server, and it had actually worked! So I figured never mind, let me run the process all the way through again.

However, now I can't even read the initial table from BigQuery, with the exact same code. It returns this error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 4 times, most recent failure: Lost task 0.3 in stage 20.0 (TID 53, 10.102.249.174, executor 0): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 1!

Any clues? I'm using Databricks to import a table from BigQuery. The cluster I'm using is Databricks 3.3, which includes Spark 2.2.0 and Scala 2.11.

Thank you so much for your time and expertise

The text was updated successfully, but these errors were encountered:

samelamin · 2017-11-14T21:59:04Z

Hi Seems to be a problem with the transaction log. It's hard to be sure without looking at the full stack trace The library is built using spark 2.1. I've been meaning to upgrade to 2.2 and I'll be doing that over the next few days I would say try aiming it at a 2.1 cluster or build a custom package and see how it goes Regards Sam

…

On Tue, 14 Nov 2017 at 23:50, mayankshah891 ***@***.***> wrote: *I am attempting to read a table from BigQuery and write it to another database.* On my first few tries, I was able to read the table from BigQuery, but when I attempted to write said table to the database, I got the following error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 13.0 failed 4 times, most recent failure: Lost task 1.3 in stage 13.0 (TID 41, 10.102.241.112, executor 0): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0! Then, I checked our server, and it *had* actually worked! So I figured never mind, let me run the process all the way through again. However, now I can't even read the initial table from BigQuery, with the exact same code. It returns this error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 4 times, most recent failure: Lost task 0.3 in stage 20.0 (TID 53, 10.102.249.174, executor 0): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 1! Any clues? I'm using Databricks to import a table from BigQuery. The cluster I'm using is Databricks 3.3, which includes Spark 2.2.0 and Scala 2.11. Thank you so much for your time and expertise — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#48>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEHLm6Fir7GYmnquj-vnDufO6AT2oNxlks5s2gsZgaJpZM4QeDPk> .

mayankshah891 · 2017-11-14T22:00:54Z

Wow! thank you for the incredibly fast response! I'm launching a 2.1 cluster now, will let you know how it goes.

EDIT: In case this matters, here is the complete error:

at com.google.common.base.Preconditions.checkState(Preconditions.java:177)
at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.setEndFileMarkerFile(DynamicFileListRecordReader.java:327)
at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:207)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:249)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:243)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1676)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1664)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1663)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1663)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:930)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:930)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:930)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1896)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1847)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1835)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:732)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2076)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2095)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:361)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2925)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2215)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2215)
at org.apache.spark.sql.Dataset$$anonfun$56.apply(Dataset.scala:2909)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:99)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2908)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2215)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2428)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:247)
at org.apache.spark.sql.Dataset.show(Dataset.scala:644)
at org.apache.spark.sql.Dataset.show(Dataset.scala:603)
at org.apache.spark.sql.Dataset.show(Dataset.scala:612)
at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-203108:2)
at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-203108:55)
at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-203108:57)
at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-203108:59)
at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-203108:61)
at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-203108:63)
at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-203108:65)
at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw$$iw.(command-203108:67)
at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw$$iw.(command-203108:69)
at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw$$iw.(command-203108:71)
at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw$$iw.(command-203108:73)
at line72e45f3548f94f05b4f11df9fae814ff57.$read$$iw.(command-203108:75)
at line72e45f3548f94f05b4f11df9fae814ff57.$read.(command-203108:77)
at line72e45f3548f94f05b4f11df9fae814ff57.$read$.(command-203108:81)
at line72e45f3548f94f05b4f11df9fae814ff57.$read$.(command-203108)
at line72e45f3548f94f05b4f11df9fae814ff57.$eval$.$print$lzycompute(:7)
at line72e45f3548f94f05b4f11df9fae814ff57.$eval$.$print(:6)
at line72e45f3548f94f05b4f11df9fae814ff57.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:186)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:182)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:182)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:182)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:456)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:410)
at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:182)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$3.apply(DriverLocal.scala:234)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$3.apply(DriverLocal.scala:215)
at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:188)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:183)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:39)
at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:221)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:39)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:215)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:601)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:601)
at scala.util.Try$.apply(Try.scala:192)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:596)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:486)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:554)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:391)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:348)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:215)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 1!
at com.google.common.base.Preconditions.checkState(Preconditions.java:177)
at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.setEndFileMarkerFile(DynamicFileListRecordReader.java:327)
at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:207)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at

org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:249)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:243)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

mayankshah891 · 2017-11-14T22:08:01Z

Update:

So using a v2.1 cluster fixed the initial issue, in that I can read data from BigQuery into Spark. The table does indeed display (!)

Writing to the other database is still producing this weird error, and this time no data came through (besides the schema):

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 (TID 8, 10.102.229.60, executor 1): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0!

Here is the complete error:

Py4JJavaError Traceback (most recent call last)
in ()
----> 1 Events_data.write.format("jdbc").option("url", "jdbc:mapd:e*************st-2.compute.amazonaws.com:9091:mapd").option("driver", "com.mapd.jdbc.MapDDriver").option("dbtable", "randomevents").option("user", "mapd").option("password", "**************").mode("overwrite").save()

/databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
546 self.format(format)
547 if path is None:
--> 548 self._jwrite.save()
549 else:
550 self._jwrite.save(path)

/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in call(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()

/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o186.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 (TID 8, 10.102.229.60, executor 1): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0!
at com.google.common.base.Preconditions.checkState(Preconditions.java:177)
at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.setEndFileMarkerFile(DynamicFileListRecordReader.java:327)
at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:578)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1430)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1429)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1429)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:803)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1657)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1612)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1937)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1950)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1963)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1977)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:924)
at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply$mcV$sp(Dataset.scala:2323)
at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2323)
at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2323)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:87)
at org.apache.spark.sql.execution.SQLExecution$.withFileAccessAudit(SQLExecution.scala:60)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:70)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2777)
at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2322)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.saveTable(JdbcUtils.scala:670)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:90)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:488)
at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply$mcV$sp(DataFrameWriter.scala:217)
at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply(DataFrameWriter.scala:209)
at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply(DataFrameWriter.scala:209)
at org.apache.spark.sql.execution.SQLExecution$.withFileAccessAudit(SQLExecution.scala:53)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:209)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0!
at com.google.common.base.Preconditions.checkState(Preconditions.java:177)
at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.setEndFileMarkerFile(DynamicFileListRecordReader.java:327)
at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:578)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more

samelamin · 2017-11-15T09:15:59Z

Hey sorry I'm at a conference so won't have a chance to really look at this until the conference So just to clarify this is a fresh cluster with the Google creds json in it ( it must be since you wouldn't be able to read if it wasn't) Does it fail everytime or after a while? Also is it a structured streaming save or a batch save? I.e one off save to bigquery?

…

On Wed, 15 Nov 2017 at 00:08, mayankshah891 ***@***.***> wrote: Update: So using a v2.1 cluster fixed the initial issue, in that I can read data from BigQuery into Spark. The table does indeed display (!) Writing to the other database is still producing this weird error, and this time no data came through (besides the schema): org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 (TID 8, 10.102.229.60, executor 1): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0! Here is the complete error: Py4JJavaError Traceback (most recent call last) in () ----> 1 Events_data.write.format("jdbc").option("url", "jdbc:mapd:ec2-18-220-187-39.us-east-2.compute.amazonaws.com:9091:mapd").option("driver", "com.mapd.jdbc.MapDDriver").option("dbtable", "InAppEvents3").option("user", "mapd").option("password", "i-053bdfd74c7ab374b").mode("overwrite").save() /databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options) 546 self.format(format) 547 if path is None: --> 548 self._jwrite.save() 549 else: 550 self._jwrite.save(path) /databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in *call*(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( Py4JJavaError: An error occurred while calling o186.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage 3.0 (TID 8, 10.102.229.60, executor 1): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0! at com.google.common.base.Preconditions.checkState(Preconditions.java:177) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.setEndFileMarkerFile(DynamicFileListRecordReader.java:327) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:578) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1442) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1430) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1429) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1429) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:803) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1657) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1612) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1937) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1950) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1963) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1977) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:924) at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply$mcV$sp(Dataset.scala:2323) at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2323) at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2323) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:87) at org.apache.spark.sql.execution.SQLExecution$.withFileAccessAudit(SQLExecution.scala:60) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:70) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2777) at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2322) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.saveTable(JdbcUtils.scala:670) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:90) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:488) at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply$mcV$sp(DataFrameWriter.scala:217) at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply(DataFrameWriter.scala:209) at org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply(DataFrameWriter.scala:209) at org.apache.spark.sql.execution.SQLExecution$.withFileAccessAudit(SQLExecution.scala:53) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:209) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0! at com.google.common.base.Preconditions.checkState(Preconditions.java:177) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.setEndFileMarkerFile(DynamicFileListRecordReader.java:327) at com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:578) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#48 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEHLm2hkIF7aeUYP6_9F8TO2MGdFEithks5s2g9CgaJpZM4QeDPk> .

mayankshah891 · 2017-11-15T15:52:37Z

Its a fresh cluster, I have the credentials saved via json correctly (I'm assuming bc I was able to read it at least once).

Now every time I fire up the cluster, it fails to read the table initially. It is a batch save. I was going to run it every night.

This is the error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 10.102.245.245, executor 0): java.lang.IllegalStateException: Found known file 'data-000000000004.avro' with index 4, which isn't less than or equal to than endFileNumber 2!

samelamin · 2017-11-15T16:57:20Z

that seems to be a problem with the data being read in, the connector doest read or use avro. Is the data being read avro?

…

On Wed, Nov 15, 2017 at 5:52 PM, mayankshah891 ***@***.***> wrote: Its a fresh cluster, I have the credentials saved via json correctly (I'm assuming bc I was able to read it at least once). Now every time I fire up the cluster, it fails to read the table initially. It is a batch save. I was going to run it every night. This is the error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 10.102.245.245, executor 0): java.lang.IllegalStateException: Found known file 'data-000000000004.avro' with index 4, which isn't less than or equal to than endFileNumber 2! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#48 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEHLm6oqLg74dQH1l8L5b3S5ZLRVCStjks5s2wjFgaJpZM4QeDPk> .

mayankshah891 · 2017-11-16T15:39:23Z

How do I check if the data is avro? Its a table in BigQuery (obviously), the person who put it there says he doesn't know what avro is, so I'm assuming not? Again, it worked once and just hasn't worked since, but I still have the print statement showing the table in databricks, so I'm 100% sure it worked at least once.

samelamin · 2017-11-16T16:16:33Z

Can you make sure you are using my connector and not the spotify one? I believe they convert to avro

…

On Thu, Nov 16, 2017 at 5:39 PM, mayankshah891 ***@***.***> wrote: How do I check if the data is avro? Its a table in BigQuery (obviously), the person who put it there says he doesn't know what avro is, so I'm assuming not? Again, it worked once and just hasn't worked since, but I still have the print statement showing the table in databricks, so I'm 100% sure it worked at least once. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#48 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEHLm_A6s-CvN87mSO4mUtDCm6Qh2n66ks5s3FcsgaJpZM4QeDPk> .

samelamin · 2017-11-16T16:22:07Z

actually my mistake, digging into the code I see the google hadoop connector is using avro https://github.com/samelamin/spark-bigquery/blob/ ed716c8/src/main/scala/com/ samelamin/spark/bigquery/BigQuerySQLContext.scala#L89 I am not near my laptop at the moment but I will have a look tomorrow In the meantime try restart/fresh cluster just to make sure its nothing to do with the state, the connector def works on Databricks I am just not sure why only you are facing the problem. My only guess is that its something with the cluster set up

…

On Thu, Nov 16, 2017 at 6:16 PM, Sam Elamin ***@***.***> wrote: Can you make sure you are using my connector and not the spotify one? I believe they convert to avro On Thu, Nov 16, 2017 at 5:39 PM, mayankshah891 ***@***.***> wrote: > How do I check if the data is avro? Its a table in BigQuery (obviously), > the person who put it there says he doesn't know what avro is, so I'm > assuming not? Again, it worked once and just hasn't worked since, but I > still have the print statement showing the table in databricks, so I'm 100% > sure it worked at least once. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#48 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AEHLm_A6s-CvN87mSO4mUtDCm6Qh2n66ks5s3FcsgaJpZM4QeDPk> > . >

mayankshah891 · 2017-11-16T17:05:50Z

Okay thank you so much for your time and for building this in the first place. I'll try some different clusters and setups and we'll see. Thank you again

mayankshah891 · 2017-11-16T20:50:52Z

I've tried a bunch of different clusters, to mixed results. I can read in the data sometimes, but if I try to call .show(200) on the data, it fails. Again the error seems to be something to do with partitions? Databricks seems to believe it is done reading the data, but then another file pops up that is past the "endFileNumber", then it throws the error

Someone else was running into an error like this here, but I haven't cracked how to solve it myself yet:

https://stackoverflow.com/questions/46404081/illegalstateexception-on-google-cloud-dataproc

EDIT: To recap: I'm reading from a bigquery table that is being updated daily. It is a batch read, which I then attempt to view fully and write to another server. So far it intermittently fails at each step of the process.

mayankshah891 · 2017-11-17T17:59:32Z

Sam, I think I figured out the problem. The bigquery table I'm reading from is streaming new data every few seconds, so the lazy transformations in Spark are breaking the code because the file size keeps changing! I just need to call .persist or .cache and hold the data in memory and then it works! (Well, sort of works, its not writing properly but I think that can be fixed as well)

Tentatively I think this issue can be closed. Thanks again

samelamin · 2017-11-17T18:46:15Z

Hi no worries glad we got there in the end Had no idea you were streaming! Can you close the issue please

…

On Fri, 17 Nov 2017 at 19:59, mayankshah891 ***@***.***> wrote: Sam, I think I figured out the problem. The bigquery table I'm reading from is streaming new data every few seconds, so the lazy transformations in Spark are breaking the code because the file size keeps changing! I just need to call .persist or .cache and hold the data in memory and then it works! (Well, sort of works, its not writing properly but I think that can be fixed as well) Tentatively I think this issue can be closed. Thanks again — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#48 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEHLm3YmlIlyX83oYAzMczY35ZpndXWgks5s3cmEgaJpZM4QeDPk> .

mayankshah891 closed this as completed Nov 17, 2017

yohannnyc mentioned this issue Jan 18, 2018

Import Error with Databricks from a table with streaming #53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import Error with Databricks #48

Import Error with Databricks #48

mayankshah891 commented Nov 14, 2017

samelamin commented Nov 14, 2017 via email

mayankshah891 commented Nov 14, 2017 •

edited

Loading

mayankshah891 commented Nov 14, 2017 •

edited

Loading

samelamin commented Nov 15, 2017 via email

mayankshah891 commented Nov 15, 2017

samelamin commented Nov 15, 2017 via email

mayankshah891 commented Nov 16, 2017

samelamin commented Nov 16, 2017 via email

samelamin commented Nov 16, 2017 via email

mayankshah891 commented Nov 16, 2017

mayankshah891 commented Nov 16, 2017 •

edited

Loading

mayankshah891 commented Nov 17, 2017

samelamin commented Nov 17, 2017 via email

Import Error with Databricks #48

Import Error with Databricks #48

Comments

mayankshah891 commented Nov 14, 2017

samelamin commented Nov 14, 2017 via email

mayankshah891 commented Nov 14, 2017 • edited Loading

mayankshah891 commented Nov 14, 2017 • edited Loading

samelamin commented Nov 15, 2017 via email

mayankshah891 commented Nov 15, 2017

samelamin commented Nov 15, 2017 via email

mayankshah891 commented Nov 16, 2017

samelamin commented Nov 16, 2017 via email

samelamin commented Nov 16, 2017 via email

mayankshah891 commented Nov 16, 2017

mayankshah891 commented Nov 16, 2017 • edited Loading

mayankshah891 commented Nov 17, 2017

samelamin commented Nov 17, 2017 via email

mayankshah891 commented Nov 14, 2017 •

edited

Loading

mayankshah891 commented Nov 14, 2017 •

edited

Loading

mayankshah891 commented Nov 16, 2017 •

edited

Loading