-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import Error with Databricks #48
Comments
Hi
Seems to be a problem with the transaction log. It's hard to be sure
without looking at the full stack trace
The library is built using spark 2.1. I've been meaning to upgrade to 2.2
and I'll be doing that over the next few days
I would say try aiming it at a 2.1 cluster or build a custom package and
see how it goes
Regards
Sam
…On Tue, 14 Nov 2017 at 23:50, mayankshah891 ***@***.***> wrote:
*I am attempting to read a table from BigQuery and write it to another
database.*
On my first few tries, I was able to read the table from BigQuery, but
when I attempted to write said table to the database, I got the following
error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 13.0 failed 4 times, most recent failure: Lost task 1.3 in stage
13.0 (TID 41, 10.102.241.112, executor 0): java.lang.IllegalStateException:
Found known file 'data-000000000002.avro' with index 2, which
isn't less than or equal to than endFileNumber 0!
Then, I checked our server, and it *had* actually worked! So I figured
never mind, let me run the process all the way through again.
However, now I can't even read the initial table from BigQuery, with the
exact same code. It returns this error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 20.0 failed 4 times, most recent failure: Lost task 0.3 in stage
20.0 (TID 53, 10.102.249.174, executor 0): java.lang.IllegalStateException:
Found known file 'data-000000000002.avro' with index 2, which isn't less
than or equal to than endFileNumber 1!
Any clues? I'm using Databricks to import a table from BigQuery. The
cluster I'm using is Databricks 3.3, which includes Spark 2.2.0 and Scala
2.11.
Thank you so much for your time and expertise
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#48>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AEHLm6Fir7GYmnquj-vnDufO6AT2oNxlks5s2gsZgaJpZM4QeDPk>
.
|
Wow! thank you for the incredibly fast response! I'm launching a 2.1 cluster now, will let you know how it goes. EDIT: In case this matters, here is the complete error:
|
Update: So using a v2.1 cluster fixed the initial issue, in that I can read data from BigQuery into Spark. The table does indeed display (!) Writing to the other database is still producing this weird error, and this time no data came through (besides the schema):
Here is the complete error:
|
Hey sorry I'm at a conference so won't have a chance to really look at this
until the conference
So just to clarify this is a fresh cluster with the Google creds json in it
( it must be since you wouldn't be able to read if it wasn't)
Does it fail everytime or after a while? Also is it a structured streaming
save or a batch save? I.e one off save to bigquery?
…On Wed, 15 Nov 2017 at 00:08, mayankshah891 ***@***.***> wrote:
Update:
So using a v2.1 cluster fixed the initial issue, in that I can read data
from BigQuery into Spark. The table does indeed display (!)
Writing to the other database is still producing this weird error, and
this time no data came through (besides the schema):
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage
3.0 (TID 8, 10.102.229.60, executor 1): java.lang.IllegalStateException:
Found known file 'data-000000000002.avro' with index 2, which isn't less
than or equal to than endFileNumber 0!
Here is the complete error:
Py4JJavaError Traceback (most recent call last)
in ()
----> 1 Events_data.write.format("jdbc").option("url",
"jdbc:mapd:ec2-18-220-187-39.us-east-2.compute.amazonaws.com:9091:mapd").option("driver",
"com.mapd.jdbc.MapDDriver").option("dbtable",
"InAppEvents3").option("user", "mapd").option("password",
"i-053bdfd74c7ab374b").mode("overwrite").save()
/databricks/spark/python/pyspark/sql/readwriter.py in save(self, path,
format, mode, partitionBy, **options)
546 self.format(format)
547 if path is None:
--> 548 self._jwrite.save()
549 else:
550 self._jwrite.save(path)
/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in
*call*(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o186.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in stage
3.0 (TID 8, 10.102.229.60, executor 1): java.lang.IllegalStateException:
Found known file 'data-000000000002.avro' with index 2, which isn't less
than or equal to than endFileNumber 0!
at com.google.common.base.Preconditions.checkState(Preconditions.java:177)
at
com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.setEndFileMarkerFile(DynamicFileListRecordReader.java:327)
at
com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:578)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1442)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1430)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1429)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1429)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:803)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1657)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1612)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1937)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1950)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1963)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1977)
at
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:926)
at
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:924)
at
org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply$mcV$sp(Dataset.scala:2323)
at
org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2323)
at
org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2323)
at
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:87)
at
org.apache.spark.sql.execution.SQLExecution$.withFileAccessAudit(SQLExecution.scala:60)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:70)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2777)
at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2322)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.saveTable(JdbcUtils.scala:670)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:90)
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:488)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply$mcV$sp(DataFrameWriter.scala:217)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply(DataFrameWriter.scala:209)
at
org.apache.spark.sql.DataFrameWriter$$anonfun$save$1.apply(DataFrameWriter.scala:209)
at
org.apache.spark.sql.execution.SQLExecution$.withFileAccessAudit(SQLExecution.scala:53)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:209)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: Found known file
'data-000000000002.avro' with index 2, which isn't less than or equal to
than endFileNumber 0!
at com.google.common.base.Preconditions.checkState(Preconditions.java:177)
at
com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.setEndFileMarkerFile(DynamicFileListRecordReader.java:327)
at
com.google.cloud.hadoop.io.bigquery.DynamicFileListRecordReader.nextKeyValue(DynamicFileListRecordReader.java:177)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:578)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:670)
at
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1963)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#48 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEHLm2hkIF7aeUYP6_9F8TO2MGdFEithks5s2g9CgaJpZM4QeDPk>
.
|
Its a fresh cluster, I have the credentials saved via json correctly (I'm assuming bc I was able to read it at least once). Now every time I fire up the cluster, it fails to read the table initially. It is a batch save. I was going to run it every night. This is the error:
|
that seems to be a problem with the data being read in, the connector doest
read or use avro.
Is the data being read avro?
…On Wed, Nov 15, 2017 at 5:52 PM, mayankshah891 ***@***.***> wrote:
Its a fresh cluster, I have the credentials saved via json correctly (I'm
assuming bc I was able to read it at least once).
Now every time I fire up the cluster, it fails to read the table
initially. It is a batch save. I was going to run it every night.
This is the error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage
1.0 (TID 4, 10.102.245.245, executor 0): java.lang.IllegalStateException:
Found known file 'data-000000000004.avro' with index 4, which isn't less
than or equal to than endFileNumber 2!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#48 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEHLm6oqLg74dQH1l8L5b3S5ZLRVCStjks5s2wjFgaJpZM4QeDPk>
.
|
How do I check if the data is avro? Its a table in BigQuery (obviously), the person who put it there says he doesn't know what avro is, so I'm assuming not? Again, it worked once and just hasn't worked since, but I still have the print statement showing the table in databricks, so I'm 100% sure it worked at least once. |
Can you make sure you are using my connector and not the spotify one? I
believe they convert to avro
…On Thu, Nov 16, 2017 at 5:39 PM, mayankshah891 ***@***.***> wrote:
How do I check if the data is avro? Its a table in BigQuery (obviously),
the person who put it there says he doesn't know what avro is, so I'm
assuming not? Again, it worked once and just hasn't worked since, but I
still have the print statement showing the table in databricks, so I'm 100%
sure it worked at least once.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#48 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEHLm_A6s-CvN87mSO4mUtDCm6Qh2n66ks5s3FcsgaJpZM4QeDPk>
.
|
actually my mistake, digging into the code I see the google hadoop
connector is using avro
https://github.com/samelamin/spark-bigquery/blob/
ed716c8/src/main/scala/com/
samelamin/spark/bigquery/BigQuerySQLContext.scala#L89
I am not near my laptop at the moment but I will have a look tomorrow
In the meantime try restart/fresh cluster just to make sure its nothing to
do with the state, the connector def works on Databricks I am just not sure
why only you are facing the problem. My only guess is that its something
with the cluster set up
…On Thu, Nov 16, 2017 at 6:16 PM, Sam Elamin ***@***.***> wrote:
Can you make sure you are using my connector and not the spotify one? I
believe they convert to avro
On Thu, Nov 16, 2017 at 5:39 PM, mayankshah891 ***@***.***>
wrote:
> How do I check if the data is avro? Its a table in BigQuery (obviously),
> the person who put it there says he doesn't know what avro is, so I'm
> assuming not? Again, it worked once and just hasn't worked since, but I
> still have the print statement showing the table in databricks, so I'm 100%
> sure it worked at least once.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#48 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AEHLm_A6s-CvN87mSO4mUtDCm6Qh2n66ks5s3FcsgaJpZM4QeDPk>
> .
>
|
Okay thank you so much for your time and for building this in the first place. I'll try some different clusters and setups and we'll see. Thank you again |
I've tried a bunch of different clusters, to mixed results. I can read in the data sometimes, but if I try to call .show(200) on the data, it fails. Again the error seems to be something to do with partitions? Databricks seems to believe it is done reading the data, but then another file pops up that is past the "endFileNumber", then it throws the error Someone else was running into an error like this here, but I haven't cracked how to solve it myself yet: https://stackoverflow.com/questions/46404081/illegalstateexception-on-google-cloud-dataproc EDIT: To recap: I'm reading from a bigquery table that is being updated daily. It is a batch read, which I then attempt to view fully and write to another server. So far it intermittently fails at each step of the process. |
Sam, I think I figured out the problem. The bigquery table I'm reading from is streaming new data every few seconds, so the lazy transformations in Spark are breaking the code because the file size keeps changing! I just need to call .persist or .cache and hold the data in memory and then it works! (Well, sort of works, its not writing properly but I think that can be fixed as well) Tentatively I think this issue can be closed. Thanks again |
Hi no worries glad we got there in the end
Had no idea you were streaming!
Can you close the issue please
…On Fri, 17 Nov 2017 at 19:59, mayankshah891 ***@***.***> wrote:
Sam, I think I figured out the problem. The bigquery table I'm reading
from is streaming new data every few seconds, so the lazy transformations
in Spark are breaking the code because the file size keeps changing! I just
need to call .persist or .cache and hold the data in memory and then it
works! (Well, sort of works, its not writing properly but I think that can
be fixed as well)
Tentatively I think this issue can be closed. Thanks again
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#48 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEHLm3YmlIlyX83oYAzMczY35ZpndXWgks5s3cmEgaJpZM4QeDPk>
.
|
I am attempting to read a table from BigQuery and write it to another database.
On my first few tries, I was able to read the table from BigQuery, but when I attempted to write said table to the database, I got the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 13.0 failed 4 times, most recent failure: Lost task 1.3 in stage 13.0 (TID 41, 10.102.241.112, executor 0): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 0!
Then, I checked our server, and it had actually worked! So I figured never mind, let me run the process all the way through again.
However, now I can't even read the initial table from BigQuery, with the exact same code. It returns this error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 4 times, most recent failure: Lost task 0.3 in stage 20.0 (TID 53, 10.102.249.174, executor 0): java.lang.IllegalStateException: Found known file 'data-000000000002.avro' with index 2, which isn't less than or equal to than endFileNumber 1!
Any clues? I'm using Databricks to import a table from BigQuery. The cluster I'm using is Databricks 3.3, which includes Spark 2.2.0 and Scala 2.11.
Thank you so much for your time and expertise
The text was updated successfully, but these errors were encountered: