Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark_read_delta failed when connected through databricks connect #3091

Closed
tsengj opened this issue Jun 1, 2021 · 2 comments
Closed

spark_read_delta failed when connected through databricks connect #3091

tsengj opened this issue Jun 1, 2021 · 2 comments

Comments

@tsengj
Copy link

tsengj commented Jun 1, 2021

spark_read_delta fails when connected through databricks connect


spark_read_delta works when i'm on the R notebook within databricks.
spark_read_delta also works when i create table within databricks, and run spark_read_delta (from my rstudio desktop) to import the sql created table below

CREATE TABLE tbl
USING delta
AS SELECT *
FROM delta.<path>

But when i am running rstudio desktop and connected successfully through databricks connect, spark_read_delta fails when i issue a command to databricks to mount it directly from path

sc <- spark_connect(spark_home = "c:/programdata/anaconda3/lib/site-packages/pyspark", method = "databricks")
tbl <- spark_read_delta(sc, path = "/mnt/dataLake/<ms storage blob delta path>/")

Error Message below

Error: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: H:%5Cmnt%5CdataLake%5C<path>%5C
	at org.apache.hadoop.fs.Path.initialize(Path.java:205)
	at org.apache.hadoop.fs.Path.<init>(Path.java:171)
	at com.databricks.sql.transaction.tahoe.DeltaValidation$.validateDeltaRead(DeltaValidation.scala:94)
	at org.apache.spark.sql.DataFrameReader.preprocessDeltaLoading(DataFrameReader.scala:308)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:356)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:278)
	at com.databricks.service.SparkServiceRPCHandler$$anon$1.call(SparkServiceRPCHandler.scala:99)
	at com.databricks.service.SparkServiceRPCHandler$$anon$1.call(SparkServiceRPCHandler.scala:80)
	at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4724)
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3522)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2315)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
	at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
	at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4721)
	at com.databricks.service.SparkServiceRPCHandler$.getOrLoadAnonymousRelation(SparkServiceRPCHandler.scala:80)
	at com.databricks.service.SparkServiceRPCHandler.execute0(SparkServiceRPCHandler.scala:711)
	at com.databricks.service.SparkServiceRPCHandler.$anonfun$executeRPC0$1(SparkServiceRPCHandler.scala:474)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
	at com.databricks.service.SparkServiceRPCHandler.executeRPC0(SparkServiceRPCHandler.scala:370)
	at com.databricks.service.SparkServiceRPCHandler$$anon$2.call(SparkServiceRPCHandler.scala:321)
	at com.databricks.service.SparkServiceRPCHandler$$anon$2.call(SparkServiceRPCHandler.scala:307)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at com.databricks.service.SparkServiceRPCHandler.$anonfun$executeRPC$1(SparkServiceRPCHandler.scala:357)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
	at com.databricks.service.SparkServiceRPCHandler.executeRPC(SparkServiceRPCHandler.scala:334)
	at com.databricks.service.SparkServiceRPCServlet.doPost(SparkServiceRPCServer.scala:152)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:791)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:550)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.server.Server.handle(Server.java:516)
	at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:273)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:773)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:905)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: H:%5Cmnt%5CdataLake%5C<path>%5C
	at java.net.URI.checkPath(URI.java:1849)
	at java.net.URI.<init>(URI.java:745)
	at org.apache.hadoop.fs.Path.initialize(Path.java:202)
	... 50 more

My R session

> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sparklyr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        pillar_1.6.0      compiler_4.0.5    dbplyr_2.1.1      plyr_1.8.6        r2d3_0.2.5        base64enc_0.1-3   tools_4.0.5      
 [9] uuid_0.1-4        digest_0.6.27     jsonlite_1.7.2    lifecycle_1.0.0   tibble_3.1.1      gtable_0.3.0      pkgconfig_2.0.3   rlang_0.4.10     
[17] rstudioapi_0.13   DBI_1.1.1         parallel_4.0.5    yaml_2.2.1        xfun_0.22         gridExtra_2.3     withr_2.4.2       dplyr_1.0.5      
[25] stringr_1.4.0     httr_1.4.2        askpass_1.1       generics_0.1.0    vctrs_0.3.8       htmlwidgets_1.5.3 rprojroot_2.0.2   grid_4.0.5       
[33] tidyselect_1.1.0  glue_1.4.2        forge_0.2.0       R6_2.5.0          fansi_0.4.2       tidyr_1.1.3       reshape2_1.4.4    purrr_0.3.4      
[41] magrittr_2.0.1    htmltools_0.5.1.1 ellipsis_0.3.1    assertthat_0.2.1  config_0.3.1      utf8_1.2.1        tinytex_0.31      stringi_1.5.3    
[49] openssl_1.4.3     crayon_1.4.1     
@nviraj
Copy link

nviraj commented Jun 1, 2021

I think this could be related either due to path or how dbconnect is configured.
Error message showing a "H" drive indicates to a path issue.

If you are using dbfs what happens when you use this instead:
tbl <- spark_read_delta(sc, path = "dbfs:/mnt/dataLake/<ms storage blob delta path>/")

Links which might be helpful.
How to specify dbfs path?
Other storage systems
Related issue (But should have worked in 1.6.2 which you are using)

P.S. Switched to databricks very recently so could be wrong. Apologies in advance if that's the case.

@yitao-li
Copy link
Contributor

yitao-li commented Jun 1, 2021

I agree with what nviraj mentioned above. If path should be a dbfs path then prefixing it with the scheme dbfs:/ is required, because otherwise it is indistinguishable from a local path, and also because you are using windows the /mnt/datalake/... path you specified gets interpreted as a relative path on local file system, which is clearly not what you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants