Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - Databricks absolute path must be written differently #516

Closed
ghost opened this issue May 23, 2023 · 6 comments
Closed

[BUG] - Databricks absolute path must be written differently #516

ghost opened this issue May 23, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@ghost
Copy link

ghost commented May 23, 2023

Description

I try to distribute my ingestion areas on several containers in an azure data lake mounted on a dbfs path on the databricks platform (ex: /mnt/starlake).
For that I specified in environment variable SL_AREA_PENDING=dbfs://mnt/pending_area for example for the pending area.
Problem, when you specify an absolute path on databricks with the filesystem included in it, you have to write it with a single '/'.

%fs ls dbfs://mnt/starlake gives us this error:
IllegalArgumentException: Hostname not allowed in dbfs uri. Please use 'dbfs:/' instead of 'dbfs://' in uri: dbfs://mnt/starlake

While %fs ls dbfs:/mnt/starlake works correctly.

However I can't change the absolute path of SL_AREA_PENDING=dbfs:/mnt/pending_area because according to your code, in DatasetArea.scala line 49, a path is recognized as absolute only if it contains "://".
So if I include the variable this way, starlake will recognize this path as relative and concatenate me SL_DATASETS with SL_AREA_PENDING : "dbfs:/mnt/datasets_area/dbfs:/mnt/pending_area"
This will cause an error at runtime.

So it might be useful to change this condition to contains(":/") or just create a specific condition for databricks.

@ghost ghost added the bug Something isn't working label May 23, 2023
@hayssams
Copy link
Contributor

hayssams commented May 23, 2023

Thank you for reporting this @ametwalli1
When the hostname is not present, the correct syntax for xdfs compatible filesystem should be xdfs:///.
Could try using dbfs:///mnt/datasets_area/dbfs:/mnt/pending_area instead and let me know how this work for you.

@ghost
Copy link
Author

ghost commented May 23, 2023

Thank you very much @hayssams, using dbfs:///mnt/pending_area as SL_AREA_PENDING is working fine !

However, I have an other trouble here.

When doing the import with this setting, starlake pushes all the source files without filtering by domain. This was done by default in the DatasetArea.scala path function that would add the /$domain variable at the end of the path. But it doesn't do it anymore.

So in my situation if I wanted to create the domain directories in my pending area, I thought using dbfs:///mnt/pending_area/$domain would work in my cluster environment. It didn't change anything ($domain was ignored), so I tried dbfs:///mnt/pending_area/{domain} instead and it created a {domain} directory.

Do you know how can I use variables in my cluster environment ?

@hayssams
Copy link
Contributor

Could you try {{domain}} instead and let me know

@ghost
Copy link
Author

ghost commented May 24, 2023

dbfs:///mnt/pending_area/{{domain}} creates a {{domain}} directory also.
The only way to use variables is done by using /$domain but I guess that you have some regex security check and it is ignored.

The better way to solved this would be to add the /$domain at the end of the area in DatasetArea.scala line 50.
Or maybe there is another way to introduce variables.

@hayssams
Copy link
Contributor

Can We setup a call 5:30 PM?

@ghost
Copy link
Author

ghost commented May 24, 2023

Sure ! Here is my email : a.metwalli@groupeonepoint.com

@ghost ghost mentioned this issue May 24, 2023
5 tasks
@hayssams hayssams linked a pull request May 24, 2023 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant