Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executors don't resolve dependencies #141

Closed
sbernauer opened this issue Sep 13, 2022 · 19 comments · Fixed by #281
Closed

Executors don't resolve dependencies #141

sbernauer opened this issue Sep 13, 2022 · 19 comments · Fixed by #281

Comments

@sbernauer
Copy link
Member

sbernauer commented Sep 13, 2022

Affected version

0.5.0

Current and expected behavior

Following https://iceberg.apache.org/docs/latest/getting-started/

Current

Use

  deps:
    packages:
      # - org.apache.hadoop:hadoop-aws:3.3.3
      - org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:0.14.1

Driver logs:

:: loading settings :: url = jar:file:/stackable/spark-3.3.0-bin-hadoop3/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /stackable/.ivy2/cache
The jars for the packages stored in: /stackable/.ivy2/jars
org.apache.iceberg#iceberg-spark-runtime-3.3_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-ddd36894-c46d-4d2f-82b2-8a916f718eba;1.0
        confs: [default]
        found org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;0.14.1 in central
downloading https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/0.14.1/iceberg-spark-runtime-3.3_2.12-0.14.1.jar ...
        [SUCCESSFUL ] org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;0.14.1!iceberg-spark-runtime-3.3_2.12.jar (11980ms)
:: resolution report :: resolve 496ms :: artifacts dl 11983ms
        :: modules in use:
        org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;0.14.1 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   1   |   1   |   1   |   0   ||   1   |   1   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-ddd36894-c46d-4d2f-82b2-8a916f718eba
        confs: [default]
        1 artifacts copied, 0 already retrieved (29791kB/17ms)

Executors do not pull the dependencies and fail with java.lang.ClassNotFoundException: org.apache.iceberg.spark.source.SparkWrite$WriterFactory (should come with org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:0.14.1)

Expected

drivers and executors pull the dependencies

Possible solution

No response

Additional context

No response

Environment

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: write-iceberg-table
spec:
  version: "1.0"
  sparkImage: docker.stackable.tech/stackable/pyspark-k8s:3.3.0-stackable0.2.0
  mode: cluster
  mainApplicationFile: local:///stackable/spark/jobs/write-iceberg-table.py
  deps:
    packages:
      # - org.apache.hadoop:hadoop-aws:3.3.3
      - org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:0.14.1
  sparkConf:
    spark.hadoop.fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
    spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
    spark.sql.catalog.spark_catalog: org.apache.iceberg.spark.SparkSessionCatalog
    spark.sql.catalog.spark_catalog.type: hive
    spark.sql.catalog.local: org.apache.iceberg.spark.SparkCatalog
    spark.sql.catalog.local.type: hadoop
    spark.sql.catalog.local.warehouse: /tmp/warehouse
  volumes:
    - name: script
      configMap:
        name: write-iceberg-table-script
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    volumeMounts:
      - name: script
        mountPath: /stackable/spark/jobs
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    volumeMounts:
      - name: script
        mountPath: /stackable/spark/jobs
      # - name: job-deps
      #   mountPath: /dependencies
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: write-iceberg-table-script
data:
  write-iceberg-table.py: |
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName("write-iceberg-table").getOrCreate()

    #df = spark.read.parquet("s3a://public-backup-nyc-tlc/trip-data/yellow_tripdata_2020-04.parquet")
    #df.show(10)

    print("FOO creating table")
    spark.sql("CREATE TABLE local.db.table (id bigint, data string) USING iceberg")
    spark.sql("INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c')")
    spark.sql("SELECT * FROM local.db.table")
    spark.sql("SELECT * FROM local.db.table.snapshots")

Would you like to work on fixing this bug?

maybe

@sbernauer
Copy link
Member Author

@soenkeliebau
Copy link
Member

Stupid question probably, but can we influence the executor pods?

For example add an init container with coursier to fetch all dependencies of stuff stated in deps..

https://get-coursier.io/docs/cli-fetch

@razvan
Copy link
Member

razvan commented Oct 10, 2022

The operator lays out the pod templates for both the driver and executor, so it should be easily possible.

@adwk67
Copy link
Member

adwk67 commented Oct 10, 2022

This maybe overlaps a bit with #117

@adwk67
Copy link
Member

adwk67 commented Feb 2, 2023

Found https://issues.apache.org/jira/browse/SPARK-35084

Fixed in apache/spark#38828, due for inclusion in 3.4.0 (not released yet).

@adwk67
Copy link
Member

adwk67 commented Feb 2, 2023

For example add an init container with coursier to fetch all dependencies of stuff stated in deps..
https://get-coursier.io/docs/cli-fetch

Maybe I'm doing something wrong but I tested this quickly and SparkWrite wasn't picked up with cs:

cs fetch org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:0.14.1
https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/0.14.1/iceberg-spark-runtime-3.3_2.12-0.14.1.jar
  100.0% [##########] 29.1 MiB (4.6 MiB / s)
~/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/0.14.1/iceberg-spark-runtime-3.3_2.12-0.14.1.jar

@adwk67
Copy link
Member

adwk67 commented Feb 21, 2023

According to this, new releases come out very couple of months: as 3.3.0 was released in July 2022, 3.4.0 could be on its way in the medium term. Not able to find a roadmap date for it, though. This issue applies to JVM dependencies, not python ones (which are installed by ourselves).

Currently the workaround for this is to use an image (based on a stackable one) with resolved dependencies "baked in" (we do this in the stackablectl datalake demo).

@fhennig
Copy link
Member

fhennig commented Feb 21, 2023

thanks for the update Andrew!

@lfrancke I'd suggest that we move this into "track" and wait for upstream, instead of spending time on developing a workaround

@lfrancke
Copy link
Member

Sounds good to me!

@fhennig
Copy link
Member

fhennig commented Feb 21, 2023

Thanks for the quick response, I'm moving the ticket then!

@razvan
Copy link
Member

razvan commented Mar 1, 2023

There is a 3.4.0-rc1 version now

@sbernauer
Copy link
Member Author

3.4.0 is officially released

@soenkeliebau
Copy link
Member

@lfrancke the new version should fix this, but we didn't want to move it to the next column until the lts discussion.

@lfrancke
Copy link
Member

lfrancke commented May 4, 2023

I'm fine with 3.4 or do you see any reason not to support it?

@razvan
Copy link
Member

razvan commented May 31, 2023

Dependency solving is still broken. See the PR above.

Proposal: discuss the idea of introducing a mechanism (via an init container) that provisions dependencies on both drivers and executors before submitting the applications. Note that vector logging already does some pre-provisioning and already sets [driver|executor].extraClassPath properties.

@razvan
Copy link
Member

razvan commented Jun 7, 2023

Possible next steps:

bors bot pushed a commit that referenced this issue Jun 20, 2023
Part of #141

Reminder: cherry-pick to `release-23.4` after `main` merge.
@sbernauer
Copy link
Member Author

Update: While setting up a JupyterHub setup with spark-k8s accessing hdfs we did find a setup that correctly resolved dependencies.
We can take this as an working example and figure out what we do differently.

access-hdfs-with-pyspark-and-iceberg.ipynb.txt

image
image

@razvan razvan self-assigned this Sep 11, 2023
@lfrancke
Copy link
Member

I'm sorry I lost track. Can you briefly explain why this is closed again?

@sbernauer
Copy link
Member Author

We closed it as we have fixed the reported bug. Now you can e.g. pull in Iceberg, but not JDBC drivers if I understood correctly.
Implementation PR is #281

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment