Executors don't resolve dependencies #141

sbernauer · 2022-09-13T13:46:58Z

Affected version

0.5.0

Current and expected behavior

Following https://iceberg.apache.org/docs/latest/getting-started/

Current

Use

  deps:
    packages:
      # - org.apache.hadoop:hadoop-aws:3.3.3
      - org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:0.14.1

Driver logs:

:: loading settings :: url = jar:file:/stackable/spark-3.3.0-bin-hadoop3/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /stackable/.ivy2/cache
The jars for the packages stored in: /stackable/.ivy2/jars
org.apache.iceberg#iceberg-spark-runtime-3.3_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-ddd36894-c46d-4d2f-82b2-8a916f718eba;1.0
        confs: [default]
        found org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;0.14.1 in central
downloading https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/0.14.1/iceberg-spark-runtime-3.3_2.12-0.14.1.jar ...
        [SUCCESSFUL ] org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;0.14.1!iceberg-spark-runtime-3.3_2.12.jar (11980ms)
:: resolution report :: resolve 496ms :: artifacts dl 11983ms
        :: modules in use:
        org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;0.14.1 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   1   |   1   |   1   |   0   ||   1   |   1   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-ddd36894-c46d-4d2f-82b2-8a916f718eba
        confs: [default]
        1 artifacts copied, 0 already retrieved (29791kB/17ms)

Executors do not pull the dependencies and fail with java.lang.ClassNotFoundException: org.apache.iceberg.spark.source.SparkWrite$WriterFactory (should come with org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:0.14.1)

Expected

drivers and executors pull the dependencies

Possible solution

No response

Additional context

No response

Environment

---
apiVersion: spark.stackable.tech/v1alpha1
kind: SparkApplication
metadata:
  name: write-iceberg-table
spec:
  version: "1.0"
  sparkImage: docker.stackable.tech/stackable/pyspark-k8s:3.3.0-stackable0.2.0
  mode: cluster
  mainApplicationFile: local:///stackable/spark/jobs/write-iceberg-table.py
  deps:
    packages:
      # - org.apache.hadoop:hadoop-aws:3.3.3
      - org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:0.14.1
  sparkConf:
    spark.hadoop.fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
    spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
    spark.sql.catalog.spark_catalog: org.apache.iceberg.spark.SparkSessionCatalog
    spark.sql.catalog.spark_catalog.type: hive
    spark.sql.catalog.local: org.apache.iceberg.spark.SparkCatalog
    spark.sql.catalog.local.type: hadoop
    spark.sql.catalog.local.warehouse: /tmp/warehouse
  volumes:
    - name: script
      configMap:
        name: write-iceberg-table-script
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    volumeMounts:
      - name: script
        mountPath: /stackable/spark/jobs
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    volumeMounts:
      - name: script
        mountPath: /stackable/spark/jobs
      # - name: job-deps
      #   mountPath: /dependencies
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: write-iceberg-table-script
data:
  write-iceberg-table.py: |
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName("write-iceberg-table").getOrCreate()

    #df = spark.read.parquet("s3a://public-backup-nyc-tlc/trip-data/yellow_tripdata_2020-04.parquet")
    #df.show(10)

    print("FOO creating table")
    spark.sql("CREATE TABLE local.db.table (id bigint, data string) USING iceberg")
    spark.sql("INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c')")
    spark.sql("SELECT * FROM local.db.table")
    spark.sql("SELECT * FROM local.db.table.snapshots")

Would you like to work on fixing this bug?

maybe

The text was updated successfully, but these errors were encountered:

sbernauer · 2022-09-13T14:03:29Z

Found https://issues.apache.org/jira/browse/SPARK-35084

soenkeliebau · 2022-10-10T11:19:26Z

Stupid question probably, but can we influence the executor pods?

For example add an init container with coursier to fetch all dependencies of stuff stated in deps..

https://get-coursier.io/docs/cli-fetch

razvan · 2022-10-10T11:25:23Z

The operator lays out the pod templates for both the driver and executor, so it should be easily possible.

adwk67 · 2022-10-10T12:24:20Z

This maybe overlaps a bit with #117

adwk67 · 2023-02-02T16:31:23Z

Found https://issues.apache.org/jira/browse/SPARK-35084

Fixed in apache/spark#38828, due for inclusion in 3.4.0 (not released yet).

adwk67 · 2023-02-02T16:56:17Z

For example add an init container with coursier to fetch all dependencies of stuff stated in deps..
https://get-coursier.io/docs/cli-fetch

Maybe I'm doing something wrong but I tested this quickly and SparkWrite wasn't picked up with cs:

cs fetch org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:0.14.1
https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/0.14.1/iceberg-spark-runtime-3.3_2.12-0.14.1.jar
  100.0% [##########] 29.1 MiB (4.6 MiB / s)
~/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/0.14.1/iceberg-spark-runtime-3.3_2.12-0.14.1.jar

adwk67 · 2023-02-21T09:35:44Z

According to this, new releases come out very couple of months: as 3.3.0 was released in July 2022, 3.4.0 could be on its way in the medium term. Not able to find a roadmap date for it, though. This issue applies to JVM dependencies, not python ones (which are installed by ourselves).

Currently the workaround for this is to use an image (based on a stackable one) with resolved dependencies "baked in" (we do this in the stackablectl datalake demo).

fhennig · 2023-02-21T09:37:36Z

thanks for the update Andrew!

@lfrancke I'd suggest that we move this into "track" and wait for upstream, instead of spending time on developing a workaround

lfrancke · 2023-02-21T10:11:39Z

Sounds good to me!

fhennig · 2023-02-21T10:17:22Z

Thanks for the quick response, I'm moving the ticket then!

razvan · 2023-03-01T16:16:51Z

There is a 3.4.0-rc1 version now

sbernauer · 2023-04-19T09:16:59Z

3.4.0 is officially released

soenkeliebau · 2023-05-03T07:08:50Z

@lfrancke the new version should fix this, but we didn't want to move it to the next column until the lts discussion.

lfrancke · 2023-05-04T21:33:16Z

I'm fine with 3.4 or do you see any reason not to support it?

razvan · 2023-05-31T09:33:20Z

Dependency solving is still broken. See the PR above.

Proposal: discuss the idea of introducing a mechanism (via an init container) that provisions dependencies on both drivers and executors before submitting the applications. Note that vector logging already does some pre-provisioning and already sets [driver|executor].extraClassPath properties.

razvan · 2023-06-07T07:33:46Z

Possible next steps:

Document custom images (also document that the current mechanism doesn't work). Document how to build and use custom images #246
Investigate init containers for provisioning
Create Jira issue to Spark
Contribute upstream

Part of #141 Reminder: cherry-pick to `release-23.4` after `main` merge.

sbernauer · 2023-06-21T08:02:11Z

Update: While setting up a JupyterHub setup with spark-k8s accessing hdfs we did find a setup that correctly resolved dependencies.
We can take this as an working example and figure out what we do differently.

access-hdfs-with-pyspark-and-iceberg.ipynb.txt

lfrancke · 2023-09-27T21:37:50Z

I'm sorry I lost track. Can you briefly explain why this is closed again?

sbernauer · 2023-09-28T06:57:15Z

We closed it as we have fixed the reported bug. Now you can e.g. pull in Iceberg, but not JDBC drivers if I understood correctly.
Implementation PR is #281

sbernauer added the type/bug label Jan 17, 2023

sbernauer mentioned this issue Feb 1, 2024

[Tracker] Findings of demos stackabletech/demos#15

Open

lfrancke added the priority/high label Jan 18, 2023

sbernauer mentioned this issue Apr 19, 2023

Update products for 23.7 stackabletech/issues#375

Closed

15 tasks

zihao123yang mentioned this issue May 21, 2023

How to define dependency packages kubeflow/spark-operator#352

Open

sbernauer assigned sbernauer and unassigned sbernauer May 22, 2023

sbernauer assigned razvan May 30, 2023

razvan mentioned this issue May 30, 2023

Executors don't resolve dependencies #245

Closed

razvan mentioned this issue Jun 12, 2023

[Merged by Bors] - Add note regarding job dependencies. #250

Closed

bors bot pushed a commit that referenced this issue Jun 20, 2023

Add note regarding job dependencies. (#250)

45e0cdc

Part of #141 Reminder: cherry-pick to `release-23.4` after `main` merge.

sbernauer unassigned razvan Aug 14, 2023

razvan self-assigned this Sep 11, 2023

razvan mentioned this issue Sep 11, 2023

Deploy apps with dynamic dependencies. #281

Merged

sbernauer added status/blocked and removed priority/high labels Sep 12, 2023

razvan closed this as completed in #281 Sep 25, 2023

sbernauer removed the status/blocked label Sep 27, 2023

lfrancke added the release/2023-11 label Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Executors don't resolve dependencies #141

Executors don't resolve dependencies #141

sbernauer commented Sep 13, 2022 •

edited

sbernauer commented Sep 13, 2022

soenkeliebau commented Oct 10, 2022

razvan commented Oct 10, 2022

adwk67 commented Oct 10, 2022

adwk67 commented Feb 2, 2023

adwk67 commented Feb 2, 2023 •

edited

adwk67 commented Feb 21, 2023

fhennig commented Feb 21, 2023

lfrancke commented Feb 21, 2023

fhennig commented Feb 21, 2023

razvan commented Mar 1, 2023 •

edited

sbernauer commented Apr 19, 2023

soenkeliebau commented May 3, 2023

lfrancke commented May 4, 2023

razvan commented May 31, 2023

razvan commented Jun 7, 2023

sbernauer commented Jun 21, 2023

lfrancke commented Sep 27, 2023

sbernauer commented Sep 28, 2023

Executors don't resolve dependencies #141

Executors don't resolve dependencies #141

Comments

sbernauer commented Sep 13, 2022 • edited

Affected version

Current and expected behavior

Current

Expected

Possible solution

Additional context

Environment

Would you like to work on fixing this bug?

sbernauer commented Sep 13, 2022

soenkeliebau commented Oct 10, 2022

razvan commented Oct 10, 2022

adwk67 commented Oct 10, 2022

adwk67 commented Feb 2, 2023

adwk67 commented Feb 2, 2023 • edited

adwk67 commented Feb 21, 2023

fhennig commented Feb 21, 2023

lfrancke commented Feb 21, 2023

fhennig commented Feb 21, 2023

razvan commented Mar 1, 2023 • edited

sbernauer commented Apr 19, 2023

soenkeliebau commented May 3, 2023

lfrancke commented May 4, 2023

razvan commented May 31, 2023

razvan commented Jun 7, 2023

sbernauer commented Jun 21, 2023

lfrancke commented Sep 27, 2023

sbernauer commented Sep 28, 2023

sbernauer commented Sep 13, 2022 •

edited

adwk67 commented Feb 2, 2023 •

edited

razvan commented Mar 1, 2023 •

edited