# Building and Installing the Iceberg ProtocolSwitchingFileIO JAR in Databricks

This project provides tools for protocol switching file input/output in Apache Iceberg with Azure support.

---

## Building the JAR

To package the project as a JAR file:

1. Ensure you have [Maven](https://maven.apache.org/install.html) installed and configured on **LOCAL** your machine.
2. Clone this repository and navigate into the `ProtocolSwitchingFileIO` directory.
3. Run the following command to build the JAR:

   ```bash
   mvn clean package
   ```

   This will generate the JAR file in the `target` directory. For this project, the file will be named `ProtocolSwitchingFileIO-1.0.0.jar`.

---

## Uploading to Databricks

1. **Upload the Library to Workspace Files**
   - Open your Databricks workspace in a browser.
   - In the left-hand menu, click **Workspace**.
   - Navigate to the folder where you want to upload the JAR file.
   - Click the kebab menu (three vertical dots) in the upper-right corner and select **Import**.
   - In the **Import** dialog:
     - Choose **File** as the import source.
     - Drag and drop the `ProtocolSwitchingFileIO-1.0.0.jar` file or browse to select it.
   - Click **Import**.

2. **Install the Library on a Cluster**
   - Go to the **Compute** tab in Databricks.
   - Click the name of your cluster in the list.
   - Click the **Libraries** tab and then **Install new**.
   - In the **Install library** dialog:
     - Select **Workspace** as the library source.
     - Browse to the uploaded JAR file or specify its path, such as `/Workspace/Users/your-email@example.com/path-to-library/ProtocolSwitchingFileIO-1.0.0.jar`.
   - Click **Install**.

For additional instructions and alternative workflows, refer to [Databricks' official documentation](https://docs.databricks.com/en/libraries/workspace-files-libraries.html).

In [0]:
import os
from pyspark.sql import SparkSession

# Polaris account and configuration
account = "aa0000.west-us-2.azure"  # Replace with your Snowflake account
principal_client_id = os.getenv("POLARIS_CLIENT_ID", "default_client_id")  # Use environment variables
principal_secret = os.getenv("POLARIS_SECRET", "default_secret")  # Use environment variables
catalog_name = "protocalswitchingtest"  # Replace with your catalog name
role = "PRINCIPAL_ROLE:data_engineer"  # Replace with the role you are using


# Create Spark session
spark = SparkSession.builder.appName('Testing_ProtocolSwitchingFileIO') \
    .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.13:1.6.1,org.apache.iceberg:iceberg-azure-bundle:1.6.1') \
    .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
    .config('spark.sql.defaultCatalog', 'opencatalog') \
    .config('spark.sql.catalog.opencatalog', 'org.apache.iceberg.spark.SparkCatalog') \
    .config('spark.sql.catalog.opencatalog.type', 'rest') \
    .config('spark.sql.catalog.opencatalog.header.X-Iceberg-Access-Delegation', 'vended-credentials') \
    .config('spark.sql.catalog.opencatalog.uri', f'https://{account}.snowflakecomputing.com/polaris/api/catalog') \
    .config('spark.sql.catalog.opencatalog.credential', f'{principal_client_id}:{principal_secret}') \
    .config('spark.sql.catalog.opencatalog.warehouse', catalog_name) \
    .config('spark.sql.catalog.opencatalog.scope', role) \
    .getOrCreate()

# Custom IO Configuration
# Note: These configurations are specific to the ProtocolSwitchingFileIO JAR.
# While these could be included in the Spark session builder config above, they are separated here
# for clarity and to highlight their association with the ProtocolSwitchingFileIO functionality.
spark.conf.set("spark.sql.catalog.opencatalog.io-impl", "org.apache.iceberg.tools.ProtocolSwitchingFileIO")
spark.conf.set("spark.sql.catalog.opencatalog.io-impl-delegate", "org.apache.iceberg.azure.adlsv2.ADLSFileIO")
spark.conf.set("spark.sql.catalog.opencatalog.protocol.mapping.wasb://", "abfs://")
spark.conf.set("spark.sql.catalog.opencatalog.protocol.mapping.wasbs://", "abfss://")

# Optional: Disable Vectorization for Compatibility
# Note: This can impact performance. Only disable if required for compatibility.
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
spark.conf.set("spark.sql.iceberg.vectorization.enabled", "false")

In [0]:
# Test to Verify JAR Import and Usage
from py4j.java_gateway import java_import

java_import(spark._jvm, "org.apache.iceberg.tools.ProtocolSwitchingFileIO")

try:
    # Attempt to initialize the class
    wrapper = spark._jvm.org.apache.iceberg.tools.ProtocolSwitchingFileIO()
    print("Class initialized successfully:", wrapper)
except Exception as e:
    print("Error initializing ProtocolSwitchingFileIO class:", e)

try:
    # Fetch the class version
    version = spark._jvm.org.apache.iceberg.tools.ProtocolSwitchingFileIO.getVersion()
    print(f"ProtocolSwitchingFileIO version: {version}")
except Exception as e:
    print("Error fetching ProtocolSwitchingFileIO version:", e)

In [0]:
# Simple SQL Commands for Testing
# Replace DATABASE.SCHEMA.TABLENAME with actual values
try:
    spark.sql("SHOW NAMESPACES").show()
    df = spark.sql("SELECT * FROM DATABASE.SCHEMA.TABLENAME LIMIT 10")
    df.show(10)  # Display the first 10 rows
except Exception as e:
    print("Error executing SQL commands:", e)