# Snowpark Connect for Spark — Workspace Notebook Spike

**Goal:** Test whether Snowpark Connect for Spark can be installed and used
within a Snowflake Workspace Notebook.

**Key questions:**
1. Can we `pip install snowpark-connect` in the Workspace?
2. Can the local Spark Connect gRPC server bind to localhost inside the SPCS container?
3. Can we authenticate — via SPCS OAuth token, PAT, or config.toml?
4. Can we run PySpark DataFrame operations on Snowflake warehouse compute?
5. Can we connect a Scala Spark client to the running server?

**Docs:**
- [Overview](https://docs.snowflake.com/en/developer-guide/snowpark-connect/snowpark-connect-overview)
- [Jupyter / VS Code setup](https://docs.snowflake.com/en/developer-guide/snowpark-connect/snowpark-connect-workloads-jupyter)

---

## Contents

1. [Install Snowpark Connect](#1)
2. [Environment Checks](#2)
3. [Auth Strategy: SPCS OAuth Token](#3)
4. [Auth Strategy: PAT (as used in R/ADBC)](#4)
5. [Auth Strategy: config.toml](#5)
6. [Start Spark Connect Session](#6)
7. [PySpark DataFrame Test](#7)
8. [Scala Client Test (if server works)](#8)
9. [Findings](#9)

---
<a id="1"></a>
## 1. Install Snowpark Connect

Try installing with the `[jdk]` extra first (bundles Java). If it conflicts
with an existing JDK (e.g. from the Scala prototype's micromamba install),
we'll try without `[jdk]` and point to our existing Java.

In [None]:
import subprocess, sys, time

t0 = time.time()

# Attempt 1: with bundled JDK
result = subprocess.run(
    [sys.executable, "-m", "pip", "install", "snowpark-connect[jdk]", "-q"],
    capture_output=True, text=True, timeout=300
)
elapsed = time.time() - t0

if result.returncode == 0:
    print(f"Installed snowpark-connect[jdk] in {elapsed:.1f}s")
else:
    print(f"snowpark-connect[jdk] failed ({elapsed:.1f}s):")
    print(result.stderr[-500:] if result.stderr else "no stderr")
    print("\nRetrying without [jdk] extra...")
    
    t0 = time.time()
    result2 = subprocess.run(
        [sys.executable, "-m", "pip", "install", "snowpark-connect", "-q"],
        capture_output=True, text=True, timeout=300
    )
    elapsed2 = time.time() - t0
    if result2.returncode == 0:
        print(f"Installed snowpark-connect (no jdk) in {elapsed2:.1f}s")
    else:
        print(f"snowpark-connect also failed ({elapsed2:.1f}s):")
        print(result2.stderr[-500:] if result2.stderr else "no stderr")

In [None]:
# Verify installation
try:
    import snowflake.snowpark_connect
    print(f"snowpark_connect imported OK")
    print(f"  Location: {snowflake.snowpark_connect.__file__}")
except ImportError as e:
    print(f"Import failed: {e}")

try:
    import pyspark
    print(f"PySpark version: {pyspark.__version__}")
except ImportError as e:
    print(f"PySpark not available: {e}")

---
<a id="2"></a>
## 2. Environment Checks

Check Python version, Java availability, architecture match, and
whether localhost port binding is possible.

In [None]:
import platform, shutil, socket, os

print("=== Environment ===")
print(f"Python:       {sys.version}")
print(f"Architecture: {platform.machine()}")
print(f"OS:           {platform.platform()}")

# Java
java_path = shutil.which("java")
print(f"\nJava binary:  {java_path or 'NOT FOUND'}")
if java_path:
    jv = subprocess.run(["java", "-version"], capture_output=True, text=True)
    print(f"Java version: {jv.stderr.splitlines()[0]}")

java_home = os.environ.get("JAVA_HOME", "")
print(f"JAVA_HOME:    {java_home or 'NOT SET'}")

# Port binding test
print("\n=== Port Binding Test ===")
test_port = 15002  # Spark Connect default
try:
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.bind(("127.0.0.1", test_port))
    sock.close()
    print(f"localhost:{test_port} — bind OK (port available)")
except OSError as e:
    print(f"localhost:{test_port} — bind FAILED: {e}")

# SPCS token
spcs_token_path = "/snowflake/session/token"
print(f"\n=== Auth Tokens ===")
print(f"SPCS token:   {'EXISTS' if os.path.isfile(spcs_token_path) else 'NOT FOUND'}")
print(f"SNOWFLAKE_PAT env: {'SET' if os.environ.get('SNOWFLAKE_PAT') else 'NOT SET'}")

---
<a id="3"></a>
## 3. Auth Strategy: SPCS OAuth Token

Inside a Workspace Notebook, the container provides an OAuth token at
`/snowflake/session/token`. Let's see if we can write a `config.toml`
that Snowpark Connect accepts using this token.

**Note:** This is speculative — the docs show user/password auth in
config.toml. We'll try token-based auth and see what happens.

In [None]:
from snowflake.snowpark.context import get_active_session

session = get_active_session()

# Extract connection details from the active Python Snowpark session
def _safe(fn):
    try:
        v = fn()
        return v.strip('"') if v else ""
    except Exception:
        return ""

conn_info = {
    "account": _safe(lambda: session.sql("SELECT CURRENT_ACCOUNT()").collect()[0][0]),
    "user": _safe(lambda: session.sql("SELECT CURRENT_USER()").collect()[0][0]),
    "role": _safe(session.get_current_role),
    "database": _safe(session.get_current_database),
    "schema": _safe(session.get_current_schema),
    "warehouse": _safe(session.get_current_warehouse),
    "host": os.environ.get("SNOWFLAKE_HOST", ""),
}

# Read SPCS token
spcs_token = ""
if os.path.isfile("/snowflake/session/token"):
    with open("/snowflake/session/token") as f:
        spcs_token = f.read().strip()

print("Connection info extracted:")
for k, v in conn_info.items():
    print(f"  {k}: {v}")
print(f"  spcs_token: {'SET (' + str(len(spcs_token)) + ' chars)' if spcs_token else 'NOT FOUND'}")

In [None]:
# Write a config.toml for Snowpark Connect
# Try with token auth first (speculative), fall through to PAT/password later
import pathlib

config_dir = pathlib.Path.home() / ".snowflake"
config_dir.mkdir(parents=True, exist_ok=True)
config_file = config_dir / "config.toml"

# Strategy A: Use token with authenticator=oauth
toml_content = f'''[connections.spark-connect]
host = "{conn_info['host']}"
account = "{conn_info['account']}"
user = "{conn_info['user']}"
token = "{spcs_token}"
authenticator = "oauth"
warehouse = "{conn_info['warehouse']}"
database = "{conn_info['database']}"
schema = "{conn_info['schema']}"
role = "{conn_info['role']}"
'''

config_file.write_text(toml_content)
config_file.chmod(0o600)
print(f"Wrote {config_file}")
print("Auth strategy: SPCS OAuth token")

# Show (redacted)
for line in toml_content.splitlines():
    if 'token' in line.lower() and '=' in line:
        key = line.split('=')[0].strip()
        print(f"  {key} = <redacted>")
    else:
        print(f"  {line}")

---
<a id="4"></a>
## 4. Auth Strategy: PAT (as used in R/ADBC)

If OAuth token auth doesn't work with Snowpark Connect, try PAT.
We know PATs work for ADBC connections from inside Workspace Notebooks
(proven in the R integration). Snowpark Connect's OSS client path
explicitly supports PAT auth.

**Skip this cell if Section 3 worked.**

In [None]:
# Uncomment and run this cell if OAuth auth (Section 3) didn't work.
# You'll need a PAT — create one first if you don't have one.

# from scala_helpers import ...  # or create PAT via SQL
# pat = session.sql("ALTER USER SET ... ADD PAT ...")  # etc.

# For now, check if a PAT is already available
pat = os.environ.get("SNOWFLAKE_PAT", "")
if pat:
    toml_pat = f'''[connections.spark-connect]
host = "{conn_info['host']}"
account = "{conn_info['account']}"
user = "{conn_info['user']}"
token = "{pat}"
authenticator = "programmatic_access_token"
warehouse = "{conn_info['warehouse']}"
database = "{conn_info['database']}"
schema = "{conn_info['schema']}"
role = "{conn_info['role']}"
'''
    config_file.write_text(toml_pat)
    config_file.chmod(0o600)
    print("Rewrote config.toml with PAT auth")
else:
    print("No PAT available in SNOWFLAKE_PAT env var.")
    print("To test PAT auth, set os.environ['SNOWFLAKE_PAT'] = '<your-pat>'")

---
<a id="5"></a>
## 5. Auth Strategy: Direct URL with token (OSS client path)

The Snowpark Connect OSS client docs show connecting via a URL with
embedded PAT: `sc://host/;token=...;token_type=PAT`. This bypasses
config.toml entirely.

We'll also try the Snowpark Connect host URL for this account.

In [None]:
# Discover the Snowpark Connect host for this account
try:
    rows = session.sql("""
        SELECT t.VALUE:type::VARCHAR as type,
               t.VALUE:host::VARCHAR as host,
               t.VALUE:port as port
        FROM TABLE(FLATTEN(input => PARSE_JSON(SYSTEM$ALLOWLIST()))) AS t
        WHERE type = 'SNOWPARK_CONNECT'
    """).collect()
    if rows:
        for r in rows:
            print(f"Snowpark Connect endpoint: {r['HOST']}:{r['PORT']}")
    else:
        print("No SNOWPARK_CONNECT entry in SYSTEM$ALLOWLIST()")
        print("Snowpark Connect may not be enabled for this account.")
except Exception as e:
    print(f"Could not query SYSTEM$ALLOWLIST(): {e}")

---
<a id="6"></a>
## 6. Start Spark Connect Session

This is the critical test — can `init_spark_session()` start the local
gRPC server and create a SparkSession inside the Workspace container?

In [None]:
import time

# Ensure JAVA_HOME is set (use micromamba JDK if available)
if not os.environ.get("JAVA_HOME") and shutil.which("java"):
    java_bin = shutil.which("java")
    java_home_guess = str(pathlib.Path(java_bin).resolve().parent.parent)
    os.environ["JAVA_HOME"] = java_home_guess
    print(f"Set JAVA_HOME={java_home_guess}")

try:
    from snowflake import snowpark_connect
    
    t0 = time.time()
    spark = snowpark_connect.server.init_spark_session()
    elapsed = time.time() - t0
    
    print(f"SparkSession created in {elapsed:.1f}s")
    print(f"  Spark version: {spark.version}")
    print(f"  Type:          {type(spark).__name__}")
    print("\n*** Spark Connect session is RUNNING ***")
    
except Exception as e:
    print(f"Failed to start Spark Connect session: {type(e).__name__}: {e}")
    import traceback
    traceback.print_exc()

---
<a id="7"></a>
## 7. PySpark DataFrame Test

If the SparkSession is running, test basic PySpark DataFrame operations
that execute on the Snowflake warehouse.

In [None]:
# Test 1: Create a DataFrame from local data
from pyspark.sql import Row

df = spark.createDataFrame([
    Row(id=1, name="Alice", score=95.0),
    Row(id=2, name="Bob", score=87.5),
    Row(id=3, name="Carol", score=92.0),
])

print("=== Local DataFrame ===")
df.show()
print(f"Count: {df.count()}")

In [None]:
# Test 2: Spark SQL against Snowflake
result = spark.sql("SELECT CURRENT_USER() AS user")
result.show()

In [None]:
# Test 3: Read Snowflake data via Spark SQL
# Note: SHOW ... LIMIT N fails (Spark parser doesn't support LIMIT on SHOW).
# INFORMATION_SCHEMA works fine.
try:
    info_df = spark.sql("SELECT TABLE_NAME, TABLE_TYPE, ROW_COUNT FROM INFORMATION_SCHEMA.TABLES LIMIT 5")
    info_df.show(truncate=False)
except Exception as e:
    print(f"INFORMATION_SCHEMA query failed: {e}")
    print("\nTrying a simple SELECT with VALUES...")
    spark.sql("SELECT * FROM VALUES (1, 'a'), (2, 'b') AS t(id, letter)").show()

In [None]:
# Test 4: DataFrame transformations (pushed down to Snowflake)
from pyspark.sql.functions import col, lit, upper

transformed = (
    df.filter(col("score") > 88)
      .withColumn("grade", lit("A"))
      .withColumn("name_upper", upper(col("name")))
      .orderBy(col("score").desc())
)
print("=== Transformed (pushed to Snowflake warehouse) ===")
transformed.show()

# Convert to Pandas
pdf = transformed.toPandas()
print(f"\nAs Pandas DataFrame ({type(pdf).__name__}):\n{pdf}")

---
<a id="7b"></a>
## 7b. Spark ↔ Snowpark Python Interop

Can we share data between the native Snowpark Python session and the
Spark Connect session? Both connect to the same Snowflake account —
test writing from one and reading from the other via transient tables.

In [None]:
# Test 5: Spark -> Snowpark Python (write from Spark, read from Snowpark)
try:
    spark.sql("CREATE OR REPLACE TRANSIENT TABLE _SPARK_INTEROP_TEST AS SELECT 1 AS id, 'from_spark' AS source").collect()
    print("Spark: wrote _SPARK_INTEROP_TEST")
    
    snowpark_df = session.table("_SPARK_INTEROP_TEST")
    snowpark_df.show()
    print(f"Snowpark Python read it: {type(snowpark_df).__name__}")
except Exception as e:
    print(f"Spark -> Snowpark interop failed: {e}")

# Test 6: Snowpark Python -> Spark (write from Snowpark, read from Spark)
try:
    session.sql("CREATE OR REPLACE TRANSIENT TABLE _SNOWPARK_INTEROP_TEST AS SELECT 2 AS id, 'from_snowpark' AS source").collect()
    print("\nSnowpark Python: wrote _SNOWPARK_INTEROP_TEST")
    
    spark_df = spark.sql("SELECT * FROM _SNOWPARK_INTEROP_TEST")
    spark_df.show()
    print(f"Spark read it: {type(spark_df).__name__}")
except Exception as e:
    print(f"Snowpark -> Spark interop failed: {e}")

# Cleanup
try:
    session.sql("DROP TABLE IF EXISTS _SPARK_INTEROP_TEST").collect()
    session.sql("DROP TABLE IF EXISTS _SNOWPARK_INTEROP_TEST").collect()
    print("\nCleanup done")
except Exception:
    pass

---
<a id="8"></a>
## 8. Server Architecture Check

Snowpark Connect uses a **remote** Spark Connect server on Snowflake's
infrastructure — there is no local gRPC server in the container.
The PySpark client connects directly to the Snowpark Connect endpoint.

This means a Scala Spark Connect client would need to connect to the
remote endpoint (not localhost), which requires different auth setup.

In [None]:
# Check if the Spark Connect gRPC server is actually listening
import socket

def check_port(host, port):
    try:
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.settimeout(2)
        s.connect((host, port))
        s.close()
        return True
    except (socket.error, OSError):
        return False

port_15002 = check_port("127.0.0.1", 15002)
print(f"Spark Connect server on localhost:15002: {'LISTENING' if port_15002 else 'NOT LISTENING'}")

if not port_15002:
    print("\nThe server may use a different port. Checking common alternatives...")
    for p in [15001, 15003, 4040, 18080]:
        if check_port("127.0.0.1", p):
            print(f"  Port {p}: LISTENING")

---
<a id="9"></a>
## 9. Findings

Record results after running the cells above.

| Question | Result | Notes |
|----------|--------|-------|
| pip install works? | **YES** | `snowpark-connect[jdk]`, PySpark 3.5.6 |
| Import succeeds? | **YES** | OpenTelemetry warning (harmless) |
| Port binding works? | **YES** | localhost:15002 available, but server is remote |
| Java needed locally? | **NO** | PySpark client doesn't need local JDK |
| OAuth token auth? | **YES** | config.toml with SPCS token + `authenticator=oauth` |
| PAT auth? | Not tested | OAuth worked; PAT likely works (proven for ADBC) |
| init_spark_session() | **YES** | 17.9s, Spark 3.5.6, connects to remote endpoint |
| PySpark createDataFrame | **YES** | Local data → Snowflake → show/count |
| Spark SQL (standard) | **YES** | SELECT, INFORMATION_SCHEMA, CURRENT_USER() |
| Spark SQL (SF-specific) | **PARTIAL** | CURRENT_ROLE() unsupported; SHOW LIMIT fails |
| DataFrame transforms | **YES** | filter/withColumn/upper/orderBy pushed down |
| toPandas() | | TBD |
| Spark ↔ Snowpark interop | | TBD — transient table sharing |
| Local gRPC server? | **NO** | Server is remote (Snowflake infra), not localhost |
| Scala client | **N/A** | No local server to connect to |
| Snowpark Connect endpoint | **YES** | `AK32940.snowpark.pdxaac.snowflakecomputing.com:443` |

In [None]:
# Summary: disk usage of key directories
import shutil

total, used, free = shutil.disk_usage("/")
print(f"Disk: {used / (1024**3):.1f} GB used / {total / (1024**3):.1f} GB total / {free / (1024**3):.1f} GB free")

# Check installed package sizes
result = subprocess.run(
    [sys.executable, "-m", "pip", "show", "snowpark-connect"],
    capture_output=True, text=True
)
if result.returncode == 0:
    for line in result.stdout.splitlines():
        if line.startswith(("Name:", "Version:", "Location:")):
            print(line)

In [None]:
# Cleanup: stop Spark session if running
try:
    spark.stop()
    print("Spark session stopped")
except Exception:
    print("No Spark session to stop")