# Run Your Spark Code on Snowflake

**For Spark/Databricks users:** This notebook shows your existing PySpark code running on Snowflake - unchanged.

No clusters to manage. No data movement. Just your code, running on Snowflake's engine.

---

## What You'll See

1. Standard PySpark DataFrame operations (the code you already write)
2. Running on Snowflake via Snowpark Connect
3. Results from Snowflake's compute engine

## Setup: Connect to Snowflake

This creates a SparkSession that runs on Snowflake instead of a local cluster.

In [None]:
# Initialize Snowpark Connect - this replaces your SparkSession.builder code
from snowflake import snowpark_connect

spark = snowpark_connect.server.init_spark_session()
print(f"Connected! Spark API version: {spark.version}")
print("Execution engine: Snowflake")

---

## Your Familiar PySpark Code

Everything below is standard PySpark - the same code you'd write for Databricks or any Spark cluster.

### Create a DataFrame

Just like you always do - `spark.createDataFrame()`

In [None]:
# Create sample sales data - standard PySpark
from pyspark.sql import Row

sales_data = [
    Row(order_id=1, customer="Alice", product="Laptop", amount=1200.00, region="West"),
    Row(order_id=2, customer="Bob", product="Mouse", amount=25.00, region="East"),
    Row(order_id=3, customer="Alice", product="Keyboard", amount=75.00, region="West"),
    Row(order_id=4, customer="Charlie", product="Monitor", amount=350.00, region="East"),
    Row(order_id=5, customer="Alice", product="Webcam", amount=89.00, region="West"),
    Row(order_id=6, customer="Bob", product="Laptop", amount=1100.00, region="East"),
    Row(order_id=7, customer="Diana", product="Mouse", amount=25.00, region="Central"),
    Row(order_id=8, customer="Charlie", product="Keyboard", amount=80.00, region="East"),
    Row(order_id=9, customer="Diana", product="Monitor", amount=400.00, region="Central"),
    Row(order_id=10, customer="Eve", product="Laptop", amount=1300.00, region="West"),
]

df = spark.createDataFrame(sales_data)
df.show()

### Filter and Select

Standard DataFrame operations - `.filter()`, `.select()`

In [None]:
# Filter for high-value orders
from pyspark.sql import functions as F

high_value = (
    df
    .filter(F.col("amount") > 100)
    .select("customer", "product", "amount")
)

high_value.show()

### Aggregations

`.groupBy()` and `.agg()` - exactly like you'd write them

In [None]:
# Customer lifetime value
customer_summary = (
    df
    .groupBy("customer")
    .agg(
        F.count("order_id").alias("total_orders"),
        F.sum("amount").alias("total_spent"),
        F.avg("amount").alias("avg_order")
    )
    .orderBy(F.desc("total_spent"))
)

customer_summary.show()

In [None]:
# Sales by region
region_summary = (
    df
    .groupBy("region")
    .agg(
        F.count("*").alias("orders"),
        F.sum("amount").alias("revenue")
    )
    .orderBy(F.desc("revenue"))
)

region_summary.show()

### Window Functions

Running totals, rankings - the patterns you use for analytics

In [None]:
# Window function: running total per customer
from pyspark.sql.window import Window

customer_window = Window.partitionBy("customer").orderBy("order_id")

with_running_total = (
    df
    .withColumn("running_total", F.sum("amount").over(customer_window))
    .withColumn("order_rank", F.row_number().over(customer_window))
)

with_running_total.filter(F.col("customer") == "Alice").show()

### Spark SQL

Register as a temp view and query with SQL - works exactly the same

In [None]:
# Register as temp view
df.createOrReplaceTempView("sales")

# Query with Spark SQL
result = spark.sql("""
    SELECT 
        product,
        COUNT(*) as times_sold,
        SUM(amount) as total_revenue,
        AVG(amount) as avg_price
    FROM sales
    GROUP BY product
    ORDER BY total_revenue DESC
""")

result.show()

### Write to Snowflake Table

Save your results directly to Snowflake - no data movement needed

In [None]:
# Write aggregated results to a Snowflake table
customer_summary.write.mode("overwrite").saveAsTable("CUSTOMER_SUMMARY")

# Verify it's there
spark.sql("SELECT * FROM CUSTOMER_SUMMARY").show()

---

## What Just Happened?

All that PySpark code you just ran? **It executed on Snowflake's compute engine.**

| What You Wrote | Where It Ran |
|----------------|--------------|
| `spark.createDataFrame()` | Snowflake |
| `.filter()`, `.select()` | Snowflake |
| `.groupBy().agg()` | Snowflake |
| Window functions | Snowflake |
| `spark.sql()` | Snowflake |
| `.write.saveAsTable()` | Snowflake |

### What You Didn't Have To Do

- Spin up a Spark cluster
- Configure executors and memory
- Move data from Snowflake to Spark
- Move results back to Snowflake
- Manage Spark versions and dependencies

### What Changed In Your Code?

Just the session initialization:

```python
# Before (Databricks/Spark)
spark = SparkSession.builder.appName("MyApp").getOrCreate()

# After (Snowflake)
from snowflake import snowpark_connect
spark = snowpark_connect.server.init_spark_session()
```

**Everything else stays exactly the same.**

---

## Next Steps

1. **Try your own code** - Copy a PySpark notebook you have and change only the session initialization
2. **Check compatibility** - [Snowpark Connect Compatibility Guide](https://docs.snowflake.com/en/developer-guide/snowpark-connect/compatibility)
3. **Assess your codebase** - [Snowpark Migration Accelerator](https://docs.snowflake.com/en/developer-guide/snowpark-migration-accelerator) analyzes your Spark code for compatibility