# Run PySpark on Snowflake

Your PySpark code. Snowflake's compute. One line change.

---

**What this notebook demonstrates:**

1. Standard PySpark DataFrame operations running on Snowflake via Snowpark Connect
2. The only code change required: session initialization
3. Direct read/write to Snowflake tables - no data movement

## Required Packages

After importing this notebook, add the following package from the **Packages** menu:

- `snowpark_connect`

See [Import Python packages](https://docs.snowflake.com/en/user-guide/ui-snowsight/notebooks-import-packages) for details.

## Setup

In [None]:
from snowflake import snowpark_connect

spark = snowpark_connect.server.init_spark_session()

# Import PySpark AFTER initializing the session
from pyspark.sql import Row
from pyspark.sql import functions as F
from pyspark.sql.window import Window

print(f"Spark version: {spark.version}")

---

## PySpark Operations

Standard DataFrame API - nothing Snowflake-specific below.

### Create a DataFrame

In [None]:
sales_data = [
    Row(order_id=1, customer="Alice", product="Laptop", amount=1200.00, region="West"),
    Row(order_id=2, customer="Bob", product="Mouse", amount=25.00, region="East"),
    Row(order_id=3, customer="Alice", product="Keyboard", amount=75.00, region="West"),
    Row(order_id=4, customer="Charlie", product="Monitor", amount=350.00, region="East"),
    Row(order_id=5, customer="Alice", product="Webcam", amount=89.00, region="West"),
    Row(order_id=6, customer="Bob", product="Laptop", amount=1100.00, region="East"),
    Row(order_id=7, customer="Diana", product="Mouse", amount=25.00, region="Central"),
    Row(order_id=8, customer="Charlie", product="Keyboard", amount=80.00, region="East"),
    Row(order_id=9, customer="Diana", product="Monitor", amount=400.00, region="Central"),
    Row(order_id=10, customer="Eve", product="Laptop", amount=1300.00, region="West"),
]

df = spark.createDataFrame(sales_data)
df.show()

### Filter and Select

In [None]:
high_value = (
    df
    .filter(F.col("amount") > 100)
    .select("customer", "product", "amount")
)

high_value.show()

### Aggregations

In [None]:
customer_summary = (
    df
    .groupBy("customer")
    .agg(
        F.count("order_id").alias("total_orders"),
        F.sum("amount").alias("total_spent"),
        F.avg("amount").alias("avg_order")
    )
    .orderBy(F.desc("total_spent"))
)

customer_summary.show()

In [None]:
region_summary = (
    df
    .groupBy("region")
    .agg(
        F.count("*").alias("orders"),
        F.sum("amount").alias("revenue")
    )
    .orderBy(F.desc("revenue"))
)

region_summary.show()

### Window Functions

In [None]:
customer_window = Window.partitionBy("customer").orderBy("order_id")

with_running_total = (
    df
    .withColumn("running_total", F.sum("amount").over(customer_window))
    .withColumn("order_rank", F.row_number().over(customer_window))
)

with_running_total.filter(F.col("customer") == "Alice").show()

### Spark SQL

In [None]:
df.createOrReplaceTempView("sales")

result = spark.sql("""
    SELECT 
        product,
        COUNT(*) as times_sold,
        SUM(amount) as total_revenue,
        AVG(amount) as avg_price
    FROM sales
    GROUP BY product
    ORDER BY total_revenue DESC
""")

result.show()

### Write to Table

In [None]:
customer_summary.write.mode("overwrite").saveAsTable("CUSTOMER_SUMMARY")

spark.sql("SELECT * FROM CUSTOMER_SUMMARY").show()

---

## Summary

Every operation above executed on Snowflake's compute engine. The DataFrame API calls were translated to Snowflake SQL and run on a Snowflake warehouse.

**What you didn't need:**
- Spark cluster provisioning or tuning
- Data connectors or ETL to move data
- Separate infrastructure to manage

**What's different:** One import, one session initializer. That's the entire migration.

---

## Next Steps

- **Try your own code** - Swap the session initializer in an existing notebook
- **Check compatibility** - [Snowpark Connect API Coverage](https://docs.snowflake.com/en/developer-guide/snowpark-connect/compatibility)
- **Assess at scale** - [Snowpark Migration Accelerator](https://docs.snowflake.com/en/developer-guide/snowpark-migration-accelerator) scans your codebase for compatibility