# Product Category Analysis - Snowpark

This notebook demonstrates DataFrame transformations using **Snowpark Python**.

## What is Snowpark?

Snowpark allows you to:
- Write Python code that executes **inside Snowflake**
- Use **DataFrame APIs** similar to pandas/Spark
- Push computation down to Snowflake's engine

## Key Features Demonstrated:
- Snowpark session initialization
- Reading from Iceberg tables as DataFrames
- DataFrame joins, aggregations, and window functions
- Writing results back to Snowflake

## 1. Initialize Snowpark Session

In a Snowflake Notebook, the session is automatically available.

In [None]:
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark import functions as F
from snowflake.snowpark import Window

session = get_active_session()

print("Snowpark session initialized successfully!")
print(f"Current database: {session.get_current_database()}")
print(f"Current schema: {session.get_current_schema()}")

## 2. Load Iceberg Tables as DataFrames

Snowpark can read directly from Snowflake tables (including Iceberg tables) using `session.table()`.

In [None]:
orders_df = session.table("TLV_BUILD_HOL.EXTERNAL_ICEBERG.EXT_ORDERS")
products_df = session.table("TLV_BUILD_HOL.EXTERNAL_ICEBERG.EXT_PRODUCTS")
customers_df = session.table("TLV_BUILD_HOL.EXTERNAL_ICEBERG.EXT_CUSTOMERS")

print(f"Orders count: {orders_df.count()}")
print(f"Products count: {products_df.count()}")
print(f"Customers count: {customers_df.count()}")

In [None]:
print("Orders Schema:")
orders_df.schema

print("\nProducts Schema:")
products_df.schema

In [None]:
orders_df.show(5)
products_df.show(5)

## 3. Product Category Performance Analysis

Join orders with products and aggregate by category/subcategory.

In [None]:
category_performance = (
    orders_df
    .filter(F.col("STATUS").isin(["COMPLETED", "SHIPPED"]))
    .join(products_df, "PRODUCT_ID")
    .group_by("CATEGORY", "SUBCATEGORY")
    .agg(
        F.count("ORDER_ID").alias("ORDER_COUNT"),
        F.count_distinct("CUSTOMER_ID").alias("UNIQUE_CUSTOMERS"),
        F.sum("QUANTITY").alias("UNITS_SOLD"),
        F.sum(F.col("QUANTITY") * F.col("UNIT_PRICE")).alias("GROSS_REVENUE"),
        F.sum(
            F.col("QUANTITY") * F.col("UNIT_PRICE") * (1 - F.col("DISCOUNT_PCT") / 100)
        ).alias("NET_REVENUE"),
        F.sum(F.col("QUANTITY") * F.col("COST_PRICE")).alias("TOTAL_COST"),
        F.avg("QUANTITY").alias("AVG_QUANTITY_PER_ORDER")
    )
)

print("Category Performance:")
category_performance.order_by(F.desc("NET_REVENUE")).show(10)

## 4. Calculate Profit Margins and Rankings

Add derived columns and use **Window functions** for ranking.

In [None]:
window_revenue = Window.order_by(F.desc("NET_REVENUE"))
window_category = Window.partition_by("CATEGORY").order_by(F.desc("NET_REVENUE"))

product_analysis = (
    category_performance
    .with_column("GROSS_PROFIT", F.col("NET_REVENUE") - F.col("TOTAL_COST"))
    .with_column(
        "PROFIT_MARGIN_PCT",
        F.round(F.col("GROSS_PROFIT") / F.col("NET_REVENUE") * 100, 2)
    )
    .with_column(
        "REVENUE_PER_CUSTOMER",
        F.round(F.col("NET_REVENUE") / F.col("UNIQUE_CUSTOMERS"), 2)
    )
    .with_column("OVERALL_REVENUE_RANK", F.dense_rank().over(window_revenue))
    .with_column("CATEGORY_REVENUE_RANK", F.dense_rank().over(window_category))
)

print("Product Analysis with Rankings:")
product_analysis.order_by(F.desc("NET_REVENUE")).show(15)

## 5. Category Summary Statistics

In [None]:
category_summary = (
    product_analysis
    .group_by("CATEGORY")
    .agg(
        F.count("SUBCATEGORY").alias("SUBCATEGORY_COUNT"),
        F.sum("ORDER_COUNT").alias("TOTAL_ORDERS"),
        F.sum("UNIQUE_CUSTOMERS").alias("TOTAL_CUSTOMERS"),
        F.sum("UNITS_SOLD").alias("TOTAL_UNITS"),
        F.round(F.sum("NET_REVENUE"), 2).alias("TOTAL_REVENUE"),
        F.round(F.sum("GROSS_PROFIT"), 2).alias("TOTAL_PROFIT"),
        F.round(
            F.sum("GROSS_PROFIT") / F.sum("NET_REVENUE") * 100, 2
        ).alias("AVG_PROFIT_MARGIN")
    )
    .order_by(F.desc("TOTAL_REVENUE"))
)

print("Category Summary:")
category_summary.show()

## 6. Top Performing Products

In [None]:
top_products = (
    orders_df
    .filter(F.col("STATUS").isin(["COMPLETED", "SHIPPED"]))
    .join(products_df, "PRODUCT_ID")
    .group_by("PRODUCT_ID", "PRODUCT_NAME", "CATEGORY", "SUBCATEGORY")
    .agg(
        F.count("ORDER_ID").alias("TIMES_ORDERED"),
        F.sum("QUANTITY").alias("UNITS_SOLD"),
        F.round(F.sum(F.col("QUANTITY") * F.col("UNIT_PRICE")), 2).alias("TOTAL_REVENUE"),
        F.round(F.avg("QUANTITY"), 2).alias("AVG_QUANTITY_PER_ORDER")
    )
    .with_column("RANK", F.dense_rank().over(Window.order_by(F.desc("TOTAL_REVENUE"))))
    .filter(F.col("RANK") <= 10)
    .order_by("RANK")
)

print("Top 10 Products by Revenue:")
top_products.show()

## 7. Save Results to Snowflake Table

Write the analysis results back to Snowflake using `write.save_as_table()`.

In [None]:
final_analysis = (
    product_analysis
    .select(
        "CATEGORY",
        "SUBCATEGORY",
        "ORDER_COUNT",
        "UNIQUE_CUSTOMERS",
        "UNITS_SOLD",
        F.round("GROSS_REVENUE", 2).alias("GROSS_REVENUE"),
        F.round("NET_REVENUE", 2).alias("NET_REVENUE"),
        F.round("TOTAL_COST", 2).alias("TOTAL_COST"),
        F.round("GROSS_PROFIT", 2).alias("GROSS_PROFIT"),
        "PROFIT_MARGIN_PCT",
        "REVENUE_PER_CUSTOMER",
        "OVERALL_REVENUE_RANK",
        "CATEGORY_REVENUE_RANK",
        F.current_timestamp().alias("ANALYSIS_TIMESTAMP")
    )
)

final_analysis.write.mode("overwrite").save_as_table(
    "TLV_BUILD_HOL.DATA_ENG_DEMO.PRODUCT_CATEGORY_ANALYSIS"
)

print("Results saved to PRODUCT_CATEGORY_ANALYSIS table!")

## 8. Verify Results

In [None]:
result_df = session.table("TLV_BUILD_HOL.DATA_ENG_DEMO.PRODUCT_CATEGORY_ANALYSIS")
print(f"Rows written: {result_df.count()}")
result_df.order_by("OVERALL_REVENUE_RANK").show(10)

## Summary

### What We Demonstrated:

| Feature | Snowpark Python |
|---------|----------------|
| Session Init | `get_active_session()` |
| Read Tables | `session.table("DB.SCHEMA.TABLE")` |
| DataFrame Ops | `filter()`, `join()`, `group_by()`, `agg()` |
| Window Functions | `Window.order_by()`, `Window.partition_by()` |
| Write Tables | `df.write.save_as_table(...)` |

### Key Benefits of Snowpark:

1. **No Data Movement** - Processing happens inside Snowflake
2. **Familiar APIs** - Similar to pandas/PySpark
3. **Unified Governance** - Same security/access controls as SQL
4. **Works out of the box** - No external dependencies needed

### Output:
- `PRODUCT_CATEGORY_ANALYSIS` table with category/subcategory metrics
- Ready for dashboards, reporting, or downstream pipelines