
## Multiple Interfaces
You can itneract with structured data (tables) multiple ways using Spark.

We can interact with tables to transform data multiple ways:
1. Executing SQL queries
2. Working with the data using PySpark (python)
3. Use SQL + PySpark (python) to create a DataFrame

**Method 1: Executing SQL queries**

This is a basic SQL query.

In [0]:
%sql
-- use spark SQL to get results from the orders table
SELECT * FROM samples.tpch.orders

**Method 2: Working with the data using PySpark (python)**

We can express queries using the Spark DataFrame API, which provides a Python-friendly, programmatic way to manipulate data.

The following cell returns a DataFrame containing the same results as those retrieved above.

In [0]:
# Load the 'orders' table into a DataFrame
df = spark.table('samples.tpch.orders')

# Display the contents of the DataFrame called df
df.display()

  **Method 3: Use SQL + PySpark (python) to create a DataFrame**

PySpark also supports running SQL queries directly. This is helpful for users coming from a SQL background or when you want to express complex logic in standard SQL syntax.

You can run SQL queries with spark.sql(), and the result will be returned as a DataFrame that can be further transformed using PySpark (python).

In [0]:
# Run a SQL query to retrieve data from the'orders' table
df = spark.sql("""
    SELECT 
      o_orderstatus,
      year(o_orderdate) AS YEAR,
      COUNT(*) AS order_count
    FROM samples.tpch.orders
    GROUP BY ALL
""")

# Display the results of the query
df.display()

### Creating a SQL Temporary View from a DataFrame**

Once we have a DataFrame, we can register it as a temporary view using .`createOrReplaceTempView()`.
This enables us to query the data using **SQL**, just like a table.

This is especially powerful because it allows you to **seamlessly switch between Python and SQL within your notebook** — using PySpark (python) when you want programmatic control, and SQL when you want to write familiar declarative queries.

In [0]:
# Create a temporary view from the orders DataFrame
df.createOrReplaceTempView('order_status_by_year')

**Now you can query this dataframe, or temporary view, using SQL!**

In [0]:
%sql
-- use spark SQL to get results from the orders table
SELECT * FROM order_status_by_year

### PySpark DataFrames Are Not Modified In-Place
In PySpark, DataFrames are not modified in-place. This means that when you apply a transformation — like filtering rows or adding a new column — the original DataFrame stays the same, and a new DataFrame is returned with the changes applied unless you redefine it!.

In [0]:
# Load the orders table to a dataframe (df)
df = spark.table("samples.tpch.orders")

# apply a transformation
df.filter(df.o_orderstatus == 'F')

# show that the original dataframe still contains all statuses
display(df)

# therefore you will often see syntax like this to redefine the dataframe using the query
df = df.filter(df.o_orderstatus == 'F')

### Data Manipulation Using PySpark

Now that we’ve explored how to load and query data using both **SQL and PySPark (python)**, the next step is to perform some light data manipulation using PySpark.

PySpark provides a rich set of functions for transforming, cleaning, and enriching your data — all accessible through the `pyspark.sql.functions` module.

We import this module with the alias `F` (i.e., `from pyspark.sql import functions as F`) for two main reasons:

  ✅ It keeps our code clean and concise, especially when chaining multiple transformations.

  ✅ It allows us to clearly distinguish between DataFrame column references (F.col(...)) and literal values (F.lit(...)), as well as access useful helpers like F.concat_ws, F.year, F.round, and more.

In the following examples, we’ll apply a few common transformations to the `orders` table using this approach.

In [0]:
from pyspark.sql import functions as F

# Load the orders table
df = spark.table("samples.tpch.orders")

# Step 1: Filter for orders with status 'F' (fulfilled)
df = df.filter(df.o_orderstatus == 'F')

# Step 2: Add a new column that calculates estimated shipping delay (pretend logic)
df = df.withColumn("estimated_delay_days", F.lit(7))

# Step 3: Concatenate order priority and status into a single label
df = df.withColumn("priority_status", F.concat_ws(" - ", F.col("o_orderpriority"), F.col("o_orderstatus")))

# Step 4: Rename o_orderdate to order_date for readability
df = df.withColumnRenamed("o_orderdate", "order_date")

# Step 5: Add rounded_price column
df = df.withColumn("rounded_price", F.round("o_totalprice", 0))

# Step 6: Add a flag for high-value orders
df = df.withColumn("is_high_value", F.col("o_totalprice") > 100000)

# Step 7: Add year and month
df = df.withColumn("order_year", F.year("order_date"))
df = df.withColumn("order_month", F.month("order_date"))

# Step 8: Add a ship deadline of 5 days
df = df.withColumn("ship_deadline", F.date_add("order_date", 5))

# Display the transformed DataFrame
df.display()


We can also apply all of these transformations using a chained syntax, which keeps your code clean, readable, and efficient (and easier to read!).

**The operations include:**

🔍 Filtering for only fulfilled orders (o_orderstatus = 'F')

🧮 Adding new columns such as a static shipping delay, a high-value flag, and a rounded price

🧾 Concatenating columns to create a readable status label

🗓️ Extracting date components like year and month from the order date

📦 Renaming columns for clarity

📅 Calculating a future shipping deadline

In [0]:
# Perform all transformations in a single chain
df = (
    spark.table("samples.tpch.orders")
    .filter(F.col("o_orderstatus") == "F")  # Step 1: Filter for fulfilled orders
    .withColumn("estimated_delay_days", F.lit(7))  # Step 2: Add static column
    .withColumn("priority_status", F.concat_ws(" - ", F.col("o_orderpriority"), F.col("o_orderstatus")))  # Step 3
    .withColumnRenamed("o_orderdate", "order_date")  # Step 4
    .withColumn("rounded_price", F.round("o_totalprice", 0))  # Step 5
    .withColumn("is_high_value", F.col("o_totalprice") > 100000)  # Step 6
    .withColumn("order_year", F.year("order_date"))  # Step 7a
    .withColumn("order_month", F.month("order_date"))  # Step 7b
    .withColumn("ship_deadline", F.date_add("order_date", 5))  # Step 8
)

# Display the final result
df.display()

**Joining and Aggregating with SQL**

Now that we’ve enriched our orders DataFrame with new fields, we’ll use Spark SQL to join it with the customer table. This allows us to analyze customer behavior across different market segments and order years.

Specifically, we will:
  1. Join orders with customer on the customer key
  2. Group the results by mktsegment and order_year
  3. Calculate:
      - The **number of high-value orders**
      - The **total order price** for each group



In [0]:
# Register the transformed orders DataFrame as a temporary view
df.createOrReplaceTempView("transformed_orders")

# Use SQL to join with customer and perform aggregations
result_df = spark.sql("""
    SELECT 
        c.c_mktsegment AS mktsegment,
        o.order_year,
        COUNT(CASE WHEN o.is_high_value THEN 1 END) AS high_value_order_count,
        SUM(o.o_totalprice) AS total_order_price
    FROM transformed_orders o
    JOIN samples.tpch.customer c
        ON o.o_custkey = c.c_custkey
    GROUP BY c.c_mktsegment, o.order_year
    ORDER BY c.c_mktsegment, o.order_year
""")

# Display the results
result_df.display()