# Spline Demo

You can access Spline services on the following URLs:
- Spline Web UI: http://localhost:9090
- Spline Server: http://localhost:8080

Both [Execution Events](http://localhost:9090/app/events/list) and [Data Sources](http://localhost:9090/app/data-sources/list) will be empty since we have yet to run anything.

Next, we will create a spark session with lineage tracking enabled. It may take a while since it will also download the required packages.

In [1]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

# Ref: https://github.com/AbsaOSS/spline-getting-started/blob/f4866aa/spline-on-databricks/README.md
spark = (SparkSession
    .builder
    .appName("Spline Demo")
    .config("spark.driver.extraJavaOptions", "--add-opens=java.base/sun.net.www.protocol.jar=ALL-UNNAMED")
    .config("spark.spline.lineageDispatcher.http.producer.url", "http://host.docker.internal:8080/producer")
    .config("spark.jars.packages", "za.co.absa.spline.agent.spark:spark-3.5-spline-agent-bundle_2.12:2.2.0")
    .getOrCreate()
)
spark.sparkContext._jvm.za.co.absa.spline.harvester.SparkLineageInitializer.enableLineageTracking(spark._jsparkSession)

JavaObject id=o34

## Read datasets

Let's read sample product, customer and raw sales data.

In [2]:
input_dir = "/home/jovyan/data"
output_dir = "/home/jovyan/output"
product = spark.read.option("header", True).csv(f"{input_dir}/product")
customer = spark.read.option("header", True).csv(f"{input_dir}/customer")
sales_raw = spark.read.option("header", True).csv(f"{input_dir}/sales_raw")

Generate datasets for US customers and SG customers.

In [3]:
(customer
    .filter(f.col("country") == 'US')
    .drop(f.col("country"))
    .write.mode("overwrite")
    .option("header", True)
    .csv(f"{output_dir}/customer_us")
)

(customer
    .filter(f.col("country") == 'SG')
    .drop(f.col("country"))
    .write.mode("overwrite")
    .option("header", True)
    .csv(f"{output_dir}/customer_sg")
)

### Data Sources
http://localhost:9090/app/events/list should now show 2 executions and `customer`, `customer_sg` and `customer_us` should also appear in http://localhost:9090/app/data-sources/list.

![](demo_images/customer_1.png)

We can also see the lineage at columnar level by visiting the execution plan details. Do note that this is at the [individual execution plan](https://github.com/AbsaOSS/spline/discussions/1331#discussioncomment-9646428) level and not end-to-end. 

We can see `country` column is dropped but `name` is carried over.

![](demo_images/customer_2.png)
![](demo_images/customer_3.png)

## Generate Joined Dataset
Next, we will check how lineage works when multiple sources are joined.

In [4]:
sales_report = (sales_raw
    .join(product, sales_raw.product_id == product.id)
    .join(customer, sales_raw.customer_id == customer.id)
    .select(
        customer["name"].alias("customer_name"),
        product["name"].alias("product_name"),
        sales_raw["qty"],
    )
)
sales_report.write.mode("overwrite").option("header", True).csv(f"{output_dir}/sales_report")
sales_report.show()

+-------------+-------------+---+
|customer_name| product_name|qty|
+-------------+-------------+---+
|        Alice|Awesome Apple|  1|
|          Bob|Awesome Apple| 10|
|          Bob|   Big Banana|  3|
+-------------+-------------+---+



It would show something similar to this.
![Sales Execution Event](demo_images/sales_1.png "Sales Execution Event")
![Sales Execution Plan](demo_images/sales_2.png "Sales Execution Plan")