# DST-Spark Template Notebook

This notebook provides a template for connecting to Spark in two ways:
1.  **Connecting to the shared DST-Spark cluster master.**
2.  **Running a standalone local Spark session for testing.**

## Option 1: Connect to the DST-Spark Cluster

This is the standard way to work. It uses the `spark_connector` module to connect your notebook to the shared Spark master node. All your computations will run on the cluster.

In [4]:
import importlib
from spark_connector import get_spark_session

# stop eksisterende session, ellers tager Spark ikke nye confs
try:
    spark.stop()
except:
    pass
    

# This function reads your environment variables and returns a

# pre-configured Spark session connected to the master.
# limited Spark session connected to the master.
# Check with http://srvpython16:8080/#running-app after creation...
#spark = get_spark_session()

spark = get_spark_session(
    cores_max=2,
    executor_cores=1,
    executor_memory="2g",
    memory_overhead="1g",
    dynamic_allocation=False
)
print("Successfully connected to Spark Master!")
print(f"Spark version: {spark.version}")
print(f"Spark master URL: {spark.conf.get('spark.master')}")
print(f"Number of cores: {spark.sparkContext.getConf().get("spark.cores.max")}")

# You can now use the 'spark' session for your work
spark.sql("SELECT 'Hello from Spark Master!' as message").show()


Successfully connected to Spark Master!
Spark version: 4.0.0
Spark master URL: spark://localhost:7077
Number of cores: 2


[Stage 0:>                                                          (0 + 1) / 1]

+--------------------+
|             message|
+--------------------+
|Hello from Spark ...|
+--------------------+



                                                                                

### Example: Read and Write data from MinIO

This example demonstrates reading and writing a Parquet file to the `shared` bucket in MinIO. You'll need write permissions for this to work.

In [5]:
# Create a sample DataFrame
data = [("Alice", 34), ("Bob", 45), ("Catherine", 29)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)

s3_delta = "s3a://shared/template_notebook_delta"

print(f"Writing Delta to {s3_delta}...")
df.write.format("delta").mode("overwrite").save(s3_delta)
print("Write complete.")

print(f"Reading Delta from {s3_delta}...")
read_df = spark.read.format("delta").load(s3_delta) #.show()
read_df.show()

Writing Delta to s3a://shared/template_notebook_delta...


25/09/22 08:18:13 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
25/09/22 08:18:16 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Write complete.
Reading Delta from s3a://shared/template_notebook_delta...


                                                                                

+---------+---+
|     name|age|
+---------+---+
|      Bob| 45|
|Catherine| 29|
|    Alice| 34|
+---------+---+



In [None]:
# DEFINE PRIVATE
s3_delta = "s3a://home-victor/template_notebook_test_data"


print(f"Writing Delta to {s3_delta}...")
df.write.format("delta").mode("overwrite").save(s3_delta)
print("Write complete.")

print(f"Reading Delta from {s3_delta}...")
read_df = spark.read.format("delta").load(s3_delta).show()
read_df.show()

In [None]:
# DEFINE SOMEONE ELSES
s3_delta = "s3a://home-x11/silver/template_notebook_test_data" # home-x11 => Thomas

print(f"Writing Delta to {s3_delta}...")
df.write.format("delta").mode("overwrite").save(s3_delta)
print("Write complete.")

print(f"Reading Delta from {s3_delta}...")
read_df = spark.read.format("delta").load(s3_delta)  #.show()
read_df.show()

## Option 2: Run a Local Spark Session

This is useful for small tests or when you don't need the power of the cluster. This Spark session runs entirely inside the Jupyter container and does not connect to the Spark master or workers. It won't be able to access data on MinIO unless you configure it with S3 credentials.

In [None]:
from pyspark.sql import SparkSession

# This creates a completely local Spark session
local_spark = (
    SparkSession.builder
    .appName("LocalJupyterTest")
    .master("local[*]") # Use all available local cores
    .getOrCreate()
)

print("Successfully created a local Spark session!")
print(f"Spark version: {local_spark.version}")
print(f"Spark master URL: {local_spark.conf.get('spark.master')}")

# You can now use the 'local_spark' session
local_spark.sql("SELECT 'Hello from Local Spark!' as message").show()

## Clean up

It's good practice to stop your Spark sessions when you're finished to release resources on the cluster.

In [None]:
# Stop the sessions when you're done
print("Stopping Spark Master session...")
if 'spark' in locals():
    spark.stop()

print("Stopping Local Spark session...")
if 'local_spark' in locals():
    local_spark.stop()

print("Done.")