# Spark over YT in Jupyter

This notebook demonstrates how to work with [Spark over YTsaurus](https://ytsaurus.tech/docs/en/user-guide/data-processing/spyt/overview) in a `Jupyter` notebook. We explore the following steps:
1. Setting up a new standalone `Spyt` cluster.
2. Connecting to the `Spyt` cluster from `PySpark` and running queries.

## Prepare Spyt cluster

If you already have a running `Spyt` cluster, specify its directory in the variable `user_spark_discovery_path` below.

In [4]:
user_spark_discovery_path = None

In [5]:
import uuid
import subprocess
import sys
import random
import os 

import yt.wrapper as yt

In [6]:
working_dir = f"//tmp/examples/spark-over-yt-in-jupyter_{uuid.uuid4()}"
yt.create("map_node", working_dir, recursive=True)
print(working_dir)

//tmp/examples/spark-over-yt-in-jupyter_392d7561-f2c5-4b2c-bdee-aacff4207ba9


If `user_spark_discovery_path` is not specified, we will create a new small `Spyt` cluster and shut it down at the end of the notebook.

In [8]:
spark_discovery_path = user_spark_discovery_path

if spark_discovery_path is None:
    spark_discovery_path = working_dir
    port = random.randint(27100, 27200)

    proxy = os.environ["YT_PROXY"]
    spark_launch_command = [
        "spark-launch-yt",
        "--proxy", proxy,
        "--operation-alias", f"spark-over-yt-in-jupyt-example-{uuid.uuid4()}",
        "--worker-cores", "2",
        "--worker-num", "1",
        "--worker-memory", "64G",
        "--discovery-path", spark_discovery_path,
        "--params", f'{{spark_conf={{spark.shuffle.service.port={port}}}}}',
        "--abort-existing"
    ]
    print(" ".join(spark_launch_command))

    process = subprocess.Popen(spark_launch_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, env=os.environ)
    stdout, stderr = process.communicate()
    print("Exit code: ", process.poll())
    print("Stdout: ", stdout.decode())
    print("Stderr: ", stderr.decode())


spark-launch-yt --proxy http://playground.yt.nebius.yt --operation-alias spark-over-yt-in-jupyt-example-040c8ea6-0715-4502-9636-f764fce5f44c --worker-cores 2 --worker-num 1 --worker-memory 64G --discovery-path //tmp/examples/spark-over-yt-in-jupyter_392d7561-f2c5-4b2c-bdee-aacff4207ba9 --params {spark_conf={spark.shuffle.service.port=27166}} --abort-existing


Exit code:  0
Stdout:  
Stderr:  2025-01-21 19:43:58,337 - INFO - spyt.standalone - 2.4.4 cluster version will be launched
2025-01-21 19:43:58,337 - INFO - spyt.standalone - Tmpfs is enabled, spills will be created at RAM
2025-01-21 19:43:58,337 - INFO - spyt.standalone - No disk account is specified
2025-01-21 19:43:58,337 - INFO - spyt.standalone - Launcher files will be placed to tmpfs
2025-01-21 19:43:58,719	INFO	Operation started: https://playground.yt.nebius.yt/playground/operations/fc901496-187c6be9-134403e8-3ad0c5c2/details
2025-01-21 19:43:58,733	INFO	( 0 min) operation fc901496-187c6be9-134403e8-3ad0c5c2 initializing
2025-01-21 19:44:03,780	INFO	( 0 min) Unrecognized spec: {'enable_partitioned_data_balancing': false}
2025-01-21 19:44:03,804	INFO	( 0 min) operation fc901496-187c6be9-134403e8-3ad0c5c2: running=0     completed=0     pending=3     failed=0     aborted=0     lost=0     total=3     blocked=0    
2025-01-21 19:44:08,905	INFO	( 0 min) operation fc901496-187c6be9-1344

## Run query

Let's run a query by `pyspark` library using `spyt` helper.

`spyt.spark_session` is a helper function to connect to the Spyt cluster. The session object behaves the same way as the original one from [pyspark](https://spark.apache.org/docs/latest/api/python/index.html).



Let's find all the running squirrels in a dataset about squirrels from New York's Central Park.

In [12]:
from spyt import spark_session
with spark_session(discovery_path=spark_discovery_path) as spark:
    df = spark.read.yt("//home/samples/squirrels")
    filtered = df.filter(df["running"] == True)
    filtered.show()

log4j:WARN No appenders could be found for logger (tech.ytsaurus.spyt.patch.SparkPatchAgent).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.


25/01/21 19:44:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27001. Attempting port 27002.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27002. Attempting port 27003.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27003. Attempting port 27004.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27004. Attempting port 27005.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27005. Attempting port 27006.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27006. Attempting port 27007.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27007. Attempting port 27008.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27008. Attempting port 27009.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27009. Attempting port 27010.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' cou

25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27022. Attempting port 27023.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27023. Attempting port 27024.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27024. Attempting port 27025.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27025. Attempting port 27026.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27026. Attempting port 27027.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27027. Attempting port 27028.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27028. Attempting port 27029.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27029. Attempting port 27030.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' could not bind on port 27030. Attempting port 27031.
25/01/21 19:44:52 WARN Utils: Service 'sparkDriver' cou

25/01/21 19:44:52 WARN Utils: Service 'SparkUI' could not bind on port 27001. Attempting port 27002.
25/01/21 19:44:52 WARN Utils: Service 'SparkUI' could not bind on port 27002. Attempting port 27003.
25/01/21 19:44:52 WARN Utils: Service 'SparkUI' could not bind on port 27003. Attempting port 27004.
25/01/21 19:44:52 WARN Utils: Service 'SparkUI' could not bind on port 27004. Attempting port 27005.
25/01/21 19:44:52 WARN Utils: Service 'SparkUI' could not bind on port 27005. Attempting port 27006.
25/01/21 19:44:52 WARN Utils: Service 'SparkUI' could not bind on port 27006. Attempting port 27007.
25/01/21 19:44:52 WARN Utils: Service 'SparkUI' could not bind on port 27007. Attempting port 27008.
25/01/21 19:44:52 WARN Utils: Service 'SparkUI' could not bind on port 27008. Attempting port 27009.
25/01/21 19:44:52 WARN Utils: Service 'SparkUI' could not bind on port 27009. Attempting port 27010.
25/01/21 19:44:52 WARN Utils: Service 'SparkUI' could not bind on port 27010. Attempting po

25/01/21 19:44:53 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 27001. Attempting port 27002.
25/01/21 19:44:53 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 27002. Attempting port 27003.
25/01/21 19:44:53 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 27003. Attempting port 27004.
25/01/21 19:44:53 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 27004. Attempting port 27005.
25/01/21 19:44:53 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 27005. Attempting port 27006.
25/01/21 19:44:53 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 27006. Attempting port 27007.
25/01/21 19:44:53 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' cou

2025-01-21 19:44:54,625 - INFO - spyt.client - SPYT Cluster version: 2.4.4


2025-01-21 19:44:54,626 - INFO - spyt.client - SPYT library version: 2.4.4


25/01/21 19:44:58 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+-----------------+----------------+--------------+-------+-----+----------+-----------------------+--------+-----------------+-------------------+-----------+------------+--------------------------------+-----------------+-------+-------+--------+------+--------+--------------------+-----+-----+-----+----------+-------------+----------+-----------+---------+------------------+
|              lat|             lon|   squirrel_id|hectare|shift|      date|hectare_squirrel_number|     age|primary_fur_color|highlight_fur_color|color_notes|    location|above_ground_sighter_measurement|specific_location|running|chasing|climbing|eating|foraging|    other_activities| kuks|quaas|moans|tail_flags|tail_twitches|approaches|indifferent|runs_from|other_interactions|
+-----------------+----------------+--------------+-------+-----+----------+-----------------------+--------+-----------------+-------------------+-----------+------------+--------------------------------+-----------------+-------+-------

2025-01-21 19:45:01,272 - INFO - py4j.clientserver - Closing down clientserver connection


2025-01-21 19:45:01,273 - INFO - py4j.clientserver - Closing down clientserver connection


2025-01-21 19:45:01,273 - INFO - py4j.clientserver - Closing down clientserver connection


2025-01-21 19:45:01,273 - INFO - py4j.clientserver - Closing down clientserver connection


### Run query with UDF

You can run spark queries with [UDF](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.udf.html).

In [15]:
from pyspark.sql.functions import udf

@udf
def to_upper(s):
    if s is not None:
        return s.upper()

with spark_session(discovery_path=spark_discovery_path) as spark:
    display(spark.read.yt("//home/samples/squirrels").select(to_upper("age")).limit(20).toPandas())

log4j:WARN No appenders could be found for logger (tech.ytsaurus.spyt.patch.SparkPatchAgent).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.


25/01/21 19:45:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27001. Attempting port 27002.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27002. Attempting port 27003.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27003. Attempting port 27004.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27004. Attempting port 27005.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27005. Attempting port 27006.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27006. Attempting port 27007.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27007. Attempting port 27008.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27008. Attempting port 27009.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27009. Attempting port 27010.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' cou

25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27048. Attempting port 27049.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27049. Attempting port 27050.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27050. Attempting port 27051.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27051. Attempting port 27052.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27052. Attempting port 27053.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27053. Attempting port 27054.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27054. Attempting port 27055.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27055. Attempting port 27056.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' could not bind on port 27056. Attempting port 27057.
25/01/21 19:45:04 WARN Utils: Service 'sparkDriver' cou

25/01/21 19:45:04 WARN Utils: Service 'SparkUI' could not bind on port 27001. Attempting port 27002.
25/01/21 19:45:04 WARN Utils: Service 'SparkUI' could not bind on port 27002. Attempting port 27003.
25/01/21 19:45:04 WARN Utils: Service 'SparkUI' could not bind on port 27003. Attempting port 27004.
25/01/21 19:45:04 WARN Utils: Service 'SparkUI' could not bind on port 27004. Attempting port 27005.
25/01/21 19:45:04 WARN Utils: Service 'SparkUI' could not bind on port 27005. Attempting port 27006.
25/01/21 19:45:04 WARN Utils: Service 'SparkUI' could not bind on port 27006. Attempting port 27007.
25/01/21 19:45:04 WARN Utils: Service 'SparkUI' could not bind on port 27007. Attempting port 27008.
25/01/21 19:45:04 WARN Utils: Service 'SparkUI' could not bind on port 27008. Attempting port 27009.
25/01/21 19:45:04 WARN Utils: Service 'SparkUI' could not bind on port 27009. Attempting port 27010.
25/01/21 19:45:04 WARN Utils: Service 'SparkUI' could not bind on port 27010. Attempting po

25/01/21 19:45:05 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 27001. Attempting port 27002.
25/01/21 19:45:05 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 27002. Attempting port 27003.
25/01/21 19:45:05 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 27003. Attempting port 27004.
25/01/21 19:45:05 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 27004. Attempting port 27005.
25/01/21 19:45:05 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 27005. Attempting port 27006.
25/01/21 19:45:05 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 27006. Attempting port 27007.
25/01/21 19:45:05 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' cou

2025-01-21 19:45:06,678 - INFO - spyt.client - SPYT Cluster version: 2.4.4


2025-01-21 19:45:06,680 - INFO - spyt.client - SPYT library version: 2.4.4


Unnamed: 0,to_upper(age)
0,
1,
2,
3,ADULT
4,ADULT
5,ADULT
6,ADULT
7,ADULT
8,ADULT
9,ADULT


2025-01-21 19:45:15,731 - INFO - py4j.clientserver - Closing down clientserver connection


2025-01-21 19:45:15,732 - INFO - py4j.clientserver - Closing down clientserver connection


2025-01-21 19:45:15,733 - INFO - py4j.clientserver - Closing down clientserver connection


2025-01-21 19:45:15,733 - INFO - py4j.clientserver - Closing down clientserver connection


## Stop temporary SPYT cluster

In [17]:
if not user_spark_discovery_path:
    operations = yt.list(f"{spark_discovery_path}/discovery/operation")
    assert len(operations) == 1
    yt.abort_operation(operation=operations[0])