# Call a Ray-based Web Servce from a PySpark UDF

For completeness, this notebook demonstrates a more conventional alternative approach, having the PySpark UDF make a remote call to another service explicitly. This is conceptually simpler and has advantages in production; it may be easier to manage the processes separately. See also the discussion in [Spark-RayUDF.ipynb](Spark-RayUDF.ipynb), the notebook used in my Spark + AI Summit 2020 talk.

You can learn more about Ray [here](http://ray.io).

[Dean Wampler](mailto:dean@anyscale.com)

> **Note:** Requires Java 8!

In [26]:
!java -version

java version "1.8.0_221"
Java(TM) SE Runtime Environment (build 1.8.0_221-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.221-b11, mixed mode)


In [27]:
import json
import pyspark
import ray
from ray.util import named_actors

In [28]:
from pyspark.sql.types import DataType, BooleanType, NullType, IntegerType, StringType, MapType

In [29]:
from pyspark.sql.functions import udf

Define a `DataGovernanceSystem` Ray actor that represents our governance system. All it does is add each reported id to an internal collection. A more realistic implementation would forward the ids, along with other useful metadata, asynchronously to a real governance system, like [Apache Atlas](http://atlas.apache.org/#/).

In [30]:
@ray.remote
class DataGovernanceSystem:
    def __init__(self, name = 'DataGovernanceSystem'):
        self.name = name
        self.ids = []
        self.start_time = time.time()

    def log(self, id_to_log):
        """
        Log record ids that have been processed.
        Simulate an expensive operation by sleeping for 0.1 seconds
        """
        time.sleep(0.1)
        self.ids.append(id_to_log)

    def get_ids(self):
        """Return the ids logged. Don't call this if the list is long!"""
        return self.ids

    def get_count(self):
        """Return the count of ids logged."""
        return len(self.ids)

    def reset(self):
        """Forget all ids that have been logged."""
        self.ids = []

    def get_start_time(self):
        return self.start_time

    def get_up_time(self):
        return time.time() - self.start_time

Define a simple `Record` type with a `record_id` field, used for logging to `DataGovernanceSystem`, and an opaque `data` field with everything else.

In [None]:
class Record:
    def __init__(self, record_id, data):
        self.record_id = record_id
        self.data = data
    def __str__(self):
        return f'Record(record_id={self.record_id},data={self.data})'

Now initialize Ray. The `address='auto'` tells Ray to connect to a running cluster, where this node is part of that cluster, so Ray can find what it needs locally.

In [6]:
ray.init(address='auto', ignore_reinit_error=True)



{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:14668',
 'object_store_address': '/tmp/ray/session_2020-05-16_12-00-17_506849_39838/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-05-16_12-00-17_506849_39838/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-05-16_12-00-17_506849_39838'}

In [7]:
print(f'Click here to open the Ray Dashboard: http://{ray.get_webui_url()}')

Click here to open the Ray Dashboard: http://localhost:8265


In [10]:
actor_name = 'dgs'
gov = DataGovernanceSystem.remote(actor_name)
named_actors.register_actor(actor_name, gov)
gov

Actor(DataGovernanceSystem, 45b95b1c0100)

In [11]:
test_records = [Record(i, f'data: {i}') for i in range(3)] 
for record in test_records:
    print(record)
    gov.log.remote(record.record_id)

Record(record_id=0,data=data: 0)
Record(record_id=1,data=data: 1)
Record(record_id=2,data=data: 2)


In [12]:
def gov_status():
    gov = named_actors.get_actor(name='dgs')
    print(f'count:   {ray.get(gov.get_count.remote())}')
    print(f'ids:     {ray.get(gov.get_ids.remote())}')
    print(f'up time: {ray.get(gov.get_up_time.remote())}')

In [13]:
gov_status()

count:   3
ids:     [0, 1, 2]
up time: 5.274358034133911


Reset the server:

In [14]:
gov.reset.remote()
gov_status()

count:   0
ids:     []
up time: 11.258324146270752


In [15]:
@ray.remote
def ray_log_record(id):
    from ray.util import named_actors
    gov = named_actors.get_actor(name='dgs')
    gov.log.remote(id)   # Will run asynchronously, returning a future.
    count = ray.get(gov.get_count.remote())  # but this blocks!
    return count

In [16]:
def log_record(id):
    """
    This function will become a UDF for Spark. Since each Spark task runs in a separate process, 
    we'll initialize Ray, connecting to the running cluster, if it is not already initialized.
    """
    did_initialization = 0
    if not ray.is_initialized():
        ray.init(address='auto', redis_password='5241590000000000')
        did_initialization = 1
    count = ray.get(ray_log_record.remote(id))
    return {'initialized': did_initialization, 'count': count}

In [17]:
spark = pyspark.sql.SparkSession.builder \
    .master("local[*]") \
    .appName("Data Governance Example") \
    .getOrCreate()

In [18]:
log_record_udf = udf(lambda id: log_record(id), MapType(StringType(), IntegerType()))

In [19]:
num_records=50

In [20]:
records = [Record(i, f'str: {i}') for i in range(num_records)] 

In [21]:
df = spark.createDataFrame(records, ['id', 'data'])

In [22]:
df_ray = df.select('id', 'data', log_record_udf('id').alias('logged'))

In [23]:
display(df_ray)

DataFrame[id: string, data: bigint, logged: map<string,int>]

In [24]:
%time df_ray.show(n=num_records, truncate=False)

+-------+----+-------------------------------+
|id     |data|logged                         |
+-------+----+-------------------------------+
|str: 0 |0   |[count -> 1, initialized -> 1] |
|str: 1 |1   |[count -> 2, initialized -> 0] |
|str: 2 |2   |[count -> 3, initialized -> 0] |
|str: 3 |3   |[count -> 4, initialized -> 0] |
|str: 4 |4   |[count -> 5, initialized -> 0] |
|str: 5 |5   |[count -> 6, initialized -> 0] |
|str: 6 |6   |[count -> 14, initialized -> 1]|
|str: 7 |7   |[count -> 17, initialized -> 0]|
|str: 8 |8   |[count -> 20, initialized -> 0]|
|str: 9 |9   |[count -> 23, initialized -> 0]|
|str: 10|10  |[count -> 26, initialized -> 0]|
|str: 11|11  |[count -> 29, initialized -> 0]|
|str: 12|12  |[count -> 13, initialized -> 1]|
|str: 13|13  |[count -> 16, initialized -> 0]|
|str: 14|14  |[count -> 19, initialized -> 0]|
|str: 15|15  |[count -> 22, initialized -> 0]|
|str: 16|16  |[count -> 25, initialized -> 0]|
|str: 17|17  |[count -> 28, initialized -> 0]|
|str: 18|18  

In [25]:
gov_status()

count:   50
ids:     ['str: 0', 'str: 1', 'str: 2', 'str: 3', 'str: 4', 'str: 5', 'str: 24', 'str: 25', 'str: 26', 'str: 27', 'str: 28', 'str: 29', 'str: 12', 'str: 6', 'str: 18', 'str: 13', 'str: 7', 'str: 19', 'str: 14', 'str: 8', 'str: 20', 'str: 15', 'str: 9', 'str: 21', 'str: 16', 'str: 10', 'str: 22', 'str: 17', 'str: 11', 'str: 23', 'str: 30', 'str: 36', 'str: 42', 'str: 31', 'str: 37', 'str: 43', 'str: 32', 'str: 38', 'str: 44', 'str: 33', 'str: 39', 'str: 45', 'str: 34', 'str: 40', 'str: 46', 'str: 35', 'str: 41', 'str: 47', 'str: 48', 'str: 49']
up time: 57.627371072769165
[2m[36m(pid=39848)[0m 2020-05-16 12:10:34,686	INFO master.py:122 -- Starting router with name 'SERVE_ROUTER_ACTOR'
[2m[36m(pid=39848)[0m 2020-05-16 12:10:34,689	INFO master.py:143 -- Starting HTTP proxy with name 'SERVE_PROXY_ACTOR'
[2m[36m(pid=39848)[0m 2020-05-16 12:10:34,693	INFO master.py:168 -- Starting metric monitor with name 'SERVE_METRIC_MONITOR_ACTOR'
[2m[36m(pid=39848)[0m 2020-05-16 1