# Call a Ray-based Web Servce from a PySpark UDF

For completeness, this notebook demonstrates a more conventional alternative approach, having the PySpark UDF make a remote call to another service explicitly. This is conceptually simpler and has advantages in production; it may be easier to manage the processes separately. See also the discussion in [Spark-RayUDF.ipynb](../Spark-RayUDF.ipynb), the notebook used in my Spark + AI Summit 2020 talk.

> **Note:** Run all the cells in the [DataGovernanceServer.ipynb](DataGovernanceServer.ipynb) notebook **before** running this notebook.

To learn more about Ray:
* [Ray.io](http://ray.io)
* [Ray Serve](https://docs.ray.io/en/master/rayserve/overview.html)

[Dean Wampler](mailto:dean@anyscale.com)

> **Note:** Requires Java 8!

In [1]:
!java -version

java version "1.8.0_221"
Java(TM) SE Runtime Environment (build 1.8.0_221-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.221-b11, mixed mode)


In [2]:
import json, requests
import pyspark
import ray
from ray.util import named_actors

In [3]:
from pyspark.sql.types import DataType, BooleanType, NullType, IntegerType, StringType, MapType

In [4]:
from pyspark.sql.functions import udf

Define a simple `Record` type with a `record_id` field, used for logging to `DataGovernanceSystem`, and an opaque `data` field with everything else.

In [5]:
class Record:
    def __init__(self, record_id, data):
        self.record_id = record_id
        self.data = data
    def __str__(self):
        return f'Record(record_id={self.record_id},data={self.data})'

In [9]:
port = 8100
address = f'http://localhost:{port}'
timeout = 2.0

In [10]:
test_records = [Record(i, f'data: {i}') for i in range(3)] 
for record in test_records:
    print(record)
    response = requests.put(f'{address}/log?id={record.record_id}', timeout=timeout)
    print(f'log response = {response.json()}')

Record(record_id=0,data=data: 0)
log response = {'message': 'sent async log request for 0'}
Record(record_id=1,data=data: 1)
log response = {'message': 'sent async log request for 1'}
Record(record_id=2,data=data: 2)
log response = {'message': 'sent async log request for 2'}


In [11]:
def gov_status():
    count = requests.get(f'{address}/count', timeout=timeout)
    print(f'count:    {count.json()}')
    ids = requests.get(f'{address}/ids', timeout=timeout)
    print(f'ids:      {ids.json()}')
    up_time = requests.get(f'{address}/up_time', timeout=timeout)
    print(f'up time:  {up_time.json()}')

In [12]:
gov_status()

count:    {'count': 3}
ids:      {'ids': ['0', '1', '2']}
up time:  {'up_time': 620.9878726005554}


Reset the server:

In [16]:
requests.put(f'http://127.0.0.1:{port}/reset')

<Response [200]>

In [14]:
gov_status()

count:    {'count': 0}
ids:      {'ids': []}
up time:  {'up_time': 672.7385516166687}


In [28]:
def log_record(id):
    """
    This function will become a UDF for Spark. Compare with ``log_record`` in ../Spark-RayUDF.ipynb.
    """
    response = requests.put(f'{address}/log?id={id}', timeout=timeout)
    return {'response': response.ok}  # A different return value compared to the other log_record.

In [20]:
spark = pyspark.sql.SparkSession.builder \
    .master("local[*]") \
    .appName("Data Governance Example with Ray Serve") \
    .getOrCreate()

In [29]:
log_record_udf = udf(lambda id: log_record(id), MapType(StringType(), BooleanType()))

In [30]:
num_records=50

In [31]:
records = [Record(i, f'str: {i}') for i in range(num_records)] 

In [32]:
df = spark.createDataFrame(records, ['id', 'data'])

In [33]:
df_ray = df.select('id', 'data', log_record_udf('id').alias('logged'))

In [34]:
display(df_ray)

DataFrame[id: string, data: bigint, logged: map<string,boolean>]

In [35]:
%time df_ray.show(n=num_records, truncate=False)

+-------+----+------------------+
|id     |data|logged            |
+-------+----+------------------+
|str: 0 |0   |[response -> true]|
|str: 1 |1   |[response -> true]|
|str: 2 |2   |[response -> true]|
|str: 3 |3   |[response -> true]|
|str: 4 |4   |[response -> true]|
|str: 5 |5   |[response -> true]|
|str: 6 |6   |[response -> true]|
|str: 7 |7   |[response -> true]|
|str: 8 |8   |[response -> true]|
|str: 9 |9   |[response -> true]|
|str: 10|10  |[response -> true]|
|str: 11|11  |[response -> true]|
|str: 12|12  |[response -> true]|
|str: 13|13  |[response -> true]|
|str: 14|14  |[response -> true]|
|str: 15|15  |[response -> true]|
|str: 16|16  |[response -> true]|
|str: 17|17  |[response -> true]|
|str: 18|18  |[response -> true]|
|str: 19|19  |[response -> true]|
|str: 20|20  |[response -> true]|
|str: 21|21  |[response -> true]|
|str: 22|22  |[response -> true]|
|str: 23|23  |[response -> true]|
|str: 24|24  |[response -> true]|
|str: 25|25  |[response -> true]|
|str: 26|26  |

In [36]:
gov_status()

count:    {'count': 51}
ids:      {'ids': ['str: 0', 'str: 0', 'str: 1', 'str: 2', 'str: 3', 'str: 4', 'str: 5', 'str: 18', 'str: 19', 'str: 20', 'str: 21', 'str: 22', 'str: 23', 'str: 24', 'str: 6', 'str: 12', 'str: 25', 'str: 7', 'str: 13', 'str: 26', 'str: 8', 'str: 14', 'str: 27', 'str: 9', 'str: 15', 'str: 28', 'str: 10', 'str: 16', 'str: 29', 'str: 11', 'str: 17', 'str: 30', 'str: 36', 'str: 42', 'str: 31', 'str: 37', 'str: 43', 'str: 32', 'str: 38', 'str: 44', 'str: 33', 'str: 39', 'str: 45', 'str: 34', 'str: 40', 'str: 46', 'str: 35', 'str: 41', 'str: 47', 'str: 48', 'str: 49']}
up time:  {'up_time': 1017.429967880249}
