# Spark ML serving in Azure Machine Learning (by Spark installation)

In this notebook, I'll host (install) single instance of Apache Spark and provide Spark ML inference by running pyspark on Azure Machine Learning online endpoint.

To run this notebook,

1. Create new "Machine Learning" resource in [Azure Portal](https://portal.azure.com/).
2. Install Azure Machine Learning CLI v2 on Ubuntu as follows

```
# install Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
# install AML CLI extension
az extension add --name ml
```

> Note : See [here](https://tsmatz.wordpress.com/2019/03/04/spark-ml-pipeline-serving-inference-by-azure-machine-learning-service/) for other deployment options for Spark ML model.

## Connect to Azure Machine Learning workspace

Login to Azure and prepare for connecting to Azure Machine Learning (AML) workspace.<br>
Please fill the following subscription id, AML workspace name, and resource group name.

In [None]:
!az login

In [None]:
!az account set -s {AZURE_SUBSCRIPTION_ID}

In [None]:
my_resource_group = "{AML_RESOURCE_GROUP_NAME}"
my_workspace = "{AML_WORSPACE_NAME}"

## Train model and register into AML

In this example, we use Spark ML pipeline model, ```flight_model```, which is trained in [this Databricks exercise](https://tsmatz.github.io/azure-databricks-exercise/exercise03-sparkml-pipeline.html).<br>
This pipeline uses inputs for flight and weather information (such as, aircraft carrier, depature/arrival time, depature/arrival wind speed, depature/arrival visibility, etc) and then predicts flight arrival's delay over 15 minutes by 0 or 1.

Unpack ```flight_model.zip``` in this folder.

In [None]:
!sudo apt-get install unzip

In [2]:
!unzip flight_model.zip

Archive:  flight_model.zip
   creating: flight_model/
   creating: flight_model/metadata/
  inflating: flight_model/metadata/part-00000  
 extracting: flight_model/metadata/_SUCCESS  
   creating: flight_model/stages/
   creating: flight_model/stages/5_DecisionTreeClassifier_a1f5c7b0f6f3/
   creating: flight_model/stages/5_DecisionTreeClassifier_a1f5c7b0f6f3/metadata/
  inflating: flight_model/stages/5_DecisionTreeClassifier_a1f5c7b0f6f3/metadata/part-00000  
 extracting: flight_model/stages/5_DecisionTreeClassifier_a1f5c7b0f6f3/metadata/_SUCCESS  
   creating: flight_model/stages/5_DecisionTreeClassifier_a1f5c7b0f6f3/data/
  inflating: flight_model/stages/5_DecisionTreeClassifier_a1f5c7b0f6f3/data/_committed_3287493269964804398  
  inflating: flight_model/stages/5_DecisionTreeClassifier_a1f5c7b0f6f3/data/part-00000-tid-3287493269964804398-cb7b99f7-874c-429d-a3b2-4d2bd2a6d998-374-1-c000.snappy.parquet  
 extracting: flight_model/stages/5_DecisionTreeClassifier_a1f5c7b0f6f3

Now we register this trained Spark MLlib model into Azure Machine Learning.

In [3]:
!az ml model create --name fight_delay_model \
  --version 1 \
  --path ./flight_model \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

[32mUploading flight_model (0.64 MBs): 100%|█| 641715/641715 [00:00<00:00, 3226118.9[0m
[39m

{
  "creation_context": {
    "created_at": "2022-08-31T02:14:09.691301+00:00",
    "created_by": "Tsuyoshi Matsuzaki",
    "created_by_type": "User",
    "last_modified_at": "2022-08-31T02:14:09.691301+00:00",
    "last_modified_by": "Tsuyoshi Matsuzaki",
    "last_modified_by_type": "User"
  },
  "id": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AML-rg/providers/Microsoft.MachineLearningServices/workspaces/ws01/models/fight_delay_model/versions/1",
  "name": "fight_delay_model",
  "path": "azureml://subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AML-rg/workspaces/ws01/datastores/workspaceblobstore/paths/LocalUpload/263238d665167a65d872a835a7289865/flight_model",
  "properties": {},
  "resourceGroup": "AML-rg",
  "tags": {},
  "type": "custom_model",
  "version": "1"
}
[0m

## Create AML environment for Spark ML serving

Next create AML environment for Spark ML serving.

In this example, the image is built from AML minimal inferencing image (```mcr.microsoft.com/azureml/minimal-ubuntu20.04-py38-cpu-inference:latest```).<br>
Apache Spark and pyspark is then installed and configured.

The following ```azureml-defaults``` is needed for Azure ML inferencing by entry script.

In [4]:
import os
context_folder = './docker-context-sparkml'
os.makedirs(context_folder, exist_ok=True)

In [5]:
%%writefile docker-context-sparkml/Dockerfile
FROM mcr.microsoft.com/azureml/minimal-ubuntu20.04-py38-cpu-inference:latest

USER root:root

# Install Java
RUN mkdir -p /usr/share/man/man1
RUN apt-get update -y && \
    apt-get install -y openjdk-8-jdk
ENV JAVA_HOME='/usr/lib/jvm/java-8-openjdk-amd64'

# Install Apache Spark
RUN wget -q https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz && \
   tar xzf spark-3.1.2-bin-hadoop3.2.tgz -C /opt && \
   mv /opt/spark-3.1.2-bin-hadoop3.2 /opt/spark && \
   rm spark-3.1.2-bin-hadoop3.2.tgz
ENV SPARK_HOME=/opt/spark
ENV PYSPARK_PYTHON=python
ENV PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python

# Install additional packages
WORKDIR /
COPY requirements.txt .
RUN pip install -r requirements.txt && rm requirements.txt

Writing docker-context-sparkml/Dockerfile


In [6]:
%%writefile docker-context-sparkml/requirements.txt
azureml-defaults
numpy
pyspark

Writing docker-context-sparkml/requirements.txt


Register new image as AML environment.

In [7]:
%%writefile env_sparkml_serving.yml
$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: sparkml-serving-env
build:
  path: docker-context-sparkml
description: environment for SparkML serving

Writing env_sparkml_serving.yml


In [8]:
!az ml environment create --file env_sparkml_serving.yml \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

[32mUploading docker-context-sparkml (0.0 MBs): 100%|█| 773/773 [00:00<00:00, 22976.[0m
[39m

{
  "build": {
    "dockerfile_path": "Dockerfile",
    "path": "https://ws019192117290.blob.core.windows.net/azureml-blobstore-bdaba7c0-9940-43e5-a272-86b92d91b7de/LocalUpload/fbd5c3deea08489d938b69b1ce6e49b9/docker-context-sparkml/"
  },
  "creation_context": {
    "created_at": "2022-08-31T02:14:40.661368+00:00",
    "created_by": "Tsuyoshi Matsuzaki",
    "created_by_type": "User",
    "last_modified_at": "2022-08-31T02:14:40.661368+00:00",
    "last_modified_by": "Tsuyoshi Matsuzaki",
    "last_modified_by_type": "User"
  },
  "description": "environment for SparkML serving",
  "id": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AML-rg/providers/Microsoft.MachineLearningServices/workspaces/ws01/environments/sparkml-serving-env/versions/1",
  "name": "sparkml-serving-env",
  "os_type": "linux",
  "resourceGroup": "AML-rg",
  "tags": {},
  "version": "1"
}


## Create Inference Script

Create the following inferencing code which will be deployed with inference images.<br>
The following script is saved as ```./script/inference.py```.

> Note : To use additional packages, specify ```config()``` in builder creation.<br>
> For instance, when you use Azure storage, you can add package configuration as follows. (See [here](https://tsmatz.wordpress.com/2020/12/08/apache-spark-on-azure-kubernetes-service-aks/) for details.)
> ```
> spark = (SparkSession.builder\
>     .appName("flight_delay_serving")\
>     .config("spark.jars.packages",
>         "org.apache.hadoop:hadoop-azure:3.2.0")
>     .config("spark.jars.packages",
>         "com.microsoft.azure:azure-storage:8.6.3")
>     .getOrCreate())
> ```

In [9]:
import os
script_folder = "./script"
os.makedirs(script_folder, exist_ok=True)

In [10]:
%%writefile script/inference.py
import os
import json
from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel

def init():
    global spark
    global loaded_model
    spark = SparkSession.builder.appName("flight_delay_serving").getOrCreate()
    model_path = os.path.join(
        os.getenv("AZUREML_MODEL_DIR"),
        "flight_model"
    )
    loaded_model = PipelineModel.load(model_path)
 
def run(raw_data):
    try:
        input_list = json.loads(raw_data)["data"]
        sc = spark.sparkContext
        input_rdd = sc.parallelize(input_list)
        input_df = input_rdd.toDF()
        pred_df = loaded_model.transform(input_df)
        pred_list = pred_df.collect()
        pred_array = [int(x["prediction"]) for x in pred_list]
        return pred_array
    except Exception as e:
        result = str(e)
        return "Internal Exception : " + result

Writing script/inference.py


## Create a managed endpoint

Now create a managed endpoint. **Fill the following endpoint name**, which must be unique in DNS.

Afterwards, we will deploy inferencing container image on this endpoint.

In [11]:
%%writefile managed_endpoint.yml
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: {FILL_UNIQUE_ENDPOINT_NAME}
auth_mode: key

Writing managed_endpoint.yml


In [12]:
!az ml online-endpoint create --file managed_endpoint.yml \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

{
  "auth_mode": "key",
  "id": "/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AML-rg/providers/Microsoft.MachineLearningServices/workspaces/ws01/onlineEndpoints/sparkml-test01",
  "identity": {
    "principal_id": "4b5b1518-1046-4e06-aa12-bf738c4b21dd",
    "tenant_id": "72f988bf-86f1-41af-91ab-2d7cd011db47",
    "type": "system_assigned"
  },
  "kind": "Managed",
  "location": "eastus",
  "mirror_traffic": {},
  "name": "sparkml-test01",
  "properties": {
    "AzureAsyncOperationUri": "https://management.azure.com/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/providers/Microsoft.MachineLearningServices/locations/eastus/mfeOperationsStatus/oe:bdaba7c0-9940-43e5-a272-86b92d91b7de:b235a268-79d1-46d7-827e-06e1da806fd9?api-version=2022-02-01-preview",
    "azureml.onlineendpointid": "/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourcegroups/aml-rg/providers/microsoft.machinelearningservices/workspaces/ws01/onlineendpoints/sparkml-test01"
  },
  "provisio

## Deploy inferencing for Spark ML serving

Now we deploy inferencing image on this endpoint.

In this deployment, we use custom environment, in which Apache Spark is installed and configured.

In [13]:
%%writefile managed_deployment.yml
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: sparkml-deployment-v1
endpoint_name: {FILL_UNIQUE_ENDPOINT_NAME}
model: azureml:fight_delay_model@latest
code_configuration:
  code: ./script
  scoring_script: inference.py
environment: azureml:sparkml-serving-env@latest
instance_type: Standard_DS3_v2
instance_count: 1

Writing managed_deployment.yml


In [14]:
!az ml online-deployment create --file managed_deployment.yml \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace \
  --all-traffic

All traffic will be set to deployment sparkml-deployment-v1 once it has been provisioned.
If you interrupt this command or it times out while waiting for the provisioning, you can try to set all the traffic to this deployment later once its has been provisioned.
Check: endpoint sparkml-test01 exists
[32mUploading script (0.0 MBs): 100%|██████████| 851/851 [00:00<00:00, 82012.61it/s][0m
[39m

Creating/updating online deployment sparkml-deployment-v1 ..........................................................................................Done (7m 56s)
{
  "app_insights_enabled": false,
  "code_configuration": {
    "code": "/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AML-rg/providers/Microsoft.MachineLearningServices/workspaces/ws01/codes/ff3c3f96-fe3c-43be-9ea4-be9713038ae0/versions/1",
    "scoring_script": "inference.py"
  },
  "endpoint_name": "sparkml-test01",
  "environment": "azureml:/subscriptions/b3ae1c15-4fef-4362-8c3a-5d804cdeb18d/resourceGroups/AML-

## Get endpoint url and key

After deployment has completed, get endpoint url by the following command. (**Fill the following endpoint name**)

In [15]:
endpoint_name = "{FILL_UNIQUE_ENDPOINT_NAME}"

In [16]:
!az ml online-endpoint show \
  --name $endpoint_name \
  --query scoring_uri \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

"https://sparkml-test01.eastus.inference.ml.azure.com/score"
[0m

Get authorization key for this endpoint by the following command.

In [17]:
!az ml online-endpoint get-credentials \
  --name $endpoint_name \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace

{
  "primaryKey": "dcSSJx9Oc6NDlkQCN3BV3sWDQQtOKg3n",
  "secondaryKey": "BxyoTwj0wA1OmbOmDWXogXbgrwL9Z1rV"
}
[0m

## Test web service

Now predict flight delay with this endpoint as follows. (Call online inferencing.)

In the following example, the predicted results of 2 rows are both ```0``` (which means "not delayed").

In [18]:
endpoint_url = "{FILL_ENDPOINT_URL}"
# Example : endpoint_url = "https://sparkml-test01.eastus.inference.ml.azure.com/score"
authorization_key = "{FILL_AUTHORIZATION_KEY}"
# Example : authorization_key = "dcSSJx9Oc6NDlkQCN3BV3sWDQQtOKg3n"

In [19]:
import requests

headers = {
    "Content-Type":"application/json",
    "Authorization":("Bearer " + authorization_key),
}
input_data = """
{
  "data": [
    {
      "MONTH": 1,
      "DAY_OF_WEEK": 1,
      "UNIQUE_CARRIER": "AA",
      "ORIGIN": "ABQ",
      "DEST": "DFW",
      "CRS_DEP_TIME": 9,
      "CRS_ARR_TIME": 12,
      "RelativeHumidityOrigin": 23.0,
      "AltimeterOrigin": 30.55,
      "DryBulbCelsiusOrigin": 9.4,
      "WindSpeedOrigin": 3.0,
      "VisibilityOrigin": 10.0,
      "DewPointCelsiusOrigin": -10.6,
      "RelativeHumidityDest": 35.0,
      "AltimeterDest": 30.6,
      "DryBulbCelsiusDest": 7.2,
      "WindSpeedDest": 7.0,
      "VisibilityDest": 10.0,
      "DewPointCelsiusDest": -7.2
    },
    {
      "MONTH": 1,
      "DAY_OF_WEEK": 1,
      "UNIQUE_CARRIER": "AA",
      "ORIGIN": "BNA",
      "DEST": "DFW",
      "CRS_DEP_TIME": 12,
      "CRS_ARR_TIME": 15,
      "RelativeHumidityOrigin": 78.5,
      "AltimeterOrigin": 30.05,
      "DryBulbCelsiusOrigin": 10.8,
      "WindSpeedOrigin": 1.5,
      "VisibilityOrigin": 8.0,
      "DewPointCelsiusOrigin": 7.1,
      "RelativeHumidityDest": 86.0,
      "AltimeterDest": 29.86,
      "DryBulbCelsiusDest": 9.4,
      "WindSpeedDest": 18.0,
      "VisibilityDest": 6.0,
      "DewPointCelsiusDest": 7.2
    }
  ]
}
"""
http_res = requests.post(
  endpoint_url,
  input_data,
  headers = headers)
print("Predicted : ", http_res.text)

Predicted :  [0, 0]


## Clean up

Remove endpoint.

In [None]:
!az ml online-endpoint delete \
  --name $endpoint_name \
  --resource-group $my_resource_group \
  --workspace-name $my_workspace \
  --yes