# whylogs + pySpark

First, you'll need to point to an existing Spark installation. Make sure that the Spark version matches the pyspark version in the environment.

In [1]:
import os

In [2]:
env SPARK_HOME={os.getcwd()}/spark-3.1.1-bin-hadoop2.7

env: SPARK_HOME=/Volumes/Workspace/notebooks/spark-3.1.1-bin-hadoop2.7


## whylogs Spark jar

You'll need to load a fat jar for whylogs. This jar contains all the required dependencies (some are shaded) for running whylogs in Spark.

Spark will then inject the `whyspark` module into the path.

You can build this with `spark-bundle` module in `https://github.com/whylabs/whylogs-java/tree/mainline/spark-bundle`

In [3]:
whylogs_jar = "https://repo.maven.apache.org/maven2/ai/whylabs/whylogs-java-spark_3.1.1-scala_2.12/0.1.7-b2/whylogs-java-spark_3.1.1-scala_2.12-0.1.7-b2.jar"

In [4]:
import pyspark
pyspark.__version__

'3.1.1'

In [5]:
import sys
sys.executable

'/Users/andy/miniconda3/envs/whylogs-spark/bin/python'

In [6]:
%env PYSPARK_PYTHON={sys.executable}
%env PYSPARK_DRIVER_PYTHON={sys.executable}

env: PYSPARK_PYTHON=/Users/andy/miniconda3/envs/whylogs-spark/bin/python
env: PYSPARK_DRIVER_PYTHON=/Users/andy/miniconda3/envs/whylogs-spark/bin/python


In [7]:
spark = pyspark.sql.SparkSession.builder \
                .appName("whylogs") \
                .config("spark.pyspark.driver.python", sys.executable) \
                .config("spark.pyspark.python", sys.executable) \
                .config("spark.executor.userClassPathFirst", "true") \
                .config("spark.submit.pyFiles", whylogs_jar) \
                .config("spark.jars", whylogs_jar) \
                .getOrCreate() 

## Using Spark bridge

`whyspark` is the module that is bundled in the above jar. It's a thin bridge into whylogs Spark.

In [8]:
import whyspark
whyspark # should show that the py file is from the jar

<module 'whyspark' from '/private/var/folders/58/g0k4klhn3915_fs0s1gc7h7r0000gn/T/spark-b8bb7cd8-a685-4c48-8978-efbb774b6a36/userFiles-345c96e5-cb29-4a00-b4a4-ac829432d3cd/whylogs-java-spark_3.1.1-scala_2.12-0.1.7-b2.jar/whyspark/__init__.py'>

In [9]:
import pandas as pd

# Adjust timestamp
The dataset was dated for the past and we need to re-date it to more recent time so we can use them in WhyLabs for the streaming window (last 7 days)

In [10]:
def adjust_offset(pdf: pd.DataFrame):
    offset = pd.Timestamp.today().round(freq='D') - pdf.order_estimated_delivery_date.max() - pd.Timedelta(days=1)
    pdf['order_estimated_delivery_date'] = pdf.order_estimated_delivery_date + offset
    return pdf

In [11]:
# Read the example datasetbrazillian_data_demo.parquet
import pandas as pd
pdf = pd.read_parquet("")

FileNotFoundError: [Errno 2] No such file or directory: ''

In [None]:
pdf['delivery_confidence'] = pdf['delivery_confidence'].astype(int)
pdf['delivery_prediction'] = pdf['delivery_prediction'].astype(int)
pdf['delivery_status'] = pdf['delivery_status'].astype(int)

In [None]:
pdf = adjust_offset(pdf)

# Mark output field
Rename fields to output

In [None]:
pdf = pdf.rename(columns={"delivery_prediction":"delivery_prediction (output)", "delivery_status": "delivery_status (output)", "delivery_confidence": "delivery_confidence (output)"})

In [None]:
df = spark.createDataFrame(pdf)

In [None]:
from pyspark.sql.functions import col

In [None]:
df.printSchema()

# Create a basic profiling session
First we wrap the profiling session around the dataframe

In [None]:
from whyspark import new_profiling_session

session = new_profiling_session(df, "my-model-name") 

## Classification Model

This dataset has both all the training features and predictions and actuals (or targets) with the score. It's also a binary classification model

Key fields to note:
* `targets`
* `predictions`
* `scores`

In [None]:
classificationSession = session.withTimeColumn('order_estimated_delivery_date').withClassificationModel("delivery_prediction (output)", "delivery_status (output)", "delivery_confidence (output)") 

In [None]:
# Quick validation
classificationSession.aggProfiles().count()

## Regression model

You need the prediction (numerical vallue) and the confidence (numerical value).

In [None]:
regressionModel = session.withTimeColumn('order_estimated_delivery_date').withRegressionModel("delivery_prediction (output)", "delivery_confidence (output)")

In [None]:
# quick validation
regressionModel.aggProfiles().count()

## Run the profiling

Note that the result has three entries for three days


## Publish to WhyLabs service

You'll need the following information:
* Organization ID (`org_id`)
* Model ID (`model_id`)
* API Key specific to the organization

Please reach out to the WhyLabs team if you don't have the information.

You can pass these via the method parameters or pass them as environment variables.

In [None]:
sessionWithModel.log?

In [None]:
# Here we're passing using environment variables

%env WHYLABS_API_KEY=36+fjcpFpLaW58ufYtcc0ULOv0ikV7NMz/hEYidoh1fkvGsVeWNc9lb/uYKOgv86
%env WHYLABS_ORG_ID=org-a9N6PX
%env WHYLABS_MODEL_ID=model-2

In [None]:
regressionModel.log?

In [None]:
classificationSession.log?