# whylogs + pySpark

First, you'll need to point to an existing Spark installation. Make sure that the Spark version matches the pyspark version in the environment.

In [1]:
env SPARK_HOME=/Users/andy/Workspace/spark/spark-3.0.1-bin-hadoop2.7

env: SPARK_HOME=/Users/andy/Workspace/spark/spark-3.0.1-bin-hadoop2.7


## whylogs Spark jar

You'll need to load a fat jar for whylogs. This jar contains all the required dependencies (some are shaded) for running whylogs in Spark.

Spark will then inject the `whyspark` module into the path.

In [2]:
whylogs_jar = "/Volumes/Workspace/whylogs-java/spark-bundle/build/libs/whylogs-spark-bundle_2.12-0.1.2-b0.jar"

In [3]:
import pyspark

In [4]:
import sys
sys.executable

'/Users/andy/miniconda3/envs/whylogs-spark/bin/python'

In [5]:
spark = pyspark.sql.SparkSession.builder \
                .appName("whylogs") \
                .config("spark.pyspark.python", sys.executable) \
                .config("spark.executor.userClassPathFirst", "true") \
                .config("spark.submit.pyFiles", whylogs_jar) \
                .config("spark.jars", whylogs_jar) \
                .getOrCreate() 

In [6]:
spark

## Using Spark bridge

`whyspark` is the module that is bundled in the above jar. It's a thin bridge into whylogs Spark.

In [7]:
from whyspark import new_profiling_session

In [8]:
df = spark.read.csv("full-data.csv")

In [9]:
profile_df = new_profiling_session(df, "demo").aggProfiles()

## whylogs data

whylogs output is marked as "BinaryType" and thus cannot be viewed in Spark. You'll need to use actual whylogs library in Python to handle this

In [10]:
profile_df.show()

+--------------------+
|         why_profile|
+--------------------+
|[E4 DF 32 0A 2A 0...|
+--------------------+



In [11]:
## Save to Parquet
profile_df.write.mode('overwrite').parquet("whylogs")

In [12]:
pdf = profile_df.toPandas()

In [13]:
pdf

Unnamed: 0,why_profile
0,"[228, 223, 50, 10, 42, 8, 1, 16, 1, 26, 4, 100..."


In [14]:
import whylogs

In [15]:
assert whylogs.__version__ >= '0.2.0'
print(whylogs.__version__)

0.2.0-dev0


In [16]:
from whylogs import DatasetProfile

In [17]:
pdf['profile'] = pdf['why_profile'].apply(DatasetProfile.parse_delimited)

In [18]:
pdf['profile'].get(0)[0].to_summary()

properties {
  schema_major_version: 1
  schema_minor_version: 1
  session_id: "demo"
  session_timestamp: 1612846518762
  data_timestamp: -1
  tags {
    key: "Name"
    value: "demo"
  }
  tags {
    key: "name"
    value: ""
  }
}
columns {
  key: "_c0"
  value {
    counters {
      count: 98680
    }
    schema {
      inferred_type {
        type: FRACTIONAL
        ratio: 0.9999898662342926
      }
      type_counts {
        key: "FRACTIONAL"
        value: 98679
      }
      type_counts {
        key: "STRING"
        value: 1
      }
    }
    number_summary {
      count: 98679
      min: 1000.0
      max: 40000.0
      mean: 15030.950100831991
      stddev: 8756.094826211576
      histogram {
        start: 1000.0
        end: 40000.004
        counts: 2038
        counts: 3392
        counts: 2688
        counts: 9344
        counts: 4992
        counts: 5632
        counts: 9664
        counts: 3200
        counts: 6784
        counts: 3584
        counts: 6272
        c