>### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*<br> 
>*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Getting_Started)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=Getting_Started) to leverage the power of whylogs and WhyLabs together!*

# Getting Started with Metric UDFs

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/basic/Getting_Started.ipynb)

whylogs provides a standard to log any kind of data.

With whylogs, we will show how to log data, generating statistical summaries called *profiles*. These profiles can be used in a number of ways, like:

* Data Visualization
* Data Validation
* Tracking changes in your datasets

## Table of Content

In this example, we'll explore the basics of logging data with whylogs:

- Installing whylogs
- Profiling data
- Interacting with the profile
- Writing/Reading profiles to/from disk

## Installing whylogs

whylogs is made available as a Python package. You can get the latest version from PyPI with `pip install whylogs`:

In [None]:
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs==1.3.2.dev0

Minimal requirements:

- Python 3.7+ up to Python 3.10
- Windows, Linux x86_64, and MacOS 10+

## Loading a Pandas DataFrame

Before showing how we can log data, we first need the data itself. Let's create a simple Pandas DataFrame:

In [1]:
import pandas as pd

# Some toy data showing animal names, the count of their legs, and weight
data = {
    "animal": ["cat", "hawk", "snake", "cat", "squid"],
    "legs": [4, 2, 0, 4, 10],
    "weight": [4.3, 1.8, 1.3, 4.1, 3.1],
}

df = pd.DataFrame(data)
max_row_count = df.shape[0]
max_row_count

# we will sample individual rows from this dataframe, here is what they will look like
for i in range(max_row_count):
    print(df.iloc[i:i+1])

sample_normal_record = df.iloc[0:1]
sample_outlier_record = df.iloc[2:3]
print(f"This is our example normal record {sample_normal_record}")
print(f"This is our example outlier record {sample_outlier_record}")

  animal  legs  weight
0    cat     4     4.3
  animal  legs  weight
1   hawk     2     1.8
  animal  legs  weight
2  snake     0     1.3
  animal  legs  weight
3    cat     4     4.1
  animal  legs  weight
4  squid    10     3.1
This is our example normal record   animal  legs  weight
0    cat     4     4.3
This is our example outlier record   animal  legs  weight
2  snake     0     1.3


## Profiling with whylogs and a UDF

To obtain a profile of your data, you can simply use whylogs' `log` call, and navigate through the result to a specific profile with `profile()`:

In [8]:
from typing import Optional
import whylogs as why
from whylogs.experimental.core.udf_schema import udf_schema
from whylogs.experimental.core.metrics.udf_metric import register_metric_udf

# Let's suppose we expected the leg count in our data to be between 2 and 8 inclusive,
# we might define a metric UDF on the legs column, and output a label for outliers:
@register_metric_udf(col_name="legs")
def detect_leg_count_extremes(legs: int) -> Optional[str]:
  if legs < 2 or legs > 8: # because we haven't considered there are animals with no legs
    return "outlier"
  return None # default is to not return a label since we are detecting extremes in this metric UDF

# udf_schema() wires in any UDFs I have in scope and returns a cutom schema you can pass
# into whylogs profiling, if `detect_extremes` function is defined elsewhere,
# it is sufficient to import it and then call udf_schema() in that python file.
custom_schema = udf_schema()

# now log the toy data with the custom_schema that attaches the metric UDF defined above
normal_results = why.log(sample_normal_record, schema=custom_schema, trace_id="0")
outlier_results = why.log(sample_outlier_record, schema=custom_schema, trace_id="2")
normal_profile = normal_results.profile()
outlier_profile = outlier_results.profile()

## Inspecting Profiles

Once you're done logging the data, you can generate a `Profile View` and inspect it in a Pandas Dataframe format:

In [6]:
normal_prof_df = normal_profile.view().to_pandas()
normal_prof_df

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,...,udf/detect_leg_count_extremes:counts/inf,udf/detect_leg_count_extremes:counts/n,udf/detect_leg_count_extremes:counts/nan,udf/detect_leg_count_extremes:counts/null,udf/detect_leg_count_extremes:types/boolean,udf/detect_leg_count_extremes:types/fractional,udf/detect_leg_count_extremes:types/integral,udf/detect_leg_count_extremes:types/object,udf/detect_leg_count_extremes:types/string,udf/detect_leg_count_extremes:types/tensor
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
animal,1.0,1.0,1.00005,0,1,0,0,,0.0,,...,,,,,,,,,,
legs,1.0,1.0,1.00005,0,1,0,0,4.0,4.0,4.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
weight,1.0,1.0,1.00005,0,1,0,0,4.3,4.3,4.3,...,,,,,,,,,,


In [7]:
outlier_prof_df = outlier_profile.view().to_pandas()
outlier_prof_df

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,...,udf/detect_leg_count_extremes:distribution/q_95,udf/detect_leg_count_extremes:distribution/q_99,udf/detect_leg_count_extremes:distribution/stddev,udf/detect_leg_count_extremes:frequent_items/frequent_strings,udf/detect_leg_count_extremes:types/boolean,udf/detect_leg_count_extremes:types/fractional,udf/detect_leg_count_extremes:types/integral,udf/detect_leg_count_extremes:types/object,udf/detect_leg_count_extremes:types/string,udf/detect_leg_count_extremes:types/tensor
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
animal,1.0,1.0,1.00005,0,1,0,0,,0.0,,...,,,,,,,,,,
legs,1.0,1.0,1.00005,0,1,0,0,0.0,0.0,0.0,...,,,0.0,"[FrequentItem(value='outlier', est=1, upper=1,...",0.0,0.0,0.0,0.0,1.0,0.0
weight,1.0,1.0,1.00005,0,1,0,0,1.3,1.3,1.3,...,,,,,,,,,,


In [3]:
import whylogs as why

why.log_debug_event(debug_event={"content": "outlier detected"}, trace_id="2", write_local_file=True)