# Local Profile Store with Constraints

Hey there! In this example we will understand how to setup the `LocalStore` and use it to track changes to our incoming data. It is an implementation of the `ProfileStore` that will manage listing, reading and writing whylogs' Dataset Profiles locally.

## Setting up
The first thing you'll need to do to start using the `LocalStore` is instantiate the object and check if any profiles were written with the `list` method.

In [None]:
from whylogs.api.store import LocalStore

store = LocalStore()
store.list()

[]

And since we have an empty list returned, it means that we haven't used the profile store in this location so far. But we can already check that a new directory called `profile_store` was created on our working directory:

In [None]:
import os 
"profile_store" in os.listdir(os.getcwd())

True

## Logging profiles

Now that we have our `LocalStore` configured, let's write some data to it. In order to emulate a real use-case but also maintain this notebook less complex, we will instantiate a rolling logger instance and run it for 2 minutes. The interval in which we choose to roll, the logger will rotate and persist a merged profile to the `LocalStore`. And then we will ingest the same pandas DataFrame in order to emulate multiple log calls not in sync with the rotation schedule. This tries to bring to light a real streaming case, where there is a long-living logging application that receives multiple requests and rotates the profiles to the LocalStore with a certain time-range.

In [None]:
import pandas as pd 

df = pd.DataFrame({"column_1": [1,2,3,45], "column_2": [1,2,2,None], "column_3": ["strings", "more", "strings", ""]})

In [None]:
import time
import whylogs as why

with why.logger(mode="rolling", interval=1, when="M", base_name="base_model_name") as logger:
    logger.append_store(store=store)

    for _ in range(60):
        logger.log(df)
        time.sleep(2)

And then you should see new profiles created on your LocalStore. Let's investigate if that is actually the case:

In [None]:
dataset_id = store.list()[0]
os.listdir(f"profile_store/{dataset_id}")

['profile_2022-11-29_14:11:3_019528a6-f5a9-4be5-8c53-709d9fec2f19.bin',
 'profile_2022-11-29_14:11:0_a2f92dac-98a8-4fa2-b62c-b88733def863.bin']

## Read profiles from the store

Another step in learning how to use the `LocalStore` is the ability to fetch back profiles. You can either do that by passing in a `DatasetIdQuery`, which will fetch all existing profiles within that dataset_id, or a `DateQuery`, that will get all written profiles for a specific datetime range.

In [None]:
from whylogs.api.store import DateQuery, DatasetIdQuery

name_query = DatasetIdQuery(dataset_id=dataset_id)

profile_view = store.get(query=name_query)

In [None]:
profile_view.to_pandas()

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,...,distribution/stddev,frequent_items/frequent_strings,ints/max,ints/min,type,types/boolean,types/fractional,types/integral,types/object,types/string
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
column_1,4.0,4.0,4.0002,0,240,0,0,45.0,12.75,3.0,...,18.671909,"[FrequentItem(value='2', est=60, upper=60, low...",45.0,1.0,SummaryType.COLUMN,0,0,240,0,0
column_2,2.0,2.0,2.0001,0,240,60,60,2.0,1.666667,2.0,...,0.472719,,,,SummaryType.COLUMN,0,180,0,0,0
column_3,2.0,2.0,2.0001,0,240,0,0,,0.0,,...,0.0,"[FrequentItem(value='strings', est=120, upper=...",,,SummaryType.COLUMN,0,0,0,0,240


The second approach is to get from a certain date range. Since we have written only two profiles for the same minute, we will end up with the same result from before. The nice thing about this is that it will allow users to fetch profiles for a moving window of reference, as we will demonstrate below:

In [None]:
from datetime import datetime, timedelta


date_query = DateQuery(
    dataset_id=dataset_id,
    start_date=datetime.utcnow() - timedelta(days=7),
    end_date=datetime.utcnow()
)

timed_profile_view = store.get(query=date_query)

In [None]:
timed_profile_view.to_pandas()

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,...,distribution/stddev,frequent_items/frequent_strings,ints/max,ints/min,type,types/boolean,types/fractional,types/integral,types/object,types/string
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
column_1,4.0,4.0,4.0002,0,240,0,0,45.0,12.75,3.0,...,18.671909,"[FrequentItem(value='2', est=60, upper=60, low...",45.0,1.0,SummaryType.COLUMN,0,0,240,0,0
column_2,2.0,2.0,2.0001,0,240,60,60,2.0,1.666667,2.0,...,0.472719,,,,SummaryType.COLUMN,0,180,0,0,0
column_3,2.0,2.0,2.0001,0,240,0,0,,0.0,,...,0.0,"[FrequentItem(value='strings', est=120, upper=...",,,SummaryType.COLUMN,0,0,0,0,240


So everytime you run `get` with this specified `DateQuery`, you will get 7 previous days worth of data. And this can be useful, for instance, to compare a reference profile against a newly logged one. Let's see how to do that on the next section. 

>**IMPORTANT**: Please note that even if we pass a milisecond granular datetime in the `DateQuery` range, `store.get` will always search for profiles on a daily basis. We decided to do that to simplify the API usage as well as having a more statistically significant merged profile view when reading. If this does not fit your current needs for the `LocalStore`, please submit an issue on our Github repo and also feel free to ask others on our [community Slack](http://join.slack.whylabs.ai/).  

## Validating profiles with the Local Store

Now let's use this new functionality to validate incoming profiles! This will be useful to trigger some actions when receiving incoming data, for example. In order to do that, we will need a set of fixed rules for the Validator, as well as a well-defined set of Constraints, so we can do comparisons while profiling and also after profiling. So let's get to it.

In [None]:
from whylogs.core.relations import  Not, Predicate

X = Predicate()

name_condition = {"is_not_value": Not(X.equals("John"))}

After defining the conditions that we wish to validate, we need to set the callback that will be triggered when this condition is met. We will simply print something to the screen for this example, but in a real usage scenario, you could possibly stop your processes and trigger an alert to your central communications channel, for example :) 

In [None]:
from typing import Any

def do_something_important(validator_name: str, condition_name: str, value: Any):
    print(f"Validator {validator_name} failed! Condition name {condition_name} failed for value {value}")


Lastly, we need to create the validator, that will take in the condition that we set along with the callback function.

In [None]:
from whylogs.core.validators import ConditionValidator


name_validator = ConditionValidator(
    name="no_one_named_john",
    conditions=name_condition,
    actions=[do_something_important],
)

And now we will map the condition to specific columns: 

In [None]:
validators = {
    "column_3": [name_validator]
}

Finally, we can again log incoming data. For this example, we will log data for approximately 2 minutes, which will dump 2 new profiles to our `LocalStore`. Then, we will introduce data that won't match the validator condition, and we will see the callback being executed while logging! Please note that both DataFrames follow the **same schema**, only invalid data is being brought to the second one.

In [None]:
from whylogs.core.schema import DatasetSchema
import whylogs as why

schema = DatasetSchema(validators=validators)
with why.logger(schema=schema, mode="rolling", base_name="base_model_name", interval=1, when="M") as logger:
    logger.append_store(store=store)

    for _ in range(60):
        logger.log(df)
        time.sleep(2)

    new_df = pd.DataFrame({"column_1": [1,2,3,45], "column_2": [1,2,2,None], "column_3": ["John", "more", "strings", ""]})

    logger.log(new_df)

Validator no_one_named_john failed! Condition name is_not_value failed for value John


And as we can see, the validation failed, since we introduced a DataFrame with an invalid data point! Let's check how does the stored profile look like:

In [None]:
name_query = DatasetIdQuery(dataset_id="base_model_name")
profile_view = store.get(query=name_query)
profile_view.to_pandas()

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,...,distribution/stddev,frequent_items/frequent_strings,ints/max,ints/min,type,types/boolean,types/fractional,types/integral,types/object,types/string
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
column_1,4.0,4.0,4.0002,0,484,0,0,45.0,12.75,3.0,...,18.652247,"[FrequentItem(value='2', est=121, upper=121, l...",45.0,1.0,SummaryType.COLUMN,0,0,484,0,0
column_2,2.0,2.0,2.0001,0,484,121,121,2.0,1.666667,2.0,...,0.472055,,,,SummaryType.COLUMN,0,363,0,0,0
column_3,3.0,3.0,3.00015,0,484,0,0,,0.0,,...,0.0,"[FrequentItem(value='strings', est=241, upper=...",,,SummaryType.COLUMN,0,0,0,0,484


## Comparing profiles with the Profile Store

Last thing we wanted to demonstrate is the ability to fetch ever-moving reference profiles from the `LocalStore` and use them to compare to recently profile data. We will use whylogs' Constraints module along with the `LocalStore` and we will see how users might benefit from it in the future. In order to do that, we will make two queries to the store, one of them will be our reference, and the other one will aggregate only today's worth of data.

In [None]:
store = LocalStore()

today_query = DateQuery(start_date=datetime.utcnow(), dataset_id="base_model_name")
reference_query = DateQuery(start_date=datetime.now() - timedelta(days=7), end_date=datetime.now(), dataset_id="base_model_name")

today_profile = store.get(query=today_query)
reference_profile = store.get(query=reference_query)

With both profiles read, now what we need to do is define our constraints suite. For demonstration purposes, we will check if the column values are **not** greater than the average of the reference. 

In [None]:
from whylogs.core.constraints import ConstraintsBuilder
from whylogs.core.constraints.factories import greater_than_number, null_percentage_below_number

reference_mean = reference_profile.get_column("column_1").get_metric("distribution").avg

builder = ConstraintsBuilder(dataset_profile_view=today_profile)
builder.add_constraint(null_percentage_below_number(column_name="column_2", number=0.4))
builder.add_constraint(greater_than_number(column_name="column_1", number=reference_mean))

constraints = builder.build()
print(constraints.validate())
print(constraints.generate_constraints_report())

False
[ReportResult(name='null percentage of column_2 lower than 0.4', passed=1, failed=0, summary=None), ReportResult(name='column_1 greater than number 12.75', passed=0, failed=1, summary=None)]


So we have just validated our recently read profile against the reference, which has 7 days worth of profiles. We can also mix the reference profile checks with other ones just for the newest. 
Hopefully this short demonstration notebook can bring some of the features that the `LocalStore` brings to help you make the most out of whylogs and make your data and ML pipelines more robust and responsible. To learn more, check out our [other examples](https://github.com/whylabs/whylogs/tree/mainline/python/examples).

In [None]:
# cleaning up
import shutil
shutil.rmtree("./profile_store")