- Difference between: Condition Count Metrics vs Metric Constraints
- Metric Constraints
- Condition Count Metrics
    - count metrics
    - callbacks
- Complex constraints
    - Same Column, Multiple Metrics
    - Constraints on Condition Counts
    - Different Columns, Condition Counts
- Constraints Report

## The Constraints

Let's define a list of constraints to apply to our data in the tables below.

For each of these constraints, we will show the feature it will be applied to, used parameters, and also a brief explanation of what it checks for.

### Completeness Constraints

| constraint  | feature | parameters  | semantic                                        |
|-------------|---------|-------------|-------------------------------------------------|
| is_complete | id      | column name | Checks that are no missing values in the column |

### Consistency Constraints

| constraint                        | feature               | parameters                         | semantic                                                                    |
|-----------------------------------|-----------------------|------------------------------------|-----------------------------------------------------------------------------|
| is_unique                         | _id_                  | column name                        | Checks that there are no duplicate values in a column.                      |
| matches_pattern                   | _listing_url_         | column name, regex pattern         | Checks that all values match regex pattern (if link is from airbnb domain)  |
| is_in_range                       | _latitude, longitude_ | column name, lower and upper bound | Checks that column is inside a range defined by a lower and upper bound     |
| is_less_than                      | _availability_365_    | column name, value                 | Checks that maximum value of column is less than number                     |
| is_nullable_fractional            | _bedrooms_            | column name                        | Checks that column contains only fractional values (null values acceptable) |
| is_non_negative                   | _bedrooms_            | column name                        | Checks that column contains only non negative values                        |
| matches_date_format               | _last_review_         | column name, date pattern          | Checks that all values match date pattern (Y-m-d)                           |
| frequent_strings_on_reference_set | _room_type_           | column name, reference set         | Checks that all values are in reference set                                 |
### Statistics Constraints

| constraint           | feature             | parameters                          | semantic                                                                               |
|----------------------|---------------------|-------------------------------------|----------------------------------------------------------------------------------------|
| stddev_between_range | _reviews_per_month_ | column name, lower and upper bounds | Checks that standard deviation must be between range defined by upper and lower bounds |
| has_mode             | _room_type_         | column name, value                  | Checks that most frequent item is _value_ ("Entire home/apt")                          |

## Standard Metrics vs. Condition Count Metrics 


If we take a closer look at our constraints, we may find a problem.

First, we need to remember that whylogs profiles contain summarized information about our data. This means that it's a __lossy__ process, and once we get the profiles, we don't have access anymore to the complete set of data. For almost all of the listed constraints, this is ok. For example, standard metrics in our profile give us access to the standard deviation, maximum value and frequent strings, so we can easily build the constraints for `stddev_between_range`, `is_less_than` and `frequent_strings_on_reference_set`, for example.

For other constraints, we're not so lucky. For example, to check whether the `listing_url` constains a url regex pattern, we need to verify every single row of the data. How do we do that if we don't have access to the complete data?

The answer is that you need to define a __Condition Count Metric__ to be tracked __before__ logging your data. This metric will count the number of times the values of a given column meets a user-defined condition. When the profile is generated, you'll have that information to check against the constraints you'll create.

### Actionables

For every constraint, a report is generated, whose results can be visually displayed in a dashboard or programatically accessed to perform any downstream actions. Additionally, for Condition Count Metrics, you can also define actions that will be triggered whenever the condition is met, mid-logging. This enables you to trigger actions for critical scenarios, without having to wait for the logging process to be finished. These actions are user-defined, so you can use them any way you see fit.

In this example, we'll define two types of actions: (a) Act: Emergency action, such as halting your pipeline or pulling an andon cord, and (b) Alert: sending an alert through email/slack, for example.

The table below shows the metric each constraint is applied to, as well as the resulting actionables:


| constraint                        | Appliable Metric       | actionables        |
|-----------------------------------|------------------------|--------------------|
| is_complete                       | Counts Metric          | report             |
| is_unique                         | Frequent Items Metric  | report             |
| matches_pattern                   | Condition Count Metric | act, alert, report |
| is_in_range                       | Distribution Metric    | report             |
| is_less_than                      | Distribution Metric    | report             |
| is_nullable_fractional            | Types Metric           | report             |
| is_non_negative                   | Distribution Metric    | report             |
| matches_date_format               | Condition Count Metric | alert, report      |
| frequent_strings_on_reference_set | Frequent Items Metric  | report             |
| stddev_between_range              | Distribution Metric    | report             |
| has_mode                          | Frequent Item Metric   | report             |



# Glossary

Ok, the terms are beginning to pile up. Let's make a pause and try to make the relationship between those concepts clearer.

Here's a diagram showing how __Profiles__, __Condition Count Metrics__ and __Metric Constraints__ relate to each other:

![alt text](mermaid-diagram-2022-11-30-173216.png "Title")

- A whylogs __profile__ contains a set of metrics that summarize the original data. 
- __Metric Constraints__ validate __Metrics__ present on the __Profile__ to generate a Report. This report will tell us whether the data meets the data quality constraints we defined.
- Those metrics can be standard metrics, such as Distribution, Frequent Items, or Types metrics. They can also be __Condition Count Metrics__.
- Condition Count Metrics count the number of occurrences a certain relation passed/failed. For example, the number of times the rows for the column `Name` was equal to `Bob`
- For a profile to contain Condition Count Metrics, you have first to specify a Condition __before__ logging your data, and pass that to your Dataset Schema.
- The Dataset Schema configures the behavior for tracking metrics in whylogs.
- Conditions also enable you to trigger actions when the condition is met while data is being logged. To do that, you need to specify a set of actions in your Condition object, along with the Relation that will trigger those action. This is the same Relation that will be used to track the Condition Count Metrics in the profile. 

## Defining the Conditions

In [1]:
import pandas as pd
df_total = pd.read_csv("http://data.insideairbnb.com/brazil/rj/rio-de-janeiro/2021-01-26/data/listings.csv.gz")
selected_columns = ['name','description','listing_url','last_review','number_of_reviews_ltm', 'number_of_reviews_l30d','id','latitude','longitude','availability_365','bedrooms','bathrooms','reviews_per_month','room_type']
df_total.columns
# df = df_total[selected_columns][:5]
df = df_total[selected_columns].iloc[[0,6,39,282,1]].reset_index(drop=True)

In [2]:
df_total.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'description',
       'neighborhood_overview', 'picture_url', 'host_id', 'host_url',
       'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'calendar_upd

## Condition Counts

In [26]:
from typing import Any

def pull_andon_cord(validator_name, condition_name: str, value: Any):
    print("Validator: {}\n    Condition name {} failed for value {}".format(validator_name, condition_name, value))
    print("    Pulling andon cord....")
    # Do something here to respond to the constraint violation
    return

def send_slack_alert(validator_name, condition_name: str, value: Any):
    print("Validator: {}\n    Condition name {} failed for value {}".format(validator_name, condition_name, value))
    print("    Sending slack alert....")
    # Do something here to respond to the constraint violation
    return

def send_email_alert(validator_name, condition_name: str, value: Any):
    print("Validator: {}\n    Condition name {} failed for value {}".format(validator_name, condition_name, value))
    print("    Sending email alert....")
    # Do something here to respond to the constraint violation
    return


In [27]:
import datetime
from typing import Any, Dict
from whylogs.core.relations import Predicate
from whylogs.core.metrics.condition_count_metric import (
    Condition,
    ConditionCountConfig,
    ConditionCountMetric,
)
from whylogs.core.resolvers import Resolver
from whylogs.core.datatypes import DataType
from whylogs.core.schema import ColumnSchema, DatasetSchema
from whylogs.core.metrics import Metric
import whylogs as why

def date_format(x: Any) -> bool:
    date_format = '%Y-%m-%d'
    try:
        datetime.datetime.strptime(x, date_format)
        return True
    except ValueError:
        return False

X = Predicate()

date_format_condition = {
    "is_date_format": Condition(X.is_(date_format),actions=[send_slack_alert])
}


class CustomResolver(Resolver):
    def resolve(self, name: str, why_type: DataType, column_schema: ColumnSchema) -> Dict[str, Metric]:
        return {"condition_count": ConditionCountMetric.zero(column_schema.cfg)}

config = ConditionCountConfig(conditions=date_format_condition)
resolver = CustomResolver()
schema = DatasetSchema(default_configs=config, resolvers=resolver)

n_view = why.log(df, schema=schema).profile().view()
print(n_view.to_pandas())


Validator: condition_count
    Condition name is_date_format failed for value Very Nice 2Br in Copacabana w. balcony, fast WiFi
    Sending slack alert....
Validator: condition_count
    Condition name is_date_format failed for value Rio de Janeiro Copacabana Ipanema
    Sending slack alert....
Validator: condition_count
    Condition name is_date_format failed for value Butterfly Vanazul Guest House
    Sending slack alert....
Validator: condition_count
    Condition name is_date_format failed for value Kaza Rio Hostel room up to 6 people
    Sending slack alert....
Validator: condition_count
    Condition name is_date_format failed for value Beautiful Modern Decorated Studio in Copa
    Sending slack alert....
Validator: condition_count
    Condition name is_date_format failed for value Discounts for long term stays. <br />- Large balcony (25 square meters) which allows for being outside while staying home. Apt is impeccably clean.<br />- High speed WiFi (20MB)<br /><br /><b>The spac

In [25]:
X = Predicate()

from typing import Any

def do_something_important(validator_name, condition_name: str, value: Any):
    print("Validator: {}\n    Condition name {} failed for value {}".format(validator_name, condition_name, value))
    return

class CustomResolver(Resolver):
    def resolve(self, name: str, why_type: DataType, column_schema: ColumnSchema) -> Dict[str, Metric]:
        return {"condition_count": ConditionCountMetric.zero(column_schema.cfg)}


conditions = {"url_matches_airbnb_domain": Condition(X.matches("^https:\/\/www.airbnb.com\/rooms"), actions=[pull_andon_cord,send_email_alert,send_slack_alert])}

config = ConditionCountConfig(conditions=conditions)
resolver = CustomResolver()
schema = DatasetSchema(default_configs=config, resolvers=resolver)

n_view = why.log(df, schema=schema).profile().view()
print(n_view.to_pandas())

Validator: condition_count
    Condition name url_matches_airbnb_domain failed for value Very Nice 2Br in Copacabana w. balcony, fast WiFi
Pulling andon cord....
Validator: condition_count
    Condition name url_matches_airbnb_domain failed for value Very Nice 2Br in Copacabana w. balcony, fast WiFi
Sending email alert....
Validator: condition_count
    Condition name url_matches_airbnb_domain failed for value Very Nice 2Br in Copacabana w. balcony, fast WiFi
Sending slack alert....
Validator: condition_count
    Condition name url_matches_airbnb_domain failed for value Rio de Janeiro Copacabana Ipanema
Pulling andon cord....
Validator: condition_count
    Condition name url_matches_airbnb_domain failed for value Rio de Janeiro Copacabana Ipanema
Sending email alert....
Validator: condition_count
    Condition name url_matches_airbnb_domain failed for value Rio de Janeiro Copacabana Ipanema
Sending slack alert....
Validator: condition_count
    Condition name url_matches_airbnb_domain 

In [5]:
import whylogs as why
profile = why.log(df).profile()
profile_view = profile.view()

In [6]:
df

Unnamed: 0,name,description,listing_url,last_review,number_of_reviews_ltm,number_of_reviews_l30d,id,latitude,longitude,availability_365,bedrooms,bathrooms,reviews_per_month,room_type
0,"Very Nice 2Br in Copacabana w. balcony, fast WiFi",Discounts for long term stays. <br />- Large b...,https://www.airbnb.com/rooms/17878,2020-12-26,13,0,17878,-22.96592,-43.17896,286,2.0,,2.01,Entire home/apt
1,Rio de Janeiro Copacabana Ipanema,"Apartamento tem três dormitórios, dois banheir...",https://www.airbnb.com/rooms/48726,2019-08-08,0,0,48726,-22.98414,-43.1945,90,1.0,,1.07,Private room
2,Butterfly Vanazul Guest House,"Vanazul, is a Hostel for lovers of nature, adv...",https://www.airbnb.com/rooms/92549,2014-04-22,0,0,92549,-22.99893,-43.27047,0,1.0,,0.01,Shared room
3,Kaza Rio Hostel room up to 6 people,Kaza Rio's location is the ideal for those who...,https://www.airbnb.com/rooms/452500,2017-04-03,0,0,452500,-22.91586,-43.20939,365,1.0,,0.07,Hotel room
4,Beautiful Modern Decorated Studio in Copa,"Our apartment is a little gem, everyone loves ...",https://www.airbnb.com/rooms/25026,2020-02-15,2,0,25026,-22.97712,-43.19045,357,1.0,,1.84,Entire home/apt


In [7]:
profile_view.to_pandas()

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,...,distribution/stddev,frequent_items/frequent_strings,ints/max,ints/min,type,types/boolean,types/fractional,types/integral,types/object,types/string
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
availability_365,5.0,5.0,5.00025,0,5,0,0,365.0,219.6,286.0,...,165.415537,"[FrequentItem(value='357', est=1, upper=1, low...",365.0,0.0,SummaryType.COLUMN,0,0,5,0,0
bathrooms,0.0,0.0,0.0,0,5,5,5,,0.0,,...,0.0,,,,SummaryType.COLUMN,0,0,0,0,0
bedrooms,2.0,2.0,2.0001,0,5,0,0,2.0,1.2,1.0,...,0.447214,,,,SummaryType.COLUMN,0,5,0,0,0
description,5.0,5.0,5.00025,0,5,0,0,,0.0,,...,0.0,"[FrequentItem(value='Vanazul, is a Hostel for ...",,,SummaryType.COLUMN,0,0,0,0,5
id,5.0,5.0,5.00025,0,5,0,0,452500.0,127335.8,48726.0,...,184098.943661,"[FrequentItem(value='92549', est=1, upper=1, l...",452500.0,17878.0,SummaryType.COLUMN,0,0,5,0,0
last_review,5.0,5.0,5.00025,0,5,0,0,,0.0,,...,0.0,"[FrequentItem(value='2020-12-26', est=1, upper...",,,SummaryType.COLUMN,0,0,0,0,5
latitude,5.0,5.0,5.00025,0,5,0,0,-22.91586,-22.968394,-22.97712,...,0.031711,,,,SummaryType.COLUMN,0,5,0,0,0
listing_url,5.0,5.0,5.00025,0,5,0,0,,0.0,,...,0.0,[FrequentItem(value='https://www.airbnb.com/ro...,,,SummaryType.COLUMN,0,0,0,0,5
longitude,5.0,5.0,5.00025,0,5,0,0,-43.17896,-43.208754,-43.1945,...,0.036177,,,,SummaryType.COLUMN,0,5,0,0,0
name,5.0,5.0,5.00025,0,5,0,0,,0.0,,...,0.0,[FrequentItem(value='Beautiful Modern Decorate...,,,SummaryType.COLUMN,0,0,0,0,5


In [8]:
from  whylogs.core.constraints.metric_constraints import MetricConstraint, MetricsSelector
from whylogs.core.constraints import ConstraintsBuilder
from whylogs.core.relations import Require
def is_complete(column_name: str) -> MetricConstraint:
    """Checks that there are no missing values in the column.

    Parameters
    ----------
    column_name : str
        Column the constraint is applied to
    """

    constraint = MetricConstraint(
        name=f"{column_name} is complete",
        condition=Require("null").equals(0),
        metric_selector=MetricsSelector(column_name=column_name, metric_name="counts"),
    )
    return constraint


builder = ConstraintsBuilder(dataset_profile_view=profile_view)
builder.add_constraint(is_complete(column_name="id"))

constraints = builder.build()
constraints.generate_constraints_report()


[ReportResult(name='id is complete', passed=1, failed=0, summary=None)]

In [21]:
from  whylogs.core.constraints.metric_constraints import MetricConstraint, MetricsSelector
from whylogs.core.constraints import ConstraintsBuilder
from whylogs.core.relations import Require
from whylogs.core.configs import SummaryConfig

def is_unique(column_name: str) -> MetricConstraint:
    """Checks that there are no duplicate values in a column.

    Parameters
    ----------
    column_name : str
        Column the constraint is applied to
    """

    def unique(metric) -> bool:
        frequent_strings = metric.to_summary_dict(SummaryConfig())["frequent_strings"]
        return frequent_strings[0].est==1
        

    constraint = MetricConstraint(
        name=f"{column_name} is unique",
        condition=Require().is_(unique),
        metric_selector=MetricsSelector(column_name=column_name, metric_name="frequent_items"),
    )
    return constraint


builder = ConstraintsBuilder(dataset_profile_view=profile_view)
builder.add_constraint(is_unique(column_name="room_type"))

constraints = builder.build()
constraints.generate_constraints_report()

[ReportResult(name='room_type is unique', passed=0, failed=1, summary=None)]

In [24]:
from whylogs.core.configs import SummaryConfig
from whylogs.core.metric_getters import ProfileGetter
from whylogs.core.relations import Not, Require
from whylogs.core.metrics.metrics import Metric
from whylogs.core.preprocessing import PreprocessedColumn
from typing import List
from whylogs.core.metrics import DistributionMetric, FrequentItemsMetric

def column_pair_mean_a_less_or_equal_than_mean_b(column_a: str, column_b:str, profile) -> MetricConstraint:
    """Checks if mean of column A is less or equal than mean of column B
    """
    dist_metrics = MetricsSelector(metric_name="distribution", column_name=column_a)
    condition=Require("mean").less_or_equals(ProfileGetter(profile, column_name=column_b, path="distribution/mean"))

    constraint_name = f"{column_a} mean is less or equal than {column_b} mean"
    constraint = MetricConstraint(name=constraint_name, condition=condition, metric_selector=dist_metrics)
    return constraint

builder.add_constraint(column_pair_mean_a_less_or_equal_than_mean_b("number_of_reviews_l30d","number_of_reviews_ltm",profile))

constraints = builder.build()
constraints.generate_constraints_report()

[ReportResult(name='id is complete', passed=1, failed=0, summary=None),
 ReportResult(name='number_of_reviews_l30d mean is less or equal than number_of_reviews_ltm mean', passed=1, failed=0, summary=None)]

In [9]:
from typing import Union
from whylogs.core.metrics import DistributionMetric

def is_in_range(column_name: str, lower: Union[float, int], upper: Union[float,int], skip_missing: bool = True) -> MetricConstraint:
    """Minimum value of given column must be above defined number.

    Parameters
    ----------
    column_name : str
        Column the constraint is applied to
    number : float
        reference value for applying the constraint
    skip_missing: bool
        If skip_missing is True, missing distribution metrics will make the check pass.
        If False, the check will fail on missing metrics
    """

    def in_range(metric: DistributionMetric) -> bool:
        if not metric.kll.value.is_empty():
            return metric.min >= lower and metric.max <= upper
        else:
            return True if skip_missing else False

    constraint = MetricConstraint(
        name=f"{column_name} is in range [{lower},{upper}]",
        condition=Require("min").greater_or_equals(lower).and_(Require("max").less_or_equals(upper)),
        metric_selector=MetricsSelector(column_name=column_name, metric_name="distribution"),
    )
    return constraint

builder.add_constraint(is_in_range(column_name="latitude",lower=-24,upper=-22))
builder.add_constraint(is_in_range(column_name="longitude",lower=-44,upper=-43))

constraints = builder.build()
constraints.generate_constraints_report()

[ReportResult(name='id is complete', passed=1, failed=0, summary=None),
 ReportResult(name='latitude is in range [-24,-22]', passed=1, failed=0, summary=None),
 ReportResult(name='longitude is in range [-44,-43]', passed=1, failed=0, summary=None)]

In [10]:
from whylogs.core.constraints.factories import smaller_than_number

builder.add_constraint(smaller_than_number(column_name="availability_365",number=366))

constraints = builder.build()
constraints.generate_constraints_report()


[ReportResult(name='id is complete', passed=1, failed=0, summary=None),
 ReportResult(name='latitude is in range [-24,-22]', passed=1, failed=0, summary=None),
 ReportResult(name='longitude is in range [-44,-43]', passed=1, failed=0, summary=None),
 ReportResult(name='availability_365 smaller than number 366', passed=1, failed=0, summary=None)]

In [11]:
from whylogs.core.constraints.factories import column_is_nullable_fractional

builder.add_constraint(column_is_nullable_fractional(column_name="bedrooms"))

constraints = builder.build()
constraints.generate_constraints_report()


[ReportResult(name='id is complete', passed=1, failed=0, summary=None),
 ReportResult(name='latitude is in range [-24,-22]', passed=1, failed=0, summary=None),
 ReportResult(name='longitude is in range [-44,-43]', passed=1, failed=0, summary=None),
 ReportResult(name='availability_365 smaller than number 366', passed=1, failed=0, summary=None),
 ReportResult(name='bedrooms is nullable fractional', passed=1, failed=0, summary=None)]

In [12]:
def column_is_non_negative(column_name: str, skip_missing: bool = True) -> MetricConstraint:
    """Checks if a column is non negative

    Parameters
    ----------
    column_name : str
        Column the constraint is applied to
    skip_missing: bool
        If skip_missing is True, missing distribution metrics will make the check pass.
        If False, the check will fail on missing metrics
    """

    def is_not_negative(metric: DistributionMetric) -> bool:
        if not metric.kll.value.is_empty():
            return metric.min >= 0
        else:
            return True if skip_missing else False

    constraint = MetricConstraint(
        name=f"{column_name} is non negative",
        condition=Require("min").greater_or_equals(0),
        metric_selector=MetricsSelector(column_name=column_name, metric_name="distribution"),
    )
    return constraint

builder.add_constraint(column_is_non_negative(column_name="bedrooms"))
constraints = builder.build()
constraints.generate_constraints_report()

[ReportResult(name='id is complete', passed=1, failed=0, summary=None),
 ReportResult(name='latitude is in range [-24,-22]', passed=1, failed=0, summary=None),
 ReportResult(name='longitude is in range [-44,-43]', passed=1, failed=0, summary=None),
 ReportResult(name='availability_365 smaller than number 366', passed=1, failed=0, summary=None),
 ReportResult(name='bedrooms is nullable fractional', passed=1, failed=0, summary=None),
 ReportResult(name='bedrooms is non negative', passed=1, failed=0, summary=None)]

In [13]:
from whylogs.core.constraints.factories import stddev_between_range

builder.add_constraint(stddev_between_range(column_name="reviews_per_month", lower=0.8, upper=1.1))
constraints = builder.build()
constraints.generate_constraints_report()

[ReportResult(name='id is complete', passed=1, failed=0, summary=None),
 ReportResult(name='latitude is in range [-24,-22]', passed=1, failed=0, summary=None),
 ReportResult(name='longitude is in range [-44,-43]', passed=1, failed=0, summary=None),
 ReportResult(name='availability_365 smaller than number 366', passed=1, failed=0, summary=None),
 ReportResult(name='bedrooms is nullable fractional', passed=1, failed=0, summary=None),
 ReportResult(name='bedrooms is non negative', passed=1, failed=0, summary=None),
 ReportResult(name='reviews_per_month standard deviation between 0.8 and 1.1 (inclusive)', passed=1, failed=0, summary=None)]

In [14]:
from whylogs.core.constraints.factories import frequent_strings_in_reference_set

reference_set = {"Entire home/apt","Private room","Shared room","Hotel room"}
builder.add_constraint(frequent_strings_in_reference_set(column_name="room_type", reference_set=reference_set))

constraints = builder.build()
constraints.generate_constraints_report()

[ReportResult(name='id is complete', passed=1, failed=0, summary=None),
 ReportResult(name='latitude is in range [-24,-22]', passed=1, failed=0, summary=None),
 ReportResult(name='longitude is in range [-44,-43]', passed=1, failed=0, summary=None),
 ReportResult(name='availability_365 smaller than number 366', passed=1, failed=0, summary=None),
 ReportResult(name='bedrooms is nullable fractional', passed=1, failed=0, summary=None),
 ReportResult(name='bedrooms is non negative', passed=1, failed=0, summary=None),
 ReportResult(name='reviews_per_month standard deviation between 0.8 and 1.1 (inclusive)', passed=1, failed=0, summary=None),
 ReportResult(name="room_type values in set {'Private room', 'Entire home/apt', 'Shared room', 'Hotel room'}", passed=1, failed=0, summary=None)]

In [18]:
from typing import Any
from whylogs.core.configs import SummaryConfig

def has_mode(column_name: str, value: Any) -> MetricConstraint:
    """Determine whether a set of variables appear in the frequent strings for a string column.
    Every item in frequent strings must be in defined reference set

    Parameters
    ----------
    column_name : str
        Columns the constraint is applied to.
    reference_set : dict
        Reference set for applying the constraint
    """
    frequent_strings = MetricsSelector(metric_name="frequent_items", column_name=column_name)

    def mode(metric):
        frequent_strings = metric.to_summary_dict(SummaryConfig())["frequent_strings"]
        return frequent_strings[0].value==value

    constraint_name = f"{column_name} mode is {value}"
    constraint = MetricConstraint(name=constraint_name, condition=Require().is_(mode), metric_selector=frequent_strings)
    return constraint

builder.add_constraint(has_mode(column_name="room_type", value="Entire home/apt"))
constraints = builder.build()
constraints.generate_constraints_report()


[ReportResult(name='id is complete', passed=1, failed=0, summary=None),
 ReportResult(name='latitude is in range [-24,-22]', passed=1, failed=0, summary=None),
 ReportResult(name='longitude is in range [-44,-43]', passed=1, failed=0, summary=None),
 ReportResult(name='availability_365 smaller than number 366', passed=1, failed=0, summary=None),
 ReportResult(name='bedrooms is nullable fractional', passed=1, failed=0, summary=None),
 ReportResult(name='bedrooms is non negative', passed=1, failed=0, summary=None),
 ReportResult(name='reviews_per_month standard deviation between 0.8 and 1.1 (inclusive)', passed=1, failed=0, summary=None),
 ReportResult(name="room_type values in set {'Private room', 'Entire home/apt', 'Shared room', 'Hotel room'}", passed=1, failed=0, summary=None),
 ReportResult(name='room_type mode is Entire home/apt', passed=1, failed=0, summary=None)]

In [47]:
df_total[-10:]['room_type']

26096    Entire home/apt
26097    Entire home/apt
26098       Private room
26099    Entire home/apt
26100    Entire home/apt
26101       Private room
26102    Entire home/apt
26103    Entire home/apt
26104    Entire home/apt
26105    Entire home/apt
Name: room_type, dtype: object

In [25]:

# df_total[['listing_url','number_of_reviews_ltm', 'number_of_reviews_l30d']]

Unnamed: 0,name,description,listing_url,last_review,first_review,id,latitude,longitude,availability_365,bedrooms,bathrooms,reviews_per_month,room_type
0,"Very Nice 2Br in Copacabana w. balcony, fast WiFi",Discounts for long term stays. <br />- Large b...,https://www.airbnb.com/rooms/17878,2020-12-26,2010-07-15,17878,-22.96592,-43.17896,286,2.0,,2.01,Entire home/apt
1,Beautiful Modern Decorated Studio in Copa,"Our apartment is a little gem, everyone loves ...",https://www.airbnb.com/rooms/25026,2020-02-15,2010-06-07,25026,-22.97712,-43.19045,357,1.0,,1.84,Entire home/apt
2,Cosy flat close to Ipanema beach,This cosy apartment is just a few steps away ...,https://www.airbnb.com/rooms/35636,2020-03-15,2013-10-22,35636,-22.98816,-43.19359,300,1.0,,2.05,Entire home/apt
3,COPACABANA SEA BREEZE - RIO - 20 X Superhost,Our newly renovated studio is located in the b...,https://www.airbnb.com/rooms/35764,2021-01-24,2010-10-03,35764,-22.98127,-43.19046,84,1.0,,2.79,Entire home/apt
4,"Modern 2bed,Top end of Copacabana","<b>The space</b><br />Stay in this, Modern,cle...",https://www.airbnb.com/rooms/41198,2016-02-09,2013-02-11,41198,-22.97962,-43.1923,365,2.0,,0.19,Entire home/apt


In [3]:
from typing import Any, Callable, Dict, List, Optional, Tuple

isinstance(list(),List)

True