# Intro to Segmentation

Sometimes, certain subgroups of data can behave very differently from the overall dataset. When monitoring the health of a dataset, it’s often helpful to have visibility at the sub-group level to better understand how these subgroups are contributing to trends in the overall dataset. whylogs supports data segmentation for this purpose.

Data segmentation is done at the point of profiling a dataset.

Segmentation can be done by a single feature or by multiple features simultaneously. For example, you could have different profiles according to the gender of your dataset ("M" or "F"), and also for different combinations of, let's say, Gender and City Code.

The specification of segments can be done in two different ways:
- At the Feature level (i.e., a column name - "Gender" or "Product Category")
- At the Feature-value level (i.e. value for a given column - "Product Category":"Books")



# Table of Contents

- Intro to Segmentation
- Segmentation on Different features (Feature level)
- Segmentation on key-values (Feature-value level)
- Auto Segmentation
- Merging back the segmented profiles

# Segmentation on different features

Let's use a sample data for the following steps of this notebook.
We'll be using data from the [Retail Case Study Data](https://www.kaggle.com/darpan25bajaj/retail-case-study-data). The present data was modified to contain features for a specific task: predict whether the transaction is a purchase cancelation or not. For this example, we'll look only at the input features and target label, and not on the prediction output itself.

In the dataset, we have info on the transaction itself, like total amount and item price, as well as info on the Product (category and subcategory) and the customer (Age,Gender, City Code).

We'll use data for one given day, logging a single batch of data.

In [None]:
!pip install whylogs
!pip install pybars3

In [1]:
import pandas as pd
daily_df = pd.read_csv("https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/retail-daily-features.csv")


In [2]:
daily_df.head()

Unnamed: 0,Transaction ID,Customer ID,Product Subcategory Code,Product Category Code,Item Price,Total Tax,Total Amount,Store Type,Product Category,Product Subcategory,Date of Birth,Gender,City Code,Age at Transaction Date,Purchase Canceled,Transaction Day of Week,Transaction Week,Transaction Batch
0,T25601292314,C268458,12,6,114.9,24.129,253.929,TeleShop,Home and kitchen,Tools,1976-10-08,M,1.0,36.0,0.0,0,0,0
1,T1465175267,C271344,3,5,107.7,22.617,238.017,e-Shop,Books,Comics,1970-01-29,F,5.0,43.0,0.0,0,0,0
2,T4968790114,C272305,4,3,14.6,7.665,80.665,e-Shop,Electronics,Mobiles,1975-08-25,F,10.0,37.0,0.0,0,0,0
3,T50504166310,C275057,4,4,15.7,4.9455,52.0455,MBR,Bags,Women,1980-09-17,M,7.0,32.0,0.0,0,0,0
4,T10877729712,C270074,10,5,144.1,45.3915,477.6915,e-Shop,Books,Non-Fiction,1983-02-20,M,10.0,30.0,0.0,0,0,0


Let's first segment our profiles according to the Customer's `Gender` and `Product Category`.
Let's take a look of the possible categories of each feature:

In [3]:
daily_df['Gender'].unique().tolist()

['M', 'F']

In [4]:
daily_df['Product Category'].unique().tolist()

['Home and kitchen', 'Books', 'Electronics', 'Bags', 'Footwear', 'Clothing']

We can obtain the profile's segments upon logging the dataframe by specifying the column names we want to segment on:

In [5]:

import numpy as np
from whylogs import get_or_create_session
from datetime import datetime
session = get_or_create_session()

features_to_segment = ['Gender','Product Category']
now = datetime.today()

with session.logger("segment-test", dataset_timestamp=now) as logger:
    logger.log_dataframe(daily_df,segments=features_to_segment)
    profile_segments = logger.segmented_profiles
    

WARN: Missing config


After the `with` statement is closed, the profiles are written to disk, but let's also store them in-memory as `profile_segments` to make this easier.

The `profile_segments` takes form as a dict with different keys and profiles, according to the segment's combination. We can see the different profiles by inspecting the `tag` of each one, and seeing to which `Gender` and `Product Category` the profile's segmented on.

In [6]:
for k, prof in profile_segments.items():
    print(prof.tags)

{'whylogs.tag.Gender': 'F', 'whylogs.tag.Product Category': 'Bags', 'name': 'segment-test'}
{'whylogs.tag.Gender': 'F', 'whylogs.tag.Product Category': 'Books', 'name': 'segment-test'}
{'whylogs.tag.Gender': 'F', 'whylogs.tag.Product Category': 'Clothing', 'name': 'segment-test'}
{'whylogs.tag.Gender': 'F', 'whylogs.tag.Product Category': 'Electronics', 'name': 'segment-test'}
{'whylogs.tag.Gender': 'F', 'whylogs.tag.Product Category': 'Footwear', 'name': 'segment-test'}
{'whylogs.tag.Gender': 'F', 'whylogs.tag.Product Category': 'Home and kitchen', 'name': 'segment-test'}
{'whylogs.tag.Gender': 'M', 'whylogs.tag.Product Category': 'Bags', 'name': 'segment-test'}
{'whylogs.tag.Gender': 'M', 'whylogs.tag.Product Category': 'Books', 'name': 'segment-test'}
{'whylogs.tag.Gender': 'M', 'whylogs.tag.Product Category': 'Clothing', 'name': 'segment-test'}
{'whylogs.tag.Gender': 'M', 'whylogs.tag.Product Category': 'Electronics', 'name': 'segment-test'}
{'whylogs.tag.Gender': 'M', 'whylogs.tag

Each profile is a statistical fingerpring of the data, segmented on the related columns, let's take a look at a simple summary that can be extracted from the profile for, let's say, Male Customers who bought (or cancelled) Clothing products:

In [7]:
product_label = 'Clothing'
gender_label = 'M'
for k, prof in profile_segments.items():
    if product_label==prof.tags['whylogs.tag.Product Category'] and gender_label==prof.tags['whylogs.tag.Gender']:
        profile_summary = prof.flat_summary()['summary']
        target_profile = prof
        print(profile_summary.head())

                     column  count  null_count  bool_count  numeric_count  \
0                 City Code   66.0         0.0         0.0           66.0   
1                Store Type   66.0         0.0         0.0            0.0   
2   Age at Transaction Date   66.0         0.0         0.0           66.0   
3             Date of Birth   66.0         0.0         0.0            0.0   
4  Product Subcategory Code   66.0         0.0         0.0           66.0   

    max       mean   min    stddev  nunique_numbers  ...  stddev_token_length  \
0  10.0   5.469697   1.0  2.735632             10.0  ...             0.000000   
1   0.0   0.000000   0.0  0.000000              0.0  ...             0.361298   
2  41.0  29.803030  20.0  6.043992             22.0  ...             0.000000   
3   0.0   0.000000   0.0  0.000000              0.0  ...             0.000000   
4   4.0   2.530303   1.0  1.338423              3.0  ...             0.000000   

   quantile_0.0000  quantile_0.0100  quantile_0.05

Just to make it more readable, let's use one of the features of the `NotebookProfileViewer` to display simple statistics for a given feature.
We might be interested in looking at the `Purchase Canceled` feature:

In [8]:
from whylogs.viz import NotebookProfileViewer
feature_name = "Purchase Canceled"

print("Feature Statistics for:\n    Feature:{}\nFor profile segment of:\n    Gender: {}\n    Product Category: {}".format(feature_name,gender_label,product_label))
visualization = NotebookProfileViewer()
visualization.set_profiles(target_profile=target_profile)
visualization.feature_statistics(feature_name=feature_name)

Feature Statistics for:
    Feature:Purchase Canceled
For profile segment of:
    Gender: M
    Product Category: Clothing


Around 6% of transactions were cancellations for male customers and clothing products. This is actually a lot less than the mean for other segments, and overall. You can see that this is the case by inspecting other segments! 

# Segmentation on Key-values

The second method of defining segment is by specifying the specific values of given columns you want to segment on. We will do this by specifying key-value pairs.

Suppose we are only interested for transactions in a given type of store - `e-Shops`.

In [9]:
daily_df['Store Type'].value_counts()

e-Shop            375
TeleShop          184
MBR               176
Flagship store    174
Name: Store Type, dtype: int64

As before, we specify the `features_to_segment`. But this time, passing a list of lists. Each list will have one or more dicts with `key` and `value` field, like this:

In [10]:
import numpy as np
from whylogs import get_or_create_session
session = get_or_create_session()

features_to_segment = [[{"key": "Store Type", "value": "e-Shop"}]]

now = datetime.today()

with session.logger("segment-test", dataset_timestamp=now) as logger:
    logger.log_dataframe(daily_df,segments=features_to_segment)
    profile_segments = logger.segmented_profiles



As before, we can take a look at the segments. In this case, we only have the segment related to transactions that took place in the e-Shop:

In [11]:
for k, prof in profile_segments.items():
    print(prof.tags)

{'whylogs.tag.Store Type': 'e-Shop', 'name': 'segment-test'}


For segmenting at the feature + value level with python, each nested list defines one or more conditions defining each segment. In the example above, we get only the segment for which `Store Type` has values equal to `e-Shop`. We could define additional segments, such as:


In [12]:
features_to_segment = [[{"key": "Store Type", "value": "e-Shop"}],[{"key": "Store Type", "value": "TeleShop"},{"key": "Product Subcategory", "value": "Comics"}]]

# Auto segmentation

In addition to manual segmentation, we can also automatically estimate the most important features and values on which to segment. This is done in the whylogs library using entropy-based methods. The intuition is that the columns that have the most entropy according to one target feature will probably be an interesting one to segment on.

To obtain the columns that has the most entropy according to our feature of interest (`Purchase Canceled`), we can pass the dataframe to the `estimate_segments` methods:

In [13]:
from whylogs import get_or_create_session
sess = get_or_create_session()

auto_segments = sess.estimate_segments(daily_df, max_segments=20,target_field="Purchase Canceled",name="demo1")

auto_segments

['City Code']

Note that we can also specify the maximum number of segments. In this case, we specify a maximum of 20 segments.

If no target field is specified, the method will find a suitable field based on the maximum entropy column.

`estimate_segments` returns a list of column names, which in turn can be used as argument to `log_dataframe`:

In [14]:
import numpy as np
from whylogs import get_or_create_session
session = get_or_create_session()

now = datetime.today()

with session.logger("segment-test", dataset_timestamp=now) as logger:
    logger.log_dataframe(daily_df,segments=auto_segments)
    profile_segments = logger.segmented_profiles


Once again, let's check the segmented profiles. In this case, we have 10 different categories, related to different cities:

In [15]:
for k, prof in profile_segments.items():
    print(prof.tags)

{'whylogs.tag.City Code': '1.0', 'name': 'segment-test'}
{'whylogs.tag.City Code': '2.0', 'name': 'segment-test'}
{'whylogs.tag.City Code': '3.0', 'name': 'segment-test'}
{'whylogs.tag.City Code': '4.0', 'name': 'segment-test'}
{'whylogs.tag.City Code': '5.0', 'name': 'segment-test'}
{'whylogs.tag.City Code': '6.0', 'name': 'segment-test'}
{'whylogs.tag.City Code': '7.0', 'name': 'segment-test'}
{'whylogs.tag.City Code': '8.0', 'name': 'segment-test'}
{'whylogs.tag.City Code': '9.0', 'name': 'segment-test'}
{'whylogs.tag.City Code': '10.0', 'name': 'segment-test'}


# Merging the Profiles

In case you want the complete profile from the segmented ones, you can make use of the fact that DatasetProfiles are mergeable.

Let's take the last example. If you want to get the complete profile back from the profiles for each `City Code`, you can just use the `.merge` method of each Profile object:

In [16]:
from functools import reduce
profiles = [prof for _,prof in profile_segments.items()]
merged = reduce(lambda x, y: x.merge(y), profiles)

`merged` is now the profile for the complete original DataFrame. Let's take a look at the merged profile's summary:

In [17]:
merged.flat_summary()['summary']

Unnamed: 0,column,count,null_count,bool_count,numeric_count,max,mean,min,stddev,nunique_numbers,...,stddev_token_length,quantile_0.0000,quantile_0.0100,quantile_0.0500,quantile_0.2500,quantile_0.5000,quantile_0.7500,quantile_0.9500,quantile_0.9900,quantile_1.0000
0,Date of Birth,908.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,,,,,,,,
1,City Code,908.0,0.0,0.0,908.0,10.0,5.390969,1.0,2.887491,10.0,...,0.0,1.0,1.0,1.0,3.0,5.0,8.0,10.0,10.0,10.0
2,Product Category,908.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.76799,,,,,,,,,
3,Transaction Week,908.0,0.0,0.0,908.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Purchase Canceled,908.0,72.0,0.0,836.0,1.0,0.095694,0.0,0.294347,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
5,Transaction Batch,908.0,0.0,0.0,908.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Product Subcategory Code,908.0,0.0,0.0,908.0,12.0,6.008811,1.0,3.756645,12.0,...,0.0,1.0,1.0,1.0,3.0,5.0,10.0,12.0,12.0,12.0
7,Transaction ID,908.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,,,,,,,,,
8,Total Tax,908.0,0.0,0.0,908.0,78.2775,24.808725,0.861,18.5278,809.0,...,0.0,0.861,1.701,3.4545,9.9225,20.643,36.855,61.698002,75.809998,78.277496
9,Product Subcategory,908.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.425731,,,,,,,,,
