# PC Algorithm for Distribution Shift Detection in Tabular Data

PC algorithm can detect the origins of distribution shifts in tabular, continous/discrete data with the help of domain index variable. The algorithm uses the PC algorithm to estimate the causal graph, by treating distribution shifts as intervention of the domain index on the root cause node, and PC can use  conditional independence tests to quickly recover the causal graph and detec the root cause of anomaly. Note that the algorithm supports both discrete and continuous variables, and can handle nonlinear relationships by converting the continous variables into discrete ones using K-means clustering and using discrete PC algorithm instead for CI test and causal discovery.

In [1]:
import numpy as np
import pandas as pd
from causalai.data.data_generator import DataGenerator
from causalai.application import TabularDistributionShiftDetector
from causalai.application.common import distshift_detector_preprocess

### Generate tabular data with two domains with distribution shifts

We add distribution shifts on the **node b**. Because of the causal influnences, the anomaly on node b proporgates along the causal graph to node d as well. However, node d is not the cause of distribution shifts. Our algorithm is supposed to only return node b as the root cause of anomaly.

In [2]:
fn_normal = lambda x:x
fn_abnormal = lambda x:x+1

coef = 1.0
sem_normal = {
        'a': [], 
        'b': [('a', coef, fn_normal)], 
        'c': [('a', coef, fn_normal)],
        'd': [('b', coef, fn_normal), ('c', coef, fn_normal)]
        }

sem_abnormal = {
        'a': [], 
        'b': [('a', coef, fn_abnormal)], 
        'c': [('a', coef, fn_normal)],
        'd': [('b', coef, fn_normal), ('c', coef, fn_normal)]
        }

T = 1000
data_array_normal, var_names, graph_gt = DataGenerator(sem_normal, T=T, seed=0, discrete=False)
data_array_abnormal, var_names, graph_gt = DataGenerator(sem_abnormal, T=T, seed=1, discrete=False)

In [3]:
df_normal = pd.DataFrame(data=data_array_normal, columns=var_names)
df_abnormal = pd.DataFrame(data=data_array_abnormal, columns=var_names)
c_idx = np.array([0]*T + [1]*T)

In [4]:
data_obj, var_names = distshift_detector_preprocess(
    data=[df_normal, df_abnormal],
    domain_index=c_idx,
    domain_index_name='domain_index',
    n_states=2
    )

In [5]:
var_names

['a', 'b', 'c', 'd', 'domain_index']

### Run the tabular distribution shift detector

In [6]:
model = TabularDistributionShiftDetector(
    data_obj=data_obj,
    var_names=var_names,
    domain_index_name='domain_index',
    prior_knowledge=None)

In [7]:
root_causes, graph = model.run(
    pvalue_thres=0.01,
    max_condition_set_size=4,
    return_graph=True
)

The distribution shifts are from the nodes: {'b'}


In [8]:
print(root_causes)

{'b'}


In [9]:
print(graph)

{'a': {'c', 'b', 'd'}, 'b': {'a', 'd'}, 'c': {'a', 'd'}, 'd': {'c', 'b', 'a'}, 'domain_index': {'b'}}
