# PC Algorithm for Root Cause Analysis of Microservice Failure

The Peter-Clark (PC) algorithm is one of the most general purpose algorithms for causal discovery that can be used for both tabular and time series data, of both continuous and discrete types. As proposed in CD-NOD [1], PC algorithm can be tailored for root cause analysis by treating the failure as an intervention on the root cause, and PC can use conditional independence tests to quickly detect it. Let us see how PC algorithm, with slight modifications on the PriorKnowledge sets, can be adapted for Root Cause Analysis for continous, microservice monitoring metrics data.

References:

[1] Huang, Biwei, Kun Zhang, Jiji Zhang, Joseph Ramsey, Ruben Sanchez-Romero, Clark Glymour, and Bernhard Sch√∂lkopf. "Causal discovery from heterogeneous/nonstationary data." The Journal of Machine Learning Research 21, no. 1 (2020): 3482-3534.

[2] Ikram, Azam, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. "Root Cause Analysis of Failures in Microservices through Causal Discovery." Advances in Neural Information Processing Systems 35 (2022): 31158-31170.

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import truncexpon, halfnorm
from causalai.application import RootCauseDetector
from causalai.application.common import rca_preprocess

### Generate cloud monitoring metrics data

We create distribution shifts on the marginals/external noises of caching service, in which the anomaly will get propagated downwards to product service because of the causal graph.

In [2]:
def create_observed_latency_data(unobserved_intrinsic_latencies):
    observed_latencies = {}
    observed_latencies['Product DB'] = unobserved_intrinsic_latencies['Product DB']
    observed_latencies['Customer DB'] = unobserved_intrinsic_latencies['Customer DB']
    observed_latencies['Order DB'] = unobserved_intrinsic_latencies['Order DB']
    observed_latencies['Shipping Cost Service'] = unobserved_intrinsic_latencies['Shipping Cost Service']
    observed_latencies['Caching Service'] = np.random.choice([0, 1], size=(len(observed_latencies['Product DB']),),
                                                             p=[.5, .5]) * \
                                            observed_latencies['Product DB'] \
                                            + unobserved_intrinsic_latencies['Caching Service']
    observed_latencies['Product Service'] = np.maximum(np.maximum(observed_latencies['Shipping Cost Service'],
                                                                  observed_latencies['Caching Service']),
                                                       observed_latencies['Customer DB']) \
                                            + unobserved_intrinsic_latencies['Product Service']

    return pd.DataFrame(observed_latencies)


def unobserved_intrinsic_latencies_normal(num_samples):
    return {
        'Product Service': halfnorm.rvs(size=num_samples, loc=0.1, scale=0.2),
        'Shipping Cost Service': halfnorm.rvs(size=num_samples, loc=0.1, scale=0.2),
        'Caching Service': halfnorm.rvs(size=num_samples, loc=0.1, scale=0.1),
        'Order DB': truncexpon.rvs(size=num_samples, b=5, scale=0.2),
        'Customer DB': truncexpon.rvs(size=num_samples, b=6, scale=0.2),
        'Product DB': truncexpon.rvs(size=num_samples, b=10, scale=0.2)
    }

def unobserved_intrinsic_latencies_anomalous(num_samples):
    return {
        'Product Service': halfnorm.rvs(size=num_samples, loc=0.1, scale=0.2),
        'Shipping Cost Service': halfnorm.rvs(size=num_samples, loc=0.1, scale=0.2),
        'Caching Service': 2 + halfnorm.rvs(size=num_samples, loc=0.1, scale=0.1),
        'Order DB': truncexpon.rvs(size=num_samples, b=5, scale=0.2),
        'Customer DB': truncexpon.rvs(size=num_samples, b=6, scale=0.2),
        'Product DB': truncexpon.rvs(size=num_samples, b=10, scale=0.2)
    }

In [3]:
normal_data = create_observed_latency_data(unobserved_intrinsic_latencies_normal(1000))
outlier_data = create_observed_latency_data(unobserved_intrinsic_latencies_anomalous(1000))

In [4]:
lower_level_columns = ['Customer DB', 'Shipping Cost Service', 'Caching Service', 'Product DB']
upper_level_metric = normal_data['Product Service'].tolist() + outlier_data['Product Service'].tolist()
outlier_data = outlier_data[lower_level_columns]
normal_data = normal_data[lower_level_columns]

In [5]:
data_obj, var_names = rca_preprocess(
    data=[normal_data, outlier_data],
    time_metric=upper_level_metric,
    time_metric_name='time'
)

### Run root cause analysis

In [6]:
model = RootCauseDetector(
    data_obj = data_obj,
    var_names=var_names,
    time_metric_name='time',
    prior_knowledge=None
)

In [7]:
root_causes, graph = model.run(
    pvalue_thres=0.001,
    max_condition_set_size=4,
    return_graph=True
)

The root cause(s) of the incident are: {'Caching Service'}


In [8]:
print(root_causes)

{'Caching Service'}


In [9]:
print(graph)

{'Customer DB': set(), 'Shipping Cost Service': set(), 'Caching Service': set(), 'Product DB': set(), 'time': {'Caching Service'}}
