Skip to content

Commit

Permalink
Merge pull request #13 from salesforce/revise_docs
Browse files Browse the repository at this point in the history
Revise the docs
  • Loading branch information
yangwenzhuo08 committed Apr 20, 2023
2 parents f4ef007 + aba275f commit 9d37fbf
Show file tree
Hide file tree
Showing 3 changed files with 87 additions and 84 deletions.
95 changes: 48 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,26 +22,27 @@
With the rapidly growing adoption of microservices architectures, multi-service applications become the standard
paradigm in real-world IT applications. A multi-service application usually contains hundreds of interacting
services, making it harder to detect service failures and identify the root causes. Root cause analysis (RCA)
methods leverage the KPI metrics monitored on those services to determine the root causes when a system failure
is detected, helping engineers and SREs in the troubleshooting process.

PyRCA is a Python machine-learning library designed for metric-based RCA, offering multiple state-of-the-art RCA
algorithms and an end-to-end pipeline for building RCA solutions. PyRCA includes two types of algorithms: 1.
Identifying anomalous metrics in parallel with the observed anomaly via metric data analysis, e.g., ε-diagnosis,
and 2. Identifying root causes based a topology/causal graph representing the causal relationships between
the observed metrics, e.g., Bayesian inference, Random Walk. Besides, PyRCA provides a convenient tool
for building causal graphs from the observed time series data and domain knowledge, helping users to develop
topology/causal graph based solutions quickly. PyRCA also provides a benchmark for evaluating various RCA
methods, which is valuable for industry and academic research.

The following list shows the supported RCA methods and features in our library:
methods usually leverage the KPI metrics, traces or logs monitored on those services to determine the root causes
when a system failure is detected, helping engineers and SREs in the troubleshooting process.

PyRCA is a Python machine-learning library designed for root cause analysis, offering multiple state-of-the-art RCA
algorithms and an end-to-end pipeline for building RCA solutions. Currently, PyRCA mainly focuses on metric-based RCA
including two types of algorithms: 1. Identifying anomalous metrics in parallel with the observed anomaly via
metric data analysis, e.g., ε-diagnosis, and 2. Identifying root causes based a topology/causal graph representing
the causal relationships between the observed metrics, e.g., Bayesian inference, Random Walk. Besides, PyRCA
provides a convenient tool for building causal graphs from the observed time series data and domain knowledge,
helping users to develop graph-based solutions quickly. PyRCA also provides a benchmark for evaluating
various RCA methods, which is valuable for industry and academic research.

The following list shows the supported RCA methods in our library:
1. ε-Diagnosis
2. Bayesian Inference-based Root Cause Analysis
3. Random Walk-based Root Cause Analysis
4. Ψ-PC-based Root Cause Analysis
5. Causal Inference-based Root Cause Analysis (CIRCA)

We will continue improving this library to make it more comprehensive in the future.
We will continue improving this library to make it more comprehensive in the future. In the future,
PyRCA will support trace and log based RCA methods as well.

## Installation

Expand All @@ -56,8 +57,8 @@ cloning the PyRCA repo, navigating to the root directory, and calling

## Getting Started

PyRCA provides a unified interface for training RCA models and finding root causes, you only need
to specify
PyRCA provides a unified interface for training RCA models and finding root causes. To apply
a certain RCA method, you only need to specify:

- **The select RCA method**: e.g., ``BayesianNetwork``, ``EpsilonDiagnosis``.
- **The RCA configuration**: e.g., ``BayesianNetworkConfig``, ``EpsilonDiagnosisConfig``.
Expand All @@ -66,8 +67,8 @@ to specify
- **Some detected anomalous KPI metrics**: Some RCA methods require the anomalous KPI metrics detected by
certain anomaly detector.

Let's take ``BayesianNetwork`` as an example. Suppose that ``graph_df`` is a pandas dataframe encoding
the causal graph representing causal relationships between metrics (how to construct such causal graph
Let's take ``BayesianNetwork`` as an example. Suppose that ``graph_df`` is a pandas dataframe of
the graph representing causal relationships between metrics (how to construct such causal graph
will be discussed later), and ``df`` is a pandas dataframe containing the historical observed time series
data (e.g., the index is the timestamp and each column represents one monitored metric). To train a
``BayesianNetwork``, you can simply run the following code:
Expand All @@ -79,8 +80,8 @@ model.train(df)
model.save("model_folder")
```

After the model is trained, you can use it for root cause analysis given a list of detected anomalous
metrics by a certain anomaly detector, e.g.,
After the model is trained, you can use it to find root causes of an incident given a list of anomalous
metrics detected by a certain anomaly detector, e.g.,

```python
from pyrca.analyzers.bayesian import BayesianNetwork
Expand All @@ -89,7 +90,7 @@ results = model.find_root_causes(["observed_anomalous_metric", ...])
print(results.to_dict())
```

For other RCA methods, you can use similar code for discovering root causes. For example, if you want
For other RCA methods, you can write similar code as above for finding root causes. For example, if you want
to try ``EpsilonDiagnosis``, you can initalize ``EpsilonDiagnosis`` as follows:

```python
Expand All @@ -98,31 +99,31 @@ model = EpsilonDiagnosis(config=EpsilonDiagnosis.config_class(alpha=0.01))
model.train(normal_data)
```

Here ``normal_data`` is the historical observed time series data without anomalies. To find root causes,
Here ``normal_data`` is the historically observed time series data without anomalies. To identify root causes,
you can run:

```python
results = model.find_root_causes(abnormal_data)
print(results.to_dict())
```

where ``abnormal_data`` is the time series data in an incident window.
where ``abnormal_data`` is the time series data collected in an incident window.

As mentioned above, some RCA methods require causal graphs as their inputs. To construct such causal
As mentioned above, some RCA methods such as ``BayesianNetwork`` require causal graphs as their inputs. To construct such causal
graphs from the observed time series data, you can utilize our tool by running ``python -m pyrca.tools``.
This command will launch a Dash app for time series data analysis and causal discovery.
![alt text](https://github.com/salesforce/PyRCA/raw/main/docs/_static/dashboard.png)

The dashboard allows you to try different causal discovery methods, change causal discovery parameters,
The dashboard allows you to try different causal discovery methods, adjust causal discovery parameters,
add domain knowledge constraints (e.g., root/leaf nodes, forbidden/required links), and visualize
the generated causal graph. It makes easier for manually updating causal graphs with domain knowledge.
If you satisfy with the results after several iterations, you can download the results that can be
used by the RCA methods supported in PyRCA.
the generated causal graphs. It makes easier for manually revising causal graphs based on domain knowledge.
You can download the graph generated by this tool if you satisfy with it. The graph can be used by the RCA
methods supported in PyRCA.

Instead of using this dashboard, you can also write code for causal discovery. The package
``pyrca.graphs.causal`` includes several causal discovery methods you can use. All of these methods
are adjusted to support domain knowledge constraints. Suppose ``df`` is the monitored time series data
and you want to apply PC for discovering causal graphs, then the following code will help:
Instead of using this dashboard, you can also write code for building such graphs. The package
``pyrca.graphs.causal`` includes several popular causal discovery methods you can use. All of these methods
support domain knowledge constraints. Suppose ``df`` is the observed time series data
and you want to apply the PC algorithm for building causal graphs, then the following code will help:

```python
from pyrca.graphs.causal.pc import PC
Expand Down Expand Up @@ -156,19 +157,19 @@ This domain knowledge file states that:
3. There is no connection from A to E, and
4. There is a connection from A to C.

You can modify this file according to your domain knowledge for generating more reliable causal
You can write your domain knowledge file based on this template for generating more reliable causal
graphs.

## Application Example

[Here](https://github.com/salesforce/PyRCA/tree/main/pyrca/applications/example) is an example
of applying ``BayesianNetwork`` to build a solution for RCA. The "config" folder includes the setups
for the stats-based anomaly detector and the domain knowledge. The "models" folder stores the causal
graph and the trained Bayesian network. The ``RCAEngine`` in the "rca.py" file implements all the
methods for building causal graphs, training Bayesian networks and finding root causes by utilizing
the modules provides by PyRCA. You can directly use this class if the stats-based anomaly detector
and Bayesian inference are suitable to solve your RCA problems. For example, you can build and train
a Bayesian network via the following code given a time series dataframe ``df``:
[Here](https://github.com/salesforce/PyRCA/tree/main/pyrca/applications/example) is a real-world example
of applying ``BayesianNetwork`` to build a solution for RCA, which is adapted from our internal use cases.
The "config" folder includes the settings for the stats-based anomaly detector and the domain knowledge.
The "models" folder stores the causal graph and the trained Bayesian network. The ``RCAEngine`` class in the "rca.py"
file implements the methods for building causal graphs, training Bayesian networks and finding root causes
by utilizing the modules provided by PyRCA. You can directly use this class if the stats-based anomaly detector
and Bayesian inference are suitable for your RCA problems. For example, given a time series dataframe ``df``,
you can build and train a Bayesian network via the following code:

```python
from pyrca.applications.example.rca import RCAEngine
Expand All @@ -191,9 +192,9 @@ result = engine.find_root_causes_bn(anomalies=["conn_pool", "apt"])
pprint.pprint(result)
```

The inputs of ``find_root_causes_bn`` is a list of the detected anomalous metrics by the stats-based
anomaly detector. This method will estimate the probabilities of being a root cause and extract
the paths from the potential root cause nodes to the leaf nodes.
The inputs of ``find_root_causes_bn`` is a list of the anomalous metrics detected by the stats-based
anomaly detector. This method will estimate the probability of a node being a root cause and extract
the paths from a potential root cause node to the leaf nodes.

## Benchmarks

Expand All @@ -206,12 +207,12 @@ the appropriate license headers whenever you make a commit.

To add a new RCA method into the library, you may follow the steps below:
1. Create a new python script file for this RCA method in the ``pyrca/analyzers`` folder.
2. Create the configuration class that inherits from ``pyrca.base.BaseConfig``.
3. Create the method class that inherits from ``pyrca.analyzers.base.BaseRCA``. The constructor for the new
2. Create the configuration class inheriting from ``pyrca.base.BaseConfig``.
3. Create the method class inheriting from ``pyrca.analyzers.base.BaseRCA``. The constructor for the new
method takes the new configuration instance as its input.
4. Implement the ``train`` function that trains or initializes the new method.
5. Implement the ``find_root_causes`` function that returns a ``pyrca.analyzers.base.RCAResults``
instance storing root cause analysis results.
instance for root cause analysis results.

To add a new causal discovery method, you may follow the following steps:
1. Create a new python script file for this RCA method in the ``pyrca/graphs/causal`` folder.
Expand Down
59 changes: 30 additions & 29 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,18 @@ Introduction
With the rapidly growing adoption of microservices architectures, multi-service applications become the standard
paradigm in real-world IT applications. A multi-service application usually contains hundreds of interacting
services, making it harder to detect service failures and identify the root causes. Root cause analysis (RCA)
methods leverage the KPI metrics monitored on those services to determine the root causes when a system failure
is detected, helping engineers and SREs in the troubleshooting process.

PyRCA is a Python machine-learning library designed for metric-based RCA, offering multiple state-of-the-art RCA
algorithms and an end-to-end pipeline for building RCA solutions. PyRCA includes two types of algorithms: 1.
Identifying anomalous metrics in parallel with the observed anomaly via metric data analysis, e.g., ε-diagnosis,
and 2. Identifying root causes based a topology/causal graph representing the causal relationships between
the observed metrics, e.g., Bayesian inference, Random Walk. Besides, PyRCA provides a convenient tool
for building causal graphs from the observed time series data and domain knowledge, helping users to develop
topology/causal graph based solutions quickly. PyRCA also provides a benchmark for evaluating various RCA
methods, which is valuable for industry and academic research.
methods usually leverage the KPI metrics, traces or logs monitored on those services to determine the root causes
when a system failure is detected, helping engineers and SREs in the troubleshooting process.

PyRCA is a Python machine-learning library designed for root cause analysis, offering multiple state-of-the-art RCA
algorithms and an end-to-end pipeline for building RCA solutions. Currently, PyRCA mainly focuses on metric-based RCA
including two types of algorithms: 1. Identifying anomalous metrics in parallel with the observed anomaly via
metric data analysis, e.g., ε-diagnosis, and 2. Identifying root causes based a topology/causal graph representing
the causal relationships between the observed metrics, e.g., Bayesian inference, Random Walk. Besides, PyRCA
provides a convenient tool for building causal graphs from the observed time series data and domain knowledge,
helping users to develop graph-based solutions quickly. PyRCA also provides a benchmark for evaluating
various RCA methods, which is valuable for industry and academic research. We will continue improving this library
to make it more comprehensive in the future. In the future, PyRCA will support trace and log based RCA methods as well.

Installation
############
Expand All @@ -39,8 +40,8 @@ cloning the PyRCA repo, navigating to the root directory, and calling
Getting Started
###############

PyRCA provides a unified interface for training RCA models and finding root causes, you only need
to specify
PyRCA provides a unified interface for training RCA models and finding root causes. To apply
a certain RCA method, you only need to specify:

- **The select RCA method**: e.g., :py:mod:`pyrca.analyzers.bayesian.BayesianNetwork`,
:py:mod:`pyrca.analyzers.epsilon_diagnosis.EpsilonDiagnosis`.
Expand All @@ -51,8 +52,8 @@ to specify
- **Some detected anomalous KPI metrics**: Some RCA methods require the anomalous KPI metrics detected by
certain anomaly detector.

Let's take ``BayesianNetwork`` as an example. Suppose that ``graph_df`` is a pandas dataframe encoding
the causal graph representing causal relationships between metrics (how to construct such causal graph
Let's take ``BayesianNetwork`` as an example. Suppose that ``graph_df`` is a pandas dataframe of
the graph representing causal relationships between metrics (how to construct such causal graph
will be discussed later), and ``df`` is a pandas dataframe containing the historical observed time series
data (e.g., the index is the timestamp and each column represents one monitored metric). To train a
``BayesianNetwork``, you can simply run the following code:
Expand All @@ -64,8 +65,8 @@ data (e.g., the index is the timestamp and each column represents one monitored
model.train(df)
model.save("model_folder")
After the model is trained, you can use it for root cause analysis given a list of detected anomalous
metrics by a certain anomaly detector, e.g.,
After the model is trained, you can use it to find root causes of an incident given a list of anomalous
metrics detected by a certain anomaly detector, e.g.,

.. code-block:: python
Expand All @@ -74,7 +75,7 @@ metrics by a certain anomaly detector, e.g.,
results = model.find_root_causes(["observed_anomalous_metric", ...])
print(results.to_dict())
For other RCA methods, you can use similar code for discovering root causes. For example, if you want
For other RCA methods, you can write similar code as above for finding root causes. For example, if you want
to try ``EpsilonDiagnosis``, you can initalize ``EpsilonDiagnosis`` as follows:

.. code-block:: python
Expand All @@ -83,7 +84,7 @@ to try ``EpsilonDiagnosis``, you can initalize ``EpsilonDiagnosis`` as follows:
model = EpsilonDiagnosis(config=EpsilonDiagnosis.config_class(alpha=0.01))
model.train(normal_data)
Here ``normal_data`` is the historical observed time series data without anomalies. To find root causes,
Here ``normal_data`` is the historically observed time series data without anomalies. To find root causes,
you can run:

.. code-block:: python
Expand All @@ -93,22 +94,22 @@ you can run:
where ``abnormal_data`` is the time series data in an incident window.

As mentioned above, some RCA methods require causal graphs as their inputs. To construct such causal
As mentioned above, some RCA methods such as ``BayesianNetwork`` require causal graphs as their inputs. To construct such causal
graphs from the observed time series data, you can utilize our tool by running ``python -m pyrca.tools``.
This command will launch a Dash app for time series data analysis and causal discovery.

.. image:: _static/dashboard.png

The dashboard allows you to try different causal discovery methods, change causal discovery parameters,
The dashboard allows you to try different causal discovery methods, adjust causal discovery parameters,
add domain knowledge constraints (e.g., root/leaf nodes, forbidden/required links), and visualize
the generated causal graph. It makes easier for manually updating causal graphs with domain knowledge.
If you satisfy with the results after several iterations, you can download the results that can be
used by the RCA methods supported in PyRCA.
the generated causal graphs. It makes easier for manually revising causal graphs based on domain knowledge.
You can download the graph generated by this tool if you satisfy with it. The graph can be used by the RCA
methods supported in PyRCA.

Instead of using this dashboard, you can also write code for causal discovery. The package
:py:mod:`pyrca.graphs.causal` includes several causal discovery methods you can use. All of these methods
are adjusted to support domain knowledge constraints. Suppose ``df`` is the monitored time series data
and you want to apply PC for discovering causal graphs, then the following code will help:
Instead of using this dashboard, you can also write code for building such graphs. The package
``pyrca.graphs.causal`` includes several popular causal discovery methods you can use. All of these methods
support domain knowledge constraints. Suppose ``df`` is the observed time series data
and you want to apply the PC algorithm for building causal graphs, then the following code will help:

.. code-block:: python
Expand Down Expand Up @@ -143,7 +144,7 @@ This domain knowledge file states that:
3. There is no connection from A to E, and
4. There is a connection from A to C.

You can modify this file according to your domain knowledge for generating more reliable causal
You can write your domain knowledge file based on this template for generating more reliable causal
graphs.

Library Design
Expand Down
Loading

0 comments on commit 9d37fbf

Please sign in to comment.