From aba275ff10131584f6364d86663655fe9d13595c Mon Sep 17 00:00:00 2001 From: yangwenzhuo08 Date: Thu, 20 Apr 2023 15:30:41 +0800 Subject: [PATCH] Revise the docs --- README.md | 95 ++++++++++++++++++++++---------------------- docs/index.rst | 59 +++++++++++++-------------- docs/pyrca.tools.rst | 17 ++++---- 3 files changed, 87 insertions(+), 84 deletions(-) diff --git a/README.md b/README.md index a2a60c9..5f6c206 100644 --- a/README.md +++ b/README.md @@ -22,26 +22,27 @@ With the rapidly growing adoption of microservices architectures, multi-service applications become the standard paradigm in real-world IT applications. A multi-service application usually contains hundreds of interacting services, making it harder to detect service failures and identify the root causes. Root cause analysis (RCA) -methods leverage the KPI metrics monitored on those services to determine the root causes when a system failure -is detected, helping engineers and SREs in the troubleshooting process. - -PyRCA is a Python machine-learning library designed for metric-based RCA, offering multiple state-of-the-art RCA -algorithms and an end-to-end pipeline for building RCA solutions. PyRCA includes two types of algorithms: 1. -Identifying anomalous metrics in parallel with the observed anomaly via metric data analysis, e.g., ε-diagnosis, -and 2. Identifying root causes based a topology/causal graph representing the causal relationships between -the observed metrics, e.g., Bayesian inference, Random Walk. Besides, PyRCA provides a convenient tool -for building causal graphs from the observed time series data and domain knowledge, helping users to develop -topology/causal graph based solutions quickly. PyRCA also provides a benchmark for evaluating various RCA -methods, which is valuable for industry and academic research. - -The following list shows the supported RCA methods and features in our library: +methods usually leverage the KPI metrics, traces or logs monitored on those services to determine the root causes +when a system failure is detected, helping engineers and SREs in the troubleshooting process. + +PyRCA is a Python machine-learning library designed for root cause analysis, offering multiple state-of-the-art RCA +algorithms and an end-to-end pipeline for building RCA solutions. Currently, PyRCA mainly focuses on metric-based RCA +including two types of algorithms: 1. Identifying anomalous metrics in parallel with the observed anomaly via +metric data analysis, e.g., ε-diagnosis, and 2. Identifying root causes based a topology/causal graph representing +the causal relationships between the observed metrics, e.g., Bayesian inference, Random Walk. Besides, PyRCA +provides a convenient tool for building causal graphs from the observed time series data and domain knowledge, +helping users to develop graph-based solutions quickly. PyRCA also provides a benchmark for evaluating +various RCA methods, which is valuable for industry and academic research. + +The following list shows the supported RCA methods in our library: 1. ε-Diagnosis 2. Bayesian Inference-based Root Cause Analysis 3. Random Walk-based Root Cause Analysis 4. Ψ-PC-based Root Cause Analysis 5. Causal Inference-based Root Cause Analysis (CIRCA) -We will continue improving this library to make it more comprehensive in the future. +We will continue improving this library to make it more comprehensive in the future. In the future, +PyRCA will support trace and log based RCA methods as well. ## Installation @@ -56,8 +57,8 @@ cloning the PyRCA repo, navigating to the root directory, and calling ## Getting Started -PyRCA provides a unified interface for training RCA models and finding root causes, you only need -to specify +PyRCA provides a unified interface for training RCA models and finding root causes. To apply +a certain RCA method, you only need to specify: - **The select RCA method**: e.g., ``BayesianNetwork``, ``EpsilonDiagnosis``. - **The RCA configuration**: e.g., ``BayesianNetworkConfig``, ``EpsilonDiagnosisConfig``. @@ -66,8 +67,8 @@ to specify - **Some detected anomalous KPI metrics**: Some RCA methods require the anomalous KPI metrics detected by certain anomaly detector. -Let's take ``BayesianNetwork`` as an example. Suppose that ``graph_df`` is a pandas dataframe encoding -the causal graph representing causal relationships between metrics (how to construct such causal graph +Let's take ``BayesianNetwork`` as an example. Suppose that ``graph_df`` is a pandas dataframe of +the graph representing causal relationships between metrics (how to construct such causal graph will be discussed later), and ``df`` is a pandas dataframe containing the historical observed time series data (e.g., the index is the timestamp and each column represents one monitored metric). To train a ``BayesianNetwork``, you can simply run the following code: @@ -79,8 +80,8 @@ model.train(df) model.save("model_folder") ``` -After the model is trained, you can use it for root cause analysis given a list of detected anomalous -metrics by a certain anomaly detector, e.g., +After the model is trained, you can use it to find root causes of an incident given a list of anomalous +metrics detected by a certain anomaly detector, e.g., ```python from pyrca.analyzers.bayesian import BayesianNetwork @@ -89,7 +90,7 @@ results = model.find_root_causes(["observed_anomalous_metric", ...]) print(results.to_dict()) ``` -For other RCA methods, you can use similar code for discovering root causes. For example, if you want +For other RCA methods, you can write similar code as above for finding root causes. For example, if you want to try ``EpsilonDiagnosis``, you can initalize ``EpsilonDiagnosis`` as follows: ```python @@ -98,7 +99,7 @@ model = EpsilonDiagnosis(config=EpsilonDiagnosis.config_class(alpha=0.01)) model.train(normal_data) ``` -Here ``normal_data`` is the historical observed time series data without anomalies. To find root causes, +Here ``normal_data`` is the historically observed time series data without anomalies. To identify root causes, you can run: ```python @@ -106,23 +107,23 @@ results = model.find_root_causes(abnormal_data) print(results.to_dict()) ``` -where ``abnormal_data`` is the time series data in an incident window. +where ``abnormal_data`` is the time series data collected in an incident window. -As mentioned above, some RCA methods require causal graphs as their inputs. To construct such causal +As mentioned above, some RCA methods such as ``BayesianNetwork`` require causal graphs as their inputs. To construct such causal graphs from the observed time series data, you can utilize our tool by running ``python -m pyrca.tools``. This command will launch a Dash app for time series data analysis and causal discovery. ![alt text](https://github.com/salesforce/PyRCA/raw/main/docs/_static/dashboard.png) -The dashboard allows you to try different causal discovery methods, change causal discovery parameters, +The dashboard allows you to try different causal discovery methods, adjust causal discovery parameters, add domain knowledge constraints (e.g., root/leaf nodes, forbidden/required links), and visualize -the generated causal graph. It makes easier for manually updating causal graphs with domain knowledge. -If you satisfy with the results after several iterations, you can download the results that can be -used by the RCA methods supported in PyRCA. +the generated causal graphs. It makes easier for manually revising causal graphs based on domain knowledge. +You can download the graph generated by this tool if you satisfy with it. The graph can be used by the RCA +methods supported in PyRCA. -Instead of using this dashboard, you can also write code for causal discovery. The package -``pyrca.graphs.causal`` includes several causal discovery methods you can use. All of these methods -are adjusted to support domain knowledge constraints. Suppose ``df`` is the monitored time series data -and you want to apply PC for discovering causal graphs, then the following code will help: +Instead of using this dashboard, you can also write code for building such graphs. The package +``pyrca.graphs.causal`` includes several popular causal discovery methods you can use. All of these methods +support domain knowledge constraints. Suppose ``df`` is the observed time series data +and you want to apply the PC algorithm for building causal graphs, then the following code will help: ```python from pyrca.graphs.causal.pc import PC @@ -156,19 +157,19 @@ This domain knowledge file states that: 3. There is no connection from A to E, and 4. There is a connection from A to C. -You can modify this file according to your domain knowledge for generating more reliable causal +You can write your domain knowledge file based on this template for generating more reliable causal graphs. ## Application Example -[Here](https://github.com/salesforce/PyRCA/tree/main/pyrca/applications/example) is an example -of applying ``BayesianNetwork`` to build a solution for RCA. The "config" folder includes the setups -for the stats-based anomaly detector and the domain knowledge. The "models" folder stores the causal -graph and the trained Bayesian network. The ``RCAEngine`` in the "rca.py" file implements all the -methods for building causal graphs, training Bayesian networks and finding root causes by utilizing -the modules provides by PyRCA. You can directly use this class if the stats-based anomaly detector -and Bayesian inference are suitable to solve your RCA problems. For example, you can build and train -a Bayesian network via the following code given a time series dataframe ``df``: +[Here](https://github.com/salesforce/PyRCA/tree/main/pyrca/applications/example) is a real-world example +of applying ``BayesianNetwork`` to build a solution for RCA, which is adapted from our internal use cases. +The "config" folder includes the settings for the stats-based anomaly detector and the domain knowledge. +The "models" folder stores the causal graph and the trained Bayesian network. The ``RCAEngine`` class in the "rca.py" +file implements the methods for building causal graphs, training Bayesian networks and finding root causes +by utilizing the modules provided by PyRCA. You can directly use this class if the stats-based anomaly detector +and Bayesian inference are suitable for your RCA problems. For example, given a time series dataframe ``df``, +you can build and train a Bayesian network via the following code: ```python from pyrca.applications.example.rca import RCAEngine @@ -191,9 +192,9 @@ result = engine.find_root_causes_bn(anomalies=["conn_pool", "apt"]) pprint.pprint(result) ``` -The inputs of ``find_root_causes_bn`` is a list of the detected anomalous metrics by the stats-based -anomaly detector. This method will estimate the probabilities of being a root cause and extract -the paths from the potential root cause nodes to the leaf nodes. +The inputs of ``find_root_causes_bn`` is a list of the anomalous metrics detected by the stats-based +anomaly detector. This method will estimate the probability of a node being a root cause and extract +the paths from a potential root cause node to the leaf nodes. ## Benchmarks @@ -206,12 +207,12 @@ the appropriate license headers whenever you make a commit. To add a new RCA method into the library, you may follow the steps below: 1. Create a new python script file for this RCA method in the ``pyrca/analyzers`` folder. -2. Create the configuration class that inherits from ``pyrca.base.BaseConfig``. -3. Create the method class that inherits from ``pyrca.analyzers.base.BaseRCA``. The constructor for the new +2. Create the configuration class inheriting from ``pyrca.base.BaseConfig``. +3. Create the method class inheriting from ``pyrca.analyzers.base.BaseRCA``. The constructor for the new method takes the new configuration instance as its input. 4. Implement the ``train`` function that trains or initializes the new method. 5. Implement the ``find_root_causes`` function that returns a ``pyrca.analyzers.base.RCAResults`` -instance storing root cause analysis results. +instance for root cause analysis results. To add a new causal discovery method, you may follow the following steps: 1. Create a new python script file for this RCA method in the ``pyrca/graphs/causal`` folder. diff --git a/docs/index.rst b/docs/index.rst index 86e6644..1e18ab0 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -12,17 +12,18 @@ Introduction With the rapidly growing adoption of microservices architectures, multi-service applications become the standard paradigm in real-world IT applications. A multi-service application usually contains hundreds of interacting services, making it harder to detect service failures and identify the root causes. Root cause analysis (RCA) -methods leverage the KPI metrics monitored on those services to determine the root causes when a system failure -is detected, helping engineers and SREs in the troubleshooting process. - -PyRCA is a Python machine-learning library designed for metric-based RCA, offering multiple state-of-the-art RCA -algorithms and an end-to-end pipeline for building RCA solutions. PyRCA includes two types of algorithms: 1. -Identifying anomalous metrics in parallel with the observed anomaly via metric data analysis, e.g., ε-diagnosis, -and 2. Identifying root causes based a topology/causal graph representing the causal relationships between -the observed metrics, e.g., Bayesian inference, Random Walk. Besides, PyRCA provides a convenient tool -for building causal graphs from the observed time series data and domain knowledge, helping users to develop -topology/causal graph based solutions quickly. PyRCA also provides a benchmark for evaluating various RCA -methods, which is valuable for industry and academic research. +methods usually leverage the KPI metrics, traces or logs monitored on those services to determine the root causes +when a system failure is detected, helping engineers and SREs in the troubleshooting process. + +PyRCA is a Python machine-learning library designed for root cause analysis, offering multiple state-of-the-art RCA +algorithms and an end-to-end pipeline for building RCA solutions. Currently, PyRCA mainly focuses on metric-based RCA +including two types of algorithms: 1. Identifying anomalous metrics in parallel with the observed anomaly via +metric data analysis, e.g., ε-diagnosis, and 2. Identifying root causes based a topology/causal graph representing +the causal relationships between the observed metrics, e.g., Bayesian inference, Random Walk. Besides, PyRCA +provides a convenient tool for building causal graphs from the observed time series data and domain knowledge, +helping users to develop graph-based solutions quickly. PyRCA also provides a benchmark for evaluating +various RCA methods, which is valuable for industry and academic research. We will continue improving this library +to make it more comprehensive in the future. In the future, PyRCA will support trace and log based RCA methods as well. Installation ############ @@ -39,8 +40,8 @@ cloning the PyRCA repo, navigating to the root directory, and calling Getting Started ############### -PyRCA provides a unified interface for training RCA models and finding root causes, you only need -to specify +PyRCA provides a unified interface for training RCA models and finding root causes. To apply +a certain RCA method, you only need to specify: - **The select RCA method**: e.g., :py:mod:`pyrca.analyzers.bayesian.BayesianNetwork`, :py:mod:`pyrca.analyzers.epsilon_diagnosis.EpsilonDiagnosis`. @@ -51,8 +52,8 @@ to specify - **Some detected anomalous KPI metrics**: Some RCA methods require the anomalous KPI metrics detected by certain anomaly detector. -Let's take ``BayesianNetwork`` as an example. Suppose that ``graph_df`` is a pandas dataframe encoding -the causal graph representing causal relationships between metrics (how to construct such causal graph +Let's take ``BayesianNetwork`` as an example. Suppose that ``graph_df`` is a pandas dataframe of +the graph representing causal relationships between metrics (how to construct such causal graph will be discussed later), and ``df`` is a pandas dataframe containing the historical observed time series data (e.g., the index is the timestamp and each column represents one monitored metric). To train a ``BayesianNetwork``, you can simply run the following code: @@ -64,8 +65,8 @@ data (e.g., the index is the timestamp and each column represents one monitored model.train(df) model.save("model_folder") -After the model is trained, you can use it for root cause analysis given a list of detected anomalous -metrics by a certain anomaly detector, e.g., +After the model is trained, you can use it to find root causes of an incident given a list of anomalous +metrics detected by a certain anomaly detector, e.g., .. code-block:: python @@ -74,7 +75,7 @@ metrics by a certain anomaly detector, e.g., results = model.find_root_causes(["observed_anomalous_metric", ...]) print(results.to_dict()) -For other RCA methods, you can use similar code for discovering root causes. For example, if you want +For other RCA methods, you can write similar code as above for finding root causes. For example, if you want to try ``EpsilonDiagnosis``, you can initalize ``EpsilonDiagnosis`` as follows: .. code-block:: python @@ -83,7 +84,7 @@ to try ``EpsilonDiagnosis``, you can initalize ``EpsilonDiagnosis`` as follows: model = EpsilonDiagnosis(config=EpsilonDiagnosis.config_class(alpha=0.01)) model.train(normal_data) -Here ``normal_data`` is the historical observed time series data without anomalies. To find root causes, +Here ``normal_data`` is the historically observed time series data without anomalies. To find root causes, you can run: .. code-block:: python @@ -93,22 +94,22 @@ you can run: where ``abnormal_data`` is the time series data in an incident window. -As mentioned above, some RCA methods require causal graphs as their inputs. To construct such causal +As mentioned above, some RCA methods such as ``BayesianNetwork`` require causal graphs as their inputs. To construct such causal graphs from the observed time series data, you can utilize our tool by running ``python -m pyrca.tools``. This command will launch a Dash app for time series data analysis and causal discovery. .. image:: _static/dashboard.png -The dashboard allows you to try different causal discovery methods, change causal discovery parameters, +The dashboard allows you to try different causal discovery methods, adjust causal discovery parameters, add domain knowledge constraints (e.g., root/leaf nodes, forbidden/required links), and visualize -the generated causal graph. It makes easier for manually updating causal graphs with domain knowledge. -If you satisfy with the results after several iterations, you can download the results that can be -used by the RCA methods supported in PyRCA. +the generated causal graphs. It makes easier for manually revising causal graphs based on domain knowledge. +You can download the graph generated by this tool if you satisfy with it. The graph can be used by the RCA +methods supported in PyRCA. -Instead of using this dashboard, you can also write code for causal discovery. The package -:py:mod:`pyrca.graphs.causal` includes several causal discovery methods you can use. All of these methods -are adjusted to support domain knowledge constraints. Suppose ``df`` is the monitored time series data -and you want to apply PC for discovering causal graphs, then the following code will help: +Instead of using this dashboard, you can also write code for building such graphs. The package +``pyrca.graphs.causal`` includes several popular causal discovery methods you can use. All of these methods +support domain knowledge constraints. Suppose ``df`` is the observed time series data +and you want to apply the PC algorithm for building causal graphs, then the following code will help: .. code-block:: python @@ -143,7 +144,7 @@ This domain knowledge file states that: 3. There is no connection from A to E, and 4. There is a connection from A to C. -You can modify this file according to your domain knowledge for generating more reliable causal +You can write your domain knowledge file based on this template for generating more reliable causal graphs. Library Design diff --git a/docs/pyrca.tools.rst b/docs/pyrca.tools.rst index a8b54bc..a7280e8 100644 --- a/docs/pyrca.tools.rst +++ b/docs/pyrca.tools.rst @@ -6,16 +6,17 @@ pyrca.tools package :undoc-members: :show-inheritance: -To launch the app for data analysis and causal graph construction, you can run ``python -m pyrca.tools``: +To launch the app for data analysis and causal discovery, you can run ``python -m pyrca.tools``: .. image:: _static/dashboard_1.png The "Data Analysis" tab allows you to upload your time series data, visualize all the metrics, -analyze some basic stats such as means and variances, and tune the hyperparameters for stats-threshold -based anomaly detectors. The time series data should be in a CSV format, where the first column is -the timestamp and the other columns are the metrics. PyRCA supports a basic stats-based anomaly detector +check some basic stats such as means and variances, and tune the hyperparameters for stats-threshold +based anomaly detectors. PyRCA supports a basic stats-based anomaly detector :py:mod:`pyrca.outliers.stats` that you can apply for detecting anomalous spikes in the data. If this anomaly detector is not suitable for your use cases, you can also try Merlion for other anomaly detectors. +Note that the time series data should be in a CSV format, where the first column is +the timestamp and the other columns are the metrics. The "Causal Discovery" tab is used to build causal graphs estimated from time series data. @@ -23,14 +24,14 @@ The "Causal Discovery" tab is used to build causal graphs estimated from time se To build a causal graph, you can follow the steps below: -1. Upload the time series data and the domain knowledge file (optional, in the YAML format). -2. Choose the uploaded time series data you want to use for building the causal graph that describes +1. Upload the time series data, and the domain knowledge file (optional, in the YAML format). +2. Choose the uploaded time series data you want to use for building the graph that describes the causal relationships between different metrics. -3. Select a causal discovery method, e.g., PC and FGES, and adjust the corresponding parameters if +3. Select a causal discovery method, e.g., PC or FGES, and adjust the corresponding parameters if necessary. For example, you may change "max_degree" and "penalty_discount" for FGES. 4. Select the uploaded domain knowledge file if there exists. 5. Click the "Run" button to generate the first version of the causal graph. The figure on the right - hand side will show the causal graph, so that you can manually check if there are missing links or + hand side will show the causal graph, where you can manually check if there are missing links or incorrect links. 6. If the generated causal graph has errors, you can add additional constraints, e.g., root/leaf nodes, required/forbidden links, in the "Edit Domain Knowledge" card.