Reference documentation update

This patch updates the ./docs/README.md to include reference documentation to all of the new Starlark functions that is supported by Crashd including: * Configuration functions * Provider functions * Resource enumeration function * Command functions * Default Values * OS data and functions * Argument data Signed-off-by: Vladimir Vivien <vivienv@vmware.com>
vmware-tanzu · Aug 4, 2020 · 6118ba0 · 6118ba0
1 parent 7c208e2
commit 6118ba0
Show file tree

Hide file tree

Showing 4 changed files with 852 additions and 525 deletions.
diff --git a/README.md b/README.md
@@ -1,117 +1,235 @@
 ![](https://github.com/vmware-tanzu/crash-diagnostics/workflows/Crash%20Diagnostics%20Build/badge.svg)
 
-# Crash Recovery and Diagnostics for Kubernetes
+# Crashd - Crash Diagnostics
 
-Crash Recovery and Diagnostics for Kubernetes (*Crash Diagnostics* for short) is designed to help human operators who are investigating and troubleshooting unhealthy or unresponsive Kubernetes clusters.  It is a project designed to automate the diagnosis of problem clusters that may be in an unstable state including completely inoperable.  In its introductory release, Crash Diagnostics provides cluster operators the ability to automatically collect machine states and other information from each node in a cluster.  The collected information is then bundled in a tar file for further analysis. 
+Crash Diagnostics (Crashd) is a tool that helps human operators to easily interact and collect information from infrastructures running on Kubernetes for tasks such as automated diagnosis and troubleshooting.  
 
-## Crash Diagnostics Design
-Starting with the version 0.3.x of Crash Diagnostics, the project will undergo a major redesign:
-* Refactor the programmatic API surface into distinct infrastructural components
-* A programmatic extension/plugin for distinct backend implementations to support different compute infrastructures
-* Tigher Kubernetes integration including the ability to extract troubleshooting data Cluster-API managed clusters
+## Crashd Features
+* Crashd uses the [Starlark language](https://github.com/google/starlark-go/blob/master/doc/spec.md), a Python dialect, to express and invoke automation functions
+* Easily automate interaction with infrastructures running Kubernetes
+* Interact and capture information from compute resources such as machines (via SSH)
+* Automatically execute commands on compute nodes to capture results 
+* Capture object and cluster log from the Kubernetes API server
+* Easily extract data from Cluster-API managed clusters 
 
-See the detail Google Doc design document [here](https://docs.google.com/document/d/1pqYOdTf6ZIT_GSis-AVzlOTm3kyyg-32-seIfULaYEs/edit?usp=sharing).
 
+## How Does it Work?
+Crashd executes script files, written in Starlark, that interacts a specified infrastructure along with its cluster resources.  Starlark script files contain predefined Starlark functions that are capable of interacting and collect diagnostics and other information from the servers in the cluster.
 
-## Collecting information for troubleshooting
-To specify the resources to collect from cluster machines, a series of commands are declared in a file called a diagnostics file.  Like a Dockerfile, the diagnostics file is a collection of line-by-line directives with commands that are executed on each specified cluster machine.  The output of the commands is then added to a tar file and saved for further analysis.    
+For detail on the design of Crashd, see this Google Doc design document [here](https://docs.google.com/document/d/1pqYOdTf6ZIT_GSis-AVzlOTm3kyyg-32-seIfULaYEs/edit?usp=sharing).
 
-For instance, when the following diagnostics file (saved as Diagnostics.file) is executed, it will collect information from the two cluster machines (specified with the `FROM` directive): 
+## Installation
+There are two ways to get started with Crashd. Either download a pre-built binary or pull down the code and build it locally.
 
-```
-FROM  192.168.176.100:22 192.168.176.102:22 
-AUTHCONFIG username:${remoteuser}  private-key:${HOME}/.ssh/id_rsa 
-WORKDIR /tmp/crashout 
+### Download binary
+1. Dowload the latest [binary relase](https://github.com/vmware-tanzu/crash-diagnostics/releases/) for your platform
+2. Extract `tarball` from release
+   ```
+   tar -xvf <RELEASE_TARBALL_NAME>.tar.gz
+   ```
+3. Move the binary to your operating system's `PATH`
 
-# copy log files 
-COPY /var/log/kube-apiserver.log 
-COPY /var/log/kube-scheduler.log 
-COPY /var/log/kube-controller-manager.log 
-COPY /var/log/kubelet.log 
-COPY /var/log/kube-proxy.log 
 
-# Capture service status output 
-CAPTURE journalctl -l -u kubelet 
-CAPTURE journalctl -l -u kube-apiserver 
+### Compiling from source
+Crashd is written in Go and requires version 1.11 or later.  Clone the source from its repo or download it to your local directory.  From the project's root directory, compile the code with the
+following:
 
-# Collect docker-related logs 
-CAPTURE journalctl -l -u docker 
-CAPTURE /bin/sh -c "docker ps | grep apiserver"
+```
+GO111MODULE=on go build -o crashd .
+```
 
-# Collect objects and logs from API server if available
-KUBECONFIG $HOME/.kube/kind-config-kind
-KUGEGET objects namespaces:"kube-system default" kind:"deployments" 
-KUBEGET logs namespaces:"default" containers:"hello-app"
+Or, yo can run a versioned build using the `build.go` source code:
 
-OUTPUT ./crash-out.tar.gz 
 ```
-Note that the tool can also collect resource data from the API server, if available, using `KUBECONFIG` and the `KUBEGET` command.
+go run .ci/build/build.go
 
-## Features
-* Simple declarative script with flexible format
-* Support for multiple directives to execute user-provided commands
-* Ability to declare or use existing environment variables in commands
-* Easily transfer files from cluster machines
-* Execute commands on remote machines and captures the results
-* Automatically collect information from multiple machines
-* Collect resource data and pod logs from an available API server
+Build amd64/darwin OK: .build/amd64/darwin/crashd
+Build amd64/linux OK: .build/amd64/linux/crashd
+```
+
+## Getting Started
+A Crashd script consists of a collection of Starlark functions stored in a file.  For instance, the following script (saved as diagnostics.crsh) collects system information from a list of provided hosts using SSH.  The collected data is then bundled as tar.gz file at the end: 
+
+```python
+# Crashd global config
+crshd = crashd_config(workdir="{0}/crashd".format(os.home))
+
+# Enumerate compute resources 
+# Define a host list provider with configured SSH
+hosts=resources(
+    provider=host_list_provider(
+        hosts=["170.10.20.30", "170.40.50.60"], 
+        ssh_config=ssh_config(
+            username=os.username,
+            private_key_path="{0}/.ssh/id_rsa".format(os.home),
+        ),
+    ),
+)
+
+# collect data from hosts
+capture(cmd="sudo df -i", resources=hosts)
+capture(cmd="sudo crictl info", resources=hosts)
+capture(cmd="df -h /var/lib/containerd", resources=hosts)
+capture(cmd="sudo systemctl status kubelet", resources=hosts)
+capture(cmd="sudo systemctl status containerd", resources=hosts)
+capture(cmd="sudo journalctl -xeu kubelet", resources=hosts)
+
+# archive collected data
+archive(output_file="diagnostics.tar.gz", source_paths=[crshd.workdir])
+```
 
-See the complete list of supported [directives here](./docs/README.md).
+The previous code snippet connects to two hosts (specified in the `host_list_provider`) and execute commands remotely, over SSH, and `capture` and stores the result.
 
+> See the complete list of supported [functions here](./docs/README.md).
 
-## Running Diagnostics
-The tool is compiled into a single binary named `crash-diagnostics`.  For instance, when the following command runs, by default it will search for and execute diagnostics script file named `./Diagnostics.file`:
+### Running the script
+To run the script, do the following:
 
 ```
-crash-diagnostics run
+$> crashd run diagnostics.crsh 
 ```
 
-Flag `--file` can be used to specify a different diagnostics file: 
+If you want to output debug information, use the `--debug` flag as shown:
 
 ```
-crash-diagnostics --file test-diagnostics.file 
+$> crashd run --debug diagnostics.crsh
+
+DEBU[0000] creating working directory /home/user/crashd
+DEBU[0000] run: executing command on 2 resources
+DEBU[0000] run: executing command on localhost using ssh: [sudo df -i]
+DEBU[0000] ssh.run: /usr/bin/ssh -q -o StrictHostKeyChecking=no -i /home/user/.ssh/id_rsa -p 22  user@localhost "sudo df -i"
+DEBU[0001] run: executing command on 170.10.20.30 using ssh: [sudo df -i]
+...
+```
+
+## Compute Resource Providers
+Crashd utilizes the concept of a provider to enumerate compute resources. Each implementation of a provider is responsible for enumerating compute resources on which Crashd can execute commands using a transport (i.e. SSH). Crashd comes with several providers including
+
+* *Host List Provider* - uses an explicit list of host addresses (see previous example)
+* *Kubernetes Nodes Provider* - extracts host information from a Kubernetes API node objects
+* *CAPV Provider* - uses Cluster-API to discover machines in vSphere cluster
+* *CAPA Provider* - uses Cluster-API to discover machines running on AWS
+* More providers coming!
+
+
+## Accessing script parameters
+Crashd scripts can access external values that can be used as script parameters.
+### Environment variables
+  Crashd scripts can access environment variables at runtime using the `os.getenv` method:
+```python
+kube_capture(what="logs", namespaces=[os.getenv("KUBE_DEFAULT_NS")])
 ```
 
-The output file generated by the tool can be specified using flag `--output` (which overrides value in script):
+### Command-line arguments
+Scripts can also access command-line arguments passed as key/value pairs using the `--args` flag. For instance, when the following command is used to start a script:
 
 ```
-crash-diagnostics --file test-diagnostics.file --output test-cluster.tar.gz
+  crashd run --args="kube_ns=kube-system username=$(whoami)" diagnostics.crsh
 ```
 
+Values from `--args` can be accessed as shown below:
 
-When you use the `--debug` flag, you should see log messages on the screen similar to the following:
+```python
+kube_capture(what="logs", namespaces=["default", args.kube_ns])
 ```
-$> crash-diagnostics run --debug
 
-DEBU[0000] Parsing script file
-DEBU[0000] Parsing [1: FROM local]
-DEBU[0000] FROM parsed OK
-DEBU[0000] Parsing [2: WORKDIR /tmp/crasdir]
+## More Examples
+### SSH Connection via a jump host
+The SSH configuration function can be configured with a jump user and jump host.  This is useful for providers that requires a host proxy for SSH connection as shown in the following example:
+```python
+ssh=ssh_config(username=os.username, jump_user=args.jump_user, jump_host=args.jump_host)
+hosts=host_list_provider(hosts=["some.host", "172.100.100.20"], ssh_config=ssh)
 ...
-DEBU[0000] Archiving [/tmp/crashdir] in out.tar.gz
-DEBU[0000] Archived /tmp/crashdir/local/df_-i.txt
-DEBU[0000] Archived /tmp/crashdir/local/lsof_-i.txt
-DEBU[0000] Archived /tmp/crashdir/local/netstat_-an.txt
-DEBU[0000] Archived /tmp/crashdir/local/ps_-ef.txt
-DEBU[0000] Archived /tmp/crashdir/local/var/log/syslog
-INFO[0000] Created archive out.tar.gz
-INFO[0002] Created archive out.tar.gz
-INFO[0002] Output done
-```
-
-## Compile and Run
-`crash-diagnostics` is written in Go and requires version 1.11 or later.  Clone the source from its repo or download it to your local directory.  From the project's root directory, compile the code with the
-following:
+```
 
+### Connecting to Kubernetes nodes with SSH
+The following uses the `kube_nodes_provider` to connect to Kubernetes nodes and execute remote commands against those nodes using SSH:
+
+```python
+# SSH configuration
+ssh=ssh_config(
+    username=os.username,
+    private_key_path="{0}/.ssh/id_rsa".format(os.home),
+    port=args.ssh_port,
+    max_retries=5,
+)
+
+# enumerate nodes as compute resources
+nodes=resources(
+    provider=kube_nodes_provider(
+        kube_config=kube_config(path=args.kubecfg),
+        ssh_config=ssh,
+    ),
+)
+
+# exec `uptime` command on each node
+uptimes = run(cmd="uptime", resources=nodes)
+
+# print `run` result from first node
+print(uptimes[0].result)
 ```
-GO111MODULE=on go install .
+
+
+### Retreiving Kubernetes API objects and logs
+The`kube_capture` is used, in the folliwng example, to connect to a Kubernetes API server to retrieve Kubernetes API objects and logs.  The retrieved data is then saved to the filesystem as shown below:
+
+```python
+nspaces=[
+    "capi-kubeadm-bootstrap-system",
+    "capi-kubeadm-control-plane-system",
+    "capi-system capi-webhook-system",
+    "cert-manager tkg-system",
+]
+
+conf=kube_config(path=args.kubecfg)
+
+# capture Kubernetes API object and store in files
+kube_capture(what="logs", namespaces=nspaces, kube_config=conf)
+kube_capture(what="objects", kinds=["services", "pods"], namespaces=nspaces, kube_config=conf)
+kube_capture(what="objects", kinds=["deployments", "replicasets"], namespaces=nspaces, kube_config=conf)
 ```
 
-This should place the compiled `crash-diagnostics` binary in `$(go env GOPATH)/bin`.  You can test this with:
+### Interacting with Cluster-API manged machines running on vSphere (CAPV)
+As mentioned, Crashd provides the `capv_provider` which allows scripts to interact with Cluster-API managed clusters running on a vSphere infrastructure (CAPV).  The following shows an abbreviated snippet of a Crashd script that retrieves diagnostics information from the mangement cluster machines managed by a CAPV-initiated cluster:
+
+```python
+# enumerates management cluster nodes
+nodes = resources(
+    provider=capv_provider(
+        ssh_config=ssh_config(username="capv", private_key_path=args.private_key),
+        kube_config=kube_config(path=args.mc_config)
+    )
+)
+
+# execute and capture commands output from management nodes
+capture(cmd="sudo df -i", resources=nodes)
+capture(cmd="sudo crictl info", resources=nodes)
+capture(cmd="sudo cat /var/log/cloud-init-output.log", resources=nodes)
+capture(cmd="sudo cat /var/log/cloud-init.log", resources=nodes)
+...
+
 ```
-crash-diagnostics --help
+
+The previous snippet interact with management cluster machines. The provider can be configured to enumerate workload machines (by specifying the name of a workload cluster) as shown in the following example:
+
+```python
+# enumerates workload cluster nodes
+nodes = resources(
+    provider=capv_provider(
+        workload_cluster=args.cluster_name
+        ssh_config=ssh_config(username="capv", private_key_path=args.private_key),
+        kube_config=kube_config(path=args.mc_config)
+    )
+)
+
+# execute and capture commands output from workload nodes
+capture(cmd="sudo df -i", resources=nodes)
+capture(cmd="sudo crictl info", resources=nodes)
+...
 ```
-If this does not work properly, ensure that your Go environment is setup properly.
+
+### All Examples
+See all script examples in the [./examples](./examples) directory.
 
 ## Roadmap
 This project has numerous possibilities ahead of it.  Read about our evolving [roadmap here](ROADMAP.md).

diff --git a/ROADMAP.md b/ROADMAP.md
@@ -1,5 +1,5 @@
-# Roadmap
-This project has just started and is going through a steady set of iterative changes to create a tool that will be useful for Kubernetes human operators.  The release cadance is designed to allow the implemented features to mature overtime and lessen technical debts. Each release series will consist of alpha and beta releases before each major release to allow time for the code to be properly exercized by the community.
+# Crash Diagnostics Roadmap
+This project has been in development through several releases. The release cadance is designed to allow the implemented features to mature overtime and lessen technical debts. Each release series will consist of alpha and beta releases (when necessary) before each major release to allow time for the code to be properly exercized by the community.
 
 This roadmap has a short and medium term views of the type of design and functionalities that the tool should support prior to a `1.0` release.
 
@@ -25,22 +25,20 @@ The following additional features are also planned for this series.
 
 
 ## v0.3.x-Releases
-This series of release will see the redsign of the internals of Crash Diagnostics:
-* Refactor the programmatic API surface into distinct infrastructural components
-* A programmatic extension/plugin to create backend implementations for different infrastructures
-* Tigher Kubernetes integration including the ability to extract troubleshooting data Cluster-API managed clusters
+This series of release will see the redsign of the internals of Crash Diagnostics to move away from a custom configuration and adopt the [Starlark](https://github.com/bazelbuild/starlark) language (a dialect of Python):
+* Refactor the internal implementation to use Starlark
+* Introduce/implement several Starlark functions to replace the directives from previous file format.
+* Develop ability to extract data/logs from Cluster-API managed clusters
 
 See the Google Doc design document [here](https://docs.google.com/document/d/1pqYOdTf6ZIT_GSis-AVzlOTm3kyyg-32-seIfULaYEs/edit?usp=sharing).
 
 
-## v0.5.x-Releases
+## v0.4.x-Releases
 This series of releases will explore optimization features:
 * Parsing and execution optimization (i.e. parallel execution)
 * A Uniform retry strategies (smart enough to requeue actions when failed)
 
-## v0.4.x-Releases
+## v0.5.x-Releases
 Exploring other interesting ideas: 
 * Automated diagnostics (would be nice)
-* And more...
-
-TBD
+* And more...
diff --git a/TODO.md b/TODO.md
@@ -75,8 +75,5 @@ This tag/version reflects migration to github
 * [ ] Cloud API recipes (i.e. recipes to debug CAPV)
 
 # v0.3.0
-* Refactor internal executor into a pluggable interface-driven model
-  - i.e. possible suppor for different runtime ()
-  - default runtime may use ssh and scp while other runtime may choose to use something else
-  - default runtime may use ssh/scp all the time regardless of local or remote 
-
+* Redesign the script/configuration language for Crash Diagnostics
+* Refactor internal and implement support for [Starlark](https://github.com/bazelbuild/starlark) language