Skip to content
Crash-Diagnostics is a tool to help investigate, analyze, and troubleshoot unresponsive or crashed Kubernetes clusters.
Go
Branch: master
Clone or download
Latest commit 6fca28a Nov 26, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.ci Build and release automation with travis, gorelaser Oct 10, 2019
archiver Adds source and package documentation Sep 19, 2019
buildinfo Build and release automation with travis, gorelaser Oct 10, 2019
cmd Documentation updates for RUN command Oct 20, 2019
docs Documentation update for variable expansion escape Nov 25, 2019
exec Fix for variable expansion clash with expansion escape Nov 25, 2019
k8s Adds source and package documentation Sep 19, 2019
script Fix for variable expansion clash with expansion escape Nov 25, 2019
ssh Changes and tests for quoted SSH commands Oct 23, 2019
.gitignore Build and release automation with travis, gorelaser Oct 10, 2019
.goreleaser.yml Build and release automation with travis, gorelaser Oct 10, 2019
.travis.yml Fix for encrypted env in travis Oct 11, 2019
CHANGELOG.md Documentation update for variable expansion escape Nov 25, 2019
CODE-OF-CONDUCT.md OSS compliance docs and updates Sep 27, 2019
CONTRIBUTING.md OSS compliance docs and updates Sep 27, 2019
Diagnostics.file OSS compliance docs and updates Sep 27, 2019
LICENSE.txt OSS compliance docs and updates Sep 27, 2019
NOTICE.txt OSS compliance docs and updates Sep 27, 2019
README.md v0.1.0 Release and doc update Nov 1, 2019
ROADMAP.md Documentation update for variable expansion escape Nov 25, 2019
TODO.md Documentation update for variable expansion escape Nov 25, 2019
go.mod Migration to GitHub/VMware-Tanzu Oct 2, 2019
go.sum Adds support for Go style templating in script file Aug 29, 2019
main.go Migration to GitHub/VMware-Tanzu Oct 2, 2019

README.md

Crash Recovery and Diagnostics for Kubernetes

Crash Recovery and Diagnostics for Kubernetes (Crash Diagnostics for short) is designed to help human operators who are investigating and troubleshooting unhealthy or unresponsive Kubernetes clusters. It is a project designed to automate the diagnosis of problem clusters that may be in an unstable state including completely inoperable. In its introductory release, Crash Diagnostics provides cluster operators the ability to automatically collect machine states and other information from each node in a cluster. The collected information is then bundled in a tar file for further analysis.

Collecting troubleshooting information

To specify the resources to collect from cluster machines, a series of commands are declared in a file called a diagnostics file. Like a Dockerfile, the diagnostics file is a collection of line-by-line directives with commands that are executed on each specified cluster machine. The output of the commands is then added to a tar file and saved for further analysis.

For instance, when the following diagnostics file (saved as Diagnostics.file) is executed, it will collect information from the two cluster machines specified with the FROM directive):

FROM  192.168.176.100:22 192.168.176.102:22 
AUTHCONFIG username:${remoteuser}  private-key:${HOME}/.ssh/id_rsa 
WORKDIR /tmp/crashout 

# copy log files 
COPY /var/log/kube-apiserver.log 
COPY /var/log/kube-scheduler.log 
COPY /var/log/kube-controller-manager.log 
COPY /var/log/kubelet.log 
COPY /var/log/kube-proxy.log 

# Capture service status output 
CAPTURE journalctl -l -u kubelet 
CAPTURE journalctl -l -u kube-apiserver 

# Collect docker-related logs 
CAPTURE journalctl -l -u docker 
CAPTURE /bin/sh -c "docker ps | grep apiserver" 

OUTPUT ./crash-out.tar.gz 

Features

  • Simple declarative script with flexible format
  • Support for multiple directives to execute user-provided commands
  • Ability to declare or use existing environment variables in commands
  • Easily transfer files from cluster machines
  • Execute commands on remote machines and captures the results
  • Automatically collect information from multiple machines

See the complete list of supported directives here.

Running Diagnostics

The tool is compiled into a single binary named crash-diagnostics. For instance, when the following command runs, by default it will search for and execute diagnostics script file named ./Diagnostics.file:

crash-diagnostics run

Flag --file can be used to specify a different diagnostics file:

crash-diagnostics --file test-diagnostics.file 

The output file generated by the tool can be specified using flag --output (which overrides value in script):

crash-diagnostics --file test-diagnostics.file --output test-cluster.tar.gz

When you use the --debug flag, you should see log messages on the screen similar to the following:

$> crash-diagnostics run --debug

DEBU[0000] Parsing script file
DEBU[0000] Parsing [1: FROM local]
DEBU[0000] FROM parsed OK
DEBU[0000] Parsing [2: WORKDIR /tmp/crasdir]
...
DEBU[0000] Archiving [/tmp/crashdir] in out.tar.gz
DEBU[0000] Archived /tmp/crashdir/local/df_-i.txt
DEBU[0000] Archived /tmp/crashdir/local/lsof_-i.txt
DEBU[0000] Archived /tmp/crashdir/local/netstat_-an.txt
DEBU[0000] Archived /tmp/crashdir/local/ps_-ef.txt
DEBU[0000] Archived /tmp/crashdir/local/var/log/syslog
INFO[0000] Created archive out.tar.gz
INFO[0002] Created archive out.tar.gz
INFO[0002] Output done

Compile and Run

crash-diagnostics is written in Go and requires version 1.11 or later. Clone the source from its repo or download it to your local directory. From the project's root directory, compile the code with the following:

GO111MODULE=on go install .

This should place the compiled crash-diagnostics binary in $(go env GOPATH)/bin. You can test this with:

crash-diagnostics --help

If this does not work properly, ensure that your Go environment is setup properly.

Roadmap

This project has numerous possibilities ahead of it. Read about our evolving roadmap here.

Contributing

New contributors will need to sign a CLA (contributor license agreement). Details are described in our contributing documentation.

License

This project is available under the Apache License, Version 2.0

You can’t perform that action at this time.