Capstone Project of the February 2025 cohort of the PNC/TEKsystems Early Career SRE bootcamp. This is a disaster recovery testing platform that offers a CLI to run multiple types of failure simulations built-in recovery automation and system health monitoring/visualization.
Prerequisite Installation
Environment Set-Up
Usage
Credits
Before beginning, you must have the following installed:
- Docker Desktop with Kubernetes
- Python
- Pip (Package Manager)
If you do not, follow the instructions below
- Navigate to Docs.Docker.com
- Download the correct installation for your machine
- Run the executable installer
- Follow the set-up instructions. The recommended defaults will work for this project.
- Restart your machine if necessary
- Start up Docker
- Click on the Settings icon in the top toolbar
- Click on 'Kubernetes' in the left sidebar.
- Turn on the 'Enable Kubernetes' option
- Click 'Apply & Restart'
- Click 'Install'
Once installation has finished, you should see 'Kubernetes running' along the bottom of the application window.
- Navigate to Python.org
- Download the correct installation for your machine
- Run the executable installer
- Follow the set-up instructions. The recommended defaults will work for this project.
- Restart your machine if necessary
Once Python is installed, run the following command to ensure Pip is installed:
python -m pip install --upgrade pipSkip this section and go to the next section if you would prefer to set up a virtual environment.
- Navigate to the main project folder on your local machine
- Run
lsto ensure that you are in the folder containing requirements.txt - Run
pip install -r requirements.txt
- Navigate to the main project folder on your local machine
- Run
lsto ensure that you are in the folder containing requirements.txt - Run
python -m venv .venv - Run
source .venv/Scripts/activate - You are now in your virtual environment and should see (.venv) at the start of each line in your terminal
- Run
python -m pip install -r requirements.txt
a. You may usepip install -r requirements.txthowever, usingpython -mas a preface ensures you are using thepipassociated with the currently active Python interpreter (the one inside the virtual environment). This is a good practice. - Check the correct packages are installed by running
python -m pip freeze
- Open your main project folder in VS Code
- Open the Command Palette with the shortcut Ctrl+Shift+P
- Search for and select 'Python: Select Interpreter'
- Select 'Enter interpreter path...'
- Select 'Find...'
- Browse to '.venv\Scripts' within your capstone folder and select 'python.exe'
- When you are done working in the virtual environment, simply run
deactivatein your terminal
Once all prerequisites are installed and Docker Desktop with Kubernetes is running, navigate to the main project folder and run the apply_all.py file using the following command:
python apply_all.py- Press Enter to monitor the default pod with the infrastructure metrics scraper
- Enter 'y' to confirm
- Press Enter to use the default scrape interval of 5 seconds, or enter a different integer value
Kafka in this cluster is configured to automatically create the required topics. However, the below commands can be used to view and manage the topics as well as test the broker in Kafka if needed.
# Confirm the Kafka broker is running
kubectl get pods -l app=kafka
# Launch Kafka Client Pod
# You should get a shell prompt inside the Kafka container
kubectl run -it kafka-client --image=bitnami/kafka:3.6.0 --rm --restart=Never -- bash
# List Topics
kafka-topics.sh --bootstrap-server kafka-0.kafka-headless.default.svc.cluster.local:9094 --list
# Create Test Topic
kafka-topics.sh --bootstrap-server kafka-0.kafka-headless.default.svc.cluster.local:9094 --create --topic test-topic --partitions 1 --replication-factor 1
# Send a Test Message
kafka-console-producer.sh --broker-list kafka-0.kafka-headless.default.svc.cluster.local:9094 --topic test-topic
>hello from kafka!
# Press `Ctrl + C` to exit the producer
# Using Console Consumer
kafka-console-consumer.sh --bootstrap-server kafka-0.kafka-headless.default.svc.cluster.local:9094 --topic
# Press `Ctrl + C` to exit the consumerBy default, Kafka is configured to automatically create topics when a producer or consumer references a topic name that doesn't exist.
This can be convenient during development, but you may want to disable it in certain scenarios.
You can prevent Kafka from creating topics automatically by setting the following environment variable in your Kafka manifest (e.g., kafka-statefulset.yaml):
- name: KAFKA_CFG_AUTO_CREATE_TOPICS_ENABLE
value: "false"Your prerequsites are installed, your environment is set up, and your Kubernetes cluster is running, you are ready to start using the application!
In the main project folder, run the following command:
python run_experiment.pyFollow the prompts and execute the experiments you would like. Entering 0 at any point will return you to the main menu.
Running Network Partition experiment with default parameters:
#Run experiment
python run_experiment.py
#Select experiment from the menu
====== CHAOS EXPERIMENT LAUNCHER ======
Available experiments:
1. Process Termination
2. Pod Termination
3. Resource Exhaustion (CPU stress test)
4. Network Partition
0. Exit
Select an option: 4
#Confirm selection
Targeting the below pod for network partition:
Pod: python-proxy-app-54b68fb4c7-r6qcz in namespace default
Status: Running
UID: 468306a3-a24e-4978-b6e1-e225994d5416
Confirm selection? (y/n): y
#Enter experiments parameters (Optional)
====== NETWORK PARTITION EXPERIMENT ======
Parameters (press Enter to use default):
Target service to block [default: mysql-primary]:
Port to block [default: 3306]:
Protocol (tcp/udp/icmp) [default: tcp]:
Duration in seconds [default: 60]:
Running Network Partition experiment with the following parameters:
Pod: default/python-proxy-app-54b68fb4c7-r6qcz
UID: 468306a3-a24e-4978-b6e1-e225994d5416
Target Service: mysql-primary
Port: 3306
Protocol: tcp
Duration: 60 seconds
Execute experiment? (y/n): y
#Run the experiment
Executing experiment...
Experiment completed successfully!
Press Enter to continue...Running Resource Exhaustion experiment with custom parameters:
#Run experiment
python run_experiment.py
#Select experiment from the menu
====== CHAOS EXPERIMENT LAUNCHER ======
Available experiments:
1. Process Termination
2. Pod Termination
3. Resource Exhaustion (CPU stress test)
4. Network Partition
0. Exit
Select an option: 3
#Specify target pod
Enter pod name (or part of name) to target (0 to return to main menu): mysql-p
Pod: mysql-primary-0 in namespace default
Status: Running
UID: 6e802b8a-d3ec-443e-9619-a1dc2c62f23e
Confirm selection? (y/n): y
#Enter experiments parameters (Optional)
====== RESOURCE EXHAUSTION EXPERIMENT ======
Parameters (press Enter to use default):
Exhaust CPU? (y/n) [default: y]: n
Exhaust memory? (y/n) [default: n]: y
Memory intensity percentage (1-100) [default: 80]: 95
Duration in seconds [default: 30]: 90
Running Resource Exhaustion experiment with the following parameters:
Pod: default/mysql-primary-0
UID: 6e802b8a-d3ec-443e-9619-a1dc2c62f23e
CPU Exhaustion: Disabled
Memory Exhaustion: Enabled (Intensity: 95%)
Duration: 90 seconds
Execute experiment? (y/n): y
#Run the experiment
Executing experiment...
Experiment completed successfully!
Press Enter to continue...- Open a new tab on your browser and type in "localhost:32000"
- Username: admin, Password: admin
- Click on "Dashboards" on the left hand side
- Click on "Capstone"
- Click on "Database Recovery System"
Done!
In your terminal...
- Type in this command:
kubectl get pods - Locate the mongodb pod and copy the name
- Type in this command:
kubectl exec -it <mongodb pod name> -- sh - Log into mongosh using this command:
mongosh -u root -p - When prompted, type in the password:
root - From there, you can see databases with:
show dbs - Type:
use metrics_db - From there, you can see collections with:
show collections - To view a collection, type:
db.<collection name>.find().pretty() - At this point you can enter your query
#Find all the documents from the network partition experiments
db.chaos_events.find({source: "network_partition"})
#Find the number of failover events that have occurred
db.proxy_logs.countDocuments({event: "failover"})
#Find the number of errors
db.chaos_events.countDocuments({event: "error"})In your terminal...
- Type in this command:
kubectl exec -it mysql-summary-records-0 -- sh - Log into mongosh using this command:
mysql -u root -p - When prompted, type in the password:
root - From there, you can see databases with:
show databases; - Type:
use summary_db; - From there, you can see tables with:
show tables; - To view a table, type:
select * from <table name>; - At this point you can enter your query
#Find average CPU and average memory utilization for the containers
SELECT AVG(cpu_percent) AS avg_cpu, AVG(mem_percent) AS avg_mem FROM infra_metrics WHERE metric_level = 'container'
#Calculate total number of scrapes
SELECT COUNT(*) AS total FROM infra_metrics WHERE metric_level = 'container';
#Find all the entries where CPU or memory spike
SELECT * FROM infra_metrics WHERE cpu_percent > 90 OR mem_percent > 50;- To exit the run_experiment.py script, enter 0 while on the main menu
- To close down the cluster, run
python delete_all.pyfrom the main project folder
Designed and built by Lucas Baker, Rachel Cox, Henry Hewitt, and Lukas McCain for the the February 2025 cohort of the PNC/TEKsystems Early Career SRE bootcamp.