<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# 10.0 Monitoring GPU within Kubernetes Cluster
## (part of Lab 3)

In this notebook, you'll learn to monitor and manage GPU resources across a K8s cluster using [NVIDIA Data Center GPU Manager (DCGM)](https://developer.nvidia.com/dcgm).  

**[10.1 Deploy Prometheus](#10.1-Deploy-Prometheus)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[10.1.1 Configuration File](#10.1.1-Configuration-File)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[10.1.2 Exercise: Override a Configuration Value](#10.1.2-Exercise:-Override-a-Configuration-Value)<br>
**[10.2 Deploy `dcgm-exporter`](#10.2-Deploy-dcgm-exporter)<br>**
**[10.3 Explore Prometheus and Grafana](#10.3-Explore-Prometheus-and-Grafana)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[10.3.1 Exercise: Set Up a Dashboard](#10.3.1-Exercise:-Set-Up-a-Dashboard)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[10.3.2 Shutdown](#10.3.2-Shutdown)<br>


Monitoring systems includes collecting/storing metrics, visualizing results, and alerting on specific observed conditions.
DCGM is a suite of tools that includes active health monitoring, comprehensive diagnostics, system alerts, and governance policies for the GPU cluster. 
Metrics are collected with the open-source tool [Prometheus](https://prometheus.io/) and visualized with the [Grafana](https://grafana.com/) tool to create rich dashboards.  

To gather GPU telemetry in Kubernetes, we will use [dcgm-exporter](https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html#gpu-telemetry), which exposes GPU metrics in a format that can be scraped by Prometheus and visualized using Grafana.

### Notebook Dependencies
The steps in this notebook assume that you are starting with a K8s cluster that is GPU-enabled with feature discovery.  Let's ensure that by stopping and restarting a cluster and bringing it to a known state.

In [1]:
# Delete and restart K8s
!minikube delete
!minikube start --driver=none
# Install the GPU device plugin with Helm
!helm repo add nvdp https://nvidia.github.io/k8s-device-plugin \
   && helm repo update
!helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.13.0
# Install GPU feature discovery with Helm
!helm repo add nvgfd https://nvidia.github.io/gpu-feature-discovery \
    && helm repo update
!helm upgrade -i nvgfd nvgfd/gpu-feature-discovery \
  --version 0.7.0 \
  --namespace gpu-feature-discovery \
  --create-namespace

🙄  "minikube" profile does not exist, trying anyways.
💀  Removed all traces of the "minikube" cluster.
😄  minikube v1.23.2 on Ubuntu 20.04 (docker/amd64)
✨  Using the none driver based on user configuration
👍  Starting control plane node minikube in cluster minikube
🤹  Running on localhost (CPUs=4, Memory=15818MB, Disk=297738MB) ...
ℹ️  OS release is Ubuntu 20.04.5 LTS
🐳  Preparing Kubernetes v1.22.2 on Docker 23.0.2 ...[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K
    ▪ Generating certificates and keys ...[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K
    ▪ Booting up control plane ...[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K

Check the list of Helm charts installed with the `helm list` command (see the [Helm documentation](https://helm.sh/docs/helm/helm_list/)). The `--filter` option allows filtering by name.  Use the `--output` option to specify the output format ("json", "table", or "yaml").  

In [2]:
# check the list of charts
!helm list -A  --output table

NAME 	NAMESPACE            	REVISION	UPDATED                                	STATUS  	CHART                      	APP VERSION
nvdp 	nvidia-device-plugin 	1       	2023-07-27 07:43:08.720989373 +0000 UTC	deployed	nvidia-device-plugin-0.13.0	0.13.0     
nvgfd	gpu-feature-discovery	1       	2023-07-27 07:43:09.398965504 +0000 UTC	deployed	gpu-feature-discovery-0.7.0	0.7.0      


In [3]:
# Filter the list of charts
!helm list -A --filter "nvdp" --output yaml

- app_version: 0.13.0
  chart: nvidia-device-plugin-0.13.0
  name: nvdp
  namespace: nvidia-device-plugin
  revision: "1"
  status: deployed
  updated: 2023-07-27 07:43:08.720989373 +0000 UTC


---
# 10.1 Deploy Prometheus

The first step is to deploy Prometheus to the cluster, as `dcgm-exporter` depends on Prometheus. If we do this out of order we will get an error.  

<img  src="images/k8s/prometheus-architecture.png">


In the previous notebook, our steps for deployment with Helm were simply to add the appropriate repository, then install with options itemized.  For Prometheus, we have an additional intermediate step.  We need to modify the configuration values before installation.  Our steps are:
1. Add the Prometheus repository
2. Get the `kube-prometheus-stack` values file and modify it for our configuration
3. Install Prometheus with Helm using the updated values

In [4]:
# Add the prometheus-community repo
!helm repo add prometheus-community \
    https://prometheus-community.github.io/helm-charts \
    && helm repo update

"prometheus-community" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvdp" chart repository
...Successfully got an update from the "nvgfd" chart repository
...Successfully got an update from the "prometheus-community" chart repository
Update Complete. ⎈Happy Helming!⎈


In [5]:
# Find the exact name
!helm search repo prometheus --version 44.3

NAME                                      	CHART VERSION	APP VERSION	DESCRIPTION                                       
prometheus-community/kube-prometheus-stack	44.3.1       	v0.62.0    	kube-prometheus-stack collects Kubernetes manif...


In [6]:
# Use inspect to export the values YAML file
!helm inspect values prometheus-community/kube-prometheus-stack \
    --version 44.3 > ./kube-prometheus-stack.values

## 10.1.1 Configuration File
The [kube-prometheus-stack.values](kube-prometheus-stack.values) file is quite large. You can take a look to get a sense of the many configuration settings possible in deployment. Depending on your own use case, you may need to modify the file before deployment.  In this class, there is a file with modifications already preloaded.  You can see the changes 
by running a `diff` of the two files.

In [7]:
CONFIG_DIR = "/dli/task/kubernetes-config"
!diff kube-prometheus-stack.values $CONFIG_DIR/kube-prometheus-stack-v44.values

765c765,769
< 
---
>   grafana.ini:
>     server:
>       domain: ""
>       root_url: ""
>       serve_from_subpath: true
774c778
<     enabled: false
---
>     enabled: true
799c803
<     path: /
---
>     path: /grafana
2789c2793
<     serviceMonitorSelectorNilUsesHelmValues: true
---
>     serviceMonitorSelectorNilUsesHelmValues: false


One area of the configuration file that is of particular interest to us, is the configuration of Grafana.  Here are the Grafana settings in the preloaded version of the values file:

In [8]:
!cat $CONFIG_DIR/kube-prometheus-stack-v44.values | grep -A 60 grafana:

grafana:
  enabled: true
  namespaceOverride: ""

  ## ForceDeployDatasources Create datasource configmap even if grafana deployment has been disabled
  ##
  forceDeployDatasources: false

  ## ForceDeployDashboard Create dashboard configmap even if grafana deployment has been disabled
  ##
  forceDeployDashboards: false

  ## Deploy default dashboards
  ##
  defaultDashboardsEnabled: true

  ## Timezone for the default dashboards
  ## Other options are: browser or a specific timezone, i.e. Europe/Luxembourg
  ##
  defaultDashboardsTimezone: utc

  adminPassword: prom-operator
  grafana.ini:
    server:
      domain: ""
      root_url: ""
      serve_from_subpath: true
  rbac:
    ## If true, Grafana PSPs will be created
    ##
    pspEnabled: false

  ingress:
    ## If true, Grafana Ingress will be created
    ##
    enabled: true

    ## IngressClassName for Grafana Ingress.
    ## Should be provided if Ingress is enable.
    ##
    # ingressClassName: nginx

    ## Annotations for 

There are a few more changes to the config file needed.  To access the Grafana webpage, the "domain", "root_url", and "hosts" parameters have to point to your particular GPU instance. Each student GPU instance has a unique URL, which we need to extract.  You could directly modify the values file, but as an exercise, you'll do this with an override to the `helm install` command instead, using the `--set` option. 

## 10.1.2 Exercise: Override a Configuration Value
To override a value in the configuration YAML file, use the `--set` option during installation. 
the reference to a particular key can be found by it's hierarchy, separated by dots, taking care to escape actual dots in the name!
As an example, the reference to the "Grafana server domain" is in the hierarchy `grafana`->`grafana.ini`->`server`->`domain`. Therefore, the `--set` option is of the form:

```
--set grafana.'grafana\.ini'.server.domain="your.domain.here"
```

Using the helper cell below, determine the "domain", "root_url", and "host" values.  Then complete the `helm install` command with the correct values and run it to deploy Prometheus. 

There is no precise solution to look at because every student has a unique host URL.  If you get stuck, you can look at the [example solution](solutions/ex10.1.2.ipynb), which should give you an idea of the correct pattern. <br>*Note: the example solution will not be your exact solution!*

In [9]:
%%js
var root_url = 'http://' + window.location.hostname + '/grafana';
element.append(root_url);

<IPython.core.display.Javascript object>

With the configuration changes in place, go ahead and deploy the Prometheus application. Then we can verify that Prometheus is deployed with the `kubectl get pods` command using the option `--namespace prometheus`.

In [10]:
# TODO Replace the FIXMEs with the correct setting values and deploy Prometheus
# This should take around 30 seconds
!helm install prometheus-community/kube-prometheus-stack \
   --version 44.3 \
   --create-namespace --namespace prometheus \
   --generate-name \
   --values $CONFIG_DIR/kube-prometheus-stack-v44.values \
   --set grafana.'grafana\.ini'.server.domain="FIXME" \
   --set grafana.'grafana\.ini'.server.root_url="FIXME" \
   --set grafana.ingress.hosts[0]="FIXME"

Error: INSTALLATION FAILED: Ingress.extensions "kube-prometheus-stack-1690443793-grafana" is invalid: spec.rules[0].host: Invalid value: "FIXME": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')


In [11]:
# Check prometheus pods. Should be "Running" after a "ContainerCreating" status
!kubectl get pods --namespace prometheus 

NAME                                                              READY   STATUS              RESTARTS   AGE
kube-prometheus-stack-1690-operator-7ff78d85c9-tjjzt              0/1     ContainerCreating   0          1s
kube-prometheus-stack-1690443793-grafana-6d4bc86d79-xmkts         0/3     ContainerCreating   0          1s
kube-prometheus-stack-1690443793-kube-state-metrics-5b48fcnjxgk   0/1     ContainerCreating   0          1s
kube-prometheus-stack-1690443793-prometheus-node-exporter-qvsm8   0/1     Pending             0          1s


---
# 10.2 Deploy `dcgm-exporter`

The [dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) project was built to expose DCGM GPU metrics to Prometheus. Now that Prometheus is deployed, we can deploy `dcgm-exporter`.

In [12]:
! helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts \
  && helm repo update

"gpu-helm-charts" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvgfd" chart repository
...Successfully got an update from the "nvdp" chart repository
...Successfully got an update from the "gpu-helm-charts" chart repository
...Successfully got an update from the "prometheus-community" chart repository
Update Complete. ⎈Happy Helming!⎈


In [13]:
! helm install gpu-helm-charts/dcgm-exporter \
   --create-namespace --namespace grafana \
   --generate-name 
!kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml

NAME: dcgm-exporter-1690443815
LAST DEPLOYED: Thu Jul 27 07:43:35 2023
NAMESPACE: grafana
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
  export POD_NAME=$(kubectl get pods -n grafana -l "app.kubernetes.io/name=dcgm-exporter,app.kubernetes.io/instance=dcgm-exporter-1690443815" -o jsonpath="{.items[0].metadata.name}")
  kubectl -n grafana port-forward $POD_NAME 8080:9400 &
  echo "Visit http://127.0.0.1:8080/metrics to use your application"
daemonset.apps/dcgm-exporter created
service/dcgm-exporter created


In order to expose Grafana on the class instance, we need to patch the configuration using the [kubectl patch](https://kubernetes.io/docs/tasks/manage-kubernetes-objects/update-api-object-kubectl-patch/) command.  We need to specify the port and a password.  This patch will override the previous settings.

In [14]:
# List the Grafana patch
!cat $CONFIG_DIR/grafana-patch.yaml

spec:
  type: NodePort
  ports:
    - port: 80
      nodePort: 31091
      name: grafana


In the next few cells, we will: 
1. Check the status of the Grafana service using the `kubectl get svc` command
1. Make the patch using `kubectl patch svc`
1. Check the status again to see if there is a change after applying the patch

In [15]:
# Check the status - note the TYPE and PORT for patching
!kubectl get svc --namespace prometheus -l app.kubernetes.io/name=grafana 

NAME                                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
kube-prometheus-stack-1690443793-grafana   ClusterIP   10.104.134.199   <none>        80/TCP    4s


In [16]:
%%bash
# get the GRAFANA_NAME
GRAFANA_NAME=$(kubectl get svc --namespace prometheus -l app.kubernetes.io/name=grafana -o custom-columns=NAME:.metadata.name --no-headers)

# Apply the patch
kubectl patch svc $GRAFANA_NAME \
   -n prometheus \
   --patch "$(cat /dli/task/kubernetes-config/grafana-patch.yaml)"

service/kube-prometheus-stack-1690443793-grafana patched


In [17]:
# Check the status again - note the TYPE and PORT changes
!kubectl get svc --namespace prometheus -l app.kubernetes.io/name=grafana 

NAME                                       TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
kube-prometheus-stack-1690443793-grafana   NodePort   10.104.134.199   <none>        80:31091/TCP   4s


---
# 10.3 Explore Prometheus and Grafana

The Grafana interface to Prometheus metrics is now exposed.  
[Open Grafana!](/grafana/)

Grafana greets you with a dark blue page. To login, use: 
- Username: `admin` 
- Password: `prom-operator` 

The password was originally set in the `kube-prometheus-stack.values` file. If successful, your page should look similar to this:

<img src="images/k8s/grafana_page1.png">

What's next?  Set up a dashboard by importing the NVIDIA DCGM Exporter Dashboard.

## 10.3.1 Exercise: Set Up a Dashboard
For this exercise, follow the instructions in the [GPU telemetry documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/dcgm-exporter.html?highlight=grafana#dcgm-dashboard-in-grafana) section titled "DCGM Dashboard in Grafana". Please STOP after this section and do NOT continue into the next section titled "Viewing Metrics for Running Applications", as this will cause errors later in the course code. 

Basic steps from the instructions:
1. Log in
2. Select "+Import" from the Dashboards menu
3. Load `https://grafana.com/grafana/dashboards/12239`
4. Select "Prometheus" in the "Prometheus" slot
5. Click "Import" to see the dashboard

## 10.3.2 Shutdown
Clean up your environment by shutting down K8s.

In [18]:
# Shut down K8s
!minikube delete
!docker kill $(docker ps -q)
# Check for clean environment - this should be empty
!docker ps

🔄  Uninstalling Kubernetes v1.22.2 using kubeadm ...
🔥  Deleting "minikube" in none ...
💀  Removed all traces of the "minikube" cluster.
"docker kill" requires at least 1 argument.
See 'docker kill --help'.

Usage:  docker kill [OPTIONS] CONTAINER [CONTAINER...]

Kill one or more running containers
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES


---
<h2 style="color:green;">Congratulations!</h2>

In this notebook, you have:
- Deployed a Prometheus server
- Modified initialization configurations with settings in `helm install`
- Patched a service with K8s for your environment
- Explored tools for monitoring activity on your production application

Next, you'll deploy a conversational AI Riva application on K8s. <br>
Move on to [Deploy Riva](011_K8s_Deploy_Riva.ipynb).

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>