Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Box specs and assembly

  • 1+ Nvidia 1080Ti GPUs (any modern Nvidia GPUs are ok);
  • AMD Threadripper processor;
  • Ample airflow for GPUs;

Basic steps after installing Ubuntu

Update packages:

sudo apt update
sudo apt upgrade

In my case adding repositories was also required:

# add universe to end of each line
sudo nano /etc/apt/sources.list

Optionally use a plain firewall with ufw.

Creating users and importing their keys

A group for all of the users to share folders:

sudo addgroup ds

Create a user and perform basic tasks with this user (note that I am using a github alias):

GROUP='ds' && \
sudo useradd $USER && \
sudo adduser $USER $GROUP && \
sudo mkdir -p /home/$USER/.ssh/ && \
sudo touch /home/$USER/.ssh/authorized_keys && \
sudo chown -R $USER:$USER /home/$USER/.ssh/ && \
sudo wget -O -$USER.keys | sudo tee -a /home/$USER/.ssh/authorized_keys
# sudo adduser $USER sudo

Basic monitoring and productivity

Sudo w/o entering logpass each time:

# add manually to the bottom of the file
sudo visudo
# username ALL=(ALL) NOPASSWD: ALL

Prohibit password login:

sudo sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
sudo sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config


sudo apt  install python3-pip
sudo apt  install tmux
sudo apt  install glances
sudo pip3 install gpustat
sudo apt  install lm-sensors
sudo apt  install ncdu
sudo apt  install unzip
sudo apt  install openvpn
sudo apt  install traceroute

Installing NVIDIA drivers

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt install nvidia-driver-390
# reboot

Mounting drives

Ensure that all members of group can read from folder, but cannot delete other people's files:

sudo mkdir /mnt/nvme
sudo chown -R :ds /mnt/nvme/
sudo chmod 2770 /mnt/nvme

Mount a pre-formatted NVME drive with data on it:

sudo  mount /dev/nvme1n1p1 /mnt/nvme
blkid /dev/nvme1n1p1
sudo echo 'UUID=379eade4-cf4e-4b42-bb49-efa025e81650 /mnt/nvme ext4 defaults 0 0' | sudo tee -a /etc/fstab
sudo mount -a

Tools for DL environment

Docker CE

Just follow the relevant instructions from here.

sudo apt-get update

sudo apt-get install \
    apt-transport-https \
    ca-certificates \
    curl \

curl -fsSL | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88

# for newer / odd Ubuntu versions you may have to tweak here
sudo add-apt-repository \
   "deb [arch=amd64] \
   $(lsb_release -cs) \

sudo apt-get update
sudo apt-get install docker-ce
sudo docker run hello-world

Add all of your users to docker group to grant them rights to run docker wo sudo

sudo usermod -aG docker $USER

Clean up:

# do not forget to relogin
# delete all containers
docker rm $(docker ps -a -q)
# delete all images
docker rmi $(docker images -q)

Nvidia-docker 2

Just follow here

Add the package repositories:

curl -s -L | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

Install nvidia-docker2 and reload the Docker daemon configuration:

sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

Test nvidia-smi with the latest official CUDA image:

docker run --runtime=nvidia --rm nvidia/cuda:9.0-cudnn7-devel nvidia-smi

Clean up:

# delete all containers
docker rm $(docker ps -a -q)
# delete all images
docker rmi $(docker images -q)

Basic dockerfile

Located in Dockerfile.

Key dependencies:

  • Builds from official Ubuntu + CUDA image;
  • Python from miniconda;
  • Keras, TF and PyTorch;
  • Basic DS / ML libraries;

Connect to other cluster machines via 10 Gb/s LAN

Have 2 machines with 10 Gbit/s port. Connect them via 6A class patch-cord.

Find physical etherhnet devices on both machines:

ip a l

Make sure that both devices have no IP allocated (i.e. they are NOT primary Internet connections). Machines

sudo ip ad add dev enp7s0
sudo ip ad add dev p2p1

Test speed.

Port forwarding / ports available

Ports forwarded for DS / ML work:

  • Host ssh port 8027;
  • Jupyter ports 8882 8883 8884;
  • TensorBoard ports 6001 6002 6003;

LVM array on the second box

Device level

see all compatible drives sudo lvmdiskscan

lvm physical devices

sudo lvmdiskscan -l
sudo pvscan
sudo pvs
sudo pvdisplay

Volume groups

The vgscan command can be used to scan the system for available volume groups. It also rebuilds the cache file when necessary. It is a good command to use when you are importing a volume group into a new system

sudo vgscan
sudo vgs -o +devices,lv_path
sudo vgdisplay -v

logical volumes

sudo lvscan
sudo lvs
sudo lvs --segments
sudo lvdisplay -m


sudo pvcreate /dev/nvme0n1p1 /dev/nvme1n1p1
# sudo pvremove
sudo lvmdiskscan -l
# WARNING: only considering LVM devices
# /dev/nvme0n1p1 [     931.51 GiB] LVM physical volume
# /dev/nvme1n1p1 [     931.51 GiB] LVM physical volume
# 0 LVM physical volume whole disks
# 2 LVM physical volumes


sudo vgcreate nvme_drives /dev/nvme0n1p1 /dev/nvme1n1p1
# vgrename fileserver data
sudo lvcreate --name nvme_lv --size 1.8T nvme_drives

Creating FS and mounting

sudo mkfs.ext4 /dev/nvme_drives/nvme_lv
sudo mount /dev/nvme_drives/nvme_lv /mnt/nvme


Download and install

Possible options:

1) Download precompiled versions from

2) Hard way, build everything yourself

1. Prometheus:

$ mkdir -p $GOPATH/src/
$ cd $GOPATH/src/
$ git clone
$ cd prometheus
$ make build

2. Alertmanager:

$ mkdir -p $GOPATH/src/
$ cd $GOPATH/src/
$ git clone
$ cd alertmanager
$ make build

3. Node_exporter:

$ go get
$ cd ${GOPATH-$HOME/go}/src/
$ make

4. Nvidia_gpu_prometheus_exporter:

$ go get
$ cd ${GOPATH-$HOME/go}/src/
$ make

Yaml files

1. prometheus.yml

prometheus config file, example that i use

  # Set the scrape interval to every 15 seconds
  scrape_interval: 15s # Set the scrape interval to every 15 seconds
  # Evaluate rules every 15 seconds
  evaluation_interval: 15s

# Alertmanager configuration
  - scheme: http
    - targets:
    # Alertmanager port
      - "localhost:9093"

# Rules yaml file (additional metrics + alert rules)
  - "alert_rules.yml"

# A scrape configuration
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  # Where to listen to additional metric exporters
  # GPU metrics (Nvidia_gpu_prometheus_exporter)
  - job_name: 'gpu'
    - targets: ['localhost:9445']

  # Node metrics (Node_exporter)
  - job_name: 'node'
    - targets: ['localhost:9100']

2. alertmanager.yml

alertmanager config file, example

  smtp_from: SOME_EMAIL
  smtp_auth_username: USERNAME
  smtp_auth_password: PASSWORD
  smtp_auth_identity: IDENTITY(EMAIL)

  group_by: [Alertname]
  receiver: email-me
  # How long to wait before sending a notification again if it has already
  # been sent successfully for an alert
  repeat_interval: 2h

- name: email-me
  - to:,

3. alert_rules.yml

metric rules for alertmanager (when to fire alarm), example

- name: box_2_stats
    # Fire when average gpu temperature for past 2 minutes more than 90 celsius
    expr: avg_over_time(nvidia_gpu_temperature_celsius[2m]) > 90
    for: 30s
        description: "{{ $value }} celsius mean GPU temperature for past 2 minutes!"

   # Fire when average cpu temperature for past 2 minutes more than 90 celsius
    expr: avg_over_time(node_hwmon_temp_celsius[2m]) > 75
    for: 30s
        description: "{{ $value }} celsius mean CPU temperature for past 2 minutes!"

  - alert: RAM
   # Fire when available RAM < 500 MB
    expr: round((node_memory_MemAvailable_bytes) / 1024 / 1024) < 500
    for: 1m
        description: "Only {{ $value }} MB RAM available!"

  - alert: NVME_MEM
   # Alert when less than 10 GB available on NVME disk
    expr: round(node_filesystem_avail_bytes{mountpoint="/mnt/nvme"} / 1024 / 1024 / 1024) < 10
    for: 5m
        description: "{{ $value }} space available on /mnt/nvme!"
  - alert: SDE1_MEM
    expr: round(node_filesystem_avail_bytes{device="/dev/sde1"} / 1024 / 1024 / 1024) < 5
    for: 5m
        description: "{{ $value }} space available on SDE1!"

  - alert: MD
   # Fire when disk is not available
    expr: (node_md_disks_active - node_md_disks) > 0
    for: 10m
        description: "Something wrong with disk"

  - alert: IOWAIT
   # Fire when iowait > 0.85 per second for past 5 minutes
    expr: rate(node_cpu_seconds_total{mode="iowait"}[5m]) > 0.85
    for: 5m
        description: "High iowait value! ({{ $value }} mean for past five minutes)"

Run everything

I use this .sh script in tmux to run everything at once

cd node_exporter-0.18.1.linux-amd64/ && ./node_exporter \
& nvidia-docker run -p 9445:9445 -ti mindprince/nvidia_gpu_prometheus_exporter:0.1 \
& cd alertmanager-0.17.0.linux-amd64/ && ./alertmanager --config.file=alertmanager.yml \
& cd prometheus-2.10.0.linux-amd64/ && ./prometheus --config.file=./prometheus.yml --web.listen-address="" \

kill everything

ctrl+c in tmux session stops processes

to make sure everything is off use pgrep -f "alertmanager|node_exporter|prometheus" and then kill -TERM processes

Nvidia_gpu_prometheus_exporter can be closed by shutting down docker container

Use VsCode remote ssh development on WINDOWS 10

  • Docker: set up port forwarding and docker container ports: -- E.g. My port is 8022 -- Router port forwardng (i.e. your port will be 8023) or local ssh tunnel, i.e. => 8023 -- Expose port within Docker container in EXPOSE -- Do Docker port forwarding when launching a container, i.e. -p 8023:22 -- Turn on ssh Daemon within container (service ssh start), test it, should be done each time. See Dockerfiles -- Create /keras/.ssh/authorized_keys file and paste your public key there within the container
  • VScode setup on windows -- Download, install VScode (it said that you needed their bleeding edge build, but normal build works as well now); -- Install ssh remote development plugin; -- Create VScode ssh config (had to google their forums)
Host example-remote-linux-machine-with-identity-file
    User keras
    Port 8022
    IdentityFile D:\CATS\ARE\FLUFFY\private_key_in_open_ssh_format.ppk

-- You will have the following problems on Windows 10 --- You will have to create USER/.ssh folder --- You will have to set up permissions like in this comment ( for the ssh private key file --- Some other similar fail, I do not remember

  • Useful extensions I think are important -- Python -- Linting (flake 8)
  • Open SSH format -- If you use PuTTY to create keys - you may need to use PyTTYgen to change the format of the key to open-ssh standard format
You can’t perform that action at this time.