Skip to content

Latest commit

 

History

History
223 lines (160 loc) · 8.32 KB

README.md

File metadata and controls

223 lines (160 loc) · 8.32 KB

Example Applications

CSI Ephemeral Volume Example

# replace <bucket-name> with your pre-provisioned GCS bucket name
GCS_BUCKET_NAME=your-bucket-name
sed -i "s/<bucket-name>/$GCS_BUCKET_NAME/g" ./examples/ephemeral/deployment.yaml
sed -i "s/<bucket-name>/$GCS_BUCKET_NAME/g" ./examples/ephemeral/deployment-non-root.yaml
sed -i "s/<bucket-name>/$GCS_BUCKET_NAME/g" ./examples/ephemeral/deployment-two-vols.yaml

# install a Deployment using CSI Ephemeral Inline volume
kubectl apply -f ./examples/ephemeral/deployment.yaml
kubectl apply -f ./examples/ephemeral/deployment-non-root.yaml
kubectl apply -f ./examples/ephemeral/deployment-two-vols.yaml

# clean up
kubectl delete -f ./examples/ephemeral/deployment.yaml
kubectl delete -f ./examples/ephemeral/deployment-non-root.yaml
kubectl delete -f ./examples/ephemeral/deployment-two-vols.yaml

Static Provisioning Example

# replace <bucket-name> with your pre-provisioned GCS bucket name
GCS_BUCKET_NAME=your-bucket-name
sed -i "s/<bucket-name>/$GCS_BUCKET_NAME/g" ./examples/static/pv-pvc-deployment.yaml
sed -i "s/<bucket-name>/$GCS_BUCKET_NAME/g" ./examples/static/pv-pvc-deploymen-non-root.yaml

# install PV/PVC and a Deployment
kubectl apply -f ./examples/static/pv-pvc-deployment.yaml
kubectl apply -f ./examples/static/pv-pvc-deploymen-non-root.yaml

# clean up
# the PV deletion will not delete your GCS bucket
kubectl delete -f ./examples/static/pv-pvc-deployment.yaml
kubectl delete -f ./examples/static/pv-pvc-deploymen-non-root.yaml

Batch Job Example

# replace <bucket-name> with your pre-provisioned GCS bucket name
GCS_BUCKET_NAME=your-bucket-name
sed -i "s/<bucket-name>/$GCS_BUCKET_NAME/g" ./examples/batch-job/job.yaml

# install a Job using CSI Ephemeral Inline volume
kubectl apply -f ./examples/batch-job/job.yaml

# clean up
kubectl delete -f ./examples/batch-job/job.yaml

PyTorch Application Example

This example is inspired by the TensorFlow example in Cloud Storage FUSE repo. The training jobs in this repo run exactly the same code from the Cloud Storage FUSE repo with GKE settings.

Prerequisites

If you are using a GKE Autopilot cluster, you do not need to do anything in this step.

# when you are using a Standard cluster, add a new node pool with GPU:
CLUSTER_NAME=cluster-name
ZONE=node-pool-zone
gcloud container node-pools create gpu-test-pool \
  --accelerator type=nvidia-tesla-a100,count=2 \
  --zone ${ZONE} --cluster ${CLUSTER_NAME} \
  --num-nodes 1 \
  --machine-type a2-highgpu-2g

# install the nvidia driver
# see the GKE doc for details: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Prepare the training dataset

Follow the following steps to download the dataset from Kaggle, then unzip and upload the dataset to a GCS bucket. You only need to do this step once.

# replace <bucket-name> with your pre-provisioned GCS bucket name
GCS_BUCKET_NAME=your-bucket-name
sed -i "s/<bucket-name>/$GCS_BUCKET_NAME/g" ./examples/pytorch/data-loader-job.yaml

# replace <kaggle-key> with your kaggle API key
# Go to https://www.kaggle.com to create a kaggle account if necessary, then read the "Authentication" section [here](https://www.kaggle.com/docs/api) for how to get your Kaggle API key. The format is {"username":"xxx","key":"xxx"}.
KAGGLE_KEY=your-kaggle-key
sed -i "s/<kaggle-key>/$KAGGLE_KEY/g" ./examples/pytorch/data-loader-job.yaml

kubectl create serviceaccount gcs-csi --namespace default
kubectl create namespace gcs-csi-example

# prepare the data
kubectl apply -f ./examples/pytorch/data-loader-job.yaml

# clean up
kubectl delete -f ./examples/pytorch/data-loader-job.yaml

PyTorch training job

# replace <bucket-name> with your pre-provisioned GCS bucket name
GCS_BUCKET_NAME=your-bucket-name
sed -i "s/<bucket-name>/$GCS_BUCKET_NAME/g" ./examples/pytorch/train-job-pytorch.yaml

# start the pytorch training job
kubectl apply -f ./examples/pytorch/train-job-pytorch.yaml

# clean up
kubectl delete -f ./examples/pytorch/train-job-pytorch.yaml

PyTorch training job in Deep Learning Container (DLC)

# replace <bucket-name> with your pre-provisioned GCS bucket name
GCS_BUCKET_NAME=your-bucket-name
sed -i "s/<bucket-name>/$GCS_BUCKET_NAME/g" ./examples/pytorch/train-job-pytorch-dlc.yaml

# start the pytorch training job
kubectl apply -f ./examples/pytorch/train-job-pytorch-dlc.yaml

# clean up
kubectl delete -f ./examples/pytorch/train-job-pytorch-dlc.yaml

TensorFlow Application Example

This example is inspired by the TensorFlow example in Cloud Storage FUSE repo. The training jobs in this repo run exactly the same code from the Cloud Storage FUSE repo with GKE settings.

Prerequisites

See Prerequisites for PyTorch applications. The prerequisites are the same for Tensorflow applications.

Prepare the training dataset

Follow the training dataset imagenet2012 documentation to download the dataset from ImageNet. You need to manually download the dataset to a local filesystem, unzip and upload the dataset to your bucket.

TensorFlow training job in Deep Learning Container (DLC)

# replace <bucket-name> with your pre-provisioned GCS bucket name
GCS_BUCKET_NAME=your-bucket-name
sed -i "s/<bucket-name>/$GCS_BUCKET_NAME/g" ./examples/pytorch/train-job-tensorflow-dlc.yaml

# start the tensorflow training job
kubectl apply -f ./examples/tensorflow/train-job-tensorflow-dlc.yaml

# clean up
kubectl delete -f ./examples/tensorflow/train-job-tensorflow-dlc.yaml

Jupyter Notebook Example (no experimental read cache)

# replace <bucket-name> with your pre-provisioned GCS bucket name
GCS_BUCKET_NAME=your-bucket-name
sed -i "s/<bucket-name>/$GCS_BUCKET_NAME/g" ./examples/jupyter/jupyter-notebook-server.yaml

# install a Jupyter Notebook server using CSI Ephemeral Inline volume
kubectl apply -f ./examples/jupyter/jupyter-notebook-server.yaml

# access the Jupyter Notebook via http://localhost:8888
kubectl port-forward jupyter-notebook-server 8888:8888

# clean up
kubectl delete -f ./examples/jupyter/jupyter-notebook-server.yaml

Jupyter Notebook Example (with experimental read cache)

Prerequisites

  1. Your node pool must have created an ephemeral local ssds as described in https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/local-ssd#node-pool

  2. Your node pool must have a GPU accelerator. This example uses nvidia-tesla-t4, but you can use another one (just make sure to update the nodeSelector in the yaml below if so). See an example here

Steps

# 1. replace <bucket-name> with your pre-provisioned GCS bucket name
GCS_BUCKET_NAME=your-bucket-name
sed -i "s/<bucket-name>/$GCS_BUCKET_NAME/g" ./examples/jupyter/jupyter-experimental-readcache.yaml

# 2. install a Jupyter Notebook server using experimental gcsfuse read cache
kubectl apply -f ./examples/jupyter/jupyter-experimental-readcache.yaml

# 3. get service IPs
kubectl get services -n example

# 4. Open jupyter
#  a. copy EXTERNAL-IP of tensorflow-jupyter server
#  b. open IP Address in a browser
#  c. input token "jupyter" (from yaml)

# 5. (optional) clean up
kubectl delete -f ./examples/jupyter/jupyter-notebook-server.yaml