# Perform time series forecasting on Google Kubernetes Engine with NVIDIA GPUs

In this example, we will be looking at a real-world example of **time series forecasting** with data from [the M5 Forecasting Competition](https://www.kaggle.com/competitions/m5-forecasting-accuracy). Walmart provides historical sales data from multiple stores in three states, and our job is to predict the sales in a future 28-day period.

## Prerequisites

### Prepare GKE cluster

To run the example, you will need a working Google Kubernetes Engine (GKE) cluster with access to NVIDIA GPUs. Use the following resources to set up a cluster:

* [Set up a GKE cluster with access to NVIDIA GPUs](https://docs.rapids.ai/deployment/stable/cloud/gcp/gke/)
* [Install the Dask-Kubernetes operator](https://kubernetes.dask.org/en/latest/operator_installation.html)
* [Install Kubeflow](https://www.kubeflow.org/docs/started/installing-kubeflow/)

Kubeflow is not strictly necessary, but we highly recommend it, as Kubeflow gives you a nice notebook environment to run this notebook within the k8s cluster. (You may choose any method; we tested this example after installing Kubeflow from manifests.) When creating the notebook environment, use the following configuration:

* 2 CPUs, 16 GiB of memory
* 1 NVIDIA GPU
* 40 GiB disk volume

After uploading all the notebooks in the example, run this notebook (`start_here.ipynb`) in the notebook environment.

Note: We will use the worker pods to speed up the training stage. The preprocessing steps will run solely on the scheduler node.

### Prepare a bucket in Google Cloud Storage

Create a new bucket in Google Cloud Storage. Make sure that the worker pods in the k8s cluster has read/write access to this bucket. This can be done in one of the following methods:

1. Option 1: Specify an additional scope when provisioning the GKE cluster.

   When you are provisioning a new GKE cluster, add the `storage-rw` scope.
   This option is only available if you are creating a new cluster from scratch. If you are using an exising GKE cluster, see Option 2.

   Example:
```
gcloud container clusters create my_new_cluster --accelerator type=nvidia-tesla-t4 \
   --machine-type n1-standard-32 --zone us-central1-c --release-channel stable \
   --num-nodes 5 --scopes=gke-default,storage-rw
```

2. Option 2: Grant bucket access to the associated service account.

   Find out which service account is associated with your GKE cluster. You can grant the bucket access to the service account as follows: Nagivate to the Cloud Storage console, open the Bucket Details page for the bucket, open the Permissions tab, and click on Grant Access.
   
Enter the name of the bucket that your cluster has read-write access to:

In [1]:
bucket_name = "<Put the name of the bucket here>"

### Install Python packages in the notebook environment

In [1]:
!pip install kaggle gcsfs dask-kubernetes optuna

In [4]:
# Test if the bucket is accessible
import gcsfs

fs = gcsfs.GCSFileSystem()
fs.ls(f"{bucket_name}/")

[]

## Obtain the time series data set from Kaggle

If you do not yet have an account with Kaggle, create one now. Then follow instructions in [Public API Documentation of Kaggle](https://www.kaggle.com/docs/api) to obtain the API key. This step is needed to obtain the training data from the M5 Forecasting Competition. Once you obtained the API key, fill in the following:

In [5]:
kaggle_username = "<Put your Kaggle username here>"
kaggle_api_key = "<Put your Kaggle API key here>"

Now we are ready to download the data set:

In [7]:
%env KAGGLE_USERNAME=$kaggle_username
%env KAGGLE_KEY=$kaggle_api_key

!kaggle competitions download -c m5-forecasting-accuracy

Let's unzip the ZIP archive and see what's inside.

In [8]:
!unzip m5-forecasting-accuracy.zip -d data/

Archive:  m5-forecasting-accuracy.zip
  inflating: data/calendar.csv       
  inflating: data/sales_train_evaluation.csv  
  inflating: data/sales_train_validation.csv  
  inflating: data/sample_submission.csv  
  inflating: data/sell_prices.csv    


In [9]:
!ls -lh data/*.csv

-rw-r--r-- 1 root users 102K Jun  1  2020 data/calendar.csv
-rw-r--r-- 1 root users 117M Jun  1  2020 data/sales_train_evaluation.csv
-rw-r--r-- 1 root users 115M Jun  1  2020 data/sales_train_validation.csv
-rw-r--r-- 1 root users 5.0M Jun  1  2020 data/sample_submission.csv
-rw-r--r-- 1 root users 194M Jun  1  2020 data/sell_prices.csv


# Next steps

We are now ready to run the preprocessing steps. You should run the six notebooks in order, to process the raw data into a form that can be used for model training:

* `preprocessing_part1.ipynb`
* `preprocessing_part2.ipynb`
* `preprocessing_part3.ipynb`
* `preprocessing_part4.ipynb`
* `preprocessing_part5.ipynb`
* `preprocessing_part6.ipynb`
* `training_and_evaluation.ipynb`
