# Tensorflow.keras cifar10 Training with MLRun

<a id="section_1"></a>
## 1. Import

In [1]:
import mlrun

<a id="section_2"></a>
## 2. Setup MLRun project

In [2]:
# Initialize the MLRun environment and save the project name and artifacts path:
project_name, artifact_path = mlrun.set_environment(
    project="keras-cifar10", user_project=True
)

> 2022-05-24 15:08:47,127 [info] loaded project keras-cifar10-jovyan from MLRun DB


<a id="section_3"></a>
## 3. Traing the Model Using MLRun

Now, we will create our function and run it multiple times for our hyperparameters tunning. Once the function is running, you can click on the link and see the logs collected into **MLRun**. For **Tensorboard** you will need to have a tensorboard running at the `/User` directory. We can train it:
1. **Locally** - To run locally, set the `local` parameter to `True`.
2. As a **Job** - To run as a job, set the `kind` parameter to `"job"`.
3. As a **MPIJob** - To run as a mpijob, set the `kind` parameter to `"mpijob"`. MPIJob will setup horovod automatically as noted above.

In [3]:
local = False # True or False
kind = "job"  # "job" or "mpijob"

In [4]:
# Create the mlrun function:

training_function = mlrun.code_to_function(
    name='cifar10-trainer',
    project=project_name,
    filename="keras-cifar10-original-train-code-with-2-added-lines.ipynb",
    kind=kind,
    image = 'mlrun/ml-models:0.10.0'
)

# Setup further configurations for deployment:
training_function.apply(mlrun.platforms.auto_mount())

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f4dc16fbee0>

### Run the training job with fixed set of parameters

In [5]:
# Run it:
grid_params = {
    "batch_size": [64], 
    "lr": [1e-3],
    "epochs": [5]
}

task = mlrun.new_task("cifar10-simple-trainer").with_hyper_params(grid_params,selector="max.validation_accuracy")
training_run = training_function.run(task, 
                                     handler="train",
                                     local=False,
                                     watch=True)

> 2022-05-24 15:08:58,380 [info] starting run cifar10-simple-trainer uid=bbb7e5d1f7f047ca9868621c1c255652 DB=http://mlrun-api:8080
> 2022-05-24 15:08:59,040 [info] Job is running in the background, pod: cifar10-simple-trainer-g5lhs
batch size    === >>> 64
learning rate === >>> 0.001
epochs        === >>> 5
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
2022-05-24 15:12:26.016144: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-05-24 15:12:26.044731: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2022-05-24 15:12:26.586234: W tensorflow/python/util/util.cc:368] Se

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
keras-cifar10-jovyan,...1c255652,0,May 24 15:09:21,completed,cifar10-simple-trainer,kind=jobowner=jovyanmlrun/client_version=0.10.0,,,best_iteration=1batch_size=64epochs=5lr=0.0010000000474974513training_loss=0.81195068359375training_accuracy=0.67181396484375validation_loss=1.2986965971632887validation_accuracy=0.5799000407941044,training_loss.htmltraining_accuracy.htmlvalidation_loss.htmlvalidation_accuracy.htmlloss_summary.htmlaccuracy_summary.htmllr_values.htmlmodeliteration_resultsparallel_coordinates





> 2022-05-24 15:30:36,460 [info] run executed, status=completed


### Run the training job with hyperparameter tuning

In [6]:
# Run it:
grid_params = {
    "batch_size": [64, 128], 
    "lr": [1e-2, 1e-3],
    "epochs": [10]
}

task = mlrun.new_task("cifar10-hp-trainer").with_hyper_params(grid_params,selector="max.validation_accuracy")
training_run = training_function.run(task, 
                                     handler="train",
                                     watch=True)

> 2022-05-24 15:30:44,742 [info] starting run cifar10-hp-trainer uid=86337f5c1f06423bb27aa41a2e52ff61 DB=http://mlrun-api:8080
> 2022-05-24 15:30:44,968 [info] Job is running in the background, pod: cifar10-hp-trainer-s9kxx
batch size    === >>> 64
learning rate === >>> 0.01
epochs        === >>> 10
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
2022-05-24 15:32:20.012030: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-05-24 15:32:20.014045: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2022-05-24 15:32:20.208497: W tensorflow/python/util/util.cc:368] Sets are n

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
keras-cifar10-jovyan,...2e52ff61,0,May 24 15:30:53,completed,cifar10-hp-trainer,kind=jobowner=jovyanmlrun/client_version=0.10.0,,,best_iteration=3batch_size=64epochs=10lr=0.0010000000474974513training_loss=0.75482177734375training_accuracy=0.78131103515625validation_loss=0.6332879112170527validation_accuracy=0.7853999518738768,training_loss.htmltraining_accuracy.htmlvalidation_loss.htmlvalidation_accuracy.htmlloss_summary.htmlaccuracy_summary.htmllr_values.htmlmodeliteration_resultsparallel_coordinates





> 2022-05-24 18:19:00,671 [info] run executed, status=completed


training_function.run(
    name="cifar10-trainer-training",
    handler="train",
    params={
        "batch_size": 64,
        "lr": 1e-2,
        "epochs": 5
    },
    local=False
)