# EfficientNet Distributed Training with Existing Experiment

## Enviroinment Setup

Before any experiment can be conducted. We need to setup and initialize an environment: ensure all Python modules has been setup and configured, as well as python modules

### Imports
Setting up python modules

In [28]:
import os
from datetime import datetime
import kfp
from ipython_secrets import get_secret

In [29]:
client = kfp.Client()
APPLICATION_NAME = 'efficientnet1'
try:
    exp = client.get_experiment(experiment_name=APPLICATION_NAME)
except:
    print(APPLICATION_NAME + ' is not available!')

### Define variables for experiment
In the beginning of the scrip we define all necessary variables. We have a single cell to define all experiment configuration in one place.

In [19]:
## Globals
REMOTE_MINIO_SERVER = get_secret('REMOTE_MINIO_SERVER')
ACCESS_KEY = get_secret('ACCESS_KEY')
SECRET_KEY = get_secret('SECRET_KEY')

In [23]:
ARTIFACTS_ROOT = '/mnt/s3/'
BASEDIR = os.path.join(ARTIFACTS_ROOT,"santosh-test")
DATASET_DIR = os.path.join(BASEDIR, 'datasets')
MODEL_DIR = os.path.join(ARTIFACTS_ROOT, 'models')

In [24]:
## Experiment-specific params
TAG = 'v25'
MODEL_VERSION='1'
MODEL_FNAME='pneumothorax_'  + datetime.now().strftime("%m_%d_%S") + '.h5'
DATASET_NAME='normal_pneumothorax'
LABELS='normal,pneumothorax'

### Run the pipeline

Code below will run a pipeline and inject some pipeline parameters. 

In [25]:

run = client.run_pipeline(exp.id, f'Training model {TAG}: {datetime.now():%m%d-%H%M}', 'argo-distr-training.yaml',
                          params={
                              'datasetDir': DATASET_DIR,
                              'datasetName': DATASET_NAME,
                              'labels': LABELS,
                              'remoteMinioServer': REMOTE_MINIO_SERVER,
                              'accessKey': ACCESS_KEY,
                              'secretKey': SECRET_KEY,
                              'batchSize': 32,
                              'width': 150,
                              'height': 150,
                              'epochs': 1,
                              'dropoutRate': 0.2,
                              'learningRate': 0.00002,
                              'trainInput': os.path.join(DATASET_DIR, DATASET_NAME),
                              'modelVersion': MODEL_VERSION,
                              'modelDir': MODEL_DIR,
                              'modelFname': MODEL_FNAME,
                          })

In [26]:
%%time
# block until job completion
print(f"Waiting for run: {run.id}...")
result = client.wait_for_run_completion(run.id, timeout=720).run.status
print(f"Finished with: {result}")

Waiting for run: cf417e92-5fff-11ea-9586-122862a16a39...
Finished with: Succeeded
CPU times: user 33.2 ms, sys: 2.26 ms, total: 35.4 ms
Wall time: 1min 10s


In [396]:
NB_MODEL_FILE = f"/home/jovyan/data/models/{MODEL_VERSION}/{MODEL_FNAME}"
!ls $NB_MODEL_FILE

/home/jovyan/data/models/1/pneumothorax_03_04_40.h5
