 # Part 1: Preprocessing data

Part 1: prepare training set and stores key information about the training set in MLRun (we wasn't able to use functions with our own naming to run the workflow, so this pipeline is heavily based on MLrun tutorial 1. Apologize in advance for the confusing naming)

We have completed the following:

- Created a MLRun function that automates data processing
- Stored data artifacts in a central database
- Ran the code on Kubernetes cluster 

<a id="gs-tutorial-1-mlrun-intro"></a>

<a id="gs-tutorial-1-mlrun-envr-init"></a>

## Step1: Initializing MLRun Environment

Run the following code to initialize MLRun environment to use a "getting-started-tutorial-&lt;username&gt;"
project and store the project artifacts in the default artifacts path:

In [1]:
from os import path
import mlrun

# Set the base project name
project_name_base = 'getting-started-tutorial'
# Initialize the MLRun environment and save the project name and artifacts path
project_name, artifact_path = mlrun.set_environment(project=project_name_base,
                                                    user_project=True)
                                                    
# Display the current project name and artifacts path
print(f'Project name: {project_name}')
print(f'Artifacts path: {artifact_path}')

Project name: getting-started-tutorial-jovyan
Artifacts path: /home/jovyan/data


<a id="gs-tutorial-1-step-create-basic-function"></a>

## Step 2: Creating a Basic Function

Develop a MLRun functions converted from a local function used for data preprocessing

<a id="gs-tutorial-1-define-local-func"></a>

In [2]:
import pandas as pd

# Ingest a data set
def prep_data(source_url, label_column):

    df = pd.read_csv(source_url)
    df[label_column] = df[label_column].astype('category').cat.codes    
    return df, df.shape[0]

<a id="gs-tutorial-1-create-and-run-an-mlrun-function"></a>

In [3]:
# mlrun: start-code

In [4]:
import mlrun
def prep_data(context, source_url: mlrun.DataItem, label_column='label'):

    # Convert the DataItem to a pandas DataFrame
    df = source_url.as_df()
    df[label_column] = df[label_column].astype('category').cat.codes    
    
    # Record the DataFrane length after the run
    context.log_result('num_rows', df.shape[0])

    # Store the data set in your artifacts database
    context.log_dataset('cleaned_data', df=df, index=False, format='csv')

In [5]:
# mlrun: end-code

In [6]:
# Convert the local prep_data function to an MLRun project function
data_prep_func = mlrun.code_to_function(name='prep_data', kind='job', image='mlrun/mlrun')

<a id="gs-tutorial-1-run-mlrun-function-locally"></a>

In [7]:
# Set the source-data URL. We uploaded our dataset into the directory under data, which is the only directory
# that can save changes for users on docker
source_url = '/home/jovyan/data/preprocessed-2.csv'


In [8]:
# Run the `data_prep_func` MLRun function locally
prep_data_run = data_prep_func.run(name='prep_data',
                                   handler=prep_data,
                                   inputs={'source_url': source_url},
                                   local=True)

> 2021-07-06 21:08:33,732 [info] starting run prep_data uid=0a5446c13eef4248a1cc7d65b5b9d7b7 DB=http://mlrun-api:8080


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
getting-started-tutorial-jovyan,...b5b9d7b7,0,Jul 06 21:08:33,completed,prep_data,kind=owner=jovyanhost=mlrun-kit-jupyter-6879c4d97f-hz9c2,source_url,,num_rows=429,cleaned_data


to track results use .show() or .logs() or in CLI: 
!mlrun get run 0a5446c13eef4248a1cc7d65b5b9d7b7 --project getting-started-tutorial-jovyan , !mlrun logs 0a5446c13eef4248a1cc7d65b5b9d7b7 --project getting-started-tutorial-jovyan
> 2021-07-06 21:08:33,894 [info] run executed, status=completed


<a id="gs-tutorial-1-get-run-object-info"></a>

In [9]:
# make sure the MLrun function is run successfully
prep_data_run.state()

'completed'

In [10]:
# make sure the cleaned data is generated
prep_data_run.outputs['cleaned_data']

'store://artifacts/getting-started-tutorial-jovyan/prep_data_cleaned_data:0a5446c13eef4248a1cc7d65b5b9d7b7'

<a id="gs-tutorial-1-read-output"></a>

## Step 3: Reading and Storing the Output

Investigate the output of cleaned dataset

In [11]:
dataset = mlrun.run.get_dataitem(prep_data_run.outputs['cleaned_data'])

You can also get the data as a pandas DataFrame by calling the `dataset.as_df` method:

In [12]:
dataset.as_df()

Unnamed: 0,Year,Country,Region,Continent,Latitude,Longitude,CPI,total_affected,label
0,1905,4,2,0,32.040,76.160,3.522300,20000,0
1,1907,3,0,0,38.500,69.900,3.652756,12000,0
2,1914,5,1,0,-3.924,101.820,3.926712,40,0
3,1917,3,0,0,28.000,104.000,5.022539,1800,0
4,1917,5,1,0,-7.000,116.000,5.022539,15000,0
...,...,...,...,...,...,...,...,...,...
424,1966,4,2,0,28.700,78.900,12.696028,15,-1
425,1967,5,1,0,5.121,96.286,13.048062,10014,-1
426,1971,5,1,0,-7.200,109.100,15.838255,6892,-1
427,2019,5,1,0,-3.450,128.347,100.000000,247449,-1


<a id="gs-tutorial-1-save-artifcats-in-run-specific-paths"></a>

Saving the Artifacts in Run-Specific Paths

In [13]:
out = artifact_path 

prep_data_run = data_prep_func.run(name='prep_data',
                         handler=prep_data,
                         inputs={'source_url': source_url},
                         local=True,
                         artifact_path=path.join(out, '{{run.uid}}'))

> 2021-07-06 21:08:34,022 [info] starting run prep_data uid=f4b2e1861abd455bb58e287850ac381a DB=http://mlrun-api:8080


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
getting-started-tutorial-jovyan,...50ac381a,0,Jul 06 21:08:34,completed,prep_data,kind=owner=jovyanhost=mlrun-kit-jupyter-6879c4d97f-hz9c2,source_url,,num_rows=429,cleaned_data


to track results use .show() or .logs() or in CLI: 
!mlrun get run f4b2e1861abd455bb58e287850ac381a --project getting-started-tutorial-jovyan , !mlrun logs f4b2e1861abd455bb58e287850ac381a --project getting-started-tutorial-jovyan
> 2021-07-06 21:08:34,187 [info] run executed, status=completed


<a id="gs-tutorial-1-step-run-func-on-cluster"></a>

## Step 4: Running the Function on a Cluster


In [14]:
from mlrun.platforms import auto_mount

In [15]:
data_prep_func.apply(auto_mount())
prep_data_run = data_prep_func.run(name='prep_data',
                                   handler='prep_data',
                                   inputs={'source_url': source_url},
                                   local=False)

> 2021-07-06 21:08:34,218 [info] starting run prep_data uid=1a8569be3a204a1d85b10f74ebe67db2 DB=http://mlrun-api:8080
> 2021-07-06 21:08:34,326 [info] Job is running in the background, pod: prep-data-2bgwv
> 2021-07-06 21:08:40,047 [info] run executed, status=completed
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
getting-started-tutorial-jovyan,...ebe67db2,0,Jul 06 21:08:39,completed,prep_data,kind=jobowner=jovyanhost=prep-data-2bgwv,source_url,,num_rows=429,cleaned_data


to track results use .show() or .logs() or in CLI: 
!mlrun get run 1a8569be3a204a1d85b10f74ebe67db2 --project getting-started-tutorial-jovyan , !mlrun logs 1a8569be3a204a1d85b10f74ebe67db2 --project getting-started-tutorial-jovyan
> 2021-07-06 21:08:40,457 [info] run executed, status=completed


In [16]:
print(prep_data_run.outputs)

{'num_rows': 429, 'cleaned_data': 'store://artifacts/getting-started-tutorial-jovyan/prep_data_cleaned_data:1a8569be3a204a1d85b10f74ebe67db2'}


<a id="gs-tutorial-1-step-ui-jobs-view"></a>

<a id="gs-tutorial-1-step-schedule-jobs"></a>