# Welcome to Versatile Data Kit 


## Introduction
Welcome to our introductory guide. 
Here, we'll walk you through the process of creating, developing, and deploying a Data job using the Jupyter UI. 
Please take the time to carefully go through this guide to ensure a smooth and effective understanding on using VDK.

## Step 0: Explore VDK's Functionalities
You can execute any shell command if you prefix it with "!". "!vdk --help" gives you good knowledge on VDK's capabilities.


In [None]:
!vdk --help

These commands are also readily accessible in the VDK menu located on the top toolbar. 
This guide will delve into each of these options for a comprehensive understanding.
We'll also provide you with information on utilizing VDK within Notebooks.

## Step 1 Login (Might be omitted)

TO BE ADDED


## Step 2: [Option 1] Create a Data Job
Now that we have explored VDK's capabilities, let's create our data job (if you want to update an existing job see <i>"Step 2: [Option 2]"</i>). 

Go to the VDK menu it the tool bar and click the Create button the you will see some inputs: 
* name:  The data job name. It must be between 6 and 45 characters: lowercase or dash.
* team: The team name to which the job should belong to.
* path: Realative Jupyter path to the parent directory of the DataJob.

If none specified it will use the Jupyter root directory. 
New directory using job name will be created inside that
path with sample job. 


Once you've finished entering the inputs, proceed by clicking the "Ok" button.
Shortly after, an alert will update you on the operation's status. 
If this takes longer than expected, you have the option to check the status manually using the "Check Status" button located on the right side of the toolbar.

<font color='red'>**ATTENTION!**</font>  Please be aware that VDK will attempt to create the Data Job on your local system and in the cloud, provided there's a designated REST API URL for the Control Service.
You'll receive relevant updates clarifying whether the job has been created just on your local system, or in both local and cloud environments.

<font color='red'>**ATTENTION!**</font> Should you encounter difficulties while creating a job, consider referencing our YouTube tutorial: 'Example Tutorial'(add link to tutorial which includes creating a job from Jupyter UI)

Once the operation is complete, feel free to navigate through the newly created data job directory.

> _Data Job directory can contain any files, however, there are some files that are treated in a specific way:_
> 
> * **SQL files (.sql)** - called SQL steps - are directly executed as queries against a database.
> * **Notebook files (.ipynb)** -  can contain labelled VDK cells(called notebook steps) which are executed as steps in a VDK Data Job;
> * **Python files (.py)** - called Python steps - are python scripts that define a run function that takes as an argument the job_input object.
> * **config.ini** is needed in order to schedule a job.
> * **requirements.txt** is needed when your Python steps use external python libraries.


## Step 2: [Option 2] Download a Data Job
Should you wish to update an existing data job, this task can certainly be accomplished! The VDK menu, located on the toolbar, provides the necessary functionality to download the job.

Simply navigate to the "Download Job" option within the menu and provide the requested inputs:

* name:  The data job name. It must be between 6 and 45 characters: lowercase or dash.
* team: The team name to which the job should belong to.
* path: Relative Jupyter path to the parent directory where the Data Job will be downloaded.

If none specified it will use the Jupyter root directory. 

Upon completion of the input fields, click the "Ok" button to advance. 
Shortly after, an alert will update you on the operation's status. 
If this takes longer than expected, you have the option to check the status manually using the "Check Status" button located on the right side of the toolbar.

## Step 3: Develop Data Job (in the notebok)

### What exactly is a Data Job?
Data job is a data processing unit that allows data engineers to implement automated pull ingestion (E in ELT) or batch data transformation into Data Warehouse (T in ELT). At the core of it, it is a directory with different scripts and inside of it.
> _Data Job directory can contain any files, however, there are some files that are treated in a specific way:_
> 
> * **SQL files (.sql)** - called SQL steps - are directly executed as queries against a database.
> * **Notebook files (.ipynb)** -  can contain labelled VDK cells(called notebook steps) which are executed as steps in a VDK Data Job;
> * **Python files (.py)** - called Python steps - are python scripts that define a run function that takes as an argument the job_input object.
> * **config.ini** is needed in order to schedule a job.
> * **requirements.txt** is needed when your Python or Notebook steps use external python libraries.

VDK supports having many Python, SQL and Notebook steps in a single Data Job. 
These steps are carried out in an ascending alphabetical sequence based on the file names, and Notebook steps are run from the beginning to the end of the file. For an effective execution order while retaining informative file names, it's beneficial to prefix your file names with numbers.

<font color='red'>**ATTENTION!**</font> 
If this is your initial experience developing a notebook job, we highly recommend starting with our tutorial: "How to Build a Job Using VDK Notebook Integration."(youtube video link to be added)

#### Notebook steps
As previously stated, a single notebook can comprise multiple Data Job steps - these are distinct code segments executed by VDK. Within the notebook, an universal VDK variable named 'job_input' is provided. For more information, please refer to the following link: https://github.com/vmware/versatile-data-kit/blob/8c8b752a1f4400841331b2dda42cc1b6ef7a61af/projects/vdk-core/src/vdk/api/job_input.py#L355. It's entirely feasible to develop a Data Job within the notebook utilizing any cell, but it's important to bear in mind some  guidelines.

##### Execution Order and Identifying Cells
*  *Cells can be labeled with the "vdk" tag.*
*  *While this tag doesn't impact the data job's development phase, it is crucial for subsequent automated operations, such as a "VDK Run."*
*  *Only the cells critical to the Data Job should be tagged.*
*  *Untagged cells will be overlooked during deployment and other execution processes.*
*  *You can easily recognize VDK cells by their unique color scheme, the presence of the VDK logo, and an exclusive numbering system.*
*  *VDK cells in the notebook will be executed according to the numbering when executing the notebook data job with VDK.*
*  *You can delete the cells that are not tagged with "vdk" 
    as they are not essential to the data job's execution.
    However, removing VDK cells will result in a different data job execution.* 

##### Tips: 
* *Before running the job, it is recommended to review the cells
    to ensure a clear understanding of the data job run.  
    This will help to ensure the desired outcome.* 

## Step 4: Run Data Job locally (in the notebok)

You can run data job locally in the notebook using 'Run' option from the VDK menu.
After navigating to the option provide the requested inputs:

* path: relative Jupyter path to the DataJob directory
* arguments: Pass arguments. Those arguments will be passed to each step. Must be in valid JSON format. Arguments are passed to each step. They can be used as parameters in SQL queries and will be replaced automatically.Properties can also be specified as parameters in SQL, arguments would have higher priority for the same key.

If none specified it will use the Jupyter root directory. 

Upon completion of the input fields, click the "Ok" button to advance. 
Shortly after, an alert will update you on the operation's status. 
If this takes longer than expected, you have the option to check the status manually using the "Check Status" button located on the right side of the toolbar.

<font color='red'>**ATTENTION!**</font> To access the detailed logs of the Run operation, you can proceed to the job's parent directory. There, you'll find a file named 'vdk_logs'. This file contains the comprehensive logs associated with the operation.

## Step 5: Schedule data job 

All data jobs must contain a job configuration file called config.ini 
You can edit the schedule of the job by editing `schedule_cron` option. 

## Step 6: Deploy data job

Now that we are done with the modifications to the Data Job, we will deploy it in the Control Service.
Deployment takes both the build/code and the deployment-specific properties, builds and packages them and once a Data Job is deployed, it is ready for immediate execution in the execution environment. It can be scheduled to run periodically.

To deploy a Data Job, simply proceed to the "Deploy" option within the VDK menu and  provide the requested inputs: 
* name:  The data job name. It must be between 6 and 45 characters: lowercase or dash.
* team: The team name to which the job should belong to.
* path: Realative Jupyter path to the parent directory of the DataJob.
* reason: Mandatory input contaning the reason for the deployment.

Upon completion of the input fields, click the "Ok" button to advance.
Shortly after, an alert will update you on the operation's status. 
This will submit the code of the Data Job to the Control Service and will create a Data Job Deployment. 
The Deployment process is asynchronous and even though the command completes fast, the creation takes a while until the Data Job is deployed and ready for execution.
If you have configured `notified_on_job_deploy` in `config.ini` of the Data Job, you will get mail notification.
You can validate that the Data Job Deployment is completed by checking VDK's Operations UI.

<font color='red'>**ATTENTION!**</font> 
If this is your initial experience deploying a notebook job, we highly recommend starting with our tutorial: "How to deploy and configure job using notebook"(youtube video link to be added)