# Welcome to Versatile Data Kit 


## Table of Contents
- [Introduction](#introduction)
- [What exactly is a Data Job?](#data-job)
- [Step 1: Login](#login)
- [Step 2: [Option 1] Create a Data Job](#create)
- [Step 2: [Option 2] Download a Data Job](#download)
- [Step 3: Develop Data Job](#develop)
- [Step 4: Run Data Job locally](#run)
- [Step 5: Schedule data job](#schedule)
- [Step 6: Deploy data job](#deploy)

## Introduction
Welcome to our introductory guide. 
Here, we'll walk you through the process of creating, developing, and deploying a Data job using the Jupyter UI. 
Please take the time to carefully go through this guide to ensure a smooth and effective understanding on using VDK.

## What exactly is a Data Job? <a name="data-job"></a>

<p>Data job is a data processing unit that allows data engineers to implement automated pull ingestion (E in ELT) or batch data transformation into Data Warehouse (T in ELT). A'At the core of it, it is a directory with different scripts and inside of it.</p>

<h4>Data Job directory can contain any files, however, there are some files that are treated in a specific way.</h4>

 
<details>
        <summary><strong style="color: Tomato;">Click to see them!</strong></summary>
        <ul>
            <li><strong style="color: blue;">SQL files (.sql)</strong> - called SQL steps - are directly executed as queries against a database.</li>
            <li><strong style="color: green;">Notebook files (.ipynb)</strong> - can contain labelled VDK cells (called notebook steps) which are executed as steps in a VDK Data Job;</li>
            <li><strong style="color: red;">Python files (.py)</strong> - called Python steps - are python scripts that define a run function that takes as an argument the job_input object.</li>
            <li><strong style="color: purple;">config.ini</strong> is needed in order to schedule a job.</li>
            <li><strong style="color: brown;">requirements.txt</strong> is needed when your Python or Notebook steps use external python libraries.</li>
        </ul>
        <img src="./images/data-job-directory.png" width="1000" length="546" alt="Data Job Directory">
</details>



   <p>VDK supports having many Python, SQL and Notebook steps in a single Data Job. These steps are carried out in an ascending alphabetical sequence based on the file names, and Notebook steps are run from the beginning to the end of the file. For an effective execution order while retaining informative file names, it's beneficial to prefix your file names with numbers.</p>


## Step 1: Login <a name="login"></a>

<p> First you need to login in order to be able to create and deploy jobs.</p> 
<img src="./images/login.png" width="1000" length="546" alt="Login">


## Step 2: [Option 1] Create a Data Job <a name="create"></a>


<p>Now that we have explored VDK's capabilities, let's create our data job (if you want to update an existing job see <i>"Step 2: [Option 2]"</i>). </p>

<img src="./images/create.png" width="1000" length="546" alt="Create a Data Job">

<p>If none specified it will use the Jupyter root directory. 
New directory using job name will be created inside that
path with sample job. </p>

<font color='green'>**WARNING!**</font> Should you encounter difficulties while creating a job, consider referencing our tutorial: 'Example Tutorial'(add link to tutorial which includes creating a job from Jupyter UI)

## Step 2: [Option 2] Download a Data Job <a name="download"></a>


<img src="./images/download.png" width="1000" length="546" alt="Download Job">

<p>If none specified it will use the Jupyter root directory. </p>

<details>
    <summary><strong style="color: Tomato;">Convert Job To Notebook [Optional]</strong></summary>
    <p>After downloading a job you can convert it to a notebook job from directory type job.</p>
    <img src="./images/convert.png" width="1000" length="546" alt="Convert Job">
    <p>In this way you can work with notebooks instead of .py and .sql files. To learn more about how to work with notebook steps check <i>Step 3</i>.</p>
</details>

## Step 3: Develop Data Job <a name="develop"></a>

<font color='green'>**TIP!**</font>
If this is your initial experience developing a notebook job, we highly recommend starting with our tutorial: "How to Build a Job Using VDK Notebook Integration."(tutorial link to be added)

<details>
    <summary><strong style="color: DarkRed;">Python steps</strong></summary>
    <p>This is the structure of a Python step:</p>
    
<pre style="background-color: #f5f5f5; padding: 15px; border: 1px solid #ccc; border-radius: 5px;"><code>
def run(job_input: IJobInput):
    ... gather data ...
    job_input.send_object_for_ingestion(data_to_ingest, "table_from_the_data_lake_to_ingest_into")
</code></pre>
    
</br>
<p>For every Python script, VDK provides an object - job_input, that has methods for:</p>
<ul>
        <li>Executing queries</li>
        <li>Ingesting data into the Data Lake</li>
        <li>Integrating Data Lake data into a dimensional model in Data Warehouse</li>
</ul>
</details>

<details>
    <summary><strong style="color: Blue;">SQL steps</strong></summary>
    <p>SQL scripts are standard SQL scripts.</p>
    
<p>Common uses of SQL steps are:</p>
    <ul>
        <li>Aggregating data from other tables to a new one</li>
        <li>Creating a table or a view that is needed for the Python steps</li>
    </ul>
</details>

<details>
    <summary><strong style="color: Green;">Notebook Steps</summary>
    <p>A single notebook can comprise multiple Data Job steps - these are distinct code segments executed by VDK. Within the notebook,VDK provides an object - job_input, that has methods for:</p>
<ul>
        <li>Executing queries</li>
        <li>Ingesting data into the Data Lake</li>
        <li>Integrating Data Lake data into a dimensional model in Data Warehouse</li></p>
    
<h4>Execution Order and Identifying Cells</h4>
    <ul>
        <li>Cells can be labeled with the "vdk" tag.</li>
        <li>While this tag doesn't impact the data job's development phase, it is crucial for subsequent automated operations, such as a "VDK Run."</li>
        <li>Only the cells critical to the Data Job should be tagged.</li>
        <li>Untagged cells will be overlooked during deployment and other execution processes.</li>
        <li>You can easily recognize VDK cells by their unique color scheme, the presence of the VDK logo, and an exclusive numbering system.</li>
        <li>VDK cells in the notebook will be executed according to the numbering when executing the notebook data job with VDK.</li>
        <li>You can delete the cells that are not tagged with "vdk" as they are not essential to the data job's execution. However, removing VDK cells will result in a different data job execution.</li>
    </ul>
        </br>
<img src="./images/cell-tags.png" width="1000" length="546" alt="Cell tags">
    
<h5>Tips:</h5>
    <ul>
        <li>Before running the job, it is recommended to review the cells to ensure a clear understanding of the data job run. This will help to ensure the desired outcome.</li>
    </ul>
</details>

## Step 4: Run Data Job locally <a name="run"></a>

<img src="./images/run.png" width="1000" length="546" alt="Run a job">

<p>If none specified it will use the Jupyter root directory. </p>

## Step 5: Schedule data job <a name="schedule"></a>

<p>All data jobs must contain a job configuration file called config.ini 
You can edit the schedule of the job by editing `schedule_cron` option. </p>

## Step 6: Deploy data job <a name="deploy"></a>

<p>Now that we are done with the modifications to the Data Job, we will deploy it in the Control Service.
Deployment takes both the build/code and the deployment-specific properties, builds and packages them and once a Data Job is deployed, it is ready for immediate execution in the execution environment. It can be scheduled to run periodically.</p>

<img src="./images/deploy.png" width="1000" length="546" alt="Deploy a job">

<p>The Deployment process is asynchronous and even though the command completes fast, the creation takes a while until the Data Job is deployed and ready for execution.
If you have configured `notified_on_job_deploy` in `config.ini` of the Data Job, you will get mail notification.
You can validate that the Data Job Deployment is completed by checking VDK's Operations UI.</p>

<font color='green'>**TIP!**</font>
If this is your initial experience deploying a notebook job, we highly recommend starting with our tutorial: "How to deploy and configure job using notebook"(youtube video link to be added)