# Welcome to the Versatile Data Kit Demo Example!

## Workshop Steps
Now that you have opened up the MyBinder environment and are reading this, you are already on the right track! Inside this environment,
you will also find:
* sample scripts: This is a folder containing the base of the scripts that you will be working with to finish the exercise. Please look for the triple exclamation points (!!!) as that means that you are being asked to write some code to get things to work!
* README.md: This is just the README file you saw on the Github page.
* requirements.txt: This is a list of the required libraries that were installed upon startup.
* setup.ipynb: The file you are reading right now! Think of this as your home page.
* Other system files - postBuild and start: No need to worry about these. They are needed for the setup.

### Step 1: Explore VDK's Functionalities
A simple command like that found in the setup.ipynb "!vdk --help" gives you all the information you need.


In [None]:
!vdk --help

### Step 2: Create a Data Job
Now that we have explored VDK's capabilities, let's create our data job. 

Keep in mind that we would like to have a sub-folder for the data job,so that our Streamlit script is outside of it and in the main directory. 

<font color='red'>**ATTENTION!**</font>
Based on the information above, try creating a data job titled "ingest", followed your last name, your favorite sports team,  and your favorite drink. For example, "ingest-userov". You can chose any team name that you want, but please create the job at the home directory. This will create a sub-folder for the data job. The home directory is /home/jovyan.

Here's an example code, but <font color='red'>**ATTENTION!**</font>, please uncomment and change "<my-name-data-job>" to your data job's name.

In [None]:
!vdk create -n ingest-<unique-suffix> -t team-awesome -p /home/jovyan

### Step 3: Ingestion Job

Now that you have created a data job, please go inside the subfolder and set up the structure of your data job. Here's the general idea.

We want the data job to have four scripts:
* We have a python file called aa_ingest_rates.py which we will use to ingestion Polish currency rates 
* config.ini is our configuraiton file 
* requirements.txt we place our dependecies 
 


When you create a data job, VDK automatically downloads some template scripts and files, so that you can get accustomed to the data job's structure. They are super helpful in getting you ready to run your own data jobs. However, let's go ahead and delete these for our example, since we won't be starting from scratch, but please check them out! Alternatively, you can explore the 'vdk create --no-sample' option, if you do not want these sample downloaded. Let's go ahead and delete the following files:
* The SQL script: our example does not do anything with SQL.
* The sample Python script: we already have moved four sample Python scripts, so we won't be needing this.
* README.md: We already have a README for the entire example, so we can get rid of this.
* requirements.txt: Each data job would need this file if the data job relies on external libraries that VDK does not have. In our case, MyBinder installed those upon startup, so we won't be needing this either.

As such, please run the code below to delete them:

<font color='red'>**ATTENTION!**</font> Please change `<my-name-data-job>` to the name of your data job.

In [None]:
! rm -rf ~/ingest-<unique-suffix>/*

! mv jobs/ingest-currency-exchange-rate/* ~/ingest-<unique-suffix>/ 

Great! Now you're all set up with the data job:
* You have created a data job.
* You have sample the template files that you do not need.
* You have moved the sample scripts we provided to the data job sub-folder.


Now let's run the ingest data job . But because we are using shared database with other participants let's name our table with unique suffix (similar to the data job) 

In [None]:
! vdk run ingest-<unique-suffix> --arguments '{"destination_table": "exchange_rates_series_<unique-suffix>"}'

In [None]:
! vdk sqlite-query -q "select * from exchange_rates_series_<unique-suffix>"

### Step 4: Processing job

Please open up jobs/process-exchange-rate Inside it, you will see the code template already populated. Let's explore.

Then, we open up VDK's "run" function. This is how VDK knows that the following code will be part of its execution path, if you will.<br>
In our case we set default variables. <br> 
Then we have 2 SQL files - to create table and to populate table<br>

Let's create and populate our processing job: 

In [None]:
!vdk create -n process-<unique-suffix> -t team-awesome -p /home/jovyan

In [None]:
! rm -rf ~/process-<unique-suffix>/*

! mv jobs/process-exchange-rate/* ~/process-<unique-suffix>/ 

In [None]:
! vdk run process-<unique-suffix> --arguments '{"source_table": "exchange_rates_series_<unique-suffix>", "destination_table": "aggregate_rates_<unique-suffix>"}'

In [None]:
! vdk sqlite-query -q "select * from aggregate_rates_<unique-suffix>"

### Step 5: Deploy

Since the correlation analysis that we perform is on a weekly basis, it makes sense to schedule our data job to run once per week. VDK allows the **automatic execution of data jobs by deploying them on a cloud server** which handles the regular execution as per schedule that the user defines. The deployment configurations are entered in the **"config.ini"** file that is required for deployment.  
Let's open it up and examine the contents.

In the first section [owner], we have specified the **team owning the data job**. In the second section [job] we defined the schedule of execution. It is in cron format (you can use [this website](https://crontab.guru/#*/20_*_*_*_*) to translate the cron schedule into a human-readable form). In this case, we want the schedule to run on the Monday of each week at 00:01am US time. Since VDK uses UTC time for schedule execution, the cron schedule indicates 05:01am UTC time. 

The config file could also include a [contacts] section which specifies whether any **notifications** are sent to specific emails upon job execution success, failure or deployment. In our case, we have left those empty.

The last part of the config file contains the **VDK configuration settings** - the type of DB to which we will be ingesting, the DB location, schema and catalogue. 

For a full list and explanations of the configuration settings you could enter into the "config.ini" file of a data job, you can run the following command:

In [None]:
!vdk config-help


Let's now deploy the data job!

Run the command below, but first **remember to replace name-of-data-job with your data job name** after the "-n" and in the directory pathway.

In [None]:
!vdk deploy -n ingest-<unique-suffix> -t team-awesome  -r "Initial deploy" -p /home/jovyan/ingest-<unique-suffix>

In [None]:
!vdk deploy -n process-<unique-suffix> -t team-awesome  -r "Initial deploy" -p /home/jovyan/process-<unique-suffix>

In [None]:
! vdk deploy --show -n ingest-<unique-suffix> -t team-awesome

We can now inspect the data job in Git: 

Go to https://github.com/vdk-ml-community/data-jobs

And if there's an issue revert: 

In [None]:
! vdk deploy --update --job-version <old-version> -n ingest-<unique-suffix> -t team_awesome

### Step 6: Extend (Anonymize)

Go to https://github.com/vdk-ml-community/vdk-demo/tree/main/plugins/vdk-poc-anonymize

In [None]:
! pip install -e plugins/vdk-poc-anonymize

In [None]:
! vdk run ingest-<unique-suffix> --arguments '{"destination_table": "exchange_rates_series_<unique-suffix>"}'

In [None]:
! vdk sqlite-query -q "select * from exchange_rates_series_<unique-suffix>"

### Step 7: Extend (SQL validation)

Go to https://github.com/vdk-ml-community/vdk-demo/tree/main/plugins/vdk-validate

In [None]:
! pip install -e plugins/vdk-validate

In [None]:
! vdk run process-<unique-suffix> --arguments '{"source_table": "exchange_rates_series_<unique-suffix>", "destination_table": "aggregate_rates_<unique-suffix>"}'
