<a href="https://colab.research.google.com/github/antoniivanov/vdk-demo/blob/main/ingest/Ingest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Ingest Guide

This notebook provides a guide on how to ingest data from any difference data sources using the [Versatile Data Kit (VDK)](https://github.com/vmware/versatile-data-kit)




<img src="https://bit.ly/start-ingest-guide-jpeg" width="600" />



<a name="prerequisites"></a>
## 1. Prerequisites

### 1.1 Good to Know Before You Start


This tutorial is designed to be accessible, but you'll find it easier if you're familiar with:

- **Python and SQL**: Basic commands and queries.
- **Data Concepts**: Simple data modeling and API usage.
- **Tools**: Comfort with command line and Jupyter Notebook

### 1.2 Useful notebook shortcuts


* Click the **Play icon** in the left gutter of the cell;
* Type **Cmd/Ctrl+Enter** to run the cell in place;
* Type **Shift+Enter** to run the cell and move focus to the next cell (adding one if none exists); or
* Type **Alt+Enter** to run the cell and insert a new code cell immediately below it.

There are additional options for running some or all cells in the **Runtime** menu on top.


### 1.3 Install Versatile Data Kit and enable plugins

In [None]:
!pip install quickstart-vdk vdk-notebook vdk-ipython==0.2.5 vdk-data-sources vdk-singer tap-rest-api-msdk

<a name="configuration"></a>
## 2. Configuration

In [None]:
%env DB_DEFAULT_TYPE=sqlite
%env INGEST_METHOD_DEFAULT=sqlite
%env INGESTER_WAIT_TO_FINISH_AFTER_EVERY_SEND=true

<a name="init"></a>
## 3. Initialize new VDK job (input)

In [None]:
"""
vdk.plugin.ipython extension introduces a magic command for Jupyter.
The command enables the user to load VDK for the current notebook.
VDK provides the job_input API, which has methods for:
    * executing queries to an OLAP database;
    * ingesting data into a database;
    * processing data into a database.
Type help(job_input) to see its documentation.

"""

# NOTE: The CELL may fail when run the first time. Run it again and it shoud suceeds.

%reload_ext vdk.plugin.ipython
%reload_VDK
job_input = VDK.get_initialized_job_input()

<a name="explore"></a>

### 3.1 Explore what you can do (Task 1)

![image.png](https://github.com/vmware/versatile-data-kit/assets/2536458/80ba93a9-e2cf-4067-bd09-90807e06aa33)

In [None]:
# See all methods with help:
help(job_input)

#### 3.1.1 Access job arguments



In [None]:
print(job_input.get_arguments())

#### 3.1.2 Execute SQL Queries

In [None]:
%%vdksql
create table hello_world as
select "Hello World!" as hello, "English" as language
union all
select "¡Hola Mundo!", "Spanish"
union all
select "こんにちは世界", "Japanese"
union all
select "Bonjour le monde", "French"
union all
select "Hallo Welt", "German"
union all
select "Привет мир", "Russian"

In [None]:
%%vdksql
select * from hello_world

#### 3.1.3 Manage state properties or secrets

In [None]:
import time
job_input.set_all_properties({"last_time_run": time.time()})
job_input.set_all_secrets({ "secret": "my secret" })

print(job_input.get_all_properties())

#### 3.1.4 Check the available data sources

In [None]:
!vdk data-sources --list

One particularly important data source is **singer-tap**.

[Singer Taps](https://www.singer.io/#taps) extract data from a lot of different sources. Versatile Data Kit makes it easy to reuse all kinds of singer taps

To list some singer taps that are available use:

In [None]:
!vdk singer --list-taps

<a name="ingest"></a>
## 4. Ingesting data (Task 2)

We will ingest user data from HTTP API (https://jsonplaceholder.typicode.com/users) into a database (sqlite in this case).

Feel free to pick up any other data source or any other destination. But below instruction are based on above scenario.

### Main Concepts

Before diving into the tutorial, let's get acquainted with some key terms:

- **Data Source**
A Data Source is like a bridge to your data. It handles connecting, reading, and maintaining a relationship with a specific set of data, like a database or an API.

- **Data Source Stream**
Think of a Data Source Stream as a lane on that bridge. Each lane (or stream) can carry specific types of data, like users, orders, etc. Streams allow data to flow in an organized manner and can be processed in parallel.

- **Data Source Payload**
The Payload is essentially the vehicle traveling on our bridge's lane. It carries the actual data, along with some extra information like what time it left and where it's headed (metadata), to help us understand the data better.



### 4.1 Install HTTP API data source

We will use the singer data source and we will use the REST API Tap "[tap-rest-api-msdk](https://pypi.org/project/tap-rest-api-msdk/)" . So we need to install it first

In [None]:
!pip install tap-rest-api-msdk

### 4.2 Ingestion using configuration (toml)

Now, let's configure our source and destination for the data flow.

We will use the ipython magic ***%%vdkingest*** to define and **trigger** our ingestion pipeline.


In [None]:
%%vdkingest

# Data Source Configuration
[sources.users]
# Data Source Name
name = "singer-tap"
# The singer tap we will use
config.tap_name = "tap-rest-api-msdk"

  # API Configuration for the Source
  [sources.users.config.tap_config]
  api_url = "https://jsonplaceholder.typicode.com"

  # Stream Configuration for the API endpoing /users
  [[sources.users.config.tap_config.streams]]
  name = "users"
  path = "/users"
  records_path = "$.[*]"
  num_inference_records = 200

# Data Destination Configuration
[destinations.sqlite]
method = "sqlite"

# Data Flows from Source to Destination
[[flows]]
from = "users"
to = "sqlite"


Let's verify the data by querying the database




In [None]:
%%vdksql
select * from users

### 4.3 Ingestion using Python

Now let's use programmatic python way to trigger the ingestion

In [None]:
# sources and destinations definitions
from vdk.plugin.data_sources.mapping.data_flow import DataFlowInput
from vdk.plugin.data_sources.mapping.definitions import DestinationDefinition
from vdk.plugin.data_sources.mapping.definitions import SourceDefinition
from vdk.plugin.data_sources.mapping.definitions import DataFlowMappingDefinition

# data source configuration
config = dict(tap_name="tap-rest-api-msdk",
              tap_config={
                  "api_url": "https://jsonplaceholder.typicode.com",
                  "streams": [
                      {
                          "name": "users",
                          "path": "/users",
                          "records_path": "$.[*]",
                          "num_inference_records": 200,
                      }
                  ],
              },
              tap_auto_discover_schema=True)

source = SourceDefinition(id="users", name="singer-tap", config=config)

sqlite_destination = DestinationDefinition(id="sqlite", method="sqlite")


In [None]:
# define the data flow mapping
mapping = DataFlowMappingDefinition(source, sqlite_destination)

# execute the actual ingestion

with DataFlowInput(job_input) as flow_input:
    flow_input.start(DataFlowMappingDefinition(source, sqlite_destination))


Let's verify the data by querying the database


In [None]:
%%vdksql
select * from users

### 4.4 Remapping and simple transformation

Sometimes we need to do simple mappings from source to destination. Let's see how to do that

In [None]:
from vdk.plugin.data_sources.data_source import DataSourcePayload

def map_func(p: DataSourcePayload):
    p.data["new_column"] = "new_column"
    new_table = "users_with_column"
    return DataSourcePayload(p.data, p.metadata, p.state, new_table)


with DataFlowInput(job_input) as flow_input:
    flow_input.start(DataFlowMappingDefinition(source, sqlite_destination, map_func))



So let's see the data with the new column

In [None]:
%%vdksql
select new_column, * from users_with_column

# Congratulations! 🎉

You've successfully completed the Data Ingestion Guide with VDK! We hope you found this guide useful.

## Your Feedback Matters!

We continuously strive to improve and your feedback is invaluable to us. Please take a moment to complete our survey. It will only take a few minutes.

### [**👉 Complete the Survey Here 👈**](https://bit.ly/vdk-ingest-guide-survey)

Thank you for participating in this tutorial!
