Skip to content

Releases: vmware/versatile-data-kit

Versatile Data Kit 0.12

29 Mar 10:40
9ca0f3d
Compare
Choose a tag to compare

Major features include:

Open-sourcing VDK Operations UI

VDK Operations UI would enable data practitioners to efficiently manage (operate and monitor) their data jobs.
It has been used internally in VMware for some time and the team open source it last month.

Check out more details at the Operations UI VEP

Look forward to the official launch soon.

Documentation Improvements

Significantly simplified and improve the main README and the CONTRIBUTING.md thanks to @gary-tai and @zverulacis

VDK Meta Jobs Preparation for Alpha release

implemented a limit on starting jobs at once

META_JOBS_MAX_CONCURRENT_RUNNING_JOBS=<number>

Learn more about the VDK Meta Jobs features in VDK Meta Jobs VEP

Started initiative to support multiple python versions

We are working on introducing an optional python_version property to the Control Service API, which allows users to specify the Python version they want to use for their job deployment. This means users no longer have to rely on the service administrator to make changes to the configuration and can deploy their jobs with the version they need.

See more information in the Multiple Python Versions VEP

Started initiative to create Secrets Interface

So far the way VDK recommended to store secrets was to use Properties API. Though it works well, it doesn't really meet the criteria for storing properly restricted data and likely also confidential data

The team is working on providing similar to Properties interface Secrets (backed by HashiCorp Vault).

See more information in the Vault Integration For Secrets Storage VEP

What's Changed

New Contributors

Full Changelog: v0.11...v0.12

Versatile Data Kit 0.11

22 Feb 15:25
e538bc7
Compare
Choose a tag to compare

Major features include:

Introduce data quality checks (pre-alpha) (for scd1 template)

Allow quality checks to be made before the data is inserted into the target table.
Currently, the checks done on the processing step are not covering if the semantics of the data is correct. Therefore, bad data could went into the target table which could be unwanted behavior.

Example:

    def sample_check_true(tmp_table_name):
        return False if "bad" in tmp_table_name else True 

    template_args["check"] = sample_check 
    job_input.execute_template(
        template_name="load/dimension/scd1",
        template_args=template_args,
    )

Jobs Query API (GraphQL) wildcard matching filter for team and job names

When querying information about jobs now users of the Jobs QUery API can use wildcard matches :
wildcard matching for example *search* in graphQl filters for job name and team name as well as before exact matching of search strings

Provide User Agent when using VDK CLI

Users are looking to be able to determine where requests originated from when analyzing and browsing the telemetry data about VDK Control Service usage.

export VDK_CONTROL_SERVICE_USER_AGENT = foo 

or in config.ini

[vdk]
vdk_control_service_user_agent=foo

If not set it would default to "vdk-control-cli/{version} ({os.name}; {sys.platform})" + {python version}

New plugin: vdk-notebook

A new VDK plugin that supports running data jobs which consists of .ipynb files. You can see VDK Notebook plugin page for more information.

vdk-ipython

This extension introduces a magic command for Jupyter. The command enables the user to load job_input for his current data job and use it freely while working with Jupyter.
You can see VDK ipython plugin page for more information.

Installation

Check the installation page

What's Changed

Read more

Versatile Data Kit 0.10

10 Jan 21:26
c6ee7a7
Compare
Choose a tag to compare

Summary

Major features include:

vdk-jobs-troubleshooting - new plugin

Introduces thread-dump capabilities in the Data Jobs

See more details in the plugin home page and the VDK Enhancement Proposal

Support for Python 3.11

Introduces support for Python 3.11 in vdk-core and other plugins

Package versions

See installation instructions here.
The versions of VDK components released under VDK 0.10 are:

Main components

control-service 1.5.707959356
vdk-core==0.3.723457889

Plugins

vdk-lineage-model==0.0.723435904
vdk-meta-jobs==0.1.723435904
vdk-sqlite==0.1.730902357
vdk-jobs-troubleshooting==0.2.741769066
vdk-lineage==0.3.723435904
vdk-control-cli==1.3.736732752

What's Changed

  • control-service: add docs on using different versions of k8s by @murphp15 in #1473
  • control-service: fix secret in helm chart by @murphp15 in #1379
  • control-service: graphql wildcard matching filter for team and job names by @mrMoZ1 in #1459
  • control-service: latest graphql version by @murphp15 in #1384
  • control-service: migrate from springfox to springdocs by @murphp15 in #1424
  • control-service: release helm chart with correct image tag by @murphp15 in #1383
  • control-service: reset termination status when job is disabled by @doks5 in #1405
  • control-service: run ci on gradle version change by @murphp15 in #1371
  • control-service: run release test on dependency version change by @tozka in #1400
  • control-service: set registry name correctly. by @murphp15 in #1331
  • control-service: use correct secret type by @murphp15 in #1370
  • control-service: user-agent tag should have the correct format by @murphp15 in #1412
  • examples: clarify README sample anonymize plugin by @tozka in #1394
  • Update README.md by @dimirapetrova in #1373
  • Update README.md for INSERT by @dimirapetrova in #1364
  • vdk-control-cli and some plugins: Support for 3.11 by @tozka in #1409
  • vdk-control-cli: address vulnerability in python dependency by @tozka in #1470
  • vdk-control-cli: allow cli users to explicitly set the user agent tag by @murphp15 in #1403
  • vdk-core: get_managed_connection should return opened connection by @tozka in #1410
  • vdk-core: support for 3.11 by @tozka in #1395
  • vdk-gitlab: upgrade the gitlab runner to latest version by @tozka in #1398
  • vdk-gitlab-runners: increase concurrent pipelines by @tozka in #1396
  • vdk-jobs-trobleshooting: Introduce plugin API and configuration by @doks5 in #1447
  • vdk-jobs-troubleshooting: add thread-dump utility by @doks5 in #1456
  • vdk-jobs-troubleshooting: improve robustness of the plugin by @dakodakov in #1487
  • vdk-jobs-troubleshooting: release the plugin by @dakodakov in #1481
  • vdk-jupyter: splitting functionalities of vdk-notebook Cell class by @duyguHsnHsn in #1465
  • vdk-jupyter: add create job command to jupyter front-end extension by @duyguHsnHsn in #1478
  • vdk-jupyter: add delete job command to jupyter by @duyguHsnHsn in #1488
  • vdk-jupyter: changes on diagrams and definition in notebook-plugin section in VEP by @duyguHsnHsn in #1427
  • vdk-jupyter: create notebook-plugin by @duyguHsnHsn in #1411
  • vdk-jupyter: deleting the yarn.lock file because of security issue by @duyguHsnHsn in #1382
  • vdk-jupyter: notebook-plugin by @duyguHsnHsn in #1415
  • vdk-jupyter: python subprocess security problem by @duyguHsnHsn in #1463
  • vdk-jupyter: run VDK job by @duyguHsnHsn in #1454
  • vdk-jupyter: VEP - adding the definition of Notebook step by @duyguHsnHsn in #1386
  • vdk-lineage, vdk-lineage-model, vdk-meta-jobs: support for Python 3.11 by @tozka in #1448
  • vdk-plugins: introduce vdk-jobs-troubleshooting plugin by @doks5 in #1428
  • vdk-sqlite: support for Python 3.11 by @tozka in #1466
  • vdk-trino: support for Python 3.11 by @tozka in #1471
  • vep-1416: address feedback and update proposal by @doks5 in #1491
  • versatile-data-kit: VEP-1416 vdk-troubleshooting-tools by @doks5 in #1423

New Contributors

Full Changelog: v0.9...v0.10

Versatile Data Kit 0.9

30 Nov 13:58
e32160e
Compare
Choose a tag to compare

Summary

Major features include:

vdk-meta-jobs new plugin

Using this plugin you can specify dependencies between data jobs as a direct acyclic graph (DAG).

For example

def run(job_input):
    jobs = [
        {
        "job_name": "name-of-job",
        "team_name": "team-of-job",
        "fail_meta_job_on_error": True or False,
        "depends_on": [name-of-job1, name-of-job2]
        },
        ...
    ]
    MetaJobInput().run_meta_job(jobs)

See more details in the plugin home page

Control Service security hardening

  • Options for jobs to run in read-only file system
  • Provide credentials configuration for using private images during by the Control Service
  • Use a separate file system for storing temporary user-supplied files by Control Service
  • Enhanced job upload validation for zip exploits and unallowed files

Data Job Upload validation allow list

During the installation of Control Service administrators can limit what type of files can be uploaded as part of a data job.
A new configuration option is added called uploadValidationFileTypesAllowList .
uploadValidationFileTypesAllowList is comma separated list with file types.

For example Setting

uploadValidationFileTypesAllowList=image/png,text/plain

then only png images and plain text files can be uploaded. Otherwise, upload requests will fail.

See more details in helm chart documentation

vdk-logging-format - new plugin

This plugin allows for the configuration of the format of VDK logs.

Before there were separate plugins for each format, but they are not deprecated in favour of this one.

The plugin introduces a new configuration option LOGGING_FORMAT with possible values 'json', 'ltsv', 'text'

export LOGGING_FORMAT=JSON

Control Service helm chart support for Postgres

For embedded DB for control-service metadata storage, the Bitnami-available chart of PostgreSQL is added.

Now user can install it with

helm install vdk-control-service --postgresql.enabled=true cockroachdb.enabled=false

Package versions

See installation instructions here.
The versions of VDK components released under VDK 0.7 are:

Main components

control-service 1.5.707959356
vdk-core==0.3.692414765

Plugins

vdk-logging-json==0.1.693641831
vdk-meta-jobs==0.1.684477187
vdk-postgres== 0.0.692283840
vdk-trino== 0.4.703555598

What's Changed

  • control-service: Container read-only file system by @gageorgiev in #1291
  • control-service: Expose LOGGING_FORMAT through helm chart by @gageorgiev in #1329
  • control-service: a directory can be manually set as a location to store databjobs when processing them to git. by @murphp15 in #1290
  • control-service: add empty dir storage by @murphp15 in #1293
  • control-service: add support for allowlist in helm chart. by @murphp15 in #1283
  • control-service: add tests for some zip exploits by @tozka in #1266
  • control-service: builder base image in helm by @murphp15 in #1359
  • control-service: builder images load secrets from k8s by @murphp15 in #1358
  • control-service: create the secret in the correct namespace. by @murphp15 in #1318
  • control-service: deprecated jobsList endpoint cleanup by @ivakoleva in #1296
  • control-service: fix helm template by @murphp15 in #1295
  • control-service: fix ingress template by @murphp15 in #1277
  • control-service: helm chart for private builder by @murphp15 in #1336
  • control-service: namespace can be null by @murphp15 in #1349
  • control-service: postgresql embedded by @ivakoleva in #1273
  • control-service: refactor db query to mitigate race condition by @mrMoZ1 in #1269
  • control-service: release newer version of job builder by @murphp15 in #1362
  • control-service: set registry name correctly. by @murphp15 in #1323
  • control-service: test cleanup with the goal of making tests easier to run locally by @murphp15 in #1343
  • control-service: upload validation by @tozka in #1268
  • vdk-jupyter: Expand details on extensions design by @duyguHsnHsn in #1304
  • quickstart-vdk: Include vdk-logging-format by @gageorgiev in #1313
  • vdk-audit: set python requires >= 3.8 by @tozka in #1289
  • vdk-control-api-auth: Fix error message formatting by @gageorgiev in #1303
  • vdk-control-cli: fix cicd by @mrMoZ1 in #1327
  • vdk-control-cli: update doc for deployment of multiple jobs w/single command by @mrMoZ1 in #1325
  • vdk-core: Allow for modification of dynamic params by @doks5 in #1267
  • vdk-core: resolve library error classification on startup by @mrMoZ1 in #1241
  • vdk-events: add presentation slides of DSC event by @tozka in #1335
  • vdk-jupyter: introduce JupterLab extension by @duyguHsnHsn in #1338
  • vdk-logging-format: Fix path to readme in setup.py by @gageorgiev in #1322
  • vdk-logging-format: Join JSON and LTSV logging plugins into one by @gageorgiev in #1312
  • vdk-logging-json, vdk-logging-ltsv: Delete deprecated plugins by @gageorgiev in #1319
  • vdk-meta-jobs: Initial implementation by @tozka in #1249
  • vdk-postgres: add ingest plugin by @tozka in #1314
  • vdk-trino: Fix typo in the documentation by @tozka in #1340

New Contributors

Full Changelog: v0.8...v0.9

Versatile Data Kit 0.8

26 Oct 08:08
5556e2d
Compare
Choose a tag to compare

Summary

Major features include:

New plugin: VDK Audit

This plugin provides the ability to audit and potentially limit user operations. It requires Python 3.8 or newer. These operations can be deep within the Python runtime or standard libraries, such as dynamic code compilation, module imports, or OS command invocations.

If we want to forbid some os.* operations we can do it like this:

export AUDIT_HOOK_ENABLED=true
export AUDIT_HOOK_FORBIDDEN_EVENTS_LIST='os.removexattr;os.rename;os.rmdir;os.scandir'
export AUDIT_HOOK_EXIT_ON_FORBIDDEN_EVENT=true

vdk run <job-name>

See more details in the vdk-audit plugin page

Any version of python in VDK Control Service

Deployed jobs by Control Service can now use any version of Python and not just 3.7 automatically.

Insert only impala load template

This template can be used to load raw data from Data Lake to target Table in Data Warehouse. In summary, it appends all records from the source table to the target table. Similar to all other SQL modeling templates there is also schema validation, table refresh and statistics are computed when necessary.

Example:

def run(job_input):
    # . . .
    template_args = {
        'source_schema': 'source',
        'source_view': 'view_source',
        'target_schema': 'target',
        'target_table': 'destination_table'
    }
    job_input.execute_template('insert', template_args)

See more details in the template documentation page

Package versions

See installation instructions here.
The versions of VDK components released under VDK 0.7 are:

Main components

control-service 1.5.671965442
vdk-core==0.3.662978536

Plugins

vdk-ingest-http==0.2.670842377
vdk-impala==0.4.672320306

What's Changed

  • control-service: CVE fix - upgrade commons-text by @tozka in #1255
  • control-service: Dynamic python site-packages directory detection by @mivanov1988 in #1247
  • control-service: fix cicd deployment by @tozka in #1226
  • control-service: fix integration tests by @tozka in #1211
  • control-service: fix race condition in test by @murphp15 in #1227
  • control-service: refactor job cancellation method due to 404 errors by @mrMoZ1 in #1114
  • control-service: remove executables from secure job builder by @mivanov1988 in #1202
  • control-plane: better error logging for transient error in tests by @murphp15 in #1222
  • control-service: improve docs and local runability of integration tests by @murphp15 in #1217
  • control-service: upgrade java client k8s version by @murphp15 in #1216
  • vdk-core: errors occurred and the state (handled or not) context missing by @ivakoleva in #1182
  • vdk-core: errors occurred and the state (handled or not) context missing by @tozka in #1212
  • vdk-core: platform error no longer logged when skipping execution steps by @mrMoZ1 in #1223
  • vdk-impala: Fix parsing while analysing profile for lineage information by @kostoww in #1206
  • vdk-impala: Refactor query classifier for data lineage by @kostoww in #1239
  • vdk-impala: improve explanation in readme by @tozka in #1248
  • vdk-impala: stop using errors.get_exception_message by @tozka in #1224
  • vdk-impala: update documentation with link by @tozka in #1237
  • vdk-ingest-http: Adopt simplejson in place of json by @doks5 in #1229
  • vdk-ingest-http: Move data conversion above size calc by @doks5 in #1245
  • vdk-ingest-http: fix default value for backoff factor, add retry test by @dakodakov in #1218
  • vdk-plugins: fix broken link by @tozka in #1204
  • vdk-plugins: introduced vdk-audit plugin by @mivanov1988 in #1221
  • vdk-plugins: run tests on release of vdk-core by @tozka in #1210
  • vdk-plugins: set dind tempalte job for default build of plugins by @tozka in #1225
  • versatile-data-kit: required approving reviewers update by @ivakoleva in #1220
  • versatile-data-kit: update contributing.md by @tozka in #1214

New Contributors

Full Changelog: v0.7...v0.8

v0.7

28 Sep 13:49
05c45d7
Compare
Choose a tag to compare

Summary

Major features include:

VDK Template running state detection capability

Since template executions are autonomous data job runs, we need to be able to determine if a template is running at any time.
For example, to distinguish between root data job finalization, and template finalization

For example if we want to send telemetry somewhere:

    @hookimpl
    def finalize_job(self, context: JobContext) -> None:
        template = context.core_context.state.get(ExecutionStateStoreKeys.TEMPLATE_NAME)
        if template: 
           telemetry.send(phase="finalize_template", template_name = template) 
        else: 
           telemetry.send(phase="finalize_job", job_name=context.name)

New Logging configuration LOG_LEVEL_MODULE

Enable users to override logs per module, temporarily (e.g for debugging or prototyping reasons to increase the verbosity of certain
module).

For example assuming default log level is INFO we can enable verbose logs for 2 modules "vdk.api" and "custom.module":

export LOG_LEVEL_MODULE="vdk.api=DEBUG;custom.module=DEBUG" 
vdk run job-name 

Or in specific job config.ini:

[vdk]
log_level_module=vdk.api=DEBUG;custom.module=DEBUG

New plugin backend for Properties: from local file system

A simplistic plugin, that allows a developer or presenter to quickly store properties on the local FS.

It can be used to store secrets/configuration for a dev/demo session, that does not require a prerequisite of the entire Control Service installed and running.
It can be used to test a job run locally only without updating the state of the deployed job.

Example:

export PROPERTIES_DEFAULT_TYPE="fs-properties-client"

or in specific job config.ini

[vdk]
properties_default_type=fs-properties-client

Now properties are stored in a local file. The file location can be further configured using FS_PROPERTIES_FILENAME and FS_PROPERTIES_DIRECTORY

Coockiecutter for new plugins

Create new plugin skeleton very easy

cookiecutter https://github.com/tozka/cookiecutter-vdk-plugin.git

and follow the instructions

Add the ability to cancel remaining job steps

Now a job (or a template) can be canceled from any step and all remaining steps in the job (or template) will be skipped.
For example, it can be used if a data job depends on processing data from a source that has indicated no new entries since the last run, then we can skip the remaining steps.

Example:

def run(job_input: IJobInput): 
    data = get_last_delta()
    if not data:
        job_input.skip_remaining_steps()

Package versions

See installation instructions here.
The versions of VDK components released under VDK 0.7 are:

Main components

control-service 1.5.622899758

vdk-control-cli==1.3.626767210
vdk-core==0.3.652866366

Plugins

vdk-properties-fs==0.0.651770458
vdk-kerberos-auth==0.3.631374202
vdk-impala==0.4.651849986

What's Changed

  • vdk-control-cli: Drop requirement pluggy to be 0.* by @gageorgiev in #1116
  • vdk-core: Add log before query result fetch by @doks5 in #1195
  • vdk-core: Fix issue with serializing Decimal values during payload check by @gageorgiev in #946
  • vdk-core: add ability to cancel remaining job steps by @mrMoZ1 in #1188
  • vdk-core: add new configuration log_level_module by @tozka in #1167
  • vdk-core: added default values to write termination message method by @mivanov1988 in #1185
  • vdk-core: avoid circular references in print results by @tozka in #1176
  • vdk-core: extend classification error test by @tozka in #1180
  • vdk-core: fix error classification of vdk code by @tozka in #1173
  • vdk-core: fix flakey test in test checking logs output by @murphp15 in #1194
  • vdk-core: template running state detection capability by @ivakoleva in #941
  • vdk-csv: Updates on vdk-csv README by @duyguHsnHsn in #952
  • vdk-impala: Add validation for queries that doesn't provide lineage info by @kostoww in #1175
  • vdk-impala: fix error classification in impala by @tozka in #1178
  • vdk-impala: fix impala template empty source view usr err by @mrMoZ1 in #1189
  • vdk-impala: fixed platform error missclasified when running template by @mrMoZ1 in #944
  • vdk-impala: improve vdk-impala documentation by @tozka in #948
  • vdk-kerberos-auth: Pinned minikerberos in vdk-kerberos-auth plugin by @mivanov1988 in #1168
  • vdk-kerberos-auth: add KerberosClient for authenticating API calls by @tozka in #879
  • vdk-plugins: improve plugin project creation with cookiecutter by @tozka in #942
  • vdk-properties-fs: new plugin for local FS properties storage by @ivakoleva in #1190
  • vep: Jupyter Notebook Integration Goals and Requirements by @duyguHsnHsn in #1165
  • vep: Jupyter Notebook Integration by @duyguHsnHsn in #1113
  • versatile-data-kit: Without and with VDK image by @zverulacis in #1184
  • versatile-data-kit: set automatic java formatter by @tozka in #757
  • versatile-data-kit: simplify release process by @tozka in #951
  • versatile-data-kit: update contact instructions by @tozka in #1172

New Contributors

Full Changelog: v0.6...v0.7

Versatile Data Kit 0.6

23 Aug 13:03
7d3da40
Compare
Choose a tag to compare

Summary

Major features include:

Configuration auto-wiring improvement: detect non vdk_ prefixed environment variables

Before configuration option must have been prefixed with "vdk_" when set as an environment variable in order to be recognized.
This was very error prone since the options are documented without the prefix.

Now they can be set without a prefix as well.

The following are equivalent:

export VDK_DB_DEFAULT_TYPE='impala'
export DB_DEFAULT_TYPE='impala'

If both are set, the "prefixed" variable has a higher priority.

New plugin/library: vdk-lineage-model

VDK Lineage Model plugin aims to abstract emitting lineage data from VDK data jobs, so that different lineage loggers can be configured at run time in any plugin that supports emitting lineage data

Check out more at the plugin page.

New export-csv command

Alongside vdk ingest-csv which enabled users to import (or ingest) CSV data into a table.
Users can now export CSV with a simple command from SQL query:

vdk export-csv -q "select * from my_table --file 'output.csv'

Checkout out more at the plugin page

In memory properties client

Until now properties required Control Service to be able to work. Sometimes for prototyping and testing purposes, you do not need to connect to external services.

  • New configuration value can be set.

In a specific job's config file (config.ini

[vdk]
properties_default_type = memory

Or as an environment variable

export properties_default_type="memory"
  • Now the properties would be entirely in memory. That means they will be "deleted" after the job's run.

New example: Ingest and anonymize

Example how to anonymize any data being ingested using VDK with a plugin.

Check out more at the example page

New example: Airflow integration

Example how to create dependencies between data job in Airflow.

Check out more at the example page

Package versions

See installation instructions here.
The versions of VDK components released under VDK 0.6 are:

Main components

control-service 1.5.620438292
vdk-core==0.3.620677184

Plugins

airflow-provider-vdk==0.0.602273476
vdk-lineage-model== 0.0.581430542
vdk-kerberos-auth==0.3.584577337
vdk-ingest-http==0.2.616713987
vdk-impala==0.4.613570906
vdk-lineage== 0.3.604201902
vdk-trino== 0.4.605101952

What's Changed

  • airflow-provider-vdk: Add hidden fields to VDK Connection by @doks5 in #883
  • control-service: Atomic job cancellation by @gageorgiev in #860
  • control-service: Fluentd integration for data jobs by @mivanov1988 in #940
  • control-service: Secure job builder image by @gageorgiev in #936
  • control-service: add default jwt jwk uri by @mrMoZ1 in #873
  • control-service: fix the examples in swagger by @tozka in #945
  • control-service: fix vdk-server startup issues by @mrMoZ1 in #908
  • control-service: increase integration test builder memory by @mrMoZ1 in #929
  • control-service: upgrade docker container used in cicd by @mrMoZ1 in #911
  • vdk-airflow: populate readme by @tozka in #924
  • vdk-control-cli: remove hidden flag for CLI commands by @tozka in #902
  • vdk-control-cli: use latest dependencies version during build by @tozka in #903
  • vdk-core,vdk-impala,vdk-lineage,vdk-trino: Support for pluggy 1.0 by @gageorgiev in #931
  • vdk-core: Add printed output to set-default and reset-default by @gageorgiev in #884
  • vdk-core: BaseVdkError exception propagation flaw fix by @ivakoleva in #917
  • vdk-core: Improve ingestion error logging by @gageorgiev in #930
  • vdk-core: add memory properties client by @tozka in #921
  • vdk-core: add option to disable version check by @tozka in #876
  • vdk-core: detect non vdk_ prefixed environment values for config by @tozka in #874
  • vdk-core: execution result missing exception and blamee fix by @ivakoleva in #938
  • vdk-core: hide native cursor from execute hook by @tozka in #875
  • vdk-core: make db_default_type case insensitive by @tozka in #935
  • vdk-core: show log_level_vdk in help by @tozka in #905
  • vdk-core: step loading failure misclassified as Platform error fix by @ivakoleva in #920
  • vdk-core: termination message now idempotent by @mrMoZ1 in #909
  • vdk-core: vdk_exception hook exit code fix by @ivakoleva in #912
  • vdk-core: vdk_exception hook exit code fix by @ivakoleva in #915
  • vdk-csv: add export-csv command by @duyguHsnHsn in #934
  • vdk-examples: add ingest and anonymize example by @tozka in #922
  • vdk-impala, vdk-trino: Remove deprecated use of result field by @gageorgiev in #933
  • vdk-impala: Add performance logs by @VladimirPetkov1 in #939
  • vdk-impala: Add support for lineage in vdk-impala by @VladimirPetkov1 in #932
  • vdk-ingest-http: reduce verbosity of ingestion logs by @tozka in #943
  • vdk-kerberos-auth: Separate async event loop by @doks5 in #885
  • vdk-lineage-model: Extract Lineage Model in separate plugin by @VladimirPetkov1 in #896
  • vdk-server: Pin kubernetes API version by @doks5 in #919
  • vdk-server: fix for vdk server crashing on startup by @mrMoZ1 in #907
  • vdk-trino, vdk-linage: Switch to vdk-lineage-model by @VladimirPetkov1 in #898
  • vdk-trino: fix broken tests by @tozka in #900
  • versatile-data-kit: Add Data lifecycle image and minor changes by @zverulacis in #887
  • versatile-data-kit: Add getting started, ask for help, PR checklist by @zverulacis in #881
  • versatile-data-kit: Add intro part to contributing.md from the template by @zverulacis in #880
  • versatile-data-kit: Airflow Documentation by @gageorgiev in #857
  • versatile-data-kit: add link to csv example doc by @tozka in #893
  • versatile-data-kit: add logo image by @tozka in #877
  • versatile-data-kit: make easier slack instructions by @tozka in #925
  • versatile-data-kit: update link in examples by @tozka in #892
  • versatile-data-kit: update logo for dark mode by @tozka in #878

New Contributors

Full Changelog: v0.5...v0.6

Versatile Data Kit 0.5

22 Jun 08:48
4c1a580
Compare
Choose a tag to compare

Summary

Major features include:

New managed db_connection_execute_operation hook

The hooks enable users to add behavior to existing SQL queries without modifying the code itself.
It is invoked for reach query before and after enabling to track its full execution. For example

@hookimpl(hookwrapper=True)
db_connection_execute_operation(execution_cursor: ExecutionCursor) -> Optional[int]: 
                start = time.time()
                outcome = yield # we yield the execution so that query is executed 
                end = time.time()
                log.info(f" duration: {end - start}. ")

Airflow Provider VDK release (beta)

Users can integrate with Apache Airflow to orchestrate in a DAG (workflow) Data Jobs.
Check out more at airflow-provider-vdk

What's Changed

  • airflow-provider-vdk: Adopt auth plugin by @doks5 in #856
  • airflow-provider-vdk: Example DAG by @gageorgiev in #847
  • airflow-provider-vdk: Fix VDKSensor templating issue, improve example DAG by @gageorgiev in #852
  • control-service: clear execution fail alert when failing with user error by @mrMoZ1 in #850
  • control-service: fix graphql team filter not retrieving special chars by @mrMoZ1 in #863
  • control-service: improve api message on oom job execution errors by @mrMoZ1 in #861
  • documentation improvements by @zverulacis in #853
  • vdk-control-cli: Adopt new auth exceptions by @doks5 in #846
  • vdk-core: Add unit test for destination_table in empty queue by @doks5 in #865
  • vdk-core: Fix destination_table referenced early by @doks5 in #864
  • vdk-core: Split execution summary into chunks by @doks5 in #867
  • vdk-core: add new managed db_connection_execute_operation hook by @tozka in #805
  • vdk-core: fix buggy (false positive) connection unit test by @tozka in #841
  • vdk-control-api-auth: New VDK Auth exceptions by @doks5 in #845
  • vdk-heartbeat: pipelines-control-service-integration-tests image rebuild by @ivakoleva in #848
  • vdk-plugins: Add Managed Database Connection cycle plugin by @tozka in #859
  • vdk-test-utils: enable back tests by @tozka in #855

New Contributors

Full Changelog: v0.4...v0.5

Versatile Data Kit 0.4

25 May 05:51
f0d06c6
Compare
Choose a tag to compare

Summary

Major features include:

Standalone Data Job run

Until now the only way to run a data job was with CLI command "vdk run". Now users can run a job entirely programmatically using it.

For example:

  with StandaloneDataJobFactory.create(
        data_job_directory=Path(__file__), extra_plugins=[hook_tracker]
    ) as job_input:
        print(job_input.get_name())

Check out more in the new API documentation here

New Plugin: vdk-control-api-auth

A new library plugin, not a runnable plugin, that is intended to be used as a dependency for other plugins, which need to authenticate users against the Control Service.

Check more in the plugin documentation

What's Changed

  • Scenario 3 - Created the Energy Scenario by @alod83 in #781
  • [vdk-plugins] vdk-control-api-auth: Add api-token flow by @doks5 in #822
  • [vdk-plugins] vdk-control-api-auth: Add authorization code flow by @doks5 in #834
  • [vdk-plugins] vdk-control-api-auth: Enable plugin release by @doks5 in #837
  • [vdk-plugins] vdk-control-api-auth: Fix query key type by @doks5 in #838
  • airflow-provider-vdk: Fix CICD release step by @gageorgiev in #824
  • airflow-provider-vdk: VDKOperator execute method by @gageorgiev in #823
  • airflow-provider-vdk: VDKOperator initial structure by @gageorgiev in #820
  • airflow-provider-vdk: VDKSensor poke method by @gageorgiev in #818
  • control-service: Allow for jobs with no schedule to be deployed by @gageorgiev in #835
  • control-service: Kerberos authentication IT by @mrMoZ1 in #798
  • control-service: cicd unit tests should run on pull requests by @mrMoZ1 in #830
  • control-service: kerberos authentication IT test by @mrMoZ1 in #831
  • vdk-control-api-auth: Add core auth logic by @doks5 in #815
  • vdk-control-cli: Adopt new vdk-control-api-auth library by @doks5 in #840
  • vdk-control-cli: Fix command printed on successful deploy by @gageorgiev in #839
  • vdk-control-cli: Make schedule_cron config param optional by @gageorgiev in #827
  • vdk-core: New feature: StandaloneDataJob by @mrdavidlaing in #793
  • vdk-core: encapsulate router-specific properties logic by @ivakoleva in #817
  • vdk-core: new version check built-in plugin false positive fix by @ivakoleva in #816
  • vdk-core: properties write pre-processing support by @ivakoleva in #819
  • vdk-heartbeat: null datetime conversion fix by @ivakoleva in #813
  • versatile-data-kit: allow commit with any newer version of python by @tozka in #826
  • versatile-data-kit: link examples wiki in the git examples by @tozka in #812
  • versatile-data-kit: update readme with clear slack instructions by @tozka in #806

New Contributors

Full Changelog: v0.3...v0.4

Versatile Data Kit 0.3

20 Apr 08:40
c5be8a0
Compare
Choose a tag to compare

Summary

Major features include:

Support for Kerberos Authentication provider in the Control Service

Alongside support for Oauth2, now organizations can integrate with their Kerberos infrastructure.
Users can specify Kerberos as an authentication provider for accessing VDK Control Service.

For more information on how to configure Kerberos see VDK helm documentation here

A new plugin: vdk-lineage (alpha)

VDK Lineage plugin provides lineage data (input data -> job -> output data) information for any SQL query (regardless of the database) executed using VDK and sends it to a pre-configured destination using OpenLineage standard

We also have introduced a utility command vdk marquez-server --start which starts Marquez UI locally so that lineage is visualized.

For more information check out vdk-lineage plugin documentation

Support for Kuberentes 1.23

Now VDK Control Service can work seamlessly with the newest versions of Kubernetes and make use of its features:

  • VDK Control Service can now work with CronJob controller V2 (alongside V1).
  • With TTL Controller, any jobs launched by VDK Control Service can be cleaned up after preconfigured time.

Users can override the VDK version of a deployed data jobs

Users can now specify the vdk version both using API or CLI when deploying a Data Job.
For example, with CLI it's as simple as vdk deploy --update --vdk-version old-vdk-version

This would enable canary deployments or rolling deployments of VDK.

Introducing VEP (VDK Enhancement Proposal) process and first VEP

Versatile Data Kit has a process in place for proposing and adding large changes in an efficient and consistent manner.

For more information check the process here.

We also have used the process for our first major feature change - Apache Airflow Integration

Package versions

See installation instructions here.
The versions of VDK components released under VDK 0.3 are:

Main components

control-service 1.5.520417292

vdk-control-cli==1.3.520417292
vdk-core==0.2.520417292
vdk-heartbeat==0.6.520417292

Plugins

vdk-trino==0.3.520417292
vdk-lineage==0.2.520417292
vdk-kerberos-auth==0.3.520417292
vdk-impala==0.3.520417292

What's Changed

  • VEP-554: Apache Airflow Integration by @mivanov1988 in #748 and @doks5 in #786
  • airflow-provider-vdk: Initial Airflow provider structure by @gageorgiev in #772
  • airflow-provider-vdk: Job execution status and logs method by @gageorgiev in #796
  • airflow-provider-vdk: Start and cancel job execution methods by @gageorgiev in #778
  • airflow-provider-vdk: VDKSensor initial structure by @gageorgiev in #800
  • control-service: Adopt kubernetes-client 14.0.1 by @gageorgiev in #761
  • control-service: add kerberos auth properties to helm chart by @mrMoZ1 in #764
  • control-service: Adopt use of the V1CronJob API by @gageorgiev in #767
  • control-service: Bump pipelines-control-service version by @doks5 in #762
  • control-service: Set TTLAfterFinished period for K8s CronJobs by @gageorgiev in #776
  • control-service: Update CHANGELOG.md by @doks5 in #760
  • control-service: add OAuth2 enable/disable flag by @mrMoZ1 in #765
  • control-service: add kerberos auth provider by @mrMoZ1 in #755
  • control-service: builder job configurable security context by @mivanov1988 in #708
  • control-service: configurable builder job service account by @mivanov1988 in #791
  • control-service: fix builder security context by @mivanov1988 in #784
  • control-service: fix concatAddresses NPE by @mivanov1988 in #782
  • control-service: fix job builder unit tests by @mivanov1988 in #792
  • control-service: fix log link to set endTime always by @tozka in #735
  • vdk-control-cli: Adopt click version 8 by @ivakoleva in #770
  • vdk-control-cli: set vdk version and enabled when deploying new job by @tozka in #752
  • vdk-core: JobInput get_name and get_job_directory implementation by @ivakoleva in #745
  • vdk-core: Verify payload after pre-processing it by @YanaZhivkova in #777
  • vdk-core: clarify run descriptions on --arguments option by @tozka in #731
  • vdk-core: ensure sql args are subsituted in correct priority by @tozka in #749
  • vdk-core: lowercase env variables are inferred as configuration by @tozka in #751
  • vdk-core: minor refactoring in managed_cursor to reduce long method by @tozka in #803
  • vdk-core: print query duration by @mrMoZ1 in #804
  • vdk-core: refactor test to use job_path method by @tozka in #747
  • vdk-core: update plugin hook diagrams by @tozka in #775
  • vdk-core: Adopt click version 8.0 by @doks5 in #769
  • vdk-heartbeat: Fix initial job executions with specific vdk version by @YanaZhivkova in #758
  • vdk-heartbeat: Handle execution end_time not string by @doks5 in #750
  • vdk-impala: unify names of templates betwen trino and impala by @tozka in #787
  • vdk-kerberos-auth: support kerberos auth for all CLI commands by @tozka in #774
  • vdk-kerberos-auth: upgrade minikerberos and requests-kerberos to latest by @ivakoleva in #742
  • vdk-lineage: introducing POC (pre-alpha) implementation by @tozka in #783
  • vdk-plugins: Introduce vdk-control-api-auth plugin by @doks5 in #801
  • vdk-snowflake: Enable support for Python 3.10 by @gageorgiev in #746
  • vdk-trino: add link to template examples by @tozka in #788
  • vdk-trino: collect lineage for select/insert and rename table only by @philip-alexiev in #756
  • vdk-trino: fix ingesting value with bool type failing by @tozka in #753
  • vdk: add VDK enhancement proposal (VEP) spec template by @tozka in #727
  • versatile-data-kit: Update CONTRIBUTING.md with links to coding standard by @tozka in #794

New Contributors

Full Changelog: 0.2...v0.3