OpenLineage parser for Meltano

This is a simple CLI for parsing meltano logs, subject to a lot of breaking changes and improvements.

The goal

Meltano loads data from sources to destination, but there is currently little integration with lineage. It should, however, be possible to output some open-lineage type metadata. This cli tool parses key files in a Meltano project, and creates openlineage-compliant records. It can optionally post these records to a Marquez API endpoint.

What it currently does

Currently, the tool finds the following information:

the name of the tap and target
the name of the stream
The start and end time for the run
The schema (of the source at least)
run metrics (number or rows etc) if the tap/target is SDK-based
Run result (success/error)
Tap/Target config such as host

What might be possible

Perhaps the biggest improvement of this tool would be to automatically generate URIs of the source/target locations. Such URIs typically look something like snowflake://xy12345.north-europe.azure/db-name/schema-name/table-name or sftp://ftp.example.com/data-folder/file1.csv. It is not possible to reliably parse data like that from the config, so the practical solution is probably to encourage some standardization of log messages so that the tap/target can emit these URIs themselves, to be picked up by the log parser.

It might also be possible to provide exact column name mapping. This would have to be a two-fold process:

Mappers can create new columns, and we would need to parse these in order to calculate new columns
targets can change the names of columns, typically in order to comply with conventions or requirements in the target system

The log parser should be able to parse mappings, but in order to catch column renamings in the target, we would require logging standards.

Improvements to meltano and the SDK

It is obvious that a good lineage parser requires good logging and good standardization. Some ideas so far:

Meltano could generate even clearer messages for when a run starts and when it ends
We need to agree on some standard (optional) log fields that can be parsed for lineage purposes
Perhaps dedicated "lineage log lines" could be useful

How to use the CLI

When standing inside a meltano project, run meltano-lineage to parse the logs and generate summary output. By default this output will be printed to stdout, but it can optionally be written directly to file by passing a filename to the --output or -o option, or it can publish lineage directly to a Marquez endpoint by using the --publish flag.

The script can be run without arguments (just meltano-lineage), but it accepts the following arguments:

--environment: Which meltano environment to use. Defaults to the project default.
--oufile: path to metadata file to be written. Default openlineage.json
--publish: flag to post the records to a Marquez endpoint, specified by the --url option.
--url: A Marquez endpoint to post records to. Default http://localhost:5000/api/v1/lineage. Not used without the --publish flag.

Requirements

In order for the parser to work, you need a Meltano project with a logging.yaml file that specifies logging to file with JSON formatting and at least INFO level logging. Because the tool parses logs, meltano run must have been invoked first.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
cli		cli
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli

cli

.gitignore

.gitignore

.pre-commit-config.yaml

.pre-commit-config.yaml

README.md

README.md

pyproject.toml

pyproject.toml

Repository files navigation

OpenLineage parser for Meltano

The goal

What it currently does

What might be possible

Improvements to meltano and the SDK

How to use the CLI

Requirements

About

Releases

Packages

Languages

radbrt/meltano-meta

Folders and files

Latest commit

History

Repository files navigation

OpenLineage parser for Meltano

The goal

What it currently does

What might be possible

Improvements to meltano and the SDK

How to use the CLI

Requirements

About

Resources

Stars

Watchers

Forks

Languages