Skip to content

Test data management tool for any data source, batch or real-time. Generate, validate and clean up data all in one tool.

License

Notifications You must be signed in to change notification settings

data-catering/data-caterer

 
 

Repository files navigation

Data Caterer - Test Data Management Tool

Overview

A test data management tool with automated data generation, validation and clean up.

Basic data flow for Data Caterer

Generate data for databases, files, messaging systems or HTTP requests via UI, Scala/Java SDK or YAML input and executed via Spark. Run data validations after generating data to ensure it is consumed correctly. Clean up generated data or consumed data in downstream data sources to keep your environments tidy. Define alerts to get notified when failures occur and deep dive into issues from the generated report.

Full docs can be found here.

Scala/Java examples found here.

A demo of the UI found here.

Features

Basic flow

Quick start

Follow quick start instructions from here.

Integrations

Supported data sources

Data Caterer supports the below data sources. Check here for the full roadmap.

Data Source Type Data Source Support
Cloud Storage AWS S3
Cloud Storage Azure Blob Storage
Cloud Storage GCP Cloud Storage
Database BigQuery
Database Cassandra
Database MySQL
Database Postgres
Database Elasticsearch
Database MongoDB
File CSV
File Delta Lake
File JSON
File Iceberg
File ORC
File Parquet
File Hudi
HTTP REST API
Messaging Kafka
Messaging RabbitMQ
Messaging Solace
Messaging ActiveMQ
Messaging Pulsar
Metadata Data Contract CLI
Metadata Great Expectations
Metadata Marquez
Metadata OpenAPI/Swagger
Metadata OpenMetadata
Metadata Open Data Contract Standard (ODCS)
Metadata Amundsen
Metadata Datahub
Metadata Solace Event Portal

Additional Details

Run Configurations

Different ways to run Data Caterer based on your use case:

Types of run configurations

Design

Design motivations and details can be found here.

Roadmap

Can check here for full list of roadmap items.

Pricing

Data Caterer is set up under a usage pricing model for the latest application version. There are different pricing tiers based on how much you use Data Caterer. This also includes support and requesting features. The current open-source version will be kept for those who want to continue using the open-source version.

Find out more details here.

Mildly Quick Start

Generate and validate data

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")  //name and url
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(validation.count.isEqual(1000))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(
     validation.upstreamData(postgresTask)
       .joinFields("account_id")
       .withValidation(validation.count().isEqual(1000))
  )
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(
     validation.upstreamData(postgresTask)
       .joinFields("account_id")
       .withValidation(validation.count().isEqual(1000))
  )
  .validationWait(waitCondition.file("/data/parquet/customer"))

Generate same data across data sources

kafka("my_kafka", "localhost:29092")
  .topic("account-topic")
  .fields(...)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}"))

val kafkaTask = kafka("my_kafka", "localhost:29092")
  .topic("account-topic")
  .fields(...)

plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(kafkaTask -> List("account_id"))
)

Generate data and clean up

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerField(5, "account_id"))
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerFieldGenerator(generator.min(1).max(5), "account_id"))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))

val cassandraTxns = cassandra("ingested_data", "localhost:9042")
  .table("account", "transactions")

val deletePlan = plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(),
   List(cassandraTxns -> List("account_id"))
)

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))

val cassandraTxns = cassandra("ingested_data", "localhost:9042")
  .table("account", "transactions")

val deletePlan = plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(),
   List(cassandraTxns -> List("SUBSTR(account_id, 3) AS account_number"))
)

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

Generate data with schema from metadata source

parquet("customer_parquet", "/data/parquet/customer")
  .fields(metadataSource.openDataContractStandard("/data/odcs/full-example.odcs.yaml"))
http("my_http")
  .fields(metadataSource.openApi("/data/http/petstore.json"))

Validate data using validations from metadata source

parquet("customer_parquet", "/data/parquet/customer")
  .validations(metadataSource.greatExpectations("/data/great-expectations/taxi-expectations.json"))

Packages

No packages published

Languages

  • Scala 82.0%
  • JavaScript 12.7%
  • Java 2.8%
  • HTML 1.6%
  • Shell 0.6%
  • CSS 0.3%