Skip to content

rubiojr/dsg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DSG - DataHub Schema Generator

Overview

DSG (DataHub Schema Generator) is a command-line tool that uses AI to generate and manage DataHub dataset schemas. It leverages OpenAI's language models to create dataset schemas based on user descriptions, then posts them directly to DataHub.

Features

  • AI-assisted dataset generation
  • Support for Azure OpenAI and OpenAI APIs
  • Local history management of generated datasets
  • Direct integration with DataHub REST API
  • View, manage, and deploy past datasetgenerations
  • Adding new global glossary terms to DataHub

Installation

Prerequisites

  • Go 1.24 or higher
  • DataHub instance with API access
  • OpenAI API key or Azure OpenAI access

Building from source

git clone https://github.com/rubiojr/dsg
cd dsg
go build

Or:

go install github.com/rubiojr/dsg@main

to install the latest version.

Usage

Environment Setup

You can set the following environment variables or pass them as command-line flags:

# DataHub configuration
export DATAHUB_GMS_URL=http://localhost:8080
export DATAHUB_GMS_TOKEN="your-datahub-token"

# OpenAI configuration
export OPENAI_API_KEY="your-openai-api-key"
export OPENAI_MODEL="gpt-4o"  # or another model

# For Azure OpenAI
export OPENAI_USE_AZURE=true
export OPENAI_API_BASE="https://your-azure-openai-endpoint"
export AZURE_OPENAI_DEPLOYMENT="deployment-name"
export AZURE_OPENAI_API_VERSION="2024-08-01-preview"

Basic Commands

Adding glossary terms

dsg add-term --name <term> --definition <definition> # URN is auto-generated

Generate a Dataset Schema

dsg generate

This will open an interactive prompt where you can describe the dataset you want to create. After writing your description, press Ctrl+D to submit. The AI will generate a schema and it'll be posted to DataHub automatically.

Generate using a previously used prompt:

dsg generate --prompt-from <ID> # see history command

View Generation History

dsg history

Shows a list of previously generated schemas.

View Details of a Specific Generation

dsg show 1  # Show details for history ID 1

Post an Existing Schema to DataHub

dsg post 1  # Post schema with history ID 1 to DataHub

Delete a History Entry

dsg delete 1  # Delete history entry with ID 1

Clear All History

dsg clear

Examples

Generating a Customer Dataset

$ dsg generate
Creating a DataHub dataset using NLP...

Writing temp prompt file to /tmp/XXXXXprompt...
Write the input for AI, hit Ctrl-D when finished:

Create a dataset named "customer_profiles" with fields for:
- customer_id (unique identifier) Tags: unique
- first_name Glossary Term: DSR.PrivateData
- last_name
- email Glossary Term: DSR.PrivateData
- signup_date
- last_purchase_date
- loyalty_tier (bronze, silver, gold, platinum)
- total_spend (numerical value)
- preferred_payment_method
^D

Understood!
Processing input and generating the dataset (may take a while)...
🤖 finished!
1 datasets created! ☑

This is sent to DataHub and the following dataset is created:

dataset

License

MIT

About

DataHub smart dataset generator

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages