DSG (DataHub Schema Generator) is a command-line tool that uses AI to generate and manage DataHub dataset schemas. It leverages OpenAI's language models to create dataset schemas based on user descriptions, then posts them directly to DataHub.
- AI-assisted dataset generation
- Support for Azure OpenAI and OpenAI APIs
- Local history management of generated datasets
- Direct integration with DataHub REST API
- View, manage, and deploy past datasetgenerations
- Adding new global glossary terms to DataHub
- Go 1.24 or higher
- DataHub instance with API access
- OpenAI API key or Azure OpenAI access
git clone https://github.com/rubiojr/dsg
cd dsg
go build
Or:
go install github.com/rubiojr/dsg@main
to install the latest version.
You can set the following environment variables or pass them as command-line flags:
# DataHub configuration
export DATAHUB_GMS_URL=http://localhost:8080
export DATAHUB_GMS_TOKEN="your-datahub-token"
# OpenAI configuration
export OPENAI_API_KEY="your-openai-api-key"
export OPENAI_MODEL="gpt-4o" # or another model
# For Azure OpenAI
export OPENAI_USE_AZURE=true
export OPENAI_API_BASE="https://your-azure-openai-endpoint"
export AZURE_OPENAI_DEPLOYMENT="deployment-name"
export AZURE_OPENAI_API_VERSION="2024-08-01-preview"
dsg add-term --name <term> --definition <definition> # URN is auto-generated
dsg generate
This will open an interactive prompt where you can describe the dataset you want to create. After writing your description, press Ctrl+D to submit. The AI will generate a schema and it'll be posted to DataHub automatically.
Generate using a previously used prompt:
dsg generate --prompt-from <ID> # see history command
dsg history
Shows a list of previously generated schemas.
dsg show 1 # Show details for history ID 1
dsg post 1 # Post schema with history ID 1 to DataHub
dsg delete 1 # Delete history entry with ID 1
dsg clear
$ dsg generate
Creating a DataHub dataset using NLP...
Writing temp prompt file to /tmp/XXXXXprompt...
Write the input for AI, hit Ctrl-D when finished:
Create a dataset named "customer_profiles" with fields for:
- customer_id (unique identifier) Tags: unique
- first_name Glossary Term: DSR.PrivateData
- last_name
- email Glossary Term: DSR.PrivateData
- signup_date
- last_purchase_date
- loyalty_tier (bronze, silver, gold, platinum)
- total_spend (numerical value)
- preferred_payment_method
^D
Understood!
Processing input and generating the dataset (may take a while)...
🤖 finished!
1 datasets created! ☑
This is sent to DataHub and the following dataset is created: