SQL Data Dictionary Generator with AI

A comprehensive tool for generating data dictionaries and running profiling scripts for databases with an intuitive Streamlit UI. Now with AI-powered natural language queries and automated documentation!

Features

Core Features

Multi-Database Support: PostgreSQL, MySQL, SQLite, SQL Server, MongoDB, Neo4j
Comprehensive Data Dictionary Generation:
- Table and column information
- Data types and constraints
- Primary and foreign keys
- Indexes and relationships
- Row counts and statistics
Data Profiling:
- NULL value analysis
- Duplicate detection
- Data completeness scoring
- Column statistics (min, max, avg)
- Value distribution analysis
Custom SQL Query Execution: Run custom profiling queries
Multiple Export Formats: JSON and Markdown
Interactive Streamlit UI: User-friendly web interface

AI-Powered Features (New!)

Natural Language to SQL: Ask questions in plain English, get SQL queries
- Uses local Ollama/Llama3.2 (on-premises, privacy-first)
- Automatic schema context injection
- Confidence scoring for generated queries
- Auto-execution with safety limits
AI-Enhanced Documentation:
- Automated table and column explanations
- Relationship documentation
- Database-wide summaries
- Human-readable purpose and usage notes
On-Premises AI: No cloud dependencies, all processing happens locally

Testing

E2E Tests for AI Features

Comprehensive test suite for AI documentation generation:

33 Mocked tests: Fast validation without external dependencies (~1 second)
17 Real Ollama E2E tests: Full integration testing with actual Ollama service (~87 seconds)
100% passing: All tests validated

# Run mocked tests (fast)
pytest tests/test_ai_documentation.py -v

# Run real Ollama E2E tests
pytest tests/test_ai_documentation_e2e.py -v

# Run all tests
pytest tests/ -v

# Health check on Azure VM
./test_deployment_health.sh

# Run E2E tests on Azure VM
./run_e2e_tests_azure.sh

See E2E_TESTS_SUMMARY.md and AZURE_E2E_TESTING.md for details.

Installation

Prerequisites

Python 3.8 or higher
pip package manager
Ollama (for AI features - optional but recommended)

Setup

Clone or download this repository:

cd sql2doc

Install dependencies:

pip install -r requirements.txt

(Optional) Set up Ollama for AI features:

# Install Ollama from https://ollama.ai
# Download and install for your platform

# Pull the Llama 3.2 model
ollama pull llama3.2

# Start Ollama service (if not running)
ollama serve

Note: AI features will gracefully degrade if Ollama is not available. The core functionality works without AI.

Usage

Running the Application

Start the Streamlit application:

streamlit run app.py

The application will open in your default web browser at http://localhost:8501.

Connecting to a Database

Use the sidebar to configure your database connection
Select database type (PostgreSQL, MySQL, or SQLite)
Enter connection details:
- PostgreSQL: postgresql://username:password@host:port/database
- MySQL: mysql+pymysql://username:password@host:port/database
- SQLite: sqlite:///path/to/database.db
Click "Connect"

Generating a Data Dictionary

Navigate to the "Data Dictionary" tab
Choose whether to include row counts (slower for large databases)
Click "Generate Dictionary"
Browse through tables and view detailed information

Running Data Profiling

Go to the "Table Profiling" tab
Select a table from the dropdown
Click "Run Profiling"
View comprehensive data quality metrics:
- Row counts
- NULL value analysis
- Completeness scores
- Column-level statistics
- Value distributions

Running Custom Queries

Navigate to the "Custom Query" tab
Enter your SQL query
Click "Execute Query"
View and download results

Exporting Data

Go to the "Export" tab
Choose export format:
- JSON: Complete machine-readable format
- Markdown: Human-readable documentation format
Download the generated file

Using AI Features

Natural Language Queries

Navigate to the "AI Query (NL)" tab
Ensure Ollama is running (check status indicator)
Ask a question in plain English:
- "Show me the top 10 customers by order value"
- "What are the largest tables in the database?"
- "Find all foreign key relationships"
View generated SQL with confidence score
Execute automatically or review first
Download results as CSV

AI-Enhanced Documentation

Generate a data dictionary first (Data Dictionary tab)
Navigate to the "AI Documentation" tab
Click "Enhance Dictionary with AI"
View AI-generated:
- Table descriptions and purposes
- Relationship explanations
- Usage notes
Export enhanced documentation

Project Structure

sql2doc/
├── app.py                      # Streamlit UI application
├── requirements.txt            # Python dependencies
├── pytest.ini                  # Pytest configuration
├── .gitignore                  # Git ignore rules
├── src/
│   ├── __init__.py
│   ├── database_connector.py  # Database connection management
│   ├── schema_fetcher.py      # Schema information retrieval
│   ├── dictionary_builder.py  # Data dictionary generation
│   ├── profiling_scripts.py   # Data profiling functionality
│   ├── nl_query_generator.py  # Natural language to SQL (AI)
│   └── schema_explainer.py    # AI-powered documentation
└── tests/
    ├── __init__.py
    ├── test_database_connector.py
    ├── test_schema_fetcher.py
    └── test_profiling_scripts.py

Module Documentation

DatabaseConnector

Manages database connections for multiple SQL database types.

from src.database_connector import DatabaseConnector

connector = DatabaseConnector()
engine = connector.connect("postgresql://user:pass@localhost:5432/mydb")

Key Methods:

connect(connection_string): Establish database connection
disconnect(): Close database connection
is_connected(): Check connection status
get_database_type(): Get database dialect name

SchemaFetcher

Retrieves schema information from SQL databases.

from src.schema_fetcher import SchemaFetcher

fetcher = SchemaFetcher(engine)
tables = fetcher.get_all_tables()
columns = fetcher.get_table_columns('table_name')

Key Methods:

get_all_tables(): List all tables
get_table_columns(table_name): Get column details
get_primary_keys(table_name): Get primary key columns
get_foreign_keys(table_name): Get foreign key relationships
get_indexes(table_name): Get index information

DictionaryBuilder

Compiles comprehensive data dictionaries from schema information.

from src.dictionary_builder import DictionaryBuilder

builder = DictionaryBuilder(engine)
dictionary = builder.build_full_dictionary()
builder.export_to_json(dictionary, 'output.json')

Key Methods:

build_full_dictionary(): Generate complete data dictionary
build_table_dictionary(table_name): Generate dictionary for specific table
export_to_json(dictionary, file_path): Export to JSON
export_to_markdown(dictionary, file_path): Export to Markdown

DataProfiler

Executes data profiling scripts for quality assessment.

from src.profiling_scripts import DataProfiler

profiler = DataProfiler(engine)
profile = profiler.profile_table('table_name')

Key Methods:

profile_table(table_name): Complete table profiling
profile_column(table_name, column_name): Column-level profiling
check_null_values(table_name): NULL value analysis
check_duplicates(table_name): Duplicate detection
calculate_completeness(table_name): Completeness score
get_value_distribution(table_name, column_name): Value distribution
run_custom_query(query): Execute custom SQL query

NaturalLanguageQueryGenerator (AI)

Converts natural language questions to SQL queries using local LLM.

from src.nl_query_generator import NaturalLanguageQueryGenerator

nl_gen = NaturalLanguageQueryGenerator(engine, model="llama3.2")
result = nl_gen.ask("Show me the top 10 customers")
print(result['sql'])

Key Methods:

generate_sql(question): Generate SQL from natural language
execute_query(sql, limit): Execute generated SQL safely
ask(question, execute): Complete workflow (generate + execute)
is_available(): Check if Ollama is running
get_database_schema(): Get schema context for LLM

SchemaExplainer (AI)

Generates AI-powered documentation and explanations for database schemas.

from src.schema_explainer import SchemaExplainer

explainer = SchemaExplainer(engine, model="llama3.2")
explanation = explainer.explain_table("users", columns)
enhanced = explainer.enhance_dictionary(dictionary)

Key Methods:

explain_table(table_name, columns): Generate table explanation
explain_column(table, column, type): Explain specific column
generate_relationship_explanation(table, fks): Explain relationships
enhance_dictionary(dictionary): Add AI docs to full dictionary
generate_database_summary(dictionary): Create database overview
is_available(): Check if Ollama is running

Running Tests

Execute the test suite:

pytest

Run with coverage:

pytest --cov=src --cov-report=html

Run specific test file:

pytest tests/test_database_connector.py

Example Workflows

Workflow 1: Generate Complete Documentation

Connect to your database
Generate data dictionary with row counts
Export to Markdown for documentation
Share with team or include in project docs

Workflow 2: Data Quality Assessment

Connect to database
Select table for profiling
Review NULL value analysis
Check for duplicates
Examine value distributions for key columns
Export results for reporting

Workflow 3: Custom Analysis

Connect to database
Navigate to Custom Query tab

Write custom profiling SQL:

SELECT
    column_name,
    COUNT(*) as total,
    COUNT(DISTINCT column_name) as unique_values,
    COUNT(*) - COUNT(column_name) as nulls
FROM information_schema.columns
GROUP BY column_name;

Execute and download results

Security Considerations

On-Premises First: Designed for on-prem deployments with local data
No Cloud Dependencies: All processing happens locally
Connection Security: Supports encrypted database connections
Credential Management: Use environment variables for sensitive credentials
Query Safety: Custom queries are executed with read-only intentions

Best Practices

Start Small: Test with a small database first
Row Counts: Disable row counts for very large databases
Regular Profiling: Run profiling periodically to track data quality
Export Documentation: Keep data dictionaries up-to-date in version control
Custom Queries: Use profiling queries to identify data issues early

Troubleshooting

Connection Issues

Verify database is running and accessible
Check connection string format
Ensure proper network access and firewall rules
Verify database user permissions

Performance Issues

Disable row counts for large databases
Profile tables individually rather than all at once
Consider database indexes for profiling queries
Use query limits for value distributions

Test Failures

Ensure all dependencies are installed
Check Python version compatibility
Verify pytest configuration

Contributing

This project follows best practices for full-stack development with a focus on:

Local-first architecture
On-premises deployment
Security and privacy
Code maintainability
Comprehensive testing

License

MIT License - See LICENSE file for details

Support

For issues, questions, or contributions, please refer to the project repository.

Acknowledgments

Built with:

SQLAlchemy for database abstraction
Streamlit for interactive UI
Pandas for data manipulation
Pytest for testing
Ollama for local LLM inference
Llama 3.2 for AI-powered features
Vanna AI framework (concepts integrated)

Note: This tool is designed for authorized database access and profiling. Always ensure you have proper permissions before connecting to production databases.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
azure-deployment		azure-deployment
deploy		deploy
graphrag-service		graphrag-service
src		src
test_data		test_data
tests		tests
ui-service		ui-service
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CODEBASE_STRUCTURE.md		CODEBASE_STRUCTURE.md
DEVELOPER_GUIDE.md		DEVELOPER_GUIDE.md
Dockerfile		Dockerfile
GRAPHRAG_GUIDE.md		GRAPHRAG_GUIDE.md
INSTALL_DATABASES.md		INSTALL_DATABASES.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
app.py		app.py
docker-compose.microservices.yml		docker-compose.microservices.yml
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
quickstart_graphrag.sh		quickstart_graphrag.sh
requirements.txt		requirements.txt
run_e2e_tests_azure.sh		run_e2e_tests_azure.sh
test_ai_doc_generation.py		test_ai_doc_generation.py
test_ambiguous_schema.py		test_ambiguous_schema.py
test_deployment_health.sh		test_deployment_health.sh
test_graphrag.py		test_graphrag.py
test_sql2doc.py		test_sql2doc.py

thedataengineer/sql2doc

Folders and files

Latest commit

History

Repository files navigation

SQL Data Dictionary Generator with AI

Features

Core Features

AI-Powered Features (New!)

Testing

E2E Tests for AI Features

Installation

Prerequisites

Setup

Usage

Running the Application

Connecting to a Database

Generating a Data Dictionary

Running Data Profiling

Running Custom Queries

Exporting Data

Using AI Features

Natural Language Queries

AI-Enhanced Documentation

Project Structure

Module Documentation

DatabaseConnector

SchemaFetcher

DictionaryBuilder

DataProfiler

NaturalLanguageQueryGenerator (AI)

SchemaExplainer (AI)

Running Tests

Example Workflows

Workflow 1: Generate Complete Documentation

Workflow 2: Data Quality Assessment

Workflow 3: Custom Analysis

Security Considerations

Best Practices

Troubleshooting

Connection Issues

Performance Issues

Test Failures

Contributing

License

Support

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages