A comprehensive tool for generating data dictionaries and running profiling scripts for databases with an intuitive Streamlit UI. Now with AI-powered natural language queries and automated documentation!
- Multi-Database Support: PostgreSQL, MySQL, SQLite, SQL Server, MongoDB, Neo4j
- Comprehensive Data Dictionary Generation:
- Table and column information
- Data types and constraints
- Primary and foreign keys
- Indexes and relationships
- Row counts and statistics
- Data Profiling:
- NULL value analysis
- Duplicate detection
- Data completeness scoring
- Column statistics (min, max, avg)
- Value distribution analysis
- Custom SQL Query Execution: Run custom profiling queries
- Multiple Export Formats: JSON and Markdown
- Interactive Streamlit UI: User-friendly web interface
- Natural Language to SQL: Ask questions in plain English, get SQL queries
- Uses local Ollama/Llama3.2 (on-premises, privacy-first)
- Automatic schema context injection
- Confidence scoring for generated queries
- Auto-execution with safety limits
- AI-Enhanced Documentation:
- Automated table and column explanations
- Relationship documentation
- Database-wide summaries
- Human-readable purpose and usage notes
- On-Premises AI: No cloud dependencies, all processing happens locally
Comprehensive test suite for AI documentation generation:
- 33 Mocked tests: Fast validation without external dependencies (~1 second)
- 17 Real Ollama E2E tests: Full integration testing with actual Ollama service (~87 seconds)
- 100% passing: All tests validated
# Run mocked tests (fast)
pytest tests/test_ai_documentation.py -v
# Run real Ollama E2E tests
pytest tests/test_ai_documentation_e2e.py -v
# Run all tests
pytest tests/ -v
# Health check on Azure VM
./test_deployment_health.sh
# Run E2E tests on Azure VM
./run_e2e_tests_azure.shSee E2E_TESTS_SUMMARY.md and AZURE_E2E_TESTING.md for details.
- Python 3.8 or higher
- pip package manager
- Ollama (for AI features - optional but recommended)
- Clone or download this repository:
cd sql2doc- Install dependencies:
pip install -r requirements.txt- (Optional) Set up Ollama for AI features:
# Install Ollama from https://ollama.ai
# Download and install for your platform
# Pull the Llama 3.2 model
ollama pull llama3.2
# Start Ollama service (if not running)
ollama serveNote: AI features will gracefully degrade if Ollama is not available. The core functionality works without AI.
Start the Streamlit application:
streamlit run app.pyThe application will open in your default web browser at http://localhost:8501.
- Use the sidebar to configure your database connection
- Select database type (PostgreSQL, MySQL, or SQLite)
- Enter connection details:
- PostgreSQL:
postgresql://username:password@host:port/database - MySQL:
mysql+pymysql://username:password@host:port/database - SQLite:
sqlite:///path/to/database.db
- PostgreSQL:
- Click "Connect"
- Navigate to the "Data Dictionary" tab
- Choose whether to include row counts (slower for large databases)
- Click "Generate Dictionary"
- Browse through tables and view detailed information
- Go to the "Table Profiling" tab
- Select a table from the dropdown
- Click "Run Profiling"
- View comprehensive data quality metrics:
- Row counts
- NULL value analysis
- Completeness scores
- Column-level statistics
- Value distributions
- Navigate to the "Custom Query" tab
- Enter your SQL query
- Click "Execute Query"
- View and download results
- Go to the "Export" tab
- Choose export format:
- JSON: Complete machine-readable format
- Markdown: Human-readable documentation format
- Download the generated file
- Navigate to the "AI Query (NL)" tab
- Ensure Ollama is running (check status indicator)
- Ask a question in plain English:
- "Show me the top 10 customers by order value"
- "What are the largest tables in the database?"
- "Find all foreign key relationships"
- View generated SQL with confidence score
- Execute automatically or review first
- Download results as CSV
- Generate a data dictionary first (Data Dictionary tab)
- Navigate to the "AI Documentation" tab
- Click "Enhance Dictionary with AI"
- View AI-generated:
- Table descriptions and purposes
- Relationship explanations
- Usage notes
- Export enhanced documentation
sql2doc/
├── app.py # Streamlit UI application
├── requirements.txt # Python dependencies
├── pytest.ini # Pytest configuration
├── .gitignore # Git ignore rules
├── src/
│ ├── __init__.py
│ ├── database_connector.py # Database connection management
│ ├── schema_fetcher.py # Schema information retrieval
│ ├── dictionary_builder.py # Data dictionary generation
│ ├── profiling_scripts.py # Data profiling functionality
│ ├── nl_query_generator.py # Natural language to SQL (AI)
│ └── schema_explainer.py # AI-powered documentation
└── tests/
├── __init__.py
├── test_database_connector.py
├── test_schema_fetcher.py
└── test_profiling_scripts.py
Manages database connections for multiple SQL database types.
from src.database_connector import DatabaseConnector
connector = DatabaseConnector()
engine = connector.connect("postgresql://user:pass@localhost:5432/mydb")Key Methods:
connect(connection_string): Establish database connectiondisconnect(): Close database connectionis_connected(): Check connection statusget_database_type(): Get database dialect name
Retrieves schema information from SQL databases.
from src.schema_fetcher import SchemaFetcher
fetcher = SchemaFetcher(engine)
tables = fetcher.get_all_tables()
columns = fetcher.get_table_columns('table_name')Key Methods:
get_all_tables(): List all tablesget_table_columns(table_name): Get column detailsget_primary_keys(table_name): Get primary key columnsget_foreign_keys(table_name): Get foreign key relationshipsget_indexes(table_name): Get index information
Compiles comprehensive data dictionaries from schema information.
from src.dictionary_builder import DictionaryBuilder
builder = DictionaryBuilder(engine)
dictionary = builder.build_full_dictionary()
builder.export_to_json(dictionary, 'output.json')Key Methods:
build_full_dictionary(): Generate complete data dictionarybuild_table_dictionary(table_name): Generate dictionary for specific tableexport_to_json(dictionary, file_path): Export to JSONexport_to_markdown(dictionary, file_path): Export to Markdown
Executes data profiling scripts for quality assessment.
from src.profiling_scripts import DataProfiler
profiler = DataProfiler(engine)
profile = profiler.profile_table('table_name')Key Methods:
profile_table(table_name): Complete table profilingprofile_column(table_name, column_name): Column-level profilingcheck_null_values(table_name): NULL value analysischeck_duplicates(table_name): Duplicate detectioncalculate_completeness(table_name): Completeness scoreget_value_distribution(table_name, column_name): Value distributionrun_custom_query(query): Execute custom SQL query
Converts natural language questions to SQL queries using local LLM.
from src.nl_query_generator import NaturalLanguageQueryGenerator
nl_gen = NaturalLanguageQueryGenerator(engine, model="llama3.2")
result = nl_gen.ask("Show me the top 10 customers")
print(result['sql'])Key Methods:
generate_sql(question): Generate SQL from natural languageexecute_query(sql, limit): Execute generated SQL safelyask(question, execute): Complete workflow (generate + execute)is_available(): Check if Ollama is runningget_database_schema(): Get schema context for LLM
Generates AI-powered documentation and explanations for database schemas.
from src.schema_explainer import SchemaExplainer
explainer = SchemaExplainer(engine, model="llama3.2")
explanation = explainer.explain_table("users", columns)
enhanced = explainer.enhance_dictionary(dictionary)Key Methods:
explain_table(table_name, columns): Generate table explanationexplain_column(table, column, type): Explain specific columngenerate_relationship_explanation(table, fks): Explain relationshipsenhance_dictionary(dictionary): Add AI docs to full dictionarygenerate_database_summary(dictionary): Create database overviewis_available(): Check if Ollama is running
Execute the test suite:
pytestRun with coverage:
pytest --cov=src --cov-report=htmlRun specific test file:
pytest tests/test_database_connector.py- Connect to your database
- Generate data dictionary with row counts
- Export to Markdown for documentation
- Share with team or include in project docs
- Connect to database
- Select table for profiling
- Review NULL value analysis
- Check for duplicates
- Examine value distributions for key columns
- Export results for reporting
- Connect to database
- Navigate to Custom Query tab
- Write custom profiling SQL:
SELECT column_name, COUNT(*) as total, COUNT(DISTINCT column_name) as unique_values, COUNT(*) - COUNT(column_name) as nulls FROM information_schema.columns GROUP BY column_name;
- Execute and download results
- On-Premises First: Designed for on-prem deployments with local data
- No Cloud Dependencies: All processing happens locally
- Connection Security: Supports encrypted database connections
- Credential Management: Use environment variables for sensitive credentials
- Query Safety: Custom queries are executed with read-only intentions
- Start Small: Test with a small database first
- Row Counts: Disable row counts for very large databases
- Regular Profiling: Run profiling periodically to track data quality
- Export Documentation: Keep data dictionaries up-to-date in version control
- Custom Queries: Use profiling queries to identify data issues early
- Verify database is running and accessible
- Check connection string format
- Ensure proper network access and firewall rules
- Verify database user permissions
- Disable row counts for large databases
- Profile tables individually rather than all at once
- Consider database indexes for profiling queries
- Use query limits for value distributions
- Ensure all dependencies are installed
- Check Python version compatibility
- Verify pytest configuration
This project follows best practices for full-stack development with a focus on:
- Local-first architecture
- On-premises deployment
- Security and privacy
- Code maintainability
- Comprehensive testing
MIT License - See LICENSE file for details
For issues, questions, or contributions, please refer to the project repository.
Built with:
- SQLAlchemy for database abstraction
- Streamlit for interactive UI
- Pandas for data manipulation
- Pytest for testing
- Ollama for local LLM inference
- Llama 3.2 for AI-powered features
- Vanna AI framework (concepts integrated)
Note: This tool is designed for authorized database access and profiling. Always ensure you have proper permissions before connecting to production databases.