Skip to content

Transform natural language questions into production-ready SQL with ontological context and warm, human-centered design.

License

Notifications You must be signed in to change notification settings

ravishan16/converSQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

37 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

New Documentation

We have introduced new documentation to help you understand the modular architecture and migration path.

Please refer to the following documents:

converSQL logo

converSQL

CI Code Coverage License: MIT Built with Streamlit

Transform natural language questions into production-ready SQL with ontological context and warm, human-centered design.

Why converSQL

The challenge

  • Business teams wait on backlogs of custom SQL while analysts juggle endless report tweaks.
  • Complex domains like mortgage analytics demand institutional knowledge that traditional BI tools cannot encode.
  • Open data is abundant, but combining it with AI safely and accurately remains tedious.

Our approach

  • Ontology-first modeling captures relationships, risk logic, and business vocabulary once and reuses it everywhere.
  • Adapter-based AI orchestration lets you swap Claude, Bedrock, Gemini, or local engines without touching the UI.
  • Streamlit experience design bridges analysts and executives with curated prompts, cached schemas, and explainable results.

Flagship implementation: Single Family Loan Analytics

The reference app ships with 9M+ rows of Fannie Mae loan performance data. Ask the AI for โ€œhigh-risk California loans under 620 credit scoreโ€ and get DuckDB-ready SQL plus rich metrics at a glance.

Spotlight features

  • ๐Ÿง  110+ fields grouped into 15 ontology domains with risk heuristics baked into prompts.
  • โšก CSV โžœ Parquet pipeline with enforced types, 10ร— compression, and predicate pushdown via DuckDB.
  • ๐Ÿ” Google OAuth guardrails with optional Cloudflare D1 logging.
  • ๐Ÿค– Multi-provider AI adapters (Bedrock, Claude, Gemini) with graceful fallbacks and prompt caching.
SELECT LOAN_ID, STATE, CSCORE_B, OLTV, DTI, DLQ_STATUS, CURRENT_UPB
FROM data
WHERE STATE = 'CA'
  AND CSCORE_B < 620
  AND CSCORE_B IS NOT NULL
ORDER BY CSCORE_B ASC, OLTV DESC
LIMIT 20;

Architecture at a glance

Streamlit UI (app.py)
 โ””โ”€ Core orchestration (src/core.py)
     โ”œโ”€ DuckDB execution
     โ”œโ”€ Cached schema + ontology context
     โ””โ”€ Data sync checks (scripts/sync_data.py)
 โ””โ”€ AI service (src/ai_service.py)
     โ”œโ”€ Adapter registry (src/ai_engines/*)
     โ”œโ”€ Prompt construction with risk framework
     โ””โ”€ Clean SQL post-processing

๐Ÿ—๏ธ Architecture

converSQL follows a clean, layered architecture designed for extensibility:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     Application Layer                        โ”‚
โ”‚  (Streamlit UI โ€ข Query Builder โ€ข Ontology Explorer)         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                         โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      AI Engine Layer                         โ”‚
โ”‚  (Adapter Pattern: Bedrock โ€ข Claude โ€ข Gemini โ€ข Ollama)      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                         โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   Intelligence Layer                         โ”‚
โ”‚  (Ontology โ€ข Schema Context โ€ข Business Rules)               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                         โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Data Layer                              โ”‚
โ”‚  (Parquet Files โ€ข DuckDB โ€ข R2 Storage โ€ข Query Execution)    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The Data Engineering Pipeline

Our showcase implementation demonstrates a complete data engineering workflow:

  1. Ingestion: Fannie Mae's pipe-separated loan performance files
  2. Transformation: Schema enforcement with explicit data types (VARCHAR, Float, Int16, etc.)
  3. Storage: Parquet format with SNAPPY compression (10x size reduction)
  4. Performance: DuckDB for blazing-fast analytical queries
  5. Ontology: Structured metadata linking business concepts to database schema

๐Ÿ“„ Learn more about the data pipeline โ†’

Brand palette

Token Hex Description
--color-background #FAF6F0 Ivory linen canvas across the app
--color-background-alt #FDFDFD Porcelain surfaces for cards and modals
--color-text-primary #3A3A3A Charcoal Plum headings
--color-text-secondary #7C6F64 Warm Taupe body copy
--color-accent-primary #DDBEA9 Soft Clay primary accent
--color-accent-primary-darker #B45F4D Terracotta hover and emphasis
--color-border-light #E4C590 Gold Sand borders, dividers, and tags

Quick start

  1. Install prerequisites
    git clone https://github.com/ravishan16/converSQL.git
    cd converSQL
    pip install -r requirements.txt
  2. Configure environment
    cp .env.example .env
    # Enable one AI block (CLAUDE_API_KEY, AWS_* for Bedrock, or GEMINI_API_KEY)
    # Provide Google OAuth or set ENABLE_AUTH=false for local dev
  3. Launch the app
    streamlit run app.py

Key documentation

  • Architecture โ€“ layered design and component interactions.
  • Data pipeline โ€“ ingest, transformation, and Parquet strategy.
  • AI engines โ€“ adapter contracts and extension guides.
  • Environment setup โ€“ required variables for auth, data, and providers.

Developer workflow

  • make setup โ€“ clean install + cache purge.
  • make test-unit / make test โ€“ pytest with coverage that mirrors CI.
  • make format and make lint โ€“ Black (120 cols), isort, flake8, mypy.
  • Cached helpers such as scan_parquet_files() trigger scripts/sync_data.py when Parquet is missingโ€”keep data/processed/ warm during tests.

Contributing

  1. Fork and branch: git checkout -b feature/my-update.
  2. Run formatting + tests before committing.
  3. Open a PR describing the change, provider credentials (if applicable), and test strategy.

See CONTRIBUTING.md for templates, AI adapter expectations, and review checklists.

Broader use cases

  • Financial services โ€“ credit risk, portfolio concentrations, regulatory stress tests.
  • Healthcare โ€“ patient outcomes, clinical trial cohorts, claims analytics.
  • E-commerce โ€“ customer segments, inventory velocity, supply chain exceptions.
  • Any ontology-driven domain โ€“ define your schema metadata and let converSQL converse.

Roadmap snapshot

  • โœ… Multi-AI adapter support with prompt caching and fallbacks.
  • โœ… Mortgage analytics reference implementation.
  • ๐Ÿ”„ Ollama adapter and enhanced SQL validation.
  • ๐Ÿ”ฎ Upcoming: multi-table joins, query explanations, historical learning, self-serve ontology editor.

License

Released under the MIT License.

Acknowledgments

  • Fannie Mae for the Single Family Loan Performance dataset.
  • The DuckDB, Streamlit, and Anthropic/AWS/Google teams for exceptional tooling.
  • The converSQL community for ideas, issues, and adapters.

Stay connected

  • โญ Star the repo to follow releases.
  • ๐Ÿ’ฌ Join discussions or open issues at github.com/ravishan16/converSQL/issues.
  • ๐Ÿ“จ Share what you buildโ€”data should feel conversational.

About

Transform natural language questions into production-ready SQL with ontological context and warm, human-centered design.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •