feat: Text to SQL support #1399

ishaandatta · 2023-12-13T15:50:52Z

Support for text-to-SQL for MySQL (https://github.com/users/imartinez/projects/3?pane=issue&itemId=44266590)

This uses sqlalchemy to initialise a client to the database.

Thereafter, a SQLDatabase instance is initialised, which is passed to the llama_index object SQLTableNodeMapping for creating a node per schema. (allowing query-time retrieval of schema)

Finally, the llama_index SQLTableRetrieverQueryEngine is used to-
1. Convert text to appropriate SQL query
2. Execute the SQL query
3. Use the result as context to answer prompt

github-actions · 2023-12-13T15:51:34Z

Published docs preview URL: https://privategpt-preview-b4ba5be9-1266-4c46-bec2-8bf74af32ebb.docs.buildwithfern.com

github-actions · 2023-12-13T16:07:36Z

Published docs preview URL: https://privategpt-preview-fb1cbf15-a3a8-435a-8e64-70a3ec9672c0.docs.buildwithfern.com

github-actions · 2023-12-13T16:14:57Z

Published docs preview URL: https://privategpt-preview-fbc14ba0-7d95-481c-ae64-eca391fc1b06.docs.buildwithfern.com

github-actions · 2023-12-13T17:04:30Z

Published docs preview URL: https://privategpt-preview-8e2ded0f-694a-40f6-a39b-c6efcc893dd8.docs.buildwithfern.com

github-actions · 2023-12-13T17:18:10Z

Published docs preview URL: https://privategpt-preview-94fadd41-9144-41a0-9930-b1df12abee5b.docs.buildwithfern.com

github-actions · 2023-12-13T18:48:58Z

Published docs preview URL: https://privategpt-preview-0af0ff27-2f23-49aa-9134-7c2dee8b0f66.docs.buildwithfern.com

github-actions · 2023-12-13T19:29:20Z

Published docs preview URL: https://privategpt-preview-d67b9c29-1211-4b95-8e6e-a6b8208de95c.docs.buildwithfern.com

github-actions · 2023-12-13T19:33:36Z

Published docs preview URL: https://privategpt-preview-ce3d69ad-7320-4edf-971d-3beaafcb1c09.docs.buildwithfern.com

github-actions · 2023-12-13T19:47:41Z

Published docs preview URL: https://privategpt-preview-b1d4501a-687d-427b-b7ea-115f246a62c9.docs.buildwithfern.com

github-actions · 2023-12-13T20:59:38Z

Published docs preview URL: https://privategpt-preview-51c06c73-27fe-478f-8a82-6aae609aae0e.docs.buildwithfern.com

lopagela

The current state of the Text to SQL support is not properly taking care of:

Multi SQL dialect
Execution without Text to SQL (i.e. Vanilla privateGPT) is broken
Declaration of new dependencies is not justified

I think this feature must be tested across different "privateGPT mode" (with, and without DB support installed).

lopagela · 2023-12-15T20:05:50Z

fern/docs.yml

+          - page: SQL Databases
+            path: ./docs/pages/manual/nlsql.mdx


I don't think this should be under Storage section, but more into Advance Setup or something like this.

lopagela · 2023-12-15T20:09:54Z

fern/docs/pages/manual/nlsql.mdx

+## Other Databases
+To get started with exploring the use of other databases, set the `sqldatabase.dialect` and `sqldatabase.driver` properties in the `settings.yaml` file.
+
+```yaml
+sqldatabase:
+  dialect: <dialect_here>
+  driver: <driver_here>
+```
+
+Refer to [SQLAlchemy Engine Configuration](https://docs.sqlalchemy.org/en/20/core/engines.html) for constructing the DB connection string.


That would be great if that section could be a bit more user friendly - for example, putting additional links to SQLAlchemy reference (for example, the list of dialect https://docs.sqlalchemy.org/en/20/dialects/), and at least one additional link (to postgres or sqlite for example, which are very common).

lopagela · 2023-12-15T20:13:13Z

private_gpt/components/nlsql/nlsql_component.py

+from sqlalchemy import (
+    MetaData,
+)
+from sqlalchemy.engine import create_engine
+from sqlalchemy.engine.base import Engine


Did you try without the local mode?

Are these dependencies shipped from privateGPT by default?

It looks to me that these additional dependencies are brought by default, while it should be an opt-in feature I think. This would also means that this NLSQLComponent should be conditionally injected in the app context (basically, the app should still be able to run without that SQL component).

Ideally, the app should load this component only if the configuration is specifying a valid SQL configuration (thanks to some "on/off" switch in configuration maybe)

@lopagela I have moved them inside the component, after checking on if settings.context_database.enabled

lopagela · 2023-12-15T20:14:19Z

private_gpt/settings/settings.py

+    dialect: Literal["mysql"]
+    driver: Literal["pymysql"]


This is going against the documentation - Pydantic will fail to validate any other values, and consequently, one will not be able to change the dialect and driver.

lopagela · 2023-12-15T20:15:51Z

private_gpt/settings/settings.py

+    db_user: str | None = Field(
+        "root",
+        description="Username to be used for accessing the SQL Database Server. If user is None, set to 'root'.",
+    )


The default should be left to None, as root is not a standard default user in SQL norm (actually, SQL does not tell anything on a default user, but implementations are usually defining some, e.g. postgres for postgreSQL).

MySQL default admin user is called root.

Regardless, wrt supporting other SQL databases, I'm making username and password both default to None

lopagela · 2023-12-15T20:16:42Z

private_gpt/settings/settings.py

+    db_password: str = Field(
+        "",
+        description="Password to be used for accessing the SQL Database Server. If password is None, set to empty string.",
+    )


The description is misleading. The default is a blank string, which is different from None

lopagela · 2023-12-15T20:20:03Z

pyproject.toml

+gradio = "4.4.1"
+pymysql = "^1.1.0"
+cryptography = "^41.0.7"


This has not its place in the default dependencies. gradio is in the ui target.

Additionally, I think that pymsql should not be added in the dependencies of this project, as this is a driver that is specific to mySQL: users might not use mySQL, and might need other dependencies (psycopg2, psycopg, and many others for example - c.f. sqlalchemy dialects)

lopagela · 2023-12-15T20:21:38Z

settings.yaml

+  db_host: ${HOSTNAME:localhost}
+  db_user: ${USERNAME:root}
+  db_password: ${PASSWORD:}
+  database: ${DATABASE:}
+  tables: ["${TABLES:}"]


The env var are specific to SQL/DB world, so I'd personally prefix them with something like SQL_ or DB_.

Additionally, TABLES is set to plural, while its usage shows that only a single name is supported, which is misleading.

The tables variable accepts a list of table names (you can setup multiple tables as context- SQLTableRetrieverQueryEngine will select the right table to query).

It's used in the code like-
for table_name in self.metadata_obj.tables: table_schema_objs.append(SQLTableSchema(table_name=table_name))

@ishaandatta you are right - tables does accept a list of string, but the env var TABLES seems to only accept one table.

Let me ask this: how would you pass several tables names in the env variable TABLES?

Right, makes sense- I'm not actually sure how to define an array/list here in the env file.
Figured square brackets were sufficient in conveying the usage, as I tested and it works for multiple tables.

I tested it with the following value-
tables: ["EMPLOYEE", "DEPARTMENT"]

Currently if TABLES is not given, no default is set as per https://docs.privategpt.dev/manual/general-configuration/configuration#environment-variables-expansion

Is this better- db_tables: ${TABLES_LIST}

pabloogc · 2023-12-15T21:52:47Z

fern/docs/pages/manual/nlsql.mdx

+```yaml
+sqldatabase:
+  dialect: mysql
+  driver: pymysql
+  host: <db_hostname_here>
+  user: <db_username_here>
+  password: <db_password_here>
+  database: <database_name_here>
+  tables: <list_of_table_names_here>
+```


this is very rigid, why not move it to a parameter similar to context_files (like context_database(s))? Conceptually speaking there is not big difference between talking to your documents or talking to your database(s)

I understand that would require a big refactor of this PR but it will set a pattern for next features (like context_websites)

github-actions · 2023-12-17T12:21:10Z

Published docs preview URL: https://privategpt-preview-0d2a2cd6-18d1-45f0-94ea-8520b66a31da.docs.buildwithfern.com

github-actions · 2023-12-17T12:24:12Z

Published docs preview URL: https://privategpt-preview-8398e968-7a97-4f8c-a971-60effb050e48.docs.buildwithfern.com

github-actions · 2023-12-17T12:33:31Z

Published docs preview URL: https://privategpt-preview-a9cc24a7-2737-4620-92fd-5baef4d90794.docs.buildwithfern.com

github-actions · 2023-12-17T12:59:47Z

Published docs preview URL: https://privategpt-preview-6f1e054d-e90c-4487-8fbc-74549e3b9484.docs.buildwithfern.com

github-actions · 2023-12-17T13:05:22Z

Published docs preview URL: https://privategpt-preview-84b92ff1-390b-4124-9528-957ca2e65530.docs.buildwithfern.com

github-actions · 2023-12-17T13:39:20Z

Published docs preview URL: https://privategpt-preview-670e18aa-6151-4f9a-830d-9c6331e46eed.docs.buildwithfern.com

github-actions · 2023-12-17T14:00:58Z

Published docs preview URL: https://privategpt-preview-d29e4898-d133-4130-8c9c-8c5c54e35cce.docs.buildwithfern.com

github-actions · 2023-12-17T14:18:12Z

Published docs preview URL: https://privategpt-preview-0b1faa26-b2a0-4288-9cc8-cde2105dac95.docs.buildwithfern.com

github-actions · 2023-12-17T14:57:01Z

Published docs preview URL: https://privategpt-preview-86152132-67f6-40c2-9221-e3f98ff4a456.docs.buildwithfern.com

github-actions · 2023-12-17T14:57:38Z

Published docs preview URL: https://privategpt-preview-e0a139de-7820-4004-9e36-bb7f957fca78.docs.buildwithfern.com

github-actions · 2024-01-02T05:45:32Z

Stale pull request

ishaandatta · 2024-01-08T21:02:00Z

@lopagela addressed the changes, please check once

ricklettow · 2024-02-11T22:40:29Z

Thank you for your efforts!

I have hit a bump when running this on a large database schema. It appears that SQLAlchemy is running a reflect() multiple times (1. during initial startup, 2. Before the first query in ObjectIndex.from_objects() ). Subsequent queries perform better however the startup is > 30 minutes and the first query is another 30 minutes. I am limiting to a single table however, that table has relationships to other tables that are recursively parsed.

sql_database = SQLDatabase(engine, include_tables=tables)

obj_index = ObjectIndex.from_objects(
          table_schema_objs,
          table_node_mapping,
          VectorStoreIndex,
          service_context=service_context,
      )

Both call above trigger a reflection. Is this by design? Can the reflection results be persisted for quick loading when the schema has infrequent changes? It seems to be on the llama-index side but I was wondering if a different approach could be used.

Thank you.

Text to SQL support

3a72339

poetry lock issue

c29da32

lint

0ead57c

Update doc with engine configuration link

7edd4d9

fix lint error

acd5840

Fix pydantic syntax issues

fdb30c2

Improve typing, fix mypy errors

ab136ea

linter loves me

bb78c6d

sqldatabase.tables in settings.yaml should be a list

99a973a

Make SQL mode optional, fix tests, improve error handling

93211ca

lopagela suggested changes Dec 15, 2023

View reviewed changes

pabloogc reviewed Dec 15, 2023

View reviewed changes

Fixes as per feedback: Update documentation, Variables, Dependency group

8752829

Fix docs

965baf9

More fixes

cdabad0

Final fixes

9de7f3e

conditionally import sqlalchemy

fd04e17

Make docs user friendly

d805c98

Add steps, more details in docs

cb1ecf0

fix typos

5ef61b4

ishaandatta requested a review from lopagela December 17, 2023 14:48

Don't update poetry.lock

3fd36d2

Don't update poetry.lock

8e931be

github-actions bot added the stale label Jan 2, 2024

ishaandatta requested a review from pabloogc January 8, 2024 20:59

feat: Text to SQL support #1399

Are you sure you want to change the base?

feat: Text to SQL support #1399

Conversation

ishaandatta commented Dec 13, 2023

github-actions bot commented Dec 13, 2023

github-actions bot commented Dec 13, 2023

github-actions bot commented Dec 13, 2023

github-actions bot commented Dec 13, 2023

github-actions bot commented Dec 13, 2023

github-actions bot commented Dec 13, 2023

github-actions bot commented Dec 13, 2023

github-actions bot commented Dec 13, 2023

github-actions bot commented Dec 13, 2023

github-actions bot commented Dec 13, 2023

lopagela left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ishaandatta Dec 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 17, 2023

github-actions bot commented Dec 17, 2023

github-actions bot commented Dec 17, 2023

github-actions bot commented Dec 17, 2023

github-actions bot commented Dec 17, 2023

github-actions bot commented Dec 17, 2023

github-actions bot commented Dec 17, 2023

github-actions bot commented Dec 17, 2023

github-actions bot commented Dec 17, 2023

github-actions bot commented Dec 17, 2023

github-actions bot commented Jan 2, 2024

ishaandatta commented Jan 8, 2024

ricklettow commented Feb 11, 2024

ishaandatta Dec 16, 2023 •

edited