Auto-create tables and schemas #2090

jieguangzhou · 2024-05-22T10:05:47Z

Description

Related Issues

Checklist

Is this code covered by new or existing unit tests or integration tests?
Did you run make unit_testing and make integration-testing successfully?
Do new classes, functions, methods and parameters all have docstrings?
Were existing docstrings updated, if necessary?
Was external documentation updated, if necessary?

Additional Notes or Comments

blythed · 2024-05-23T08:58:49Z

superduperdb/base/exceptions.py

@@ -91,3 +91,19 @@ class ComponentException(BaseException):

    :param msg: msg for BaseException
    """
+
+
+class UnsupportedDatatype(BaseException):


blythed · 2024-05-23T09:01:20Z

superduperdb/ext/pillow/encoder.py

+    """A factory for pil image # noqa."""
+
+    @staticmethod
+    def check(data: t.Any) -> bool:


Would it make sense(future) to add a function check=my_checking_function to DataType?

This approach complicates the DataType; additionally, we need to discover all possible datatypes for implementing auto-schema. If we add to the datatype, we must initialize an instance of the datatype to discover it. For some types used in functions, such as arrays or torch, this is not very elegant. Therefore, implementing a factory class method is preferable. Only the data types that have implemented this class will be discovered.

WDYT?

superduperdb/base/datalayer.py

blythed · 2024-05-23T09:03:47Z

superduperdb/base/datalayer.py

@@ -1076,6 +1087,16 @@ def infer_schema(
        """
        return self.databackend.infer_schema(data, identifier)

+    def set_cfg(self, cfg: Config):


Why not do this?

@cfg.setter def cfg(self, value): ...

Also, do we really want to do this? That would mess things up: self.compute depends on the config used to build the Datalayer. So you set the cfg and it doesn't match the attributes of the db.

There should be an original issue here. If we start a datalayer by invoking superduperdb(**new_config), the value of the config is actually incorrect at this point because it has not been updated using **new_config. Therefore, when we send tasks to the compute layer, we should send the config with which we actually constructed the datalayer, not the one from the configuration file; otherwise, there will be a mismatch between the client side and the compute side.

I believe that the correct approach should be to launch the datalayer using the received CFG in each compute job, rather than merely retrieving it from a configuration file.

WDYT?

superduperdb/misc/auto_schema.py

superduperdb/backends/mongodb/data_backend.py

superduperdb/backends/ibis/data_backend.py

blythed · 2024-05-23T09:16:28Z

superduperdb/base/config.py

@@ -262,6 +263,7 @@ class Config(BaseConfig):
    logging_type: LogType = LogType.SYSTEM

    bytes_encoding: BytesEncoding = BytesEncoding.BYTES
+    auto_schema: bool = True


jieguangzhou force-pushed the feat/auto-create-tables-and-schemas branch 7 times, most recently from 03fceab to f871325 Compare May 22, 2024 12:39

Auto-create tables and schemas

f871325

jieguangzhou force-pushed the feat/auto-create-tables-and-schemas branch 4 times, most recently from b2402f3 to 5b8c8ed Compare May 22, 2024 15:54

Optimize the logic of auto-schema for different backends.

5b8c8ed

jieguangzhou requested review from kartik4949 and blythed May 22, 2024 16:02

jieguangzhou linked an issue May 22, 2024 that may be closed by this pull request

Auto-create tables and schemas #2080

Closed

5 tasks

jieguangzhou changed the title ~~Feat/auto create tables and schemas~~ Auto-create tables and schemas May 22, 2024

jieguangzhou self-assigned this May 22, 2024

jieguangzhou force-pushed the feat/auto-create-tables-and-schemas branch 2 times, most recently from cafa679 to 3788c7b Compare May 23, 2024 04:18

jieguangzhou added 2 commits May 23, 2024 12:20

Updated schema fields set

3788c7b

Fix the specified version of ruff.

c8301fa