From 8fe0d621725de60a1f334f64656e0bcd6b72e52f Mon Sep 17 00:00:00 2001 From: Andrey Petrov Date: Wed, 20 Oct 2010 16:01:53 -0700 Subject: [PATCH] Adding idea regarding a schemaless sql structure framework. --- idea/arbitrarily-structured-data-in-rdbms.md | 75 ++++++++++++++++++++ 1 file changed, 75 insertions(+) create mode 100644 idea/arbitrarily-structured-data-in-rdbms.md diff --git a/idea/arbitrarily-structured-data-in-rdbms.md b/idea/arbitrarily-structured-data-in-rdbms.md new file mode 100644 index 0000000..b3d6f44 --- /dev/null +++ b/idea/arbitrarily-structured-data-in-rdbms.md @@ -0,0 +1,75 @@ +# Arbitrarily-Structured Data in Relational Databases + +This approach is similar to [FriendFeed's schemaless database framework](http://bret.appspot.com/entry/how-friendfeed-uses-mysql). The key difference is in the data locality. + +## Hypothesis + +In an evolving relational (SQL) database schema, we store two types of data: Data we will be querying against and data we will be displaying. There is often a subset of display data which will not be used for querying in the foreseeable future, and this is the data whose structure changes most often. + +## Solution + +Store query data and display data separately such that display data is less strictly-structured and thus more easily evolved. + +Imagine a standardized table structure where each table has the following columns: ``id``, ``time_created``, ``time_updated``, ``_data``, and additional "index columns". + +The ``_data`` column contains a dictionary of arbitrary data serialized into JSON (or could be zlib-compressed Pickle if it were Python-specific). Index columns are columns which you query against. + +### Example + +A typical *user* table might have the following columns (using an SQLAlchemy declarative model): + + class User(Model): + + id = Column(types.Integer, primary_key=True) + time_created = Column(types.DateTime, default=datetime.now, nullable=False) + time_updated = Column(types.DateTime, onupdate=datetime.now) + + is_admin = Column(types.Boolean, default=False, nullable=False) + + email = Column(types.String(255), nullable=False, index=True, unique=True) + display_name = Column(types.String(64)) + + password_hash = Column(types.String(40), nullable=False) + password_salt = Column(types.String(8), nullable=False) + +(We could replace the primary_key ``id`` with the ``email`` column, but this is not important.) + +In our example, this table will have two types of queries: + + -- Load the user object from the current session (where we store the user_id) + SELECT * FROM user WHERE id = :user_id; + + -- Check the given password against the email address, for login + SELECT password_hash, password_salt FROM user WHERE email = :user_email; + +In the schemaless model, the table would look like this: + + class User(SchemalessModel): + + id = Column(types.Integer, primary_key=True) + time_created = Column(types.DateTime, default=datetime.now, nullable=False) + time_updated = Column(types.DateTime, onupdate=datetime.now) + _data = Column(types.JSON) + + email = Column(types.String(255), nullable=False, index=True, unique=True) + +Where the ``_data`` column would contain data like this: + + { + 'display_name': 'Andrey Petrov', + 'is_admin': 1, + 'password_hash': 'cSKSsy315E4EroxeDQrsxjTb6ijBxxbK', + 'password_salt': 'vS5Otm', + } + +And perhaps we would build a framework on top of SQLAlchemy which would let us access columns as ``user.display_name`` or ``user.email`` regardless whether it's an extracted indexed property or a buried _data element. + + +### Process + +1. Build a table with just a free-structure ``_data`` field. +2. Determine queries, extract relevant properties into indexed columns: + 1. ALTER TABLE to add the column + 2. Run full-scan query to populate new column with data + 3. Add relevant index onto said column + 4. Deprecate property from ``_data`` (optional, we could just assume that proper columns always supercede ``_data`` attributes)