Adding idea regarding a schemaless sql structure framework.

shazow · Oct 20, 2010 · 8fe0d62 · 8fe0d62
1 parent f1b34c4
commit 8fe0d62
Showing 1 changed file with 75 additions and 0 deletions.
diff --git a/idea/arbitrarily-structured-data-in-rdbms.md b/idea/arbitrarily-structured-data-in-rdbms.md
@@ -0,0 +1,75 @@
+# Arbitrarily-Structured Data in Relational Databases
+
+This approach is similar to [FriendFeed's schemaless database framework](http://bret.appspot.com/entry/how-friendfeed-uses-mysql). The key difference is in the data locality.
+
+## Hypothesis
+
+In an evolving relational (SQL) database schema, we store two types of data: Data we will be querying against and data we will be displaying. There is often a subset of display data which will not be used for querying in the foreseeable future, and this is the data whose structure changes most often.
+
+## Solution
+
+Store query data and display data separately such that display data is less strictly-structured and thus more easily evolved.
+
+Imagine a standardized table structure where each table has the following columns: ``id``, ``time_created``, ``time_updated``, ``_data``, and additional "index columns".
+
+The ``_data`` column contains a dictionary of arbitrary data serialized into JSON (or could be zlib-compressed Pickle if it were Python-specific). Index columns are columns which you query against.
+
+### Example
+
+A typical *user* table might have the following columns (using an SQLAlchemy declarative model):
+
+    class User(Model):
+
+        id = Column(types.Integer, primary_key=True)
+        time_created = Column(types.DateTime, default=datetime.now, nullable=False)
+        time_updated = Column(types.DateTime, onupdate=datetime.now)
+
+        is_admin = Column(types.Boolean, default=False, nullable=False)
+
+        email = Column(types.String(255), nullable=False, index=True, unique=True)
+        display_name = Column(types.String(64))
+
+        password_hash = Column(types.String(40), nullable=False)
+        password_salt = Column(types.String(8), nullable=False)
+
+(We could replace the primary_key ``id`` with the ``email`` column, but this is not important.)
+
+In our example, this table will have two types of queries:
+
+    -- Load the user object from the current session (where we store the user_id)
+    SELECT * FROM user WHERE id = :user_id;
+
+    -- Check the given password against the email address, for login
+    SELECT password_hash, password_salt FROM user WHERE email = :user_email;
+
+In the schemaless model, the table would look like this:
+
+    class User(SchemalessModel):
+
+        id = Column(types.Integer, primary_key=True)
+        time_created = Column(types.DateTime, default=datetime.now, nullable=False)
+        time_updated = Column(types.DateTime, onupdate=datetime.now)
+        _data = Column(types.JSON)
+
+        email = Column(types.String(255), nullable=False, index=True, unique=True)
+
+Where the ``_data`` column would contain data like this:
+
+    {
+        'display_name': 'Andrey Petrov',
+        'is_admin': 1,
+        'password_hash': 'cSKSsy315E4EroxeDQrsxjTb6ijBxxbK',
+        'password_salt': 'vS5Otm',
+    }
+
+And perhaps we would build a framework on top of SQLAlchemy which would let us access columns as ``user.display_name`` or ``user.email`` regardless whether it's an extracted indexed property or a buried _data element.
+
+
+### Process
+
+1. Build a table with just a free-structure ``_data`` field.
+2. Determine queries, extract relevant properties into indexed columns:
+    1. ALTER TABLE to add the column
+    2. Run full-scan query to populate new column with data
+    3. Add relevant index onto said column
+    4. Deprecate property from ``_data`` (optional, we could just assume that proper columns always supercede ``_data`` attributes)