Skip to content
This repository has been archived by the owner on Feb 8, 2022. It is now read-only.

Commit

Permalink
Adding idea regarding a schemaless sql structure framework.
Browse files Browse the repository at this point in the history
  • Loading branch information
shazow committed Oct 20, 2010
1 parent f1b34c4 commit 8fe0d62
Showing 1 changed file with 75 additions and 0 deletions.
75 changes: 75 additions & 0 deletions idea/arbitrarily-structured-data-in-rdbms.md
@@ -0,0 +1,75 @@
# Arbitrarily-Structured Data in Relational Databases

This approach is similar to [FriendFeed's schemaless database framework](http://bret.appspot.com/entry/how-friendfeed-uses-mysql). The key difference is in the data locality.

## Hypothesis

In an evolving relational (SQL) database schema, we store two types of data: Data we will be querying against and data we will be displaying. There is often a subset of display data which will not be used for querying in the foreseeable future, and this is the data whose structure changes most often.

## Solution

Store query data and display data separately such that display data is less strictly-structured and thus more easily evolved.

Imagine a standardized table structure where each table has the following columns: ``id``, ``time_created``, ``time_updated``, ``_data``, and additional "index columns".

The ``_data`` column contains a dictionary of arbitrary data serialized into JSON (or could be zlib-compressed Pickle if it were Python-specific). Index columns are columns which you query against.

### Example

A typical *user* table might have the following columns (using an SQLAlchemy declarative model):

class User(Model):

id = Column(types.Integer, primary_key=True)
time_created = Column(types.DateTime, default=datetime.now, nullable=False)
time_updated = Column(types.DateTime, onupdate=datetime.now)

is_admin = Column(types.Boolean, default=False, nullable=False)

email = Column(types.String(255), nullable=False, index=True, unique=True)
display_name = Column(types.String(64))

password_hash = Column(types.String(40), nullable=False)
password_salt = Column(types.String(8), nullable=False)

(We could replace the primary_key ``id`` with the ``email`` column, but this is not important.)

In our example, this table will have two types of queries:

-- Load the user object from the current session (where we store the user_id)
SELECT * FROM user WHERE id = :user_id;

-- Check the given password against the email address, for login
SELECT password_hash, password_salt FROM user WHERE email = :user_email;

In the schemaless model, the table would look like this:

class User(SchemalessModel):

id = Column(types.Integer, primary_key=True)
time_created = Column(types.DateTime, default=datetime.now, nullable=False)
time_updated = Column(types.DateTime, onupdate=datetime.now)
_data = Column(types.JSON)

email = Column(types.String(255), nullable=False, index=True, unique=True)

Where the ``_data`` column would contain data like this:

{
'display_name': 'Andrey Petrov',
'is_admin': 1,
'password_hash': 'cSKSsy315E4EroxeDQrsxjTb6ijBxxbK',
'password_salt': 'vS5Otm',
}

And perhaps we would build a framework on top of SQLAlchemy which would let us access columns as ``user.display_name`` or ``user.email`` regardless whether it's an extracted indexed property or a buried _data element.


### Process

1. Build a table with just a free-structure ``_data`` field.
2. Determine queries, extract relevant properties into indexed columns:
1. ALTER TABLE to add the column
2. Run full-scan query to populate new column with data
3. Add relevant index onto said column
4. Deprecate property from ``_data`` (optional, we could just assume that proper columns always supercede ``_data`` attributes)

0 comments on commit 8fe0d62

Please sign in to comment.