This repository has been archived by the owner on Feb 8, 2022. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adding idea regarding a schemaless sql structure framework.
- Loading branch information
Showing
1 changed file
with
75 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
# Arbitrarily-Structured Data in Relational Databases | ||
|
||
This approach is similar to [FriendFeed's schemaless database framework](http://bret.appspot.com/entry/how-friendfeed-uses-mysql). The key difference is in the data locality. | ||
|
||
## Hypothesis | ||
|
||
In an evolving relational (SQL) database schema, we store two types of data: Data we will be querying against and data we will be displaying. There is often a subset of display data which will not be used for querying in the foreseeable future, and this is the data whose structure changes most often. | ||
|
||
## Solution | ||
|
||
Store query data and display data separately such that display data is less strictly-structured and thus more easily evolved. | ||
|
||
Imagine a standardized table structure where each table has the following columns: ``id``, ``time_created``, ``time_updated``, ``_data``, and additional "index columns". | ||
|
||
The ``_data`` column contains a dictionary of arbitrary data serialized into JSON (or could be zlib-compressed Pickle if it were Python-specific). Index columns are columns which you query against. | ||
|
||
### Example | ||
|
||
A typical *user* table might have the following columns (using an SQLAlchemy declarative model): | ||
|
||
class User(Model): | ||
|
||
id = Column(types.Integer, primary_key=True) | ||
time_created = Column(types.DateTime, default=datetime.now, nullable=False) | ||
time_updated = Column(types.DateTime, onupdate=datetime.now) | ||
|
||
is_admin = Column(types.Boolean, default=False, nullable=False) | ||
|
||
email = Column(types.String(255), nullable=False, index=True, unique=True) | ||
display_name = Column(types.String(64)) | ||
|
||
password_hash = Column(types.String(40), nullable=False) | ||
password_salt = Column(types.String(8), nullable=False) | ||
|
||
(We could replace the primary_key ``id`` with the ``email`` column, but this is not important.) | ||
|
||
In our example, this table will have two types of queries: | ||
|
||
-- Load the user object from the current session (where we store the user_id) | ||
SELECT * FROM user WHERE id = :user_id; | ||
|
||
-- Check the given password against the email address, for login | ||
SELECT password_hash, password_salt FROM user WHERE email = :user_email; | ||
|
||
In the schemaless model, the table would look like this: | ||
|
||
class User(SchemalessModel): | ||
|
||
id = Column(types.Integer, primary_key=True) | ||
time_created = Column(types.DateTime, default=datetime.now, nullable=False) | ||
time_updated = Column(types.DateTime, onupdate=datetime.now) | ||
_data = Column(types.JSON) | ||
|
||
email = Column(types.String(255), nullable=False, index=True, unique=True) | ||
|
||
Where the ``_data`` column would contain data like this: | ||
|
||
{ | ||
'display_name': 'Andrey Petrov', | ||
'is_admin': 1, | ||
'password_hash': 'cSKSsy315E4EroxeDQrsxjTb6ijBxxbK', | ||
'password_salt': 'vS5Otm', | ||
} | ||
|
||
And perhaps we would build a framework on top of SQLAlchemy which would let us access columns as ``user.display_name`` or ``user.email`` regardless whether it's an extracted indexed property or a buried _data element. | ||
|
||
|
||
### Process | ||
|
||
1. Build a table with just a free-structure ``_data`` field. | ||
2. Determine queries, extract relevant properties into indexed columns: | ||
1. ALTER TABLE to add the column | ||
2. Run full-scan query to populate new column with data | ||
3. Add relevant index onto said column | ||
4. Deprecate property from ``_data`` (optional, we could just assume that proper columns always supercede ``_data`` attributes) |