Skip to content

Denormalization

valhuber edited this page Dec 27, 2020 · 5 revisions

Performance denormalization is a common design choice. Heretofore, it has been a challenge due to lack of Data Independence at the Logical level.

Here, we briefly review a common example of Physcial Data Independence, and then introduce Logical Data Independence, with an example.

Background: Early Guess, Long Dev, Late Discovery

Agile advocates that projects begin implementation, even in the face of unknowns.  Some of these includes data size and transaction volumes.  Optimizing around these can have a big effect on performance.

The problem is that these unknowns often are not discovered until late "go-live" testing.  And that's after the long phase of dev iterations.

That's where data independence comes in: choose technologies and approaches that enable us to maximize the changes we can make without recoding. We want our programs to be independent of  database design decisions - data independence.

Let's take 2 examples.  One is familiar, but a great reminder how technology can support data independence.

Example 1: Add Missing Index

Back in the day, pre-relational DBMS systems required that you explicitly name indices on each and every query, in each and every program.  Then, we would discover, late in the project, that a certain access was common, and required an index.

It took little time to discover this, and add the index. The problem is that all the existing programs did not know about the index - they were not independent of this critical aspect of database design. So, we would have review all of them and make the required revisions. And this took a lot of time.

Relational: Physical Data Independence

With the advent of relational database, this huge problem was eliminated.  Knowledge of the index structure was extracted out of each application, replaced by the query optimizer.  We can term this _physical _data independence.

Example 2: Add Performance Denormalization

But index structure is not the only critical design decision we need to make early.  Another is "performance denormalization".  Consider a rollup, like Customer.Balance:

  • Should I store a rollup as a physical database column?

This provides near-instant access, but requires all accessing programs to maintain it, consistently.

  • Or derive it on the fly?

This approach eliminates the consistency issues, but could perform slowly.

The performance implication can be substantial, particularly in the case of chained rollups - one rollup depends on another.  In our example, the Customer Balance depends on another rollup, Order.OrderTotal (a rollup from OrderDetails).

Such chained rollups can literally prevent deployment.  I personally had a situation where the Service Level Agreement was 3 seconds, and the result was 3 minutes.

No right answer - Need Logical Data Independence

As with most interesting questions, there's no right answer.  We don't know everything, yet development must start.

The only way to address this is with Data Independence.  Only this time it's not an index, it's the data content itself.  We'll call this logical data independence.

Example: Only Sales Reps can have Orders

Let's consider an example, from this data model where some Employees are "commissioned," and we want to ensure that Orders related only to Commissioned Employees:

We could add logic to "add order", but we also want to prevent employees from being reclassified as non-commissioned if they are assigned orders.  So, our design is:

  1. Define Employee.order_count
  2. Constraint to ensure order_count can be > 0 only for commissioned

Let's imagine a common scenario where we cannot alter the schema, so can't introduce a new column.

Retrieval: Object Model Virtual Attributes

Many projects use ORMs (Object Relational Managers) to provide an abstraction level for data access.  These can provide a better programmatic interface (better attribute / object names, type checking / code completion, etc), but many also provide virtual attributes - the ability to define attributes that are not stored in the database, but computed by the ORM.

SQLAlchemy is one such example, where such support is termed "hybrid attributes".  In addition to sums, you can elegantly define counts:

The models.py file contains the Python classes that map to our tables, here, Employee (note: the file can be generated). Some quick background orientation:

  • lines 110-115 defined stored columns, so you can code anEmployee.Salary

  • line 119 defines a self relationship, so you can code anEmployee.Manager to get an employees manager, and anEmployee.Manages to get a list of employees managed

Line 123 defines our hybrid (virtual) attribute as Python code.  Note you can set breakpoints (the red dot).

It references order_count_sql, defined as follows:

Employee.order_count_sql = column_property(
    select([func.count(Order.Id)]). \
    where(Order.Id == Order.EmployeeId))

So now, all of our apps can access this property, without any knowledge of whether it is stored, or derived.  Logical data independence! Well... almost...

Update: Logic Bank

We are all set for retrieval, but what about update?  Recall we need to enforce our constraint that only Commissioned Employees can have Orders.

SQLAlchemy support includes "setters" as well (see line 129).   We can leverage this in Logic Bank, like this:

So, this addresses well the Use Cases we imagined:

  1. Place Order - the Logic Bank rule engine will:
    1. Adjust Employee.order_count, per the rule on line 75
    2. Verify our constraint on line 71
  2. Update Employee.IsCommissioned
    1. Verify our constraint on line71

Win #1: No Consistency Issues With "Derive on the Fly"

Our design eliminated the consistency issues by not storing the employee_count. Our "Derive on the Fly" approach also deals with situations where the schema was locked, and there was not even a choice to do performance denormalizations.

Win #2: Performance Denormalization Requires No Recoding, No Consistency Issues

If you discover performance issues, you can add employee_count to your schema.
No application recoding is required (beyond few lines in the class file above), but the system will run faster, possibly much faster.

But what about consistency?

Logic Bank automatically enforces your derivations and constraints, for all applications using SQLAlchemy. And it does so automatically, requiring no application action - eliminating possibility of error.

In fact, you can add the logic after the apps are written, and they will automatically enforce logic for all future updates. Your logic independence extends to the business logic itself, not just the schema.

Win #3: Automatic Reuse Catches Missed Use Case

But there's a bonus.  As inevitably happens, there are Use Cases we forgot to consider:

  1. Move Order from Employee-1 to Employee-2 - the Logic Bank Rule Engine will:
    1. Adjust Employee-1.order_count down by 1
    2. Adjust Employee-2.order_count up by 1; this change triggers...
    3. Verify our constraint on line 68

This is a pretty big deal.  By using a declarative approach, our logic is automatically_ re-used over all transactions that touch the data - even ones we might have overlooked._

Summary

Automation often levies a "flexibility tax", reducing the flexibility required to achieve project objectives. But done right, as in the case of SQLAlchemy and Logic Bank, you get the time-saving agility of automation, yet retain control of critical design decisions.

And it's not just flexibility, it's data independence - the ability to make those design decisions the apps are all written.