# Cloud Data Warehouse

## Data Warehouse

### A Business Perspective
You are in charge of a retailer’s data infrastructure. Let’s look at some business activities.

Customers should be able to find goods & make orders
Inventory Staff should be able to stock, retrieve, and re-order goods
Delivery Staff should be able to pick up & deliver goods
HR should be able to assess the performance of sales staff
Marketing should be able to see the effect of different sales channels
Management should be able to monitor sales growth
Ask yourself: Can I build a database to support these activities? Are all of the above questions of the same nature?
Let's take a closer look at details that may affect your data infrastructure.

1. Retailer has a nation-wide presence → Scale?
2. Acquired smaller retailers, brick & mortar shops, online store → Single database? Complexity?
3. Has support call center & social media accounts → Tabular data?
4. Customers, Inventory Staff and Delivery staff expect the system to be fast & stable → Performance
5. HR, Marketing & Sales Reports want a lot information but have not decided yet on everything they need → Clear Requirements?

Ok, maybe one single relational database won’t suffice :)

### Defination

- A data warehouse is a copy of transaction data specifically structured for query and analysis. 
- A data warehouse is a subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of management's decisions. 
- A data warehouse is a system that retrieves and consolidates data periodically from the source systems into a dimensional or normalized data store. It usually keeps years of history and is queried for business intelligence or other analytical activities. It is typically updated in batches, not every time a transaction happens in the source system. 



### Operational vs Analytical Business Processes 
**Operational Processes (Make it work!)**  
- Find goods & make orders (for customers)   
- Stock and find goods (for inventory staff)   
- Pick up & deliver goods (for delivery staff)   

**Analytical Processes (What is going on?)**
- Assess the performance of sales staff (for HR) 
- See the effect of different sales channels (for marketing) 
- Mon>itor sales growth (for management) 

**Operational Databases** 
- Excellent for operations 
- No redundancy, high integrity 

**Operational Databases**
- Too slow for analytics, too many joins
- Too hard to understand 

<img src="images/image1.png" alt="Drawing" style="width: 600px;"/>

**? -> ETL**   
**? -> Dimensional Model**
- Extract the data and from the source systems used for operations, Transform the data and Load it into a dimensional model 
- The dimensional model is designed to a) make it easy for business users to work with the data, b) improve analytical queries performance 
- The technologies used for storing dimensional models are different than traditional technologies 
- Business-user-facing application are heeded, with clear visuals, aka Business Intelligence (BI) apps 

#### Dimensional Model (Easy to understand, Fast analytical query performance) 
**Star Schema**
- joins with dimensions only 
- Good for OLAP not OLTP
**Compated to 3NF**
- Lots of expensive joins
- Hard to explain to business users

#### Facts & Dimensions

**Fact tables:** Record business events, like an order, a phone call, a book review o Fact tables columns record events recorded in quantifiable metrics like quantity of an item, duration of a call, a book rating.  
**Dimension tables:** Record the context of the business events, e.g. who, what, where, why, etc.. o Dimension tables columns contain attributes like the store at which an item is purchased, or the customer who made the call, etc.. 

**Fact or Dimension Dilemma**
- **For facts, If you're unsure if a column is a fact or dimension, the simplest rule is that a fact is usually: Numeric  & Additive.**   

- Examples facts: 
    - A comment on an article represents an event but we can not easily make a statistic out of its 
      content per se **(Not a good fact)** 
    - Invoice number is numeric but adding it does not make sense **(Not a good fact)** 
    - Total amount of an invoice could be added to compute total sales **(A good fact)**.   

- Example dimensions: 
    - Date & time are always a dimension 
    - Physical locations and their attributes are good candidates dimensions 
    - Human Roles like customers and staff always good candidates for dimensions 
    - Goods sold always good candidates for dimensions.   

<img src="images/image2.png" alt="Drawing" style="width: 400px;"/>

**Extact:** Query the 3NF DB  
**Transform:**   
   - Join tables togehter  
   - Change types  
   - Add new columns  
**Load:** Insert into facts & dimension tables  

## DWH Architecture
1. Kimball's Bus Architecture 
2. Independent Data Marts 
3. Inmon's Cocporate Information Factory (CIF) 
4. Hybrid Bus & CIF 

### Kimball's Bus Architecture

<img src="images/image3.png" alt="Drawing" style="width: 600px;"/>

<img src="images/image4.png" alt="Drawing" style="width: 600px;"/>

**ETL: A Closer Look** 
- Extracting: 
    - Get the data from its source
    - Postibly deleting old state 
- Transforrning: 
    - Integrates many sources together 
    - Possibly cleansing: inconsistencies, duplication, missing values, etc..
    - Possibly producing diagnostic metadata 
- Loading: 
    - Structuring and loading the data into the dimensional data model 


## Data Marts
<img src="images/image5.png" alt="Drawing" style="width: 600px;"/>

Independent Data Marts 
- Departments have independent ETL processes & dimensional models 
- These separate & smaller dimensional models are called "Data Marts" 
- Different fact tables for the same events, no conformed dimensions 
- Uncoordinated efforts can lead to Inconsistent views 
- Despite awareness of the emergence of this architecture from departmental autonomy, it is generally discouraged 

## Inmon's Corporate Information Factory (CIF)

<img src="images/image6.png" alt="Drawing" style="width: 600px;"/>

**Inmon's Corporate Information Factory (CIF) Data Marts**
- 2 ETL Process 
    - Source systems 3 NF DB 
    - 3 NF DB Departmental Data Marts 
- The 3NF DB acts an enterprise wide data store. 
    - Single integrated source of truth for data-marts 
    - Could be accessed by end-users if needed 
- Data marts dimenlionally modelled & unlike Kimball's dimensional models, they are mostly aggregated. 

## Hybrid Kimball Bus & Inmon CIF
<img src="images/image7.png" alt="Drawing" style="width: 600px;"/>

## OLAP Cubes
 
- An OLAP cube is an aggregation of a fact metric on a number of dimensions 
- E.g. Movie, Branch, Month • Easy to communicate to business users 
- Common OLAP operations include: Rollup, drill-down, slice, & dice 

<img src="images/image8.png" alt="Drawing" style="width: 600px;"/>

**OLAP Cubes: Roll-Up and Drill Down:** 
- **Roll-up:** Sum up the sales of each city by Country: e.g. US, France (less columns in branch dimension) 
- **Drill-Down:** Decompose the sales of each city into smaller districts (more columns in branch dimension) 
- **The OLAP cubes should store the finest grain of data (atomic data), in case we need to drill-down to the lowest level, 
  e.g. Country —> City —> District —> Street, etc..** 
  
### Roll-up
<img src="images/image9.png" alt="Drawing" style="width: 500px;"/>

### Slice 
<img src="images/image10.png" alt="Drawing" style="width: 500px;"/>

### Dice
<img src="images/image11.png" alt="Drawing" style="width: 500px;"/>

## OLAP Cubes: Query Optimization
- Business users will typically want to slice, dice, rollup and drill-down all the time • 
- Each such combination will potentially go through all the facts table (suboptimal) • 
- The "GROUP by CUBE (movie, branch, month)" will make pmg pass through the facts table and will aggregate all possible combinations of groupings, of length   0, 1, 2 and 3 
  Example:
    - Total revenue 
    - Revenue by movie, Revenue by branch, Revenue by month 
    - Revenue by movie, branch, Revenue by branch, month, Revenue by movie, month
    - Revenue by movie, branch, month 
- Saving/Materializing the output of the CUBE operation and using it is usually enough to answer all forthcoming aggregations from business users without having to process the whole facts table again 


## RECAP

**The Last Mile: Delivering the analytics to users**
- Data is available...
    - In an understandable & performant dimensional model 
    - With Conformed Dimensions or separate Data Marts 
    - For users to report and visualize 
        - By interacting directly with the model 
        - Or in most cases, through a BI application 
- OLAP cubes is a very convenient way for slicing, dicing and drilling down 
- How do we serve these OLAP cubes? 
    - Approach 1: Pre-aggregate the OLAP cubes and saves them on a special purpose non-relational database (MOLAP) 
    - Approach 2: Compute the OLAP cubes on the fly from the existing relational databases where the dimensional model resides (ROLAP) 
- Demo: Column format in ROLAP 
    - Use a postgresql with a columnar table extension 
    - Load a dataset in a normal table 
    - Load the same dataset in a columnar table 
    - Compare the performance of the fact-aggregating query performance in both tables 