# Overview
The concept of a data lakehouse, or simply lakehouse, is the next generation of datastore architectures. According to the original [whitepaper](http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf), a lakehouse architecture seeks to replace that of a data warehouse. 

In this notebook we cover
1. History Of Data Architectures
2. Current Implimentations

# 1. History Of Data Architectures
At this point in time there are three generations of data store architectures that have evolved over the last 50 years. The data lakehouse is the third generation.

Below we see an illustration of these architectures:

<center><img src='images/datastore-generation-comparision.png'></center>

In the next few sections we will review these concepts indetail.

## 1.1. Generation I (Data Warehouse / Data Mart - 1970s)
In the 1970's ACNielsen (a data provider) offered their clients a Data Mart to store information digitally and enhance their sales efforts. A "Data Mart" is an archive of stored, normally structured data, typically used and controlled by a specific community or department. It is normally smaller and more focused than a Data Warehouse and, currently, is often a subdivision of Data Warehouses. Data Marts were the first evolutionary step in the physical reality of Data Warehouses and Data Lakes.

In the 1980's IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow, mainly the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations, it was typical for multiple decision support environments to operate independently. Though each environment served different users, they often required much of the same stored data. 

Generally speaking a data warehouse has the following features:
- subject oriented - information specif to a given subject (not all data in the company)
- integrated - standardized format / naming
- time-variant - contains historical information
- nonvolatile - de data only flows in, it is not changed nor deleted
- summarized - provides insights for data analytics / BI

Generally speaking, this generation faced the following problems:
1. Coupled compute and storage
2. On-Premis Implimentations
3. Schema-on-write for unstructured data

## 1.2. Generation II (Data Lakes - 2000s)
The term was coined by James Dixon, founder and former CTO of Pentaho, in 2010. According to Dixon: 

> “If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

This generation arose to solve the problems of the previous generation and was made practical with the cloud becoming an affordable mainstream solution. To be clear, Data Lakes do not replace data warehouses and data marts, they are used in tandem.

Generally speaking a data lakes offer the following features:
- Low Storage Cost
- Scalability
- Support for structured, semi-structured, and unstructured data
- Schema on read
- Reliability
- Cloud/clustered implimentations

Generally speaking, this generation faced the following problems:
- Consisttency (ACID)
- Advanced analytics
- Continued need for a warehouse
- Data duplication costs
- Support for ML

## 1.3. Generation III (Data Lakehouse - 2020s)
In Jan 2021 a [publication](http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) written by researchers at UC Berkly, Standford, and Databricks laid out the problems with Gen. II and a proposal for a new architecture. It presents a hybrid approach which offers the best features of both the data warehouse and the data lake.

Generally speaking a data warehouse offers the following additional features:
- Built in support for Data Management and ETL
- Support for ML
- SQL performance
- Data versioning
- ACID compliance

# 2. Implimentations

1. Databricks DeltaLake - A commercial and non-commercial opensource offering
2. Dremio - A commercial closed source offering
3. Snowflake - A commercial closed source offering
4. DataLakeHouse.io - A commercial closed source offering based on snowflake

# Reference Materials
- [A brief history of data lakes](https://www.dataversity.net/brief-history-data-lakes/#)
- [Wikipedia - Data Warehouses](https://en.wikipedia.org/wiki/Data_warehouse#History)
- [Lakehouse: A New Generation of Open Platforms that Unify
Data Warehousing and Advanced Analytics](http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf)
- [A Brief History of Data Lakes](https://www.dataversity.net/brief-history-data-lakes/#)
- 