# Reviewing Architectures

### Introduction

We are moving through a couple of different patterns, for accessing data for analytical purposes.  It's worth reviewing them.

### Data Lakes vs Data Warehouses

When originally storing our data in S3, we learned about storing our data in a data lake.

### Data Lake

* A data lake is a repository or system of data, generally consisting of blobs or files.  It can include structured data (from an OLTP), semi-structured data like JSON data, or unstructured data (HTML or emails), and binary data like images.  

* Consumers of a data lake generally have technical skills, and so are engineers or data scientists.

* One of the benefits of a data lake, and using storage like S3 or google cloud storage is that a lot of data can be stored cheaply.   However, this can also lead to disorganized data that is rarely used -- leading to the term data swamp.

* For querying data in S3 files, we used Athena, however Pyspark (hosted on EMR (elastic map reduce) can also be used for larger datasets).  

### Data Warehouse

* With a data warehouse, data is typically stored in a relational database designed for analytics purposes like redshift or snowflake.  Here, the data is structured, just like it would be in postgres.

* Consumers of data in a data warehouse can be less technical - like business analysts or marketing - as the data is more accessible through SQL or a data dashboard.

* A downside of data warehousing can be cost.  Remember that with a data lake we can simply store data with blob storage (S3 files), separating storage from compute.  However with a data warehouse like redshift, organizations pay for a more costly server. 

* While storing relational data provides more order, it also can be costly to transform the data.

#### Warehouse with Staging Layer

* S3 can also be used for a staging layer, before data is stored in a data warehouse.  Why store data in a staging layer before moving it to the data warehouse?

1. It could be useful for keeping historical snapshots of the data.
2. It can be useful for keeping unstructured data (like HTML or emails) or semistructured data (like JSON) that may not be easily loaded into a tool like redshift.  

So in this way, the data is stored before extracting the relevant components of the data.

### The lakehouse

The lakehouse is supposed to provide the best of both worlds, allowing for the storage of more cheap and less structured data (for use by data scientists), and more structured data in a data warehouse.  And the idea is that both of these data sources are catalogued by the same service - like AWS lake formation.

Below is a diagram of the AWS lakehouse.  See if you can make sense of the architecture.  Feel free to read more about it [here](https://aws.amazon.com/blogs/big-data/build-a-lake-house-architecture-on-aws/).

<img src="./aws-lakehouse.png">

### Downsides to Lakehouse approach

Of course, there's downsides to the lakehouse approach too.  Here are some of them.

* Complexity: The lakehouse architecture can introduce complexity to your data management ecosystem. Integrating data lakes and data warehouses requires careful design and implementation, and requires additional skills by the team.

* Data Quality and Governance: As data lakes allow for the ingestion of raw, unstructured data, ensuring data quality and governance can be more difficult in a lakehouse architecture. 

* Performance Trade-offs: While a lakehouse architecture offers flexibility in storing and querying diverse data types, it may introduce performance trade-offs. As data volumes grow, complex queries spanning both structured and unstructured data can impact query performance. 

* Vendor Lock-in: Depending on the specific technologies and cloud providers chosen for the lakehouse architecture, there may be a risk of vendor lock-in. If you heavily rely on proprietary tools or specific cloud services, it could limit your ability to switch providers or adapt to evolving technologies in the future.

### Resources

[AWS Lakehouse](https://aws.amazon.com/blogs/big-data/build-a-lake-house-architecture-on-aws/)

[Snowflake on Lakehouse](https://www.snowflake.com/guides/what-data-lakehouse)

[What and Why staging](https://www.startdataengineering.com/post/what-and-why-staging/)

[Datalake vs staging](https://qmetrix.com.au/data-lake-vs-staging-layer-difference/)

[McKinsey Datalake](https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/a-smarter-way-to-jump-into-data-lakes)