# Normalization versus De-Normalization

- Normalization: breaking a table down into more tables to reduce redundancy and maintain data integrity
- De-Normalization: not breaking a table into smaller tables, risk for data redundancy

# The 3 Normal Forms or 3NF

- 1st Normal Form: Remove duplicate data and break data down to granular or more atomic level
- 2nd Normal Form: All columns in a table must depend on the primary key column
- 3rd Normal Form: A column should not be duplicated or exists in multiple tables

### Not in 3NF:

![alt text](images/Poor_Design.png "Bad Design")

### In 3NF:

![alt text](images/Better_Design.png "Bad Design")

# YouTube Videos

- [Designing a traditional relational database](https://www.youtube.com/watch?v=I_rxqSJAj6U)
- [Data Warehouse Design](https://www.youtube.com/watch?v=--OJpdPeH80)
- [ETL Design](https://youtu.be/sLhInuwdwcc)
- [Star Schema vs Snowflake](https://youtu.be/Qq4yhhAk9fc)
- [OLAP vs OLTP](https://youtu.be/AiZWeSUjylU)
- [Slowly Changing Dimensions](https://youtu.be/1FZ7et0pN4c)

# Architectural Considerations

- [Architectural Considerations](https://medium.com/ssense-tech/principled-data-engineering-part-i-architectural-overview-6d4bdf89b657)
- Database
- Data Warehouse
- Data Lake
    - **Raw Data:** A “raw data” bucket which contains the raw, untransformed, and lossless incoming data from all our sources such as event streams from microservices, transactional database snapshots, log dumps, and data from third party sources via FTP or API calls. Data here can be in various formats such as CSV, JSON, flat text files etc., and may not have defined schemas. This bucket plays the key role of guaranteeing no data loss and replayability of our pipelines. In other words, if our pipelines downstream change or fail, our raw data store guarantees that all the original source data remains intact and available for re-processing.
    - **Interim Data:** An “interim data” bucket which imports the aforementioned raw data and performs the most minimal transformations required to homogenize its structure, impose schemas, and allow cataloging. Recommend using a format that enforces a schema like parquet. This eliminates a lot of dangerous ambiguities of schema-less data, which can lead to data loss and poor governance. The minimal transformations performed at this step also allow for some critical type management such as homogenizing date formats and number types (decimals, floats, doubles, etc.), and handling null values.
    - **Business Data:** A “business data” bucket which presents transformed datasets to end-users. Data here conforms to semantically meaningful naming conventions and each dataset corresponds to a specific business need. Furthermore, the datasets here have more refined schemas that make sense to our end-users — the consumers of this data.
- Pipeline
    - Desired characteristics:
        - Replayable or reproducible
        - Has idempotency: applied several times without changing the result beyond the initial application to prevent duplicate or corrupt data
    - Types of pipelines:
        - ETL: In an ETL pipeline, data is extracted from a source, transformed to the required shape, and inserted into the target. The advantage of ETL is that data enters your system in the shape you want it to be in, and can easily be modeled for analytics. This works exceedingly well when the data is coming from consistent and trusted sources, but can quickly become too brittle in cases where there is a possibility of a schema change, or the data cannot be re-extracted. Imagine a scenario wherein data is extracted from a third-party API, transformed and then loaded into a data warehouse. Some time later the process needs to be rerun, but the source data is no longer available due to the third-party aggregating its data after a certain amount of time (to reduce storage fees). In other words, your system does not offer immutable data.
        - ELT: ELT addresses this by extracting and loading the raw data immediately into storage. From there, you can rerun the transformation portion of the pipelines to your heart’s content. The drawback here is that the raw data can potentially be schemaless and unstructured. Managing this raw data becomes the main difficulty in this configuration, and in an era of ever increasing compliance, proper cataloging is critical.

# Data Governance

- [Data Governance](https://medium.com/ssense-tech/principled-data-engineering-part-ii-data-governance-30297abb2446)

# OLAP

- Typical of STAR Schema
- De-normalized (dimension and fact design)
- Faster analysis and search by combining tables
- Requires more data storage due to redundancy
- Simpler joins

# OLTP

- Typical of Snowflake Schema
- Normalized (1st to 3rd normal form)
- Faster inserts, updates, deletes, and improved data quality by reducing redundancy
- Requires less space due to reduction of redundancy
- Performance not as great as OLAP or STAR Schema due to more tables
- More complex joins

# Tools

- ETL tools
    - AWS Glue
    - Apache Beam
    - Apache Airflow (Python)
    - Prefect (Python)
    - Papermill (Python)
- SQL-like tools
    - AWS Athena / PyAthena
    - Apache Spark SQL
    - Apache Hive
- Dataframe-like
    - PySpark dataframe
    - [Koalas](https://github.com/databricks/koalas)