# 01 - Introduction to Data Warehousing

# Introduction

The concept of Data Warehousing originated at IBM in the 80's. The goal of the initial research was to provide a framework to transfer data from operational systems to business intelligence departments, avoiding the cost and technical challenges of high redundancy.

---

## Why Analysts cannot work directly on business databases

Business databases must stay clean at all cost: allowing Data Analysis or Data Scientist to access it introduces a breach 

Moreover, most of the times, unstructured data (ie, not stored in any kind of databases) is required to do performant analysis. 

A Warehousing solution allows the company to aggregate and store its data needed for analysis, without altering the databases used for operations.

---

## Data Warehouse VS Data Lake

You often hear both when discussing Big Data, however they are very different.

Data Lakes are a big pool of raw data, with no defined purposes: we store this unstructured data in prevision of future usage.

Data Warehouse holds **processed** and **structured** data, ready to be used for advanced analytics. 

Most of the time, data that ends up in the Warehouse was previously stored in the Lake. 

- Step 1: Data is collected and stored in its raw form in a Data Lake
- Step 2: Data is extracted from the Lake, cleaned and processed
- Step 3: Data is loaded in the warehouse, ready to be queried.

---

## Data Warehouse VS traditional databases

Roughly, a Data Warehouse **is** a relational database. It's just a little more than that.

### Key differences:

1. The Warehouse can holds data from many databases
2. Any data stored in the Warehouse is stored for **analytics purposes only**
3. Data within a warehouse has been processed to simplify the analysis, and avoid the need for  SQL queries that spread on 300 lines
4. Whereas databases are optimized for extracting rows (or observations), data warehouses are optimized to have a performance boost on columns (or fields).

In a nutshell: warehouses are optimized for performant analysis.

 **A warehouse is the perfect candidate for `LOAD` destination in ETL pipelines.**

[A nice article on Alooma's blog]([https://www.alooma.com/blog/database-vs-data-warehouse](https://www.alooma.com/blog/database-vs-data-warehouse))

# Cloud vendors

- BigQuery, owned by Google, and part of the Google Cloud Platform
- Redshift, owned by Amazon and part of the AWS platform
- Snowflake
- ...

As always when choosing between different vendors, the cost structure is one the most important aspects to check. For instance, BigQuery storage is **much** cheaper than Redshift, but querying data on Redshift is **free** whereas it costs about $5/TB on BigQuery. Depending on your need, one solution might be more suitable than the other.

# Amazon Redshift

Redshift is the Data Warehousing solution from Amazon Web Services. As every services of the AWS family, Redshift is **Cloud-based**: you only pay for the compute and storage, and you don't have to take care of maintenance costs, or scaling the hardware to support an increasing load.

### Hands-on

**Reading from Redshift onto a PySpark DataFrame**

    df = spark.read \
        .option("url", "jdbc:redshift://example.coyf2i236.eu-central-1.redshift.amazonaws.com:{PORT}/agcdb?user={USER}&password={PASSWORD") \
        .option("dbtable", "table_name") \
        .option("tempdir", "bucket") \
        .load()

**Writing to Redshift from PySpark DataFrame**

    df.write \
        .format('com.databricks.spark.redshift') \
        .option('url', REDSHIFT_URI) \
        .option('dbtable', REDSHIFT_TABLE) \
        .option('aws_iam_role', REDSHIFT_IAM_ROLE) \
        .option('tempformat', 'csv') \
        .option('tempdir', REDSHIFT_TEMP_DIR) \
        .mode('error') \
        .save()

As Spark uses an S3 bucket to store the intermediary files, both Spark and Redshift needs to have access to the S3 bucket.

→ Ensure that the Redshift cluster has assumed an IAM role that gives it access to the `tempdir` S3 bucket (or use `forward_spark_s3_credentials` option)

→ By default, Spark uses the Avro format as an intermediary storage in S3. Using CSV can significantly improve loading performance, and also allow columns to have names with characters other than ASCII letters.

→ By default, every `string` column is loaded as a 256-byte length `VARCHAR` to Redshift. To gain performance or flexibility, it is possible to edit the default behavior by giving a `redshift_type` metadata to the DataFrame's column. See docs below for implementation in Scala and Python.

[Amazon Redshift](https://docs.databricks.com/data/data-sources/aws/amazon-redshift.html#setting-a-custom-column-type)
