# Cloud Computing | AWS

## What Is Cloud Computing?
Cloud computing: the practice of using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a local server or a personal computer.

The arrival of cloud computing completely changed the way we deploy our technology, providing powerful access to instant and scalable computing power to enterprises, startups, and developers alike. Whether you need servers to host a web application, reliable storage for your data, or machines to train machine learning models, it's easy to see the advantage of relying on the cloud rather than utilizing your personal computer or local servers.

For one, you no longer have to invest in lots of hardware upfront. No need to worry about whether you are paying for more than you'll need or what to do if you need to scale a lot more later on. Cloud computing makes this as easy and clicking a few buttons to scale your resources up or down.

It's significantly faster provisioning the resources you need through the cloud versus the time it would take to gather and build up the hardware you'd need to provide the same support. This allows you and your team, or company, to develop and experiment at a much faster rate.

Lastly, you can provide efficient access to your applications around the world by spreading your deployments to multiple regions.

## Amazon Web Services (AWS)

Amazon Web Services is one of the largest providers in the cloud computing industry, with over 140 services in compute, storage, databases, networking, developer tools, security, and more. In this lesson, we'll learn about a few essential tools and services in AWS and practice using them. These services can be accessed in three different ways: the AWS Management Console, the Command Line Interface (CLI), or Software Development Kits (SDKs), which can be used in combination.

We'll start with the AWS Management Console, which is the web user interface. The AWS CLI is a useful way to control and automate your services with code, and SDKs allow you to easily integrate services with your applications through APIs built around specific languages and platforms.

## Implementing Data Wareshouse on AWS
Pre-requisites 
- Relational database design and SQL 
- Programming in Python 
- Dimensional modelling and creating OLAP cubes 
- Basic ETL in Python 
- AWS services: IAM, VPC, 53, EC2 


#### **ETL PROCESS**

**Step 1:** Data Sources: Different types, skill sets, upgrades locations, etc. (high heterogeneity)   
**Step 2:** ETL: Many processes - a 'grid" of machines with different schedules and pipeline complexities 
**Step 3:** DWH: More resources need to be added as data increases. We have different workloads; some need one machine and some need many (scalability & elasticity)   
**Step 4:** Business Intelligence Apps & Visualizations: Also need a hybrid deployment of tools for interaction, reporting, visualizations, etc.  

#### Choices for Implementing a Data Warehouse

**On-premise:** 
1. Heterogeneity, scalability, elasticity of the tools, technologies, and processes 
2. Need for diverse IT staff skills & multiple locations.
3. Cost of ownership 

**On-cloud:** 
- Lower barrier to entry
- May add as you need - its ok to change your opinion 
- Scalability & elasticity out of the box 
- Operational cost might be high and heterogeneity/complexity won't disappear, but...   
    - **On-Cloud we have two options**
        - **Cloud-Managed:** Re-use of expertise; way less IT Staff for security, upgrades, etc. and way less OpEx Deal with complexity with techniques like: 
          Infrastructure as code" 
          - **(Amazon RDS, Amazon DynamoDB Amazon S3)** 
        - **Self-Managed:** Always "catch-all" option if needed 
            - **(EC2 + Postgresql, EC2 + Cassandra, EC2 + Unix FS)**
            
<img src="images/image13.png" alt="Drawing" style="width: 600px;"/>

## Amazon Redshift Technology  (Inside postgres with modifications)

- Most relational databases execute multiple queries in parallel if they have access to many cores/servers 
- However, every query is always executed an a single CPU of a single machine 
- Acceptable for OLTP, mostly updates and few rows retrieval 

<img src="images/image14.png" alt="Drawing" style="width: 600px;"/>

- Massively Parallel Processing (MPP) databases parallelize the execution of one query on multiples CPUs/machines 
- How? A table is partitioned and partitions are processed in parallel 
- Amazon Redshift is a cloud-managed, column- oriented, MPe database 
- Other examples include Teradata Aster, Oracle ExaData and Azure SQL 

<img src="images/image15.png" alt="Drawing" style="width: 600px;"/>

## Redshift Architecture: The Cluster

Redshift Cluster:
- 1 Leader node
    - LeaderNode
    - Coordinates compute nodes 
    - Handles external communication 
    - Optimizes query execution 

- 1+ Compute node
    - Each with own CPU, memory, and disk (determined by the node type)
    - Scale up: get more powerful nodes
    - Scale out: get more nodes 

    - **Node Slices:** 
        - Each compute node is logically divided into a number of slices 
        - A cluster with n slices, can process n partitions of a tables simultaneously 
  
1. **The total number of nodes in a Redshift cluster is equal to:** The number of AWS EC2 instances used in the cluster 
2. **Each slice in a Redshift cluster is:** At least 1 CPU with dedicated storage and memory for the slice 
3. **If we have a Redshift cluster with 4 nodes, each containing 8 slices, i.e. the cluster collectively offers 32 slices. What is the maximum number of partitions per table?** 32 Partitions

<img src="images/image16.png" alt="Drawing" style="width: 600px;"/>
<img src="images/image17.png" alt="Drawing" style="width: 600px;"/>
<img src="images/image18.png" alt="Drawing" style="width: 600px;"/>

## SQL to SQL ETL
<img src="images/image20.png" alt="Drawing" style="width: 600px;"/>
<img src="images/image21.png" alt="Drawing" style="width: 600px;"/>

## Redshift & ETL in Context
<img src="images/image22.png" alt="Drawing" style="width: 800px;"/>

**Ques:** Why do you think we might need to copy data already stored in S3 to another S3 staging bucket during the ETL process?  
**Ans:**  Because it would be transformed before insertion into the DWH 

## Ingesting at Scale: **Use COPY** 
- To transfer data from an 53 staging area to redshift use the **COPY** command 
- Inserting data row by using **INSERT** will be very slow 
- If the file is large: o It is better to break it up to **multiple files** 
- Ingest in **Parallel** 
    - Either using a **common prefix** 
    - Ora **manifest file**.
- Other considerations: 
    - Better to ingest from the same AWS region 
    - Better to compress the all the csv files 
        - One can also specify the delimiter to be used 


### Redshift ETL Examples: 

#### Common Prefix
<img src="images/image23.png" alt="Drawing" style="width: 800px;"/>

#### Manifest File
<img src="images/image24.png" alt="Drawing" style="width: 800px;"/>


### Redshift ETL Automatic Compression Optimization 
- The optimal compression strategy for each column type is different 
- Redshift gives the user control over the compression of each column 
- The COPY command makes automatic best-effort compression decisions for each column 

### ETL from Other Sources 
- It is also possible to ingest directly using ssh from EC2 machines 
- Other than that: 
    - 53 needs to be used as a staging area 
    - Usually, an EC2 ETL worker needs to run the ingestion jobs orchestrated by a dataflow product like Airflow, Luigi, Nifi, StreamSet or AWS Data Pipeline 

### ETL Out of Redshift 
- Redshift is accessible, like any relational database, as a JDBC/ODBC source 
    - Naturally used by BI apps 
- However, we may need to extract data out of Redshift to pre-aggregated OLAP cubes

<img src="images/image25.png" alt="Drawing" style="width: 800px;"/>

### Building a Redshift Cluster: More Details 
- The cluster created by the Quick Launcher is a fully-functional one, but we need more functionality... 
- Security: 
    - The cluster is accessible only from the virtual private cloud 
    - We need to access it from our jupyter workspace 
- Access to S3: 
    - The cluster needs to access an s3 bucket 

### Configuring Redshift for S3 and external access 
- Naturally, we can accomplish our goal by going through lots of sereenshetsMelees or "elk-lc-al:A-:4H" instructions 
- That said, we take this as an opportunity to introduce an important technique for modern data engineers, namely: Infrastructure-as-Code (laC) 
- An advantage of being in the cloud is the ability to create infrastructure, i.e. machines, users, roles, folders and processes using code, 
- laC lets you automate, maintain, deploy, replicate and share complex infrastructures as easily as you maintain code (undreamt-of in an on-premise deployment). e.g. "Creating a machine is as easy as opening a file" 
- To be honest laC is border-line dataEng/devOps 

### We have a number of options to achieve laC on AWS 
- aws-cli scripts 
- AWS sdk 
- Amazon Cloud formation (Template)

### Which of the following are advantages of Infrastructure-as-Code over creating infrastructure by clicking-around? 
- Sharing: One can share all the steps with others easily 
- Reproducibility: One can be sure that no steps are forgotten environment 
- Multiple deployments: One can create a test environment identical to the production the code 
- Maintainability: If a change is needed, one can keep track of the changes by comparing 


### Boto3 
It is a Python SDK for programmatically accessing AWS. It enables developers to create, configure, and manage AWS services. You can find the documentation for Boto3 here.

### Optimizing Table Design

- When a table is partitioned up into many pieces and distributed across slices in different machines, this is done blindly 
- If one has an idea about the frequent access pattern of a table, one can choose a more clever strategy 
- The 2 possible strategies are: 
    - Distribution Style 
        - EVEN distribution  
        - ALL distribution 
        - AUTO distribution 
        - KEY distribution 
    - Sorting key 



## Distribution Style: EVEN
<img src="images/image26.png" alt="Drawing" style="width: 800px;"/>
<img src="images/image27.png" alt="Drawing" style="width: 800px;"/>
<img src="images/image28.png" alt="Drawing" style="width: 800px;"/>

## Distribution Style: All
<img src="images/image29.png" alt="Drawing" style="width: 800px;"/>
<img src="images/image30.png" alt="Drawing" style="width: 800px;"/>

## Distribution Syle: Auto
- Leave decision to Redshift 
- "Small enough" tables are distributed with an ALL strategy 
- Large tables are distributed with EVEN strategy 


## Distribution Syle: Key
<img src="images/image31.png" alt="Drawing" style="width: 800px;"/>

- Rows having similar values are placed in the same slice 
- This can lead to a skewed distribution if some values of the dist key are more frequent than others 
- However, very useful when a dimension table is too big to be distributed with ALL strategy. In that case, we distribute both the a fact table and the dimension table using the same dist key. 
- If two tables are distributed on the joining keys, redshift collocates the rows from both tables on the same slices 
<img src="images/image32.png" alt="Drawing" style="width: 800px;"/>

### Sorting Key 
- One can define its columns as sort key 
- Upon loading, rows are sorted before distribution to slices 
- Minimizes the query time since each node already has contiguous ranges of rows based on the sorting key 
- Useful for colurnns that are used frequently in sorting like the date dimension and its corresponding foreign key in the fact table 
