<a href="https://colab.research.google.com/github/rzl-ds/gu511/blob/master/012_dbs_5_redshift.ipynb" target="_parent">
    <img src="https://colab.research.google.com/assets/colab-badge.svg"/>
</a>

# amazon `redshift`

##### cheating ahead

starting up a cluster is simple, but takes a long time. let's get started asap.

1. open the "Amazon `redshift`" service from the console
1. click the orange "create cluster" button
1. choose a unique identifier for name
1. select the "Free trial" option.
    + **make sure** the instance type is `dc2.large`, and the number of nodes is 1
1. pick names and passwords
1. click **off** the "use defaults" for additional configurations
1. under "Network and security", change "Publicaly accessible" to `Yes`
1. under "Backup", change the "Snapshot retention" value to a `Custom value` of `0` days
1. click "Create Cluster"

## columnar databases

recall from the overview lecture on databases that **columnar** databases are databases in which we have decided to store data internally as *columns* of data rather than *rows* of data.

we gave a hand-wavy description of this as a transposed `csv` format:

```
10,12,11,22,...
Smith,Jones,Johnson,Jones,...
Joe,Mary,Cathy,Bob,...
40000,50000,44000,55000,...
```

which to a computer really looks like one long string with a few long chunks that are all covering the same material:

```
10,12,11,22,...|Smith,Jones,Johnson,Jones,...|Joe,Mary,Cathy,Bob,...|40000,50000,44000,55000,...
```

this storage decision has significant implications for the way we interact with our database. in particular, a database set up this way should be **much better** at performing analytical queries and aggregations (any query which performs aggregations across an entire dataset, or relies on a subset of all the columns in the database).

this means they are great (and common, importantly) choices for data warehouses

`aws` first and most popular data warehouse is `redshift`, a database which primarily supports columnar data storage (but has more general application, as well)

## reasons to use `redshift`

### distributing storage and computing

a columnar database *could* be operated on one single server, just like any other relational database. the default setup of `redshift` -- to have one node -- is operating just like this.

that being said, `redshift` is built to scale out to include any number of worker nodes. these workers allow you to store more `redshift` data (by storing across several different hard drives) and to perform more computations (using several different CPUs).

#### distributed storage

when you send a new record to a `redshift` table, `redshift` will write values to some existing columnar store. in a single-worker environment, it simply appends to the column storage file on that worker's hard drive.

in a distributed setting with multiple workers, you have a few options for how you distributed those incoming records. `redshift` calls these "distribution styles" and offers a few alternatives:

1. `ALL`: every record will be written to every worker
1. `EVEN`: every record is randomly assigned to a worker
1. `KEY`: every record will be sent to a pre-assigned worker based on a particular column's value
    + particularly useful if you plan to `join` on that record, so that you don't have to share records between workers (can join on the same server)
    + this will be good if you have a column with high cardinality and even distribution of values (e.g. unique customer `id`s
    + this will be very bad if you have "data skew" (certain values with many more records)

#### distributed compute

relative to traditional `rdbms` systems like `postgres`, there are a number of tweaks and improvements in the actual query computation engine in `redshift` that allow for you to do faster querying:

+ MPP (massively parallel processing): `redshift` is built assuming it may have multiple workers performing query tasks and (with some help from you the user) can distributed storage data and calculation requests across those workers to get a "times N" speedup
    + for `KEY` distribution setups this is particularly useful
+ result caching: if a query is run often and results don't change frequently, this can make long or complex queries look automatic

### `s3` integration

this may seem like a small thing, but you have the ability to directly query certain file formats (including `csv`, `json`, `parquet`, and `avro`) sitting in `s3` from the `redshift` workers. this means you can treat `s3` as a *data lake*, a dumping ground for data you *may* access in the future, and be able to investigate it via traditional `sql`

### scale of querying

the reported scale of data which you could query with `redshift` is

+ 1 petabyte (1 million gigabytes) of `redshift`-based data
+ 1 exabyte (1 billion gigabytes) of `s3`-based data

## billing

before we go firing off an `aws` service it's good to have a rough idea of how much it will cost. amazon has [a free trial for `redshift`](https://aws.amazon.com/redshift/free-trial/) which is separate from their free-tier offering for reasons that are mostly baffling.

check out how obvious amazon makes this deal:

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1u8IDxc8hw-k1z1rUFO10Xg1unpRT0_nY" width="1600px"></div>

the deal is that you get 750 free hours of `dc2.large` instances per month for two months the very first time you spin up a `redshift` cluster.

it's worth noting that `24 * 31 = 744`, so 750 hours is just over the threshold for *one* node to be on full-time for a month.

an offer of "750 hours for two months", then, amounts to a free one-node cluster with 160GB of storage to stay on for two months.

not bad, and also not overwhelming

if you choose to leave your cluster up when that free trial ends, [one `dc2.large` instance in `us-east-1` costs 0.25 USD per hour](https://aws.amazon.com/redshift/pricing/#Pricing_calculator), i.e. 6 USD per day or about 180 USD per month.

so you should consider shutting this off pretty quick.

*note: for scans of `s3` you pay for a tool called `redshift` `spectrum`, and you pay per byte scanned rather than for storage or server time*

## creating a cluster

enough of the yapping, let's create a cluster

### start up the cluster

we already cheated ahead and did this -- is it up and running yet? check the `redshift` "Clusters" dashboard (left menu pain) looking for a service that has status `available`

### installing a query workbench client

in our `rds` lecture, we installed `datagrip`, and we will re-use that now. if you haven't installed `datagrip`, please refer to the instructions in that lecture.

in [the `aws` `redshift` documentation](https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html), they walk you through an installation of the `sql workbench j` program, which you could also use.

because we've already installed `datagrip`, we will not install `sql workbench j` in this class, but the instructions and exercises for doing so are left below as notes in case you are interested in a `datagrip` alternative.

**[slide is deprecated]**

#### install `sqlworkbench`

[`aws` recommends](https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html) using `sqlworkbench j` for connecting to and querying `redshift`: http://www.sql-workbench.eu/. bask in the overwhelming beauty of that expansive beige website!

windows users can also use `aginity`, a slightly more polished product: https://www.aginity.com/main/workbench/

for now, though, we will install `sqlworkbench` as instructed.

**[slide is deprecated]**

**<div align="center">walkthrough: install `sql workbench j`</div>**

1. create a directory on your computer you want to save this application
    1. e.g. on a mac, `mkdir -p ~/Applications/sqlworkbenchj`
1. go to http://www.sql-workbench.eu/downloads.html or search for "sql workbench j"
1. choose one of the "Generic package for all systems including all optional libraries" or "Generic package for all systems" links (the former plays nice with MSFT excel) and download that file
1. unzip the downloaded `.zip` file
   1. e.g. `cd ~/Applications/sqlworkbenchj && unzip Workbench-Build125.zip`
1. go to https://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html or query "redshift jdbc connection" and download a `redshift` `jdbc` driver.
1. open `sqlworkbenchj`
    1. on windows, doubleclick the `exe`. on mac, from the terminal run `bash sqlworkbench.sh`

**[slide is deprecated]**

+ mac users: if you are given a warning when you double-click the app that it is from an "unidentified developer", you may need to jump through an extra hoop to open it
+ first, try `ctrl + click`-ing the app. if this provides you with an option to "open", you're good to go.
+ otherwise, we need to temporarily allow apps from unidentified developers.
    1. open a terminal and run `sudo spctl --master-disable`
    1. open "System Preferences > Security and Privacy", click the lock icon in the bottom left to edit, and change the "Allow apps downloaded from:" value to be "anywhere"
    1. try `ctrl + click`-ing the app again
    1. open a terminal and run `sudo spctl --master-enable`
+ if none of this works, come get me and we will troubleshoot

### permission stuff

your cluster is possibly still spinning up, so to use it we still have to wait until it is in an `available` state. 

in order to use it when it's `available`, though, we still need to update the default permissions.

in particular, we will need to

+ grant the `redshift` **service** permissions to read from the `s3` service, and
+ update the inbound traffic rules for this cluster's security group to allow us to connect from our laptop

**<div align="center">allowing `redshift` to read `s3`</div>**

1. create a new role from the `iam` service `role` sub-menu
1. keep the selected type of trusted entity as "AWS service" and select `redshift` from the list below. use the **"Redshift - Customizable"** use case and click "Next: Permissions"
1. search for and attach the `AmazonS3ReadOnlyAccess` policy
1. name this role `allow_redshift_s3_read_role`
1. go back to the `redshift` service "Clusters" menu. check the box next to your cluster and click the "manage IAM roles" button. select this new role and "Apply changes"

**<div align="center">allowing connections to `redshift` from the outside world</div>**

1. If you're not on the "Properties" tab already, go to your cluster's main page: from the "Clusters" dashboard, click on the link for your clusters name. then, click on the "Properties" tab
1. find the security group item (in "Network and security" section) and click on the link
1. click on the "Inbound" tab on the lower half of the security group page
1. click the "Edit" button and add a new inbound connection of "Redshift" type from either your current IP address or all IP addresses.

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1-MSK32WlUoKrfQyn6aeLI-MG8e1EWIS_" width="1500px"></div>

### connect via `datagrip`

all right. now that all that is set up, we should be able to make a connection from `datagrip` to our `redshift` cluster. there's no data to query there yet, so our first act will be to add some publicly available data from `s3`.

**<div align="center">collect necessary identifiers</div>**

there are two pieces of information we will need eventually: one right now to connect, and another to copy some data later. open two new tabs, or an editor in which you can paste the following:

1. your `redshift` cluster's `jdbc` string
    1. go to `redshift`, the "clusters" sub-menu, select your cluster's name link, and find the "JDBC URL" field
        + example value: `jdbc:redshift://gu511-redshift-demo-2020.csneoatl4kff.us-east-1.redshift.amazonaws.com:5439/dev`
1. the `arn` of the `iam` `role` allowing `redshift` to read `s3`
    1. this is in the "Cluster permissions" block of the "Properties" tab in `redshift` service.
        + example value: `arn:aws:iam::134461086921:role/allow_redshift_s3_read_role_2020`

**<div align="center">connect to `redshift` from `datagrip`</div>**

1. back in `datagrip`, create a new connection profile
    1. "File > New... > datasource > redshift"
1. at the bottom of the page, if there is a link to download a driver, do so
1. name it whatever you want (suggestion: cluster name)
1. paste the `jdbc` in the `URL` field at the bottom
1. fill in username and password
1. click the "Test Connection" button
    1. if you don't get a green checkmark, drop into the zoom room
1. click the "OK" button

### get data

the getting started guide for `redshift` outlines a simple set of `create table` instructions that will build a handful of transactional sales tables in `redshift` by copying then `csv`s stored in publicly-accessible `s3`.

let's follow those steps

**<div align="center">load data from `csv`s on `s3` to tables in `redshift`</div>**

1. copy the `arn` for your `iam` `role` above to your clipboard
1. head to https://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-create-sample-db.html (google search "aws redshift step 6 load sample data s3") and follow the steps there

### querying

the tutorial above should have provided you with a number of examples of how you can construct simple queries
there are a few final notes on querying in columnar databases that are important to remember

**avoid `select *`**: the entire goal of columnar databases is to query *columns*. when a query executes it smartly scan the records in the various columns you requested, and return as many of them as is necessary. the scanning may be smart (informed by the MPP elements of the `redshift` query engine) but it will still scan the entire contents of each requested row.

in particular, this means that you should be as explicit and restrictive as possible in your queries. the computational overhead of pulling in unused columns is significant.

**you don't have `index`es, but you do have `sort` columns**: in traditional `sql`, the best way to improve performance of queries which utilize `where` clauses is to create an `index`. `redshift` has no indexes, but in their place we have `sortkey`s.

where the `distkey` of a table determines the way the `sql` engine will *distribute* records, the `sortkey` of a table will determine the order in which those distributed records are recorded. `redshift` can use `sortkey`s in `where` clauses to dramatically speed up the scanning of those columns. if you plan to perform a lot of `where` filtering,

**columns should be encoded**: this is an in-the-weeds detail, but if you are choosing to create tables for your own work you should be encoding your columns. the compression allows for much faster reads of encoded columns. [amazon docs](https://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html) offer instructions on how to choose the proper encoding

**`join`s may be expensive**: one last reminder: if you are `join`ing records that have been distributed to different workers, `redshift` will spend the majority of its computational overhead identifying which records on each server need to be joined with which records on all other servers. if you are planning to do frequent joins on a given column, consider making it a `distkey`

**check on the query summary in `redshift`**: on the `redshift` cluster page, you can click on the `query` tab and see the summary of all the previous queries. in particular, focus on the query explanation

**caching**: I mentioned it above, but query results are cached. try running any computationally query multiple times -- for example

```sql
select max(dateid) from sales;
```

in addition to just *feeling* the change in query time, view the query execution time values on the `redshift` cluster query tab.

<strong><em><div align="center">you know, `redshift`, that phenomenon where you are getting further from the light</div></em></strong>
<div align="center"><img src="http://en.es-static.us/upl/2012/06/500px-Redshift_blueshift.png" width="800px"></div>

# END OF LECTURE

next lecture: [parallelization and `gpu` analytics](013_parallelization_and_gpu_analytics.ipynb)