# A Curious Moon

> Rob Conery

Notes by _Tobias Reaper_

---
---

## Chapter 1 - Enceladus

### ETL

Extraction: Pulling relevant data out of various systems.

Transformation

* Correct typing
* Completeness
* Accuracy - numbers are within the range of what is to be expected

Loading: Push data into nicely normalized tables so it can be queried.

Increasing order of complexity:

1. Shell scripts / Make files until the complexity forces a slow-down
2. Use Python until transform / loading times are too long
3. Push to a system like Kafka

An important note:

> Import everything as `text` - add types later. Get everything in the database first; everything else can wait.

```sql
drop table if exists master_plan;
create table master_plan(
  start_time_utc text,
  duration text,
  date text,
  team text,
  spass_type text,
  target text,
  request_name text,
  library_definition text,
  title text,
  description text
);
COPY master_plan
FROM '/Users/Tobias/workshop/vela/ds/kb/books/curious_moon/cassini_data/data/master_plan.csv' WITH DELIMITER ',' HEADER CSV;
```

### Schema

Postgres' hierarchical schema:

* Cluster
* Database
* One or more schemas - the default is public
* Tables, views, functions, and other relations - these are all attached to a schema

Added this to the top (and near the bottom) of the script:

```sql
create schema if not exists import;
drop table if exists import.master_plan;
create table import.master_plan(
  ...
);
COPY import.master_plan
...
```

### Makefile

* Targets are the actual artifacts or processes that will happen
* The build happens with a recipe - basically a shell command
* Prerequisites can be specified

```Makefile
DB=enceladus
BUILD=${CURDIR}/build.sql
SCRIPTS=${CURDIR}/scripts
CSV='${CURDIR}/data/master_plan.csv'
MASTER=$(SCRIPTS)/import.sql
NORMALIZE=$(SCRIPTS)/normalize.sql

all: normalize
    psql $(DB) -f $(BUILD)

master:
    @cat $(MASTER) >> $(BUILD)

import: master
    @echo "COPY import.master_plan FROM $(CSV) WITH DELIMITER ',' HEADER CSV;" >> $(BUILD)

normalize: import
    @cat $(NORMALIZE) >> $(BUILD)

clean:
    @rm -rf $(BUILD)
```

* Running `make` will execute the `all` target
* `all` depends on `normalize`, which depends on `import`...etc.
* Each one adds a bit os SQL to build.sql except for `clean`

Key point here is that the various components of the SQL script can be split into separate files.

* `import.sql` : creates import schema and loads the CSV
* `normalize.sql` : split the import table into lookups, etc.

### Review and solifidy

> Record myself explaining (ELI5) the following topics:

* [ ] Creating schema and tables
* [ ] Loading CSVs
* [ ] Writing a Makefile

---
---

## Chapter 2 -  In Orbit

### Normalization

Here are the columns in `master_plan.csv`:

* start_time_utc
* duration
* date
* team
* spass_type
* target
* request_name
* library_definition
* title
* description

Focusing on removing string repetition, 5 lookup tables can be created:

* teams
* spass_types
* targets
* requests
* library_definitions

To build a lookup table, three things are needed:

1. Get all distinct values in the import table
2. Create a new table using this data
3. Add a primary key for use with a foreign key constraint

The source table for each of the lookup tables is called a `fact` table. This type of structure is known as a `star schema`.

### Importing Events

To start, create a fact table called `events` in the public schema:

```sql
create table events(
  id serial primary key,
  time_stamp timestamptz not null,
  title varchar(500),
  description text,
  event_type_id int,
  spass_type_id int,
  target_id int,
  team_id int,
  request_id int
);
```

And insert data into it:

```sql
insert into events(
  time_stamp,
  title,
  description
)

select
  import.master_plan.date::timestamptz,
  import.master_plan.title,
  import.master_plan.description
from import.master_plan;
```

Which gives an error:

    psql:events.sql:23: ERROR:  date/time field value out of range: "29-Feb-14"

Postgres stores all dates as UTC. They get converted to timezones when queried:

    Tobias=# select '2020-01-01'::timestamptz;
        timestamptz
    ------------------------
    2020-01-01 00:00:00-08
    (1 row)

Cast any date and time using `at time zone X`:

    Tobias=# select '2020-01-01'::timestamptz at time zone 'UTC';
        timezone
    ---------------------
    2020-01-01 08:00:00
    (1 row)


### Lookup tables

Get distinct values for each of the lookup tables with `distinct`, sending the results into a new table with `into`.



### Review and solifidy

> Record myself explaining (ELI5) the following topics

* [ ] Ch 1 topic 1
* [ ] Ch 1 topic 2
* [ ] ...