# Setup

`docker compose build && docker compose -d up` to run the postgres and pgadmin containers

If psql is installed locally, can connect to the container since the 5432 port is exposed:

```sh
psql \
    -U postgres \
    -h localhost \
    -p 5432
```

after which the shell will prompt for password that's set in `.env`


## table edit

In [1]:
%load_ext sql

In [2]:
from getpass import getpass

In [3]:
user = 'postgres'
pw = getpass()
host = 'localhost'
port = '5432'
db_name = 'enceladus'
conn_str = f'postgresql://{user}:{pw}@{host}:{port}/{db_name}'

 ········


In [4]:
%sql $conn_str

In [None]:
%%sql
drop table if exists enceladus;
create table enceladus(
    id serial primary key,
    the_date date,
    title varchar(100),
    description text
);

- `drop table if exists` reduces complication when it comes to errors
- `serial` is an always increasing unique sequence to use as primary key
    - shorthand from psql
    - ANSI sql may use `create sequence id_sq` and then get `nextval('id_sq')` as the primary key
- primary key in ansi: `add constraint enceladus primary key (id);`

## master schedule

Due to budget cuts, each cassini flyby could only operate one or two sensors instead of the originally planned ensemble. Thus the master schedule was borne, to plan the entire operation down to the second.

## Importing CSVs

- correct typing - each column must have defined type
- completeness - not every row may be complete
- accuracy - even if each field was filled and had the correct typing, they might not make sense in context; can't have negative kelvins.

Importing CSVs or any other data sources must define the protocol for when any one of those criteria are not met, and must require input from stakeholders, e.g. data producer and data consumer

### Mechanics

start simple, until it doesn't work. In order of complexity:

- shell scripts/Make files
- python's pandas for more complex functions
- kafka for streaming?

Import everything as _text_ first; get the data in db, _then_ get the typing right

`COPY FROM` reads file from disk to database

*Idempotency* must be maintained so that the pipeline can be repeated while arriving at the same result; loading a table should not add a new table the next time

## Idempotency

`build.sql` contains our load script, where we load the csv entirely as text

In [None]:
%%sql
-- build.sql part 0
create schema if not exists import;
drop table if exists import.master_plan;

-- build.sql part 1
drop table if exists master_plan;
create table master_plan(
    start_time_utc text,
    ... text,
    ...,
);
-- build.sql part 2
COPY master_plan
FROM 'path/to/master-plan.csv' -- note single quotes
-- delimiter: ','; there is header row; csv type
WITH DELIMITER ',' HEADER CSV;

1. create schema for better organization
2. creates the empty table with all text types
1. load our csv with COPY FROM

Next we define the _schema_, a de facto namespace feature. Namespace hierarchy in postgres goes:

- cluster - set of servers
- database
- schemas
  - default schema is `public`
- tables, views, functions; all fall under a schema
  - in bigquery this is called _dataset_

so much like we don't commit directly to `main` in git, we create a schema for our raw text table: `create schema if not exists import;`

Execute in psql: `psql enceladus -f build.sql`

Confirm table with `select * from import.master_plan limit 5;`, don't forget the `;`

## Make

makefile consists of these components

- target - top level names for things you want to happen
- recipe - commands under `target` that accomplishes the things you want to happen
  - must be indented with _tab_, new spaces, as many editors are wont to do
- prerequisite - each target may have a pre-req, which are other targets that needs to happen first
- variables - could be assigned at the top to parametrize the makefile

Common targets:

- all: default target; executed if no target specified when calling `make`
- clean: teardown, removing build artifacts and cleaning out build dir. Deletes `build.sql`; works if we use our makefile to build a new `build.sql` each time it's called
- .PHONY: not really sure

Parametrize

- `${CURDIR}` returns where `make` is being called, which is useful since psql requires absolute paths when specifying the `build.sql` location

Starting makefile:

```make
DB=enceladus
BUILD=${CURDIR}/build.sql
SCRIPTS=${CURDIR}/scripts
CSV='${CURDIR}/data/master_plan.csv'
MASTER=$(SCRIPTS)/import.sql
NORMALIZE = $(SCRIPTS)/normalize.sql

all: normalize
    psql $(DB) -f $(BUILD)

master:
    @cat $(MASTER) >> $(BUILD)

import: master
    @echo "COPY import.master_plan FROM
$(CSV) WITH DELIMITER ',' HEADER CSV;" >> $(BUILD)

normalize: import
    @cat $(NORMALIZE) >> $(BUILD)

clean:
    @rm -rf $(BUILD)
```

`make clean && make` will now build the script from scratch, and re-run it

Make  also allows us to compartmentalize the sql commands

- import.sql - create import schema and load csv
- normalize.sql - split the raw imported table into whatever smaller tables we need

## Database metadata

Search for

- `table_schema`
- `table_name`
- `column_name`
- `data_type`

from `information_schema.columns` for metadata

Postgres keeps internal metadata in these schema:

- `information_schema`
- `pg_catalog`

In [None]:
%%sql
SELECT 
   distinct table_name, table_schema, column_name, data_type
   
FROM 
   information_schema.columns
WHERE
    table_schema not in ('information_schema', 'pg_catalog')
order by table_name;