# Orbit


In [2]:
import os
from dotenv import load_dotenv

In [3]:
load_dotenv("../.env")
user = os.environ.get('POSTGRES_USER')
pw = os.environ.get('POSTGRES_PASSWORD')
db_name = os.environ.get('POSTGRES_DB')
host = 'localhost'
port = 5432
conn_str = f'postgresql://{user}:{pw}@{host}:{port}/{db_name}'

In [5]:
%load_ext sql

In [6]:
%sql $conn_str

In [22]:
%%sql
SELECT 
   distinct table_name, table_schema, column_name, data_type
   
FROM 
   information_schema.columns
WHERE
    table_schema not in ('information_schema', 'pg_catalog')
order by table_name;

 * postgresql://postgres:***@localhost:5432/enceladus
105 rows affected.


table_name,table_schema,column_name,data_type
event_types,public,description,text
event_types,public,id,integer
events,public,description,text
events,public,event_type_id,integer
events,public,id,integer
events,public,request_id,integer
events,public,spass_type_id,integer
events,public,target_id,integer
events,public,team_id,integer
events,public,time_stamp,timestamp with time zone


## Normalization

Normalization reduces repetition and thus disk usage. Essentially we're creating lookup tables. To make lookup tables:

1. get all distinct values from import
1. create new table with those distincts
1. add primary key for use with foreign key constraint

fields like team, spass, targets don't have many distincts, but requests and libs have hundreds/thousands. Even so we can make lookup tables for each of those types. They all need to relate back to the source table, i.e. `fact` table, a la *star schema*

## Importing events

Create fact table for events in public schema, where it will be globally accessible. Since this is not imported directly from csv, we can type the fields

In [None]:
%%sql
create table events(
    id serial primary key,
    time_stamp timestamptz not null,
    title varchar(500),
    description text,
    event_type_id int,
    spass_type_id int,
    target_id int,
    team_id int,
    request_id int
);

- No null constraints, _except for timestamp_; anywhere else should be able to accept null
- when pulling from `import.master_plan`, remember to cast the fields, e.g. `date::timestamptz` to convert our string date into timezoned timestamp
- how do we know that the field can be safely cast to the type we want? we don't really until we examine it, or try:
  - `select date::timestamptz from import.master_plan` will fail; 
  - something inside is formatted wrong
  - definitely ran into this multiple times when loading into bigquery

## datetimes

pain to deal with. NASA apparently uses `year-dayofyear` format to avoid leapyear bs

postgres always stores in UTC, until retrieved; at which point it converts to whatever timezone is set in config, which by default is determined from server location

When using `timestamptz`, specify the timezone, otherwise, again, postgres will assume server loc timezone which may not be what we need. Specify by `2001-01-01::timestamptz at time zone 'UTC'`

Instead of `date`, import `start_time_utc` and specify utc timezone


In [None]:
%%sql
insert into events(
    time_stamp,
    title,
    description
)
select
    start_time_utc::timestamp at time zone 'UTC',
    title,
    description
from import.master_plan;

## Lookup tables

Point of these is so that we replace the distinct values with some integer, which maps to the actual text name in that field's lookup. I.e. for `team`, the `team` lookup will have unique team names as primary key, and then some integer that correponds to each. Now our models can rely on this lookup and use integer to represent each team instead of full varchar. This could potentially speed up compute since comparisons are done with nums instead of texts.

Caveat is that with the reduced cost of storage, and decoupling of storage/compute, this has a lower cost impact than before.

Execution:

In [None]:
%%sql
-- idempotency
drop table if exists team;
-- create lookup
select distinct(team) as description 
into teams 
from import.master_plan;

-- add primary key
alter table teams
add id serial primary key;

repeat for other lookups, with this pattern:

- using integer `id` as primary
- `description` as text
- not using repetitive naming scheme, e.g. `teams.team`

Relating lookup back to the events fact table can be tricky. Or well just a lot of joins.

```sql
insert into events(
    time_stamp,
    title,
    description,
    event_type_id,
    target_id
    ...
)
select
    timestamp,
    ...
    event_types.id as event_type_id,
    targets.id as target_id,
    ...
from import.master_plan
left join event_types
    on event_types.description = import.master_plan.library_definition
...
-- repeat for each lookup
;
```

Left join is particularly important; this keeps all data in the `from` table, and pads nonmatches with nulls.

Now that we have all our lookups, we can leverage it in our `create table events` statement with `references`:

```sql
create table events(
    event_type_id int references event_types(id)
    ...)
```

`references` creates a foreign key constraint on the `event_type_id` field: all values here must be one of the `id` values in `event_types` lookup. It's a form of data validation

Now put it all together.

- scripts/create_table_master_plan.sql creates import schema and master_plan table
- scripts/normalize.sql creates the lookups, creates the event table, and jam data from master_plan via lookups
- need to fill out the scripts
