# INMS - sniffing the sky

The Ion Neutral Mass Spectrometer sniffs space to analyze what chemicals are out there. It ionizes matter with electron beam and produces a spectrum via a quadrupole (collection of four steel rods with electrical currents running through). As part of the analysis metadata, positional data is recorded. Cross-referenced with the mission plan, we can find the closest flybys.

In [1]:
import os
from dotenv import load_dotenv

In [2]:
load_dotenv("../.env")
user = os.environ.get('POSTGRES_USER')
pw = os.environ.get('POSTGRES_PASSWORD')
db_name = os.environ.get('POSTGRES_DB')
host = 'localhost'
port = 5432
conn_str = f'postgresql://{user}:{pw}@{host}:{port}/{db_name}'

In [3]:
%load_ext sql

In [6]:
%sql $conn_str

## dataset structure

due to the large size, need to be selective on which files to use. INMS is separated to folders of years, months, days of year, and then a mess of CSVs. We need to limit our search by figuring out the year/day of year for each suspected flyby. Use our `enceladus_events` materialized view

In [7]:
%%sql
select 
    date_part('year', date), 
    to_char(time_stamp, 'DDD')
from enceladus_events
where event like '%closest%'
order by time_stamp;

 * postgresql://postgres:***@localhost:5432/enceladus
24 rows affected.


date_part,to_char
2005.0,48
2005.0,68
2005.0,195
2008.0,72
2008.0,224
2008.0,283
2008.0,305
2009.0,306
2009.0,306
2009.0,325


The dataset includes a FMT file which is a manifest describes the CSV columns

- `ALT_T` - altitude of spacecraft above target body within 1 hour of closest flyby
- `TARGET` - target body

Search for `TARGET = 'enceladus` and look for lowest `ALT_T`?

## CSVs and bash

- `ls ./**/*/.csv | wc -l`
  - lists all CSVs, pipe output to wordcount, count only lines
  - return num of CSVs
  - loop over each CSV and run `COPY FROM`?
  - or concat into single CSV, then import?
  - second is better to avoid partially copied table, in case of failure
- `cat ./**/*.csv > inms.csv`
  - what are the CSV headers?
    - rows 1-3 are headers
    - remove from each CSV?
    - import as is and remove with SQL
  - do all CSVs have same headers
- `csvkit` has useful tools for importing CSVs to db
  - `csvsql` takes a csv and generate a `CREATE TABLE` SQL statement
    - defaults all the VARCHAR but it takes up more space than TEXT
    - TEXT is same as VARCHAR in postgres, but has no length check and so is faster
    - use `sed` to convert VARCHAR to text
   
```bash
csvsql 2005/048/200504800_L1A_05.csv \
    -i postgresql \
    --tables "import.inms" \
    --no-constraints \
    -overwrite | sed 's/VARCHAR/text/g' > import.sql
```

Use this template to 

- add `DROP ... IF EXISTS`
- `COPY FROM` at the end
- change table name; we want to create this in the `import` schema, but if we run `import.sql` it will create `import.inms` in `public` schema, so we'll change the `import.sql` file manually to correct for this

### error: extra data

2015 suddenly added a bunch of new columns which failed the import, since we used 2005 csv as the template to create our table. Manifest explains the new columns.

Options:

1. cut the extra columns - `cut -d -f2 --complement inms.csv`
1. load 2015 as separate table?
   
Best to cut the extra columns from 2015 csv after visual inspection with excel


```bash
cat 2015/**/*.csv > 2015.csv
cut -d ',' -f1,3-47,50-77,82-83 2015.csv > inms_2.csv

# manually move all other years into ./good/
cat good/**/*.csv > inms_1.csv

# combine to one
cat inms_1.csv inms_2.csv > inms.csv
```

- import with `psql enceladus < import.sql`
- Remove all the header rows with SQL - `delete from import.inms where sclk='sclk';`
- remove all entries without `sclk`, or spacecraft clock, data - `delete from import.inms where sclk is null or sclk='';`
- try to look for all entries with enceladus as target and create view

In [None]:
%%sql
-- remove header rows (artifact of concatenation)
delete from import.inms where sclk='sclk';

-- remove null timestamp entries
delete from import.inms where sclk is null or sclk='';

-- create view for enceladus
drop materialized view if exists flyby_altitudes;
create materialized view flyby_altitudes as
select
    (sclk::timestamp) as time_stamp,
    alt_t::numeric(10,3) as altitude
from import.inms
where target='ENCELADUS'
and alt_t IS NOT NULL;

## INMS data inspection

lowest point of each flyby:

In [9]:
%%sql
select time_stamp,
min(altitude)
from flyby_altitudes
group by DATE(time_stamp)
order by min(altitude);

 * postgresql://postgres:***@localhost:5432/enceladus
(psycopg2.errors.GroupingError) column "flyby_altitudes.time_stamp" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: select time_stamp,
               ^

[SQL: select time_stamp,
min(altitude)
from flyby_altitudes
group by DATE(time_stamp)
order by min(altitude);]
(Background on this error at: https://sqlalche.me/e/20/f405)


## nadirs

To find the closest flyby from `flyby_altitudes`, we look for `min(altitude)` grouped by weeks, given that flybys must be at least two weeks apart for cassini to slingshot around saturn or titan.

In [12]:
%%sql
select
    date_part('year',time_stamp)::integer as year,
    date_part('week',time_stamp)::integer as week,
    min(altitude) as altitude
from flyby_altitudes
group by
    date_part('year',time_stamp),
    date_part('week',time_stamp) 
order by year, week

 * postgresql://postgres:***@localhost:5432/enceladus
23 rows affected.


year,week,altitude
2005,7,1272.075
2005,10,500.37
2005,28,168.012
2008,11,50.292
2008,33,53.353
2008,41,28.576
2008,44,173.044
2009,45,98.901
2009,47,1596.561
2010,17,3771.195


Next, find the exact `time_stamp` associated with each nadir using a CTE

Due to the speed of Cassini, multiple timestamps are associated with each flyby's min altitude. Without any additional context, the next best thing is to take the average of all the timestamps associated with each nadir

In [8]:
%%sql
-- flybys table
with lows_by_weeks as 
    (select 
        date_part('year', time_stamp) as year,
        date_part('week', time_stamp) as week,
        min(altitude) as alt
    from flyby_altitudes
    group by 
        date_part('year', time_stamp),
        date_part('week', time_stamp)
    ), 
nadirs as (
    select 
        f.time_stamp,
        l.alt
    from lows_by_weeks l
    inner join flyby_altitudes f
    on l.alt = f.altitude
)
select 
    min(time_stamp) + (max(time_stamp) - min(time_stamp))/2 time_stamp_avg,
    alt
from nadirs
group by 
    alt -- don't group by dates a second time
order by time_stamp_avg;

 * postgresql://postgres:***@localhost:5432/enceladus
23 rows affected.


time_stamp_avg,alt
2005-02-17 03:30:12.119000,1272.075
2005-03-09 09:08:03.472500,500.37
2005-07-14 19:55:22.330000,168.012
2008-03-12 19:06:11.509000,50.292
2008-08-11 21:06:18.574000,53.353
2008-10-09 19:06:39.724000,28.576
2008-10-31 17:14:51.429000,173.044
2009-11-21 02:09:56.371000,1596.561
2010-05-18 06:04:40.301000,437.292
2010-08-13 22:30:51.975000,2555.18


duplicates showing up for 2011-10-01 and 2012-04-14??? don't group by dates again in the last query