# Explore Marine Data

Want to explore [NCEI Data](https://www.ncei.noaa.gov/cdo-web/datasets), specifically the [marine](https://www.ncei.noaa.gov/data/global-marine/) dataset. What is in this dataset

In [1]:
import duckdb
import glob

## Download

Use the scripts in the data dir to download the data. Run `aria2c -i urls.txt` to download the data. Decompress with the `.sh` script. You will need to make dir's.

## Load Data

In [2]:
marine_dir = "/home/squirt/Documents/data/ncei/marine/marine_data/"

Load into duckdb

Data columns in sample are not the same in the real data. WTF. Im also not sure if these CSV columns are the same accross all csvs?? Lol. The real world is messy.

In [3]:
all_csv_columns = ["STATION", "DATE", "LATITUDE", "LONGITUDE", "ELEVATION", "NAME", "IMMA_VER", "ATTM_CT", "TIME_IND", "LL_IND", "SHIP_COURSE", "SHIP_SPD", "ID_IND", "COUNTRY_CODE", "WIND_DIR_IND", "WIND_DIR", "WIND_SPD_IND", "WIND_SPEED", "VV_IND", "VISIBILITY", "PRES_WX", "PAST_WX", "SEA_LVL_PRES", "CHAR_PPP", "AMT_PRES_TEND", "IND_FOR_TEMP", "AIR_TEMP", "IND_FOR_WBT", "WET_BULB_TEMP", "DPT_IND", "DEW_PT_TEMP", "SST_MM", "SEA_SURF_TEMP", "TOT_CLD_AMT", "LOW_CLD_AMT", "LOW_CLD_TYPE", "HGT_IND", "CLD_HGT", "MID_CLD_TYPE", "HI_CLD_TYPE", "WAVE_PERIOD", "WAVE_HGT", "SWELL_DIR", "SWELL_PERIOD", "SWELL_HGT", "TEN_BOX_NUM", "ONE_BOX_NUM", "DECK", "SOURCE_ID", "PLATFORM_ID", "DUP_STATUS", "DUP_CHK", "NIGHT_DAY_FLAG", "TRIM_FLAG", "NCDC_QC_FLAGS", "SOURCE_EXCLUSION_FLAG", "OB_SOURCE", "OB_PLATFORM", "FM_CODE_VER", "STA_WX_IND", "PAST_WX2", "DIR_OF_SWELL2", "PER_OF_SWELL2", "HGT_OF_SWELL2", "IND_FOR_PRECIP", "QC_IND", "QC_IND_FOR_FIELDS", "MQCS_VER"]

Make DuckDB table to store data 

In [4]:
conn = duckdb.connect(database=":memory:", read_only=False)
table_name = 'marine_climate_data'

# Create DuckDB table
table_columns = "Station VARCHAR, Time DATETIME, Lat DOUBLE, Lon DOUBLE, WindSpeed DOUBLE, AirTemp DOUBLE, WetTemp DOUBLE, SeaTemp DOUBLE, CloudAmount DOUBLE"
conn.execute(f"CREATE TABLE {table_name} ({table_columns})")

<duckdb.DuckDBPyConnection at 0x749dff71e870>

Read in CSV Columns to DuckDB

In [5]:
# Get CSV files
csv_files = glob.glob(marine_dir + "*.csv")

# Need to map the CSV columns to the DuckDB table columns
csv_columns = ["STATION", "DATE", "LATITUDE", "LONGITUDE", "WIND_SPEED", "AIR_TEMP", "WET_BULB_TEMP", "SEA_SURF_TEMP", "TOT_CLD_AMT"] 
temp_table = 'temp_table'

for csv_file in csv_files:
    # Create a temporary table from the CSV file
    conn.execute(f"CREATE TABLE {temp_table} AS SELECT * FROM read_csv_auto('{csv_file}')")

    # Drop table if columns not present (because I don't really understand this data)
    # Fetch the column names from the table
    table_info = conn.execute(f"PRAGMA table_info({temp_table})").fetchall()
    temp_table_columns = [column[1] for column in table_info]

    # Compare the table's columns with the csv_columns list
    if set(csv_columns) - set(temp_table_columns):
        # Drop the table if the columns don't match
        conn.execute(f"DROP TABLE IF EXISTS {temp_table}")
    else:
        # Insert data from temporary table into final table with column mapping and type conversion
        query = f"""
        INSERT INTO {table_name} (Station, Time, Lat, Lon, WindSpeed, AirTemp, WetTemp, SeaTemp, CloudAmount)
        SELECT 
            STATION, 
            TRY_CAST(REPLACE(DATE, 'T', ' ') AS DATETIME), 
            CAST(LATITUDE AS DOUBLE), 
            CAST(LONGITUDE AS DOUBLE), 
            CAST(WIND_SPEED AS DOUBLE), 
            CAST(AIR_TEMP AS DOUBLE), 
            CAST(WET_BULB_TEMP AS DOUBLE), 
            CAST(SEA_SURF_TEMP AS DOUBLE), 
            CAST(TOT_CLD_AMT AS DOUBLE)
        FROM {temp_table}
        """
        conn.execute(query)
    
        # Drop the temporary table after all CSV files have been imported
        conn.execute(f"DROP TABLE {temp_table}")

To verify data was ingested print out the count

In [6]:
print(conn.execute(f"SELECT COUNT(*) FROM {table_name}").fetchall())

[(511470,)]


## Quick Investigation

Just want to peak at the data to see what were working with, can we train a gaussian process???

In [9]:
import pandas as pd

Pull record from database

In [10]:
df = conn.execute(f"SELECT * FROM {table_name} LIMIT 10;").fetchdf()
print(df.head())

  Station                Time    Lat     Lon  WindSpeed  AirTemp  WetTemp  \
0   52841 2000-03-01 03:19:00  10.74  163.21        NaN      NaN      NaN   
1   52841 2000-03-01 05:03:00  10.74  163.21        NaN      NaN      NaN   
2   52841 2000-03-01 06:39:00  10.74  163.21        NaN      NaN      NaN   
3   52841 2000-03-01 07:56:00  10.79  163.00        NaN      NaN      NaN   
4   52841 2000-03-01 09:27:00  10.74  163.21        NaN      NaN      NaN   

   SeaTemp  CloudAmount  
0    281.0          NaN  
1    281.0          NaN  
2    281.0          NaN  
3    281.0          NaN  
4    280.0          NaN  


Want to predict AirTemp, so drop all Nulls??? 

In [11]:
conn.execute(f"DELETE FROM {table_name} WHERE AirTemp IS NULL")

<duckdb.DuckDBPyConnection at 0x749dff71e870>

How many records are left?

In [12]:
print(conn.execute(f"SELECT COUNT(*) FROM {table_name}").fetchall())

[(308269,)]


So that looks like enough data for a simple model, again the goal is not to build the best model, but get familiar with gaussian processes.

What does the data look like now?

In [14]:
df = conn.execute(f"SELECT * FROM {table_name} LIMIT 20;").fetchdf()
print(df.head(20))

   Station                Time   Lat    Lon  WindSpeed  AirTemp  WetTemp  \
0     WHRN 2000-03-03 18:00:00  19.8  167.9        5.0    244.0    222.0   
1     WHRN 2000-03-04 00:00:00  19.3  165.1       77.0    244.0    227.0   
2     WHRN 2000-03-04 06:00:00  18.8  163.2      113.0    244.0    217.0   
3     KIRH 2000-03-04 12:00:00  19.2  168.8      129.0    250.0    239.0   
4     WHRN 2000-03-04 12:00:00  18.2  160.7      118.0    233.0    216.0   
5     KIRH 2000-03-05 00:00:00  18.3  164.1      129.0    300.0    256.0   
6     WCPU 2000-03-08 18:00:00  19.9  163.5       26.0    250.0    226.0   
7     WCPU 2000-03-09 00:00:00  19.8  161.0       36.0    283.0    261.0   
8     HPEW 2000-03-11 00:00:00  14.9  161.1       98.0    280.0    260.0   
9     WPGK 2000-03-11 11:00:00  19.2  169.1       36.0    261.0    244.0   
10    HPEW 2000-03-11 12:00:00  12.9  163.1      103.0    260.0    240.0   
11    WPGK 2000-03-11 18:00:00  18.7  166.3       26.0    256.0    244.0   
12    WPGK 2

Data looks good! Surprising. I think we can train a very simple model on this.

## Save Cleaned Data

I think we can now build a data loader with this. 

In [15]:
save_path = f'{marine_dir}/marine_climate_data.snappy.parquet'
conn.execute(f"COPY (SELECT * FROM {table_name}) TO '{save_path}' (FORMAT 'parquet', CODEC 'snappy')")

<duckdb.DuckDBPyConnection at 0x749dff71e870>