# Explore Marine Data

Want to explore [NCEI Data](https://www.ncei.noaa.gov/cdo-web/datasets), specifically the [marine](https://www.ncei.noaa.gov/data/global-marine/) dataset. What is in this dataset

In [1]:
import duckdb
import glob

## Download

Use the scripts in the data dir to download the data. Run `aria2c -i urls.txt` to download the data. Decompress with the `.sh` script. You will need to make dir's.

## Load Data

In [2]:
marine_dir = "/home/squirt/Documents/data/ncei/marine/marine_data/"

Load into duckdb

Data columns in sample are not the same in the real data. WTF

In [3]:
all_csv_columns = ["STATION", "DATE", "LATITUDE", "LONGITUDE", "ELEVATION", "NAME", "IMMA_VER", "ATTM_CT", "TIME_IND", "LL_IND", "SHIP_COURSE", "SHIP_SPD", "ID_IND", "COUNTRY_CODE", "WIND_DIR_IND", "WIND_DIR", "WIND_SPD_IND", "WIND_SPEED", "VV_IND", "VISIBILITY", "PRES_WX", "PAST_WX", "SEA_LVL_PRES", "CHAR_PPP", "AMT_PRES_TEND", "IND_FOR_TEMP", "AIR_TEMP", "IND_FOR_WBT", "WET_BULB_TEMP", "DPT_IND", "DEW_PT_TEMP", "SST_MM", "SEA_SURF_TEMP", "TOT_CLD_AMT", "LOW_CLD_AMT", "LOW_CLD_TYPE", "HGT_IND", "CLD_HGT", "MID_CLD_TYPE", "HI_CLD_TYPE", "WAVE_PERIOD", "WAVE_HGT", "SWELL_DIR", "SWELL_PERIOD", "SWELL_HGT", "TEN_BOX_NUM", "ONE_BOX_NUM", "DECK", "SOURCE_ID", "PLATFORM_ID", "DUP_STATUS", "DUP_CHK", "NIGHT_DAY_FLAG", "TRIM_FLAG", "NCDC_QC_FLAGS", "SOURCE_EXCLUSION_FLAG", "OB_SOURCE", "OB_PLATFORM", "FM_CODE_VER", "STA_WX_IND", "PAST_WX2", "DIR_OF_SWELL2", "PER_OF_SWELL2", "HGT_OF_SWELL2", "IND_FOR_PRECIP", "QC_IND", "QC_IND_FOR_FIELDS", "MQCS_VER"]

In [4]:
conn = duckdb.connect(database=":memory:", read_only=False)
csv_files = glob.glob(marine_dir + "*.csv")

table_name = 'marine_climate_data'

# Create DuckDB table
table_columns = "Station VARCHAR, Time DATETIME, Lat DOUBLE, Lon DOUBLE, WindSpeed DOUBLE, AirTemp DOUBLE, WetTemp DOUBLE, SeaTemp DOUBLE, CloudAmount DOUBLE"
conn.execute(f"CREATE TABLE {table_name} ({table_columns})")

csv_colums = ["STATION", "DATE", "LATITUDE", "LONGITUDE", "WIND_SPEED", "AIR_TEMP", "WET_BULB_TEMP", "SEA_SURF_TEMP", "TOT_CLD_AMT"] 
csv_columns_string = ','.join(csv_colums)
temp_table = 'temp_table'

for csv_file in csv_files:
    print(f'processing: {csv_file}')
    # Create SQL query to import CSV data into DuckDB table
    # Create a temporary table from the CSV file
    conn.execute(f"CREATE TABLE {temp_table} AS SELECT * FROM read_csv_auto('{csv_file}')")
    print(conn.execute(f"SELECT COUNT(*) FROM {temp_table}").fetchall())


    # Insert data from temporary table into final table with column mapping and type conversion
    query = f"""
    INSERT INTO {table_name} (Station, Time, Lat, Lon, WindSpeed, AirTemp, WetTemp, SeaTemp, CloudAmount)
    SELECT 
        STATION, 
        TRY_CAST(REPLACE(DATE, 'T', ' ') AS DATETIME), 
        CAST(LATITUDE AS DOUBLE), 
        CAST(LONGITUDE AS DOUBLE), 
        CAST(WIND_SPEED AS DOUBLE), 
        CAST(AIR_TEMP AS DOUBLE), 
        CAST(WET_BULB_TEMP AS DOUBLE), 
        CAST(SEA_SURF_TEMP AS DOUBLE), 
        CAST(TOT_CLD_AMT AS DOUBLE)
    FROM {temp_table}
    """
    conn.execute(query)
    
    # Drop the temporary table after all CSV files have been imported
    conn.execute(f"DROP TABLE {temp_table}")

processing: /home/squirt/Documents/data/ncei/marine/marine_data/20_160_10_170.csv
[(458,)]
processing: /home/squirt/Documents/data/ncei/marine/marine_data/20_-160_10_-150.csv
[(5118,)]
processing: /home/squirt/Documents/data/ncei/marine/marine_data/90_-40_80_-30.csv
[(48,)]


BinderException: Binder Error: Referenced column "WIND_SPEED" not found in FROM clause!
Candidate bindings: "temp_table.ID_IND"

In [None]:
print(conn.execute(f"SELECT COUNT(*) FROM {table_name}").fetchall())