# Cross Validation for IDW Interpolation 
## Task 2A (IDW for continuous & discrete)

This document includes Python codes that conduct cross validation (CV) for Inverse Distance Weighting (IDW) Interpolation on water quality parameters, including 6 water quality parameters in arcpy environment:
- Dissolved oxygen (DO_mgl)
- Salinity (Sal_ppt)
- Turbidity (Turb_ntu)
- Temperature (T_c)
- Secchi (Secc_m)
- Total Nitrogen (TN_mgl) 

The analysis is conducted in the separate water bodies:
- Guana Tolomato Matanzas (GTM)
- Estero Bay (EB)
- Charlotte Harbor (CH)
- Biscayne Bay (BB)
- Big Bend Seagrasses (BBS)

**Tasks:**  

- **Task 2A Calculate the RMSE and Mean Error (ME) for IDW results using both continuous and discrete data**

- Task 2B Calculate the RMSE and Mean Error (ME) for IDW results using continuous data.

Time periods one year before and after storm event for Task 2A tests (seasons).
<br>
<div style="text-align: left;">
    <img src="../misc/TimePeriods.png" style="display: block; margin-left: 0; margin-right: auto; width: 600px;"/>
</div>

Summary of IDK and RK Accuracy Assessments.
<br>
<div style="text-align: left;">
    <img src="../misc/Table3.png" style="display: block; margin-left: 0; margin-right: auto; width: 600px;"/>
</div>

**Contents:**
* [1. Data Preprocess](#reg_preprocessing)
    * [1.1 Subsetting Dataset](#reg_subset)
    * [1.2 Preview Dataset](#reg_preview)
    * [1.3 Fill Unique ID](#reg_id)
* [2. Create Shapefile](#reg_create_shp)
* [3. Cross Validation for IDW](#reg_cv_idw)

# 1. Loading packages

In [2]:
import pandas as pd
import numpy  as np
import arcpy
from arcpy.sa import *
import os, time, math, importlib, sys
path = r'E:\Projects\SEACAR_WQ_2024\git\misc'
sys.path.insert(0, path)
import idw_rk
# !install conda install conda-forge::pyproj
import pyproj,csv

importlib.reload(idw_rk)

import warnings
warnings.filterwarnings('ignore')

# define scratch folder to avoid overwritting from parallel threats
arcpy.env.scratchWorkspace = r"E:\Projects\SEACAR_WQ_2024\scratch/IDW_all"

# 1. Data Preprocessing <a class="anchor" id="reg_preprocessing"></a>

## 1.1 Load csv files

In [3]:
gis_path = r'E:/Projects/SEACAR_WQ_2024/GIS_Data/'

dfCon = pd.read_csv(gis_path + 'OEAT_Continuous_WQ-2024-Feb-21.csv', low_memory=False)
dfDis = pd.read_csv(gis_path + 'OEAT_Discrete_WQ-2024-May-06.csv', low_memory=False)


## 1.2 Subsetting Data <a class="anchor" id="reg_subset"></a>

### Selecting data from 8 am to 18 pm (daytime)

In [4]:
# Convert string to datetime
dfDis['SampleDate'] = pd.to_datetime(dfDis['SampleDate'], format='%Y-%m-%d %H:%M:%S.%f')
dfCon['SampleDate'] = pd.to_datetime(dfCon['SampleDate'], format='%Y-%m-%d %H:%M:%S.%f')


# Include date from 8:00 am to 18:00 pm
start_time = '08:00'
end_time = '18:00'

dfConTime = dfCon[dfCon['SampleDate'].dt.time.between(pd.to_datetime(start_time).time(), pd.to_datetime(end_time).time())]

# Concatenate time-filtered continuous and discrete data
dfAll = pd.concat([dfDis, dfConTime], ignore_index=True)

## 1.3 Calculating average values at unique observation points

In [5]:
dfAll_Mean = dfAll.groupby(['WaterBody','ParameterName','ParameterUnits', 'Year','Season','Latitude_DD','Longitude_DD','WbodyAcronym'])["ResultValue"].agg("mean").reset_index()
dfAll = dfAll_Mean

## 1.4 Convert coordinate system to EPSG: 3086

In [6]:
# Define the EPSG codes for source (EPSG:4326) and target (EPSG:3086) coordinate systems
source_epsg = 'EPSG:4326'
target_epsg = 'EPSG:3086'

# Create a PyProj Transformer for the conversion
transformer = pyproj.Transformer.from_crs(source_epsg, target_epsg, always_xy=True)

# Define a function to apply the transformation to each row of the DataFrame
def transform_coordinates(row):
    x, y = transformer.transform(row['Longitude_DD'], row['Latitude_DD'])
    return pd.Series({'x': x, 'y': y})

# Apply the transformation function to the DataFrame and create new columns for the converted coordinates
dfAll[['x', 'y']] = dfAll.apply(transform_coordinates, axis=1)

#### Save aggregated data to csv file

In [7]:
dfAll.to_csv(gis_path + 'OEAT_All_WQ-2024-Feb-21.csv', index=False)

## 2. Prepare for batch interpolation
### 2.1 Preset abbreviation for waterbody and parameter name

In [8]:
area_shortnames = {
    'Guana Tolomato Matanzas': 'GTM',
    'Estero Bay': 'EB',
    'Charlotte Harbor': 'CH',
    'Biscayne Bay': 'BB',
    'Big Bend Seagrasses':'BBS'
}

param_shortnames = {
    'Salinity': 'Sal_ppt',
    'Total Nitrogen': 'TN_mgl',
    'Dissolved Oxygen': 'DO_mgl',
    'Turbidity':'Turb_ntu',
    'Secchi Depth':'Secc_m',
    'Water Temperature':'T_c'
}

# Set input parameters
waterbody_names = [
    'Guana Tolomato Matanzas',
    'Estero Bay',
    'Charlotte Harbor',
    'Biscayne Bay',
    'Big Bend Seagrasses'
]

covariates_dict = {
    "GTM":"LDI",
    "EB":"bathymetry+LDI+popden",
    "CH":"bathymetry+LDI+popden+water_flow_wet",
    "BB":"bathymetry+LDI+popden",
    "BBS":"bathymetry+LDI"
}

parameter_names = ['Dissolved Oxygen', 'Salinity', 'Secchi Depth', 'Total Nitrogen', 'Turbidity', 'Water Temperature']
# years = unique_years
seasons = ['Fall', 'Spring', 'Summer', 'Winter']
# shp_folder = gis_path + r"shapefiles_All"
shp_folder = gis_path + r"shapefiles"

### 2.2 Define the barrier files

In [9]:
barrier_folder = os.path.join(gis_path, 'Barriers')
barrier_folder

barriers = []
for file in os.listdir(barrier_folder):
    if file.endswith(".shp"):
        barriers.append(os.path.join(barrier_folder, file))

for barrier in barriers:
    print(barrier)

E:/Projects/SEACAR_WQ_2024/GIS_Data/Barriers\BBS_Barriers.shp
E:/Projects/SEACAR_WQ_2024/GIS_Data/Barriers\BB_Barriers.shp
E:/Projects/SEACAR_WQ_2024/GIS_Data/Barriers\CH_Barriers.shp
E:/Projects/SEACAR_WQ_2024/GIS_Data/Barriers\EB_Barriers.shp
E:/Projects/SEACAR_WQ_2024/GIS_Data/Barriers\GTM_Barriers.shp


### 2.3 Define waterbody boundary for spatial extent and masking

In [10]:
waterbody_extent = os.path.join(gis_path, 'OEAT_Waterbody_Boundaries', 'OEAT_Waterbody_Boundary.shp')

unique_waterbodies = []
with arcpy.da.SearchCursor(waterbody_extent, ['WaterbodyA']) as cursor:
    for row in cursor:
        unique_waterbodies.append(row[0])

print("Unique Waterbodies:", unique_waterbodies)

Unique Waterbodies: ['BBS', 'BB', 'CH', 'EB', 'GTM']


### 2.4 Load the table of study periods,  parameters, and seasons

In [11]:
seasons_all = pd.read_csv(gis_path + 'Seasons_all.csv', low_memory=False)

### 2.5 Define output folders

In [12]:
shpAll_folder = gis_path + r"shapefiles/shapefiles_All" 
idwAll_folder = gis_path + r"raster_output/idw_All"

# Preview dataset
dfAll

Unnamed: 0,WaterBody,ParameterName,ParameterUnits,Year,Season,Latitude_DD,Longitude_DD,WbodyAcronym,ResultValue,x,y
0,Big Bend Seagrasses,Dissolved Oxygen,mg/L,2015,Fall,29.008300,-82.825250,BBS,6.350000,514236.421551,556316.396318
1,Big Bend Seagrasses,Dissolved Oxygen,mg/L,2015,Fall,29.036716,-83.129066,BBS,6.200000,484670.524231,559226.858975
2,Big Bend Seagrasses,Dissolved Oxygen,mg/L,2015,Fall,29.046916,-83.033200,BBS,7.100000,493981.422006,560428.927830
3,Big Bend Seagrasses,Dissolved Oxygen,mg/L,2015,Fall,29.054833,-82.758666,BBS,6.500000,520659.055377,561546.969670
4,Big Bend Seagrasses,Dissolved Oxygen,mg/L,2015,Fall,29.056800,-83.059133,BBS,6.000000,491452.167490,561506.993606
...,...,...,...,...,...,...,...,...,...,...,...
77793,Guana Tolomato Matanzas,Water Temperature,Degrees C,2023,Summer,30.025360,-81.370918,GTM,29.150000,653237.586095,671395.945419
77794,Guana Tolomato Matanzas,Water Temperature,Degrees C,2023,Summer,30.026440,-81.369403,GTM,29.500000,653380.952596,671518.961956
77795,Guana Tolomato Matanzas,Water Temperature,Degrees C,2023,Summer,30.033611,-81.353027,GTM,29.766667,654940.976043,672348.548670
77796,Guana Tolomato Matanzas,Water Temperature,Degrees C,2023,Summer,30.050338,-81.371008,GTM,29.675000,653169.830710,674167.856772


## 2.6 Fill NaN RowID with unique ID, IDW function needs unique ID <a class="anchor" id="reg_id"></a>

In [13]:
idw_rk.fill_nan_rowids(dfAll, 'RowID')

# Keep RowID as integer
dfAll['RowID'] = dfAll['RowID'].astype(int)
dfAll

Unnamed: 0,WaterBody,ParameterName,ParameterUnits,Year,Season,Latitude_DD,Longitude_DD,WbodyAcronym,ResultValue,x,y,RowID
0,Big Bend Seagrasses,Dissolved Oxygen,mg/L,2015,Fall,29.008300,-82.825250,BBS,6.350000,514236.421551,556316.396318,1
1,Big Bend Seagrasses,Dissolved Oxygen,mg/L,2015,Fall,29.036716,-83.129066,BBS,6.200000,484670.524231,559226.858975,2
2,Big Bend Seagrasses,Dissolved Oxygen,mg/L,2015,Fall,29.046916,-83.033200,BBS,7.100000,493981.422006,560428.927830,3
3,Big Bend Seagrasses,Dissolved Oxygen,mg/L,2015,Fall,29.054833,-82.758666,BBS,6.500000,520659.055377,561546.969670,4
4,Big Bend Seagrasses,Dissolved Oxygen,mg/L,2015,Fall,29.056800,-83.059133,BBS,6.000000,491452.167490,561506.993606,5
...,...,...,...,...,...,...,...,...,...,...,...,...
77793,Guana Tolomato Matanzas,Water Temperature,Degrees C,2023,Summer,30.025360,-81.370918,GTM,29.150000,653237.586095,671395.945419,77794
77794,Guana Tolomato Matanzas,Water Temperature,Degrees C,2023,Summer,30.026440,-81.369403,GTM,29.500000,653380.952596,671518.961956,77795
77795,Guana Tolomato Matanzas,Water Temperature,Degrees C,2023,Summer,30.033611,-81.353027,GTM,29.766667,654940.976043,672348.548670,77796
77796,Guana Tolomato Matanzas,Water Temperature,Degrees C,2023,Summer,30.050338,-81.371008,GTM,29.675000,653169.830710,674167.856772,77797


# 3. Create Shapefiles <a class="anchor" id="reg_create_shp"></a>

In [14]:
# Merge interested with latitude and longitude columns
seasons_all_coord = idw_rk.merge_with_lat_long(seasons_all, dfAll)
seasons_all_coord

Unnamed: 0,WaterBody,Year,Season,Parameter,Filename,NumDataPoints,RMSE,ME,x,y,RowID,ResultValue
0,Guana Tolomato Matanzas,2015,Fall,Total Nitrogen,,0,,,669975.848287,626752.656623,75890,0.212233
1,Guana Tolomato Matanzas,2015,Fall,Total Nitrogen,,0,,,662275.840738,630059.187470,75891,1.034500
2,Guana Tolomato Matanzas,2015,Fall,Total Nitrogen,,0,,,667035.271306,631036.021679,75892,1.366500
3,Guana Tolomato Matanzas,2015,Fall,Total Nitrogen,,0,,,668862.259531,631692.835328,75893,0.192567
4,Guana Tolomato Matanzas,2015,Fall,Total Nitrogen,,0,,,665055.970903,631868.535738,75894,0.862000
...,...,...,...,...,...,...,...,...,...,...,...,...
20384,Big Bend Seagrasses,2022,Spring,Water Temperature,E:/Projects/SEACAR_WQ_2024/GIS_Data/raster_out...,27,-1.797693e+308,-1.797693e+308,374832.099897,689623.362197,26660,20.533333
20385,Big Bend Seagrasses,2022,Spring,Water Temperature,E:/Projects/SEACAR_WQ_2024/GIS_Data/raster_out...,27,-1.797693e+308,-1.797693e+308,371015.090649,692081.211171,26661,20.740000
20386,Big Bend Seagrasses,2022,Spring,Water Temperature,E:/Projects/SEACAR_WQ_2024/GIS_Data/raster_out...,27,-1.797693e+308,-1.797693e+308,401894.595993,699335.031753,26662,19.500000
20387,Big Bend Seagrasses,2022,Spring,Water Temperature,E:/Projects/SEACAR_WQ_2024/GIS_Data/raster_out...,27,-1.797693e+308,-1.797693e+308,401457.362103,702258.799507,26663,20.500000


In [None]:
# Skip if RK has already created and is currently running
idw_rk.create_shp_season(seasons_all_coord, shpAll_folder)

# 4. Cross Validation for IDW <a class="anchor" id="reg_cv_idw"></a>

In [17]:
# Empty the shapefile folder
idw_rk.delete_all_files(idwAll_folder)

In [36]:
# Select a section of table to process
seasons_slct = seasons_all.iloc[:]
# seasons_slct = seasons_all.iloc[:56].reset_index()
seasons_slct.drop(seasons_slct[seasons_slct['WaterBody'] == 'Charlotte Harbor'].index, inplace=True)
seasons_slct

Unnamed: 0,index,WaterBody,Year,Season,Parameter,Filename,NumDataPoints,RMSE,ME
0,0,Guana Tolomato Matanzas,2015,Fall,Total Nitrogen,,0,,
1,1,Guana Tolomato Matanzas,2015,Winter,Total Nitrogen,,0,,
2,2,Guana Tolomato Matanzas,2016,Spring,Total Nitrogen,,0,,
3,3,Guana Tolomato Matanzas,2016,Summer,Total Nitrogen,,0,,
4,4,Guana Tolomato Matanzas,2016,Fall,Total Nitrogen,,0,,
5,5,Guana Tolomato Matanzas,2016,Winter,Total Nitrogen,,0,,
6,6,Guana Tolomato Matanzas,2017,Spring,Total Nitrogen,,0,,
7,7,Guana Tolomato Matanzas,2017,Summer,Total Nitrogen,,13,,
8,8,Estero Bay,2016,Summer,Total Nitrogen,,0,,
9,9,Estero Bay,2016,Fall,Total Nitrogen,,0,,


In [None]:
importlib.reload(idw_rk)

# If the number of data points is less than 3，skipping calculate IDW
idw_rk.idw_interpolation(seasons_slct, shpAll_folder, idwAll_folder, waterbody_extent, barrier_folder)

Processing file: SHP_GTM_TN_mgl_2015_Fall.shp
