# Cross Validation for IDW Interpolation 
## Task 2 (continuous & discrete) cross-year for the four seasons

This document includes Python codes that conduct cross validation (CV) for Inverse Distance Weighting (IDW) Interpolation on water quality parameters, including 6 water quality parameters in arcpy environment:
- Dissolved oxygen (DO_mgl)
- Salinity (Sal_ppt)
- Turbidity (Turb_ntu)
- Temperature (T_c)
- Secchi (Secc_m)
- Total Nitrogen (TN_mgl) 

The analysis is conducted in the separate water bodies:
- Guana Tolomato Matanzas (GTM)
- Estero Bay (EB)
- Charlotte Harbor (CH)
- Biscayne Bay (BB)
- Big Bend Seagrasses (BBS)

**Tasks:**  

**Calculate the RMSE and Mean Error (ME) for IDW results using both continuous and discrete data across-year for four seasons**


<br>
<div style="text-align: left;">
    <img src="CrossYear.png" style="display: block; margin-left: 0; margin-right: auto; width: 600px;"/>
</div>


**Contents:**
* [1. Data Preprocess](#reg_preprocessing)
    * [1.1 Load csv files](#reg_subset)
    * [1.2 Subsetting data](#reg_preview)
    * [1.3 Filter the data](#reg_studied)
    * [1.4 Calculating average values](#reg_average)
    * [1.5 Convert coordinate system](#reg_coordinate)
* [2. Prepare for batch interpolation](#reg_batch)
    * [2.1 Preset abbreviation](#reg_preset)
    * [2.2 Define the barrier files](#reg_barrier)
    * [2.3 Define waterbody boundary](#reg_boundary)
    * [2.4 Load the table of study periods,  parameters, and seasons](#reg_study)
    * [2.5 Define output folders](#reg_output)
    * [2.6 Fill NaN RowID with unique ID](#reg_id)
* [3. Create Shapefiles](#reg_create_shp)
* [4. Cross Validation for IDW](#reg_cv_idw)

## 1. Loading packages

In [1]:
import pandas as pd
import numpy as np
import arcpy
from arcpy.sa import *
import os
import math

import importlib
import sys
# path = r'C:/Users/cong1/WQ/IDW/git/misc'
path = r'E:\Projects\SEACAR_WQ_2024\git\misc'

sys.path.insert(0, path)
import idw_rk
importlib.reload(idw_rk)

import pyproj

# define scratch folder to avoid overwritting from parallel threats
arcpy.env.scratchWorkspace = r"E:\Projects\SEACAR_WQ_2024\scratch/IDW_all"

## 1. Data Preprocessing <a class="anchor" id="reg_preprocessing"></a>
### 1.1 Load csv files

In [2]:
gis_path = r'E:/Projects/SEACAR_WQ_2024/GIS_Data/'

dfDis = pd.read_csv(gis_path + 'OEAT_Discrete_WQ-2024-Feb-15.csv', low_memory=False)
dfCon = pd.read_csv(gis_path + 'OEAT_Continuous_WQ-2024-Feb-21.csv', low_memory=False)

dfAll = pd.concat([dfDis, dfCon], ignore_index=True)

## 1.2 Subsetting Data <a class="anchor" id="reg_subset"></a>
### Selecting data from 9 am to 17 pm (daytime)

In [3]:
# Convert string to datetime
dfAll['SampleDate'] = pd.to_datetime(dfAll['SampleDate'], format='%Y-%m-%d %H:%M:%S.%f')

# Include date from 9:00 am to 17:00 pm
start_time = '09:00'
end_time = '17:00'

dfAllTime = dfAll[dfAll['SampleDate'].dt.time.between(pd.to_datetime(start_time).time(), pd.to_datetime(end_time).time())]
dfAllTime.head()

Unnamed: 0,RowID,ProgramID,ParameterName,ParameterUnits,ProgramLocationID,ActivityType,SampleDate,Year,Month,RelativeDepth,ResultValue,Latitude_DD,Longitude_DD,ManagedAreaName,AreaID,SEACAR_QAQCFlagCode,WaterBody,WbodyAcronym,Season
0,1,4058,Total Nitrogen,mg/L,4-2018-01-01,Sample,2020-08-14 09:37:00,2020,8,Surface,0.173,25.8463,-80.1282,Biscayne Bay Aquatic Preserve,6,1Q/7Q,Biscayne Bay,BB,Summer
1,2,4058,Total Nitrogen,mg/L,42,Sample,2018-03-07 11:52:00,2018,3,Surface,0.584,25.8015,-80.1401,Biscayne Bay Aquatic Preserve,6,7Q/1Q,Biscayne Bay,BB,Spring
2,3,4058,Total Nitrogen,mg/L,42,Sample,2017-01-18 09:47:00,2017,1,Surface,0.446,25.8015,-80.1401,Biscayne Bay Aquatic Preserve,6,1Q/7Q,Biscayne Bay,BB,Winter
3,4,4058,Total Nitrogen,mg/L,9,Sample,2018-10-31 14:14:00,2018,10,Surface,0.425,25.8002,-80.1278,Biscayne Bay Aquatic Preserve,6,7Q/1Q,Biscayne Bay,BB,Fall
4,5,4058,Total Nitrogen,mg/L,4-2018-01-01,Sample,2019-10-16 09:44:00,2019,10,Surface,0.155,25.8463,-80.1282,Biscayne Bay Aquatic Preserve,6,1Q/7Q,Biscayne Bay,BB,Fall


### 1.3 Filter the data<a class="anchor" id="reg_studied"></a>

In [4]:
# Load the table of cross-year seasons definitions
cross_year = pd.read_csv(gis_path + 'CrossYear.csv', low_memory=False)
cross_year

Unnamed: 0,WaterBody,Season,Year1,Year2,Year3
0,Charlotte Harbor,Spring,2017,2018,
1,Charlotte Harbor,Summer,2016,2017,
2,Charlotte Harbor,Fall,2016,2017,
3,Charlotte Harbor,Winter,2016,2017,2018.0
4,Big Bend Seagrasses,Spring,2021,2022,
5,Big Bend Seagrasses,Summer,2020,2021,
6,Big Bend Seagrasses,Fall,2020,2021,
7,Big Bend Seagrasses,Winter,2020,2021,2022.0
8,Estero Bay,Spring,2017,2018,
9,Estero Bay,Summer,2016,2017,


In [5]:
filtered_dfAllTime = idw_rk.filter_data_crossyear(cross_year, dfAllTime)
filtered_dfAllTime.head()

Unnamed: 0,RowID,ProgramID,ParameterName,ParameterUnits,ProgramLocationID,ActivityType,SampleDate,Year,Month,RelativeDepth,ResultValue,Latitude_DD,Longitude_DD,ManagedAreaName,AreaID,SEACAR_QAQCFlagCode,WaterBody,WbodyAcronym,Season
0,13244,513,Turbidity,NTU,WIN_21FLEECO_CHNEP_RSBOTAD69485,Sample,2018-03-19 11:30:00,2018,3,,1.69,26.74292,-82.2236,Gasparilla Sound-Charlotte Harbor Aquatic Pres...,18,7Q,Charlotte Harbor,CH,Spring
1,13425,513,Turbidity,NTU,WIN_21FLEECO_CHNEP_RSAD69484,Sample,2018-03-19 11:26:00,2018,3,Surface,1.51,26.74292,-82.2236,Gasparilla Sound-Charlotte Harbor Aquatic Pres...,18,9Q/7Q,Charlotte Harbor,CH,Spring
2,14244,513,Turbidity,NTU,WIN_21FLEECO_CHNEP_RSAD72270,Sample,2018-04-18 10:50:00,2018,4,Surface,4.61,26.74615,-82.24372,Gasparilla Sound-Charlotte Harbor Aquatic Pres...,18,9Q/7Q,Charlotte Harbor,CH,Spring
3,14245,513,Turbidity,NTU,WIN_21FLEECO_CHNEP_RSAD69480,Sample,2018-03-19 11:00:00,2018,3,Surface,1.01,26.74288,-82.20377,Gasparilla Sound-Charlotte Harbor Aquatic Pres...,18,7Q/9Q,Charlotte Harbor,CH,Spring
4,14883,513,Turbidity,NTU,WIN_21FLEECO_CHNEP_RSBOTAD69477,Sample,2018-03-19 10:35:00,2018,3,,1.13,26.7356,-82.18392,Gasparilla Sound-Charlotte Harbor Aquatic Pres...,18,7Q,Charlotte Harbor,CH,Spring


In [6]:
# Check the filtered results
CH_Winter = filtered_dfAllTime[(filtered_dfAllTime['WaterBody'] == 'Charlotte Harbor') & (filtered_dfAllTime['Season'] == 'Winter')]['Year'].unique()
CH_Winter

array([2016, 2018, 2017], dtype=int64)

In [7]:
GTM_Fall = filtered_dfAllTime[(filtered_dfAllTime['WaterBody'] == 'Guana Tolomato Matanzas') & (filtered_dfAllTime['Season'] == 'Fall')]['Year'].unique()
GTM_Fall

array([2016, 2015], dtype=int64)

### 1.4 Calculating average values at unique observation points<a class="anchor" id="reg_average"></a>

In [8]:
dfAll_Mean = filtered_dfAllTime.groupby(['WaterBody','ParameterName','ParameterUnits', 'Season','Latitude_DD','Longitude_DD','WbodyAcronym'])["ResultValue"].agg("mean").reset_index()
dfAll = dfAll_Mean

### 1.5 Convert coordinate system to EPSG: 3086<a class="anchor" id="reg_coordinate"></a>

In [9]:
# Define the EPSG codes for source (EPSG:4326) and target (EPSG:3086) coordinate systems
source_epsg = 'EPSG:4326'
target_epsg = 'EPSG:3086'

# Create a PyProj Transformer for the conversion
transformer = pyproj.Transformer.from_crs(source_epsg, target_epsg, always_xy=True)

# Define a function to apply the transformation to each row of the DataFrame
def transform_coordinates(row):
    x, y = transformer.transform(row['Longitude_DD'], row['Latitude_DD'])
    return pd.Series({'x': x, 'y': y})

# Apply the transformation function to the DataFrame and create new columns for the converted coordinates
dfAll[['x', 'y']] = dfAll.apply(transform_coordinates, axis=1)

#### Save aggregated data to csv file

In [10]:
dfAll.to_csv(gis_path + 'OEAT_CrossYear_All_WQ-2024-May-2.csv', index=False)

## 2. Prepare for batch interpolation<a class="anchor" id="reg_batch"></a>
### 2.1 Preset abbreviation for waterbody and parameter name<a class="anchor" id="reg_preset"></a>

In [11]:
area_shortnames = {
    'Guana Tolomato Matanzas': 'GTM',
    'Estero Bay': 'EB',
    'Charlotte Harbor': 'CH',
    'Biscayne Bay': 'BB',
    'Big Bend Seagrasses':'BBS'
}

param_shortnames = {
    'Salinity': 'Sal_ppt',
    'Total Nitrogen': 'TN_mgl',
    'Dissolved Oxygen': 'DO_mgl',
    'Turbidity':'Turb_ntu',
    'Secchi Depth':'Secc_m',
    'Water Temperature':'T_c'
}

### 2.2 Define the barrier files<a class="anchor" id="reg_barrier"></a>

In [12]:
barrier_folder = os.path.join(gis_path, 'Barriers')
barrier_folder

barriers = []
for file in os.listdir(barrier_folder):
    if file.endswith(".shp"):
        barriers.append(os.path.join(barrier_folder, file))

for barrier in barriers:
    print(barrier)

E:/Projects/SEACAR_WQ_2024/GIS_Data/Barriers\BBS_Barriers.shp
E:/Projects/SEACAR_WQ_2024/GIS_Data/Barriers\BB_Barriers.shp
E:/Projects/SEACAR_WQ_2024/GIS_Data/Barriers\CH_Barriers.shp
E:/Projects/SEACAR_WQ_2024/GIS_Data/Barriers\EB_Barriers.shp
E:/Projects/SEACAR_WQ_2024/GIS_Data/Barriers\GTM_Barriers.shp


### 2.3 Define waterbody boundary for spatial extent and masking<a class="anchor" id="reg_boundary"></a>

In [13]:
waterbody_extent = os.path.join(gis_path, 'OEAT_Waterbody_Boundaries', 'OEAT_Waterbody_Boundary.shp')

unique_waterbodies = []
with arcpy.da.SearchCursor(waterbody_extent, ['WaterbodyA']) as cursor:
    for row in cursor:
        unique_waterbodies.append(row[0])

print("Unique Waterbodies:", unique_waterbodies)

Unique Waterbodies: ['BBS', 'BB', 'CH', 'EB', 'GTM']


### 2.4 Load the table of study periods,  parameters, and seasons<a class="anchor" id="reg_study"></a>

In [14]:
crossyear_all = pd.read_csv(gis_path + 'CrossYear_all.csv', low_memory=False)
crossyear_all

Unnamed: 0,WaterBody,Season,Year1,Year2,Year3,Parameter,Filename,NumDataPoints,RMSE,ME
0,Charlotte Harbor,Spring,2017,2018,,Total Nitrogen,,,,
1,Charlotte Harbor,Summer,2016,2017,,Total Nitrogen,,,,
2,Charlotte Harbor,Fall,2016,2017,,Total Nitrogen,,,,
3,Charlotte Harbor,Winter,2016,2017,2018.0,Total Nitrogen,,,,
4,Charlotte Harbor,Spring,2017,2018,,Salinity,,,,
...,...,...,...,...,...,...,...,...,...,...
115,Biscayne Bay,Winter,2021,2022,2023.0,Secchi Depth,,,,
116,Biscayne Bay,Spring,2022,2023,,Water Temperature,,,,
117,Biscayne Bay,Summer,2021,2022,,Water Temperature,,,,
118,Biscayne Bay,Fall,2021,2022,,Water Temperature,,,,


### 2.5 Define output folders<a class="anchor" id="reg_output"></a>

In [15]:
shpAll_folder = gis_path + r"CrossYear_shapefiles_All" 
idwAll_folder = gis_path + r"CrossYear_idw_All"

# Preview dataset
dfAll

Unnamed: 0,WaterBody,ParameterName,ParameterUnits,Season,Latitude_DD,Longitude_DD,WbodyAcronym,ResultValue,x,y
0,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.008300,-82.825250,BBS,6.873333,514236.629541,556316.261436
1,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.125000,-82.841666,BBS,7.730000,512518.602025,569259.744247
2,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.149500,-83.079500,BBS,7.225000,489395.986664,571785.532572
3,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.161167,-83.047333,BBS,7.110000,492509.872329,573104.729928
4,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.198660,-82.772000,BBS,4.740000,519204.169372,577504.324908
...,...,...,...,...,...,...,...,...,...,...
3475,Guana Tolomato Matanzas,Water Temperature,Degrees C,Winter,30.050800,-81.367500,GTM,15.600000,653506.770955,674226.166167
3476,Guana Tolomato Matanzas,Water Temperature,Degrees C,Winter,30.050857,-81.367465,GTM,18.674001,653510.005839,674232.563906
3477,Guana Tolomato Matanzas,Water Temperature,Degrees C,Winter,30.083020,-81.342860,GTM,12.800000,655802.149885,677852.742967
3478,Guana Tolomato Matanzas,Water Temperature,Degrees C,Winter,30.160736,-81.360278,GTM,12.300000,653940.954353,686441.209858


### 2.6 Fill NaN RowID with unique ID, IDW function needs unique ID <a class="anchor" id="reg_id"></a>

In [16]:
idw_rk.fill_nan_rowids(dfAll, 'RowID')

# Keep RowID as integer
dfAll['RowID'] = dfAll['RowID'].astype(int)

## 3. Create Shapefiles <a class="anchor" id="reg_create_shp"></a>

In [17]:
# Merge interested with latitude and longitude columns
crossyear_all_coord = idw_rk.merge_with_lat_long_new(crossyear_all, dfAll, "Season")
crossyear_all_coord

Unnamed: 0,WaterBody,Season,Year1,Year2,Year3,Parameter,Filename,NumDataPoints,RMSE,ME,x,y,RowID,ResultValue
0,Charlotte Harbor,Spring,2017,2018,,Total Nitrogen,,,,,591267.708988,272548.489270,2716,0.790000
1,Charlotte Harbor,Spring,2017,2018,,Total Nitrogen,,,,,589338.538587,275567.917324,2717,0.780000
2,Charlotte Harbor,Spring,2017,2018,,Total Nitrogen,,,,,591931.509870,275878.186824,2718,0.800000
3,Charlotte Harbor,Spring,2017,2018,,Total Nitrogen,,,,,591354.567250,276236.638167,2719,0.890000
4,Charlotte Harbor,Spring,2017,2018,,Total Nitrogen,,,,,582823.183862,277990.491726,2720,0.930000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3493,Biscayne Bay,Winter,2021,2022,2023.0,Water Temperature,,,,,784697.376005,213473.546119,2603,23.895922
3494,Biscayne Bay,Winter,2021,2022,2023.0,Water Temperature,,,,,785872.039658,216879.318151,2604,24.480743
3495,Biscayne Bay,Winter,2021,2022,2023.0,Water Temperature,,,,,787466.428428,218699.961879,2605,24.634917
3496,Biscayne Bay,Winter,2021,2022,2023.0,Water Temperature,,,,,784434.233710,221871.479681,2606,23.509827


In [18]:
idw_rk.create_shp_season_new(crossyear_all_coord, "Season", shpAll_folder, start_year_included=False)

Number of data rows for BBS, DO_mgl, None, Fall: 41
Shapefile for BBS, DO_mgl for season Fall has been saved as SHP_BBS_DO_mgl_Fall.shp
Number of data rows for BBS, Sal_ppt, None, Fall: 33
Shapefile for BBS, Sal_ppt for season Fall has been saved as SHP_BBS_Sal_ppt_Fall.shp
Number of data rows for BBS, Secc_m, None, Fall: 33
Shapefile for BBS, Secc_m for season Fall has been saved as SHP_BBS_Secc_m_Fall.shp
Number of data rows for BBS, TN_mgl, None, Fall: 36
Shapefile for BBS, TN_mgl for season Fall has been saved as SHP_BBS_TN_mgl_Fall.shp
Number of data rows for BBS, Turb_ntu, None, Fall: 38
Shapefile for BBS, Turb_ntu for season Fall has been saved as SHP_BBS_Turb_ntu_Fall.shp
Number of data rows for BBS, T_c, None, Fall: 41
Shapefile for BBS, T_c for season Fall has been saved as SHP_BBS_T_c_Fall.shp
Number of data rows for BBS, DO_mgl, None, Spring: 54
Shapefile for BBS, DO_mgl for season Spring has been saved as SHP_BBS_DO_mgl_Spring.shp
Number of data rows for BBS, Sal_ppt, None

Shapefile for CH, Turb_ntu for season Summer has been saved as SHP_CH_Turb_ntu_Summer.shp
Number of data rows for CH, T_c, None, Summer: 3
Shapefile for CH, T_c for season Summer has been saved as SHP_CH_T_c_Summer.shp
Number of data rows for CH, DO_mgl, None, Winter: 38
Shapefile for CH, DO_mgl for season Winter has been saved as SHP_CH_DO_mgl_Winter.shp
Number of data rows for CH, Sal_ppt, None, Winter: 11
Shapefile for CH, Sal_ppt for season Winter has been saved as SHP_CH_Sal_ppt_Winter.shp
Number of data rows for CH, Secc_m, None, Winter: 8
Shapefile for CH, Secc_m for season Winter has been saved as SHP_CH_Secc_m_Winter.shp
Number of data rows for CH, TN_mgl, None, Winter: 50
Shapefile for CH, TN_mgl for season Winter has been saved as SHP_CH_TN_mgl_Winter.shp
Number of data rows for CH, Turb_ntu, None, Winter: 53
Shapefile for CH, Turb_ntu for season Winter has been saved as SHP_CH_Turb_ntu_Winter.shp
Number of data rows for CH, T_c, None, Winter: 38
Shapefile for CH, T_c for se

## 4. Cross Validation for IDW <a class="anchor" id="reg_cv_idw"></a>

In [19]:
# Empty the shapefile folder
idw_rk.delete_all_files(idwAll_folder)

In [20]:
# Select a section of table to process
seasons_slct = crossyear_all.iloc[:]
seasons_slct.head()

Unnamed: 0,WaterBody,Season,Year1,Year2,Year3,Parameter,Filename,NumDataPoints,RMSE,ME
0,Charlotte Harbor,Spring,2017,2018,,Total Nitrogen,,,,
1,Charlotte Harbor,Summer,2016,2017,,Total Nitrogen,,,,
2,Charlotte Harbor,Fall,2016,2017,,Total Nitrogen,,,,
3,Charlotte Harbor,Winter,2016,2017,2018.0,Total Nitrogen,,,,
4,Charlotte Harbor,Spring,2017,2018,,Salinity,,,,


In [None]:
# If the number of data points is less than 3，skipping calculate IDW
idw_rk.idw_interpolation_new(seasons_slct, shpAll_folder, idwAll_folder, waterbody_extent, barrier_folder, "Season", include_start_year=False)

Processing file: SHP_CH_TN_mgl_Spring.shp
File SHP_CH_TN_mgl_Spring.shp has completed 49 cross-validation iterations.
Shapefile not found for: SHP_CH_TN_mgl_Summer.shp
Shapefile not found for: SHP_CH_TN_mgl_Fall.shp
Processing file: SHP_CH_TN_mgl_Winter.shp
File SHP_CH_TN_mgl_Winter.shp has completed 50 cross-validation iterations.
Processing file: SHP_CH_Sal_ppt_Spring.shp
File SHP_CH_Sal_ppt_Spring.shp has completed 3 cross-validation iterations.
Processing file: SHP_CH_Sal_ppt_Summer.shp
File SHP_CH_Sal_ppt_Summer.shp has completed 3 cross-validation iterations.
Processing file: SHP_CH_Sal_ppt_Fall.shp
File SHP_CH_Sal_ppt_Fall.shp has completed 3 cross-validation iterations.
Processing file: SHP_CH_Sal_ppt_Winter.shp
File SHP_CH_Sal_ppt_Winter.shp has completed 11 cross-validation iterations.
