# Cross Validation for IDW Interpolation 
## Task 2 (continuous & discrete) for four seasons

This document includes Python codes that conduct cross validation (CV) for Inverse Distance Weighting (IDW) Interpolation on water quality parameters, including 6 water quality parameters in arcpy environment:
- Dissolved oxygen (DO_mgl)
- Salinity (Sal_ppt)
- Turbidity (Turb_ntu)
- Temperature (T_c)
- Secchi (Secc_m)
- Total Nitrogen (TN_mgl) 

The analysis is conducted in the separate water bodies:
- Guana Tolomato Matanzas (GTM)
- Estero Bay (EB)
- Charlotte Harbor (CH)
- Biscayne Bay (BB)
- Big Bend Seagrasses (BBS)

**Tasks:**  

**Calculate the RMSE and Mean Error (ME) for IDW results using both continuous and discrete data**


<br>
<div style="text-align: left;">
    <img src="misc/FourSeasons.png" style="display: block; margin-left: 0; margin-right: auto; width: 900px;"/>
</div>


**Contents:**
* [1. Data Preprocess](#reg_preprocessing)
    * [1.1 Load csv files](#reg_subset)
    * [1.2 Subsetting data](#reg_preview)
    * [1.3 Filter the data](#reg_studied)
    * [1.4 Calculating average values](#reg_average)
    * [1.5 Convert coordinate system](#reg_coordinate)
* [2. Prepare for batch interpolation](#reg_batch)
    * [2.1 Preset abbreviation](#reg_preset)
    * [2.2 Define the barrier files](#reg_barrier)
    * [2.3 Define waterbody boundary](#reg_boundary)
    * [2.4 Load the table of study periods,  parameters, and seasons](#reg_study)
    * [2.5 Define output folders](#reg_output)
    * [2.6 Fill NaN RowID with unique ID](#reg_id)
* [3. Create Shapefiles](#reg_create_shp)
* [4. Cross Validation for IDW](#reg_cv_idw)

## 1. Loading packages

In [76]:
import pandas as pd
import numpy as np
import arcpy
from arcpy.sa import *
import os
import math

import importlib
import sys
# path = r'C:/Users/cong1/WQ/IDW/git/misc'
path = r'E:\Projects\SEACAR_WQ_2024\git\misc'

sys.path.insert(0, path)
import idw_rk
importlib.reload(idw_rk)

import pyproj

# define scratch folder to avoid overwritting from parallel threats
arcpy.env.scratchWorkspace = r"E:\Projects\SEACAR_WQ_2024\scratch/IDW_4s"

## 1. Data Preprocessing <a class="anchor" id="reg_preprocessing"></a>
### 1.1 Load csv files

In [77]:
gis_path = r'E:/Projects/SEACAR_WQ_2024/GIS_Data/'

dfDis = pd.read_csv(gis_path + 'OEAT_Discrete_WQ-2024-Feb-15.csv', low_memory=False)
dfCon = pd.read_csv(gis_path + 'OEAT_Continuous_WQ-2024-Feb-21.csv', low_memory=False)

dfAll = pd.concat([dfDis, dfCon], ignore_index=True)

## 1.2 Subsetting Data <a class="anchor" id="reg_subset"></a>
### Selecting data from 9 am to 17 pm (daytime)

In [78]:
# Convert string to datetime
dfAll['SampleDate'] = pd.to_datetime(dfAll['SampleDate'], format='%Y-%m-%d %H:%M:%S.%f')

# Include date from 9:00 am to 17:00 pm
start_time = '09:00'
end_time = '17:00'

dfAllTime = dfAll[dfAll['SampleDate'].dt.time.between(pd.to_datetime(start_time).time(), pd.to_datetime(end_time).time())]
dfAllTime.head()

Unnamed: 0,RowID,ProgramID,ParameterName,ParameterUnits,ProgramLocationID,ActivityType,SampleDate,Year,Month,RelativeDepth,ResultValue,Latitude_DD,Longitude_DD,ManagedAreaName,AreaID,SEACAR_QAQCFlagCode,WaterBody,WbodyAcronym,Season
0,1,4058,Total Nitrogen,mg/L,4-2018-01-01,Sample,2020-08-14 09:37:00,2020,8,Surface,0.173,25.8463,-80.1282,Biscayne Bay Aquatic Preserve,6,1Q/7Q,Biscayne Bay,BB,Summer
1,2,4058,Total Nitrogen,mg/L,42,Sample,2018-03-07 11:52:00,2018,3,Surface,0.584,25.8015,-80.1401,Biscayne Bay Aquatic Preserve,6,7Q/1Q,Biscayne Bay,BB,Spring
2,3,4058,Total Nitrogen,mg/L,42,Sample,2017-01-18 09:47:00,2017,1,Surface,0.446,25.8015,-80.1401,Biscayne Bay Aquatic Preserve,6,1Q/7Q,Biscayne Bay,BB,Winter
3,4,4058,Total Nitrogen,mg/L,9,Sample,2018-10-31 14:14:00,2018,10,Surface,0.425,25.8002,-80.1278,Biscayne Bay Aquatic Preserve,6,7Q/1Q,Biscayne Bay,BB,Fall
4,5,4058,Total Nitrogen,mg/L,4-2018-01-01,Sample,2019-10-16 09:44:00,2019,10,Surface,0.155,25.8463,-80.1282,Biscayne Bay Aquatic Preserve,6,1Q/7Q,Biscayne Bay,BB,Fall


### 1.3 Filter the data<a class="anchor" id="reg_studied"></a>

In [79]:
# Load the table of four seasons definitions
seasons4 = pd.read_csv(gis_path + 'season_def/4 seasons.csv', low_memory=False)
seasons4

Unnamed: 0,WaterBody,SeasonNum,Season,Start Year,Start Month,Start Day,End Year,End Month,End Day,Start Date,End Date
0,Charlotte Harbor,1,Spring,2017,2,28,2017,6,11,2/28/2017,6/11/2017
1,Charlotte Harbor,2,Summer,2017,6,12,2017,9,11,6/12/2017,9/11/2017
2,Charlotte Harbor,3,Fall,2017,9,12,2017,11,28,9/12/2017,11/28/2017
3,Charlotte Harbor,4,Winter,2017,11,29,2018,2,27,11/29/2017,2/27/2018
4,Big Bend Seagrasses,1,Spring,2021,3,3,2021,6,7,3/3/2021,6/7/2021
5,Big Bend Seagrasses,2,Summer,2021,6,8,2021,9,7,6/8/2021,9/7/2021
6,Big Bend Seagrasses,3,Fall,2021,9,8,2021,12,2,9/8/2021,12/2/2021
7,Big Bend Seagrasses,4,Winter,2021,12,3,2022,3,2,12/3/2021,3/2/2022


In [80]:
# Function to filter data based on specified date ranges
selected_dfAllTime = idw_rk.filter_by_date_range(dfAllTime, seasons4)
selected_dfAllTime.head()

Unnamed: 0,RowID,ProgramID,ParameterName,ParameterUnits,ProgramLocationID,ActivityType,SampleDate,Year,Month,RelativeDepth,...,Latitude_DD,Longitude_DD,ManagedAreaName,AreaID,SEACAR_QAQCFlagCode,WaterBody,WbodyAcronym,Season,Start Date,End Date
14091,8864,514,Total Nitrogen,mg/L,WIN_21FLKWAT_LEV-CEDARK-2-1,Sample,2022-01-31 16:36:00,2022,1,Surface,...,29.1416,-83.0083,Big Bend Seagrasses Aquatic Preserve,5,9Q/7Q,Big Bend Seagrasses,BBS,Winter,2021-12-03,2022-03-02
14099,8866,514,Total Nitrogen,mg/L,WIN_21FLKWAT_LEV-CEDARK-2-1,Sample,2022-02-28 15:47:00,2022,2,Surface,...,29.1416,-83.0083,Big Bend Seagrasses Aquatic Preserve,5,9Q/7Q,Big Bend Seagrasses,BBS,Winter,2021-12-03,2022-03-02
14112,8870,514,Total Nitrogen,mg/L,WIN_21FLKWAT_LEV-CEDARK-2-1,Sample,2021-03-29 15:15:00,2021,3,Surface,...,29.1416,-83.0083,Big Bend Seagrasses Aquatic Preserve,5,9Q/7Q,Big Bend Seagrasses,BBS,Spring,2021-03-03,2021-06-07
14160,8902,514,Total Nitrogen,mg/L,WIN_21FLKWAT_LEV-CEDARK-2-1,Sample,2021-04-30 16:10:00,2021,4,Surface,...,29.1416,-83.0083,Big Bend Seagrasses Aquatic Preserve,5,9Q/7Q,Big Bend Seagrasses,BBS,Spring,2021-03-03,2021-06-07
14172,8905,514,Total Nitrogen,mg/L,WIN_21FLKWAT_LEV-CEDARK-2-1,Sample,2021-05-31 12:27:00,2021,5,Surface,...,29.1416,-83.0083,Big Bend Seagrasses Aquatic Preserve,5,7Q/9Q,Big Bend Seagrasses,BBS,Spring,2021-03-03,2021-06-07


### 1.4 Calculating average values at unique observation points<a class="anchor" id="reg_average"></a>

In [81]:
dfAll_Mean = selected_dfAllTime.groupby(['WaterBody','ParameterName','ParameterUnits','Season','Latitude_DD','Longitude_DD','WbodyAcronym'])["ResultValue"].agg("mean").reset_index()
dfAll = dfAll_Mean

### 1.5 Convert coordinate system to EPSG: 3086<a class="anchor" id="reg_coordinate"></a>

In [82]:
# Define the EPSG codes for source (EPSG:4326) and target (EPSG:3086) coordinate systems
source_epsg = 'EPSG:4326'
target_epsg = 'EPSG:3086'

# Create a PyProj Transformer for the conversion
transformer = pyproj.Transformer.from_crs(source_epsg, target_epsg, always_xy=True)

# Define a function to apply the transformation to each row of the DataFrame
def transform_coordinates(row):
    x, y = transformer.transform(row['Longitude_DD'], row['Latitude_DD'])
    return pd.Series({'x': x, 'y': y})

# Apply the transformation function to the DataFrame and create new columns for the converted coordinates
dfAll[['x', 'y']] = dfAll.apply(transform_coordinates, axis=1)

#### Save aggregated data to csv file

In [83]:
dfAll.to_csv(gis_path + 'OEAT_4Seasons_All_WQ-2024-May-2.csv', index=False)

## 2. Prepare for batch interpolation<a class="anchor" id="reg_batch"></a>
### 2.1 Preset abbreviation for waterbody and parameter name<a class="anchor" id="reg_preset"></a>

In [84]:
area_shortnames = {
    'Guana Tolomato Matanzas': 'GTM',
    'Estero Bay': 'EB',
    'Charlotte Harbor': 'CH',
    'Biscayne Bay': 'BB',
    'Big Bend Seagrasses':'BBS'
}

param_shortnames = {
    'Salinity': 'Sal_ppt',
    'Total Nitrogen': 'TN_mgl',
    'Dissolved Oxygen': 'DO_mgl',
    'Turbidity':'Turb_ntu',
    'Secchi Depth':'Secc_m',
    'Water Temperature':'T_c'
}

### 2.2 Define the barrier files<a class="anchor" id="reg_barrier"></a>

In [85]:
barrier_folder = os.path.join(gis_path, 'Barriers')
barrier_folder

barriers = []
for file in os.listdir(barrier_folder):
    if file.endswith(".shp"):
        barriers.append(os.path.join(barrier_folder, file))

for barrier in barriers:
    print(barrier)

E:/Projects/SEACAR_WQ_2024/GIS_Data/Barriers\BBS_Barriers.shp
E:/Projects/SEACAR_WQ_2024/GIS_Data/Barriers\BB_Barriers.shp
E:/Projects/SEACAR_WQ_2024/GIS_Data/Barriers\CH_Barriers.shp
E:/Projects/SEACAR_WQ_2024/GIS_Data/Barriers\EB_Barriers.shp
E:/Projects/SEACAR_WQ_2024/GIS_Data/Barriers\GTM_Barriers.shp


### 2.3 Define waterbody boundary for spatial extent and masking<a class="anchor" id="reg_boundary"></a>

In [86]:
waterbody_extent = os.path.join(gis_path, 'OEAT_Waterbody_Boundaries', 'OEAT_Waterbody_Boundary.shp')

unique_waterbodies = []
with arcpy.da.SearchCursor(waterbody_extent, ['WaterbodyA']) as cursor:
    for row in cursor:
        unique_waterbodies.append(row[0])

print("Unique Waterbodies:", unique_waterbodies)

Unique Waterbodies: ['BBS', 'BB', 'CH', 'EB', 'GTM']


### 2.4 Load the table of study periods,  parameters, and seasons<a class="anchor" id="reg_study"></a>

In [87]:
seasons_all = pd.read_csv(gis_path + 'season_def/FourSeasons_all.csv', low_memory=False)
seasons_all

Unnamed: 0,WaterBody,Season,Start Year,End Year,Parameter,Filename,NumDataPoints,RMSE,ME
0,Charlotte Harbor,Spring,2017,2017,Total Nitrogen,,,,
1,Charlotte Harbor,Summer,2017,2017,Total Nitrogen,,,,
2,Charlotte Harbor,Fall,2017,2017,Total Nitrogen,,,,
3,Charlotte Harbor,Winter,2017,2018,Total Nitrogen,,,,
4,Charlotte Harbor,Spring,2017,2017,Salinity,,,,
5,Charlotte Harbor,Summer,2017,2017,Salinity,,,,
6,Charlotte Harbor,Fall,2017,2017,Salinity,,,,
7,Charlotte Harbor,Winter,2017,2018,Salinity,,,,
8,Charlotte Harbor,Spring,2017,2017,Dissolved Oxygen,,,,
9,Charlotte Harbor,Summer,2017,2017,Dissolved Oxygen,,,,


### 2.5 Define output folders<a class="anchor" id="reg_output"></a>

In [88]:
shpAll_folder = gis_path + r"shapefiles/FourSeasons_shapefiles_All" 
idwAll_folder = gis_path + r"raster_output/FourSeasons_IDW_All"

# Preview dataset
dfAll

Unnamed: 0,WaterBody,ParameterName,ParameterUnits,Season,Latitude_DD,Longitude_DD,WbodyAcronym,ResultValue,x,y
0,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.008300,-82.825250,BBS,6.873333,514236.421562,556316.395208
1,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.125000,-82.841666,BBS,7.730000,512518.355037,569259.880703
2,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.149500,-83.079500,BBS,7.225000,489395.665621,571785.712589
3,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.161167,-83.047333,BBS,7.110000,492509.553281,573104.900093
4,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.198660,-82.772000,BBS,5.055000,519203.925699,577504.453517
...,...,...,...,...,...,...,...,...,...,...
955,Charlotte Harbor,Water Temperature,Degrees C,Winter,26.667800,-82.094600,CH,20.448393,589298.249938,297332.007386
956,Charlotte Harbor,Water Temperature,Degrees C,Winter,26.685190,-82.227900,CH,20.400000,576028.292260,299064.878357
957,Charlotte Harbor,Water Temperature,Degrees C,Winter,26.689730,-82.104710,CH,20.500000,588256.497578,299750.771499
958,Charlotte Harbor,Water Temperature,Degrees C,Winter,26.712690,-82.247680,CH,20.000000,574020.228155,302089.400172


### 2.6 Fill NaN RowID with unique ID, IDW function needs unique ID <a class="anchor" id="reg_id"></a>

In [89]:
idw_rk.fill_nan_rowids(dfAll, 'RowID')

# Keep RowID as integer
dfAll['RowID'] = dfAll['RowID'].astype(int)
dfAll.head()

Unnamed: 0,WaterBody,ParameterName,ParameterUnits,Season,Latitude_DD,Longitude_DD,WbodyAcronym,ResultValue,x,y,RowID
0,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.0083,-82.82525,BBS,6.873333,514236.421562,556316.395208,1
1,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.125,-82.841666,BBS,7.73,512518.355037,569259.880703,2
2,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.1495,-83.0795,BBS,7.225,489395.665621,571785.712589,3
3,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.161167,-83.047333,BBS,7.11,492509.553281,573104.900093,4
4,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.19866,-82.772,BBS,5.055,519203.925699,577504.453517,5


## 3. Create Shapefiles <a class="anchor" id="reg_create_shp"></a>

In [90]:
# Merge interested with latitude and longitude columns
seasons_all_coord = idw_rk.merge_with_lat_long_new(seasons_all, dfAll, "Season")
seasons_all_coord

Unnamed: 0,WaterBody,Season,Start Year,End Year,Parameter,Filename,NumDataPoints,RMSE,ME,x,y,RowID,ResultValue
0,Charlotte Harbor,Spring,2017,2017,Total Nitrogen,,,,,,,,
1,Charlotte Harbor,Summer,2017,2017,Total Nitrogen,,,,,,,,
2,Charlotte Harbor,Fall,2017,2017,Total Nitrogen,,,,,,,,
3,Charlotte Harbor,Winter,2017,2018,Total Nitrogen,,,,,591037.221033,273012.265703,857,0.56
4,Charlotte Harbor,Winter,2017,2018,Total Nitrogen,,,,,592008.752819,274733.690037,858,0.76
...,...,...,...,...,...,...,...,...,...,...,...,...,...
962,Big Bend Seagrasses,Winter,2021,2022,Water Temperature,,,,,371688.623915,691953.508047,808,21.10
963,Big Bend Seagrasses,Winter,2021,2022,Water Temperature,,,,,371015.090649,692081.211171,809,20.70
964,Big Bend Seagrasses,Winter,2021,2022,Water Temperature,,,,,401894.595993,699335.031753,810,19.20
965,Big Bend Seagrasses,Winter,2021,2022,Water Temperature,,,,,401457.362103,702258.799507,811,20.10


In [91]:
idw_rk.create_shp_season_new(seasons_all_coord, "Season", shpAll_folder)

Number of data rows for BBS, DO_mgl, and season: 38
Shapefile for BBS, DO_mgl for and season Fall has been saved as SHP_BBS_DO_mgl_Fall.shp
Number of data rows for BBS, Sal_ppt, and season: 29
Shapefile for BBS, Sal_ppt for and season Fall has been saved as SHP_BBS_Sal_ppt_Fall.shp
Number of data rows for BBS, Secc_m, and season: 31
Shapefile for BBS, Secc_m for and season Fall has been saved as SHP_BBS_Secc_m_Fall.shp
Number of data rows for BBS, TN_mgl, and season: 33
Shapefile for BBS, TN_mgl for and season Fall has been saved as SHP_BBS_TN_mgl_Fall.shp
Number of data rows for BBS, Turb_ntu, and season: 34
Shapefile for BBS, Turb_ntu for and season Fall has been saved as SHP_BBS_Turb_ntu_Fall.shp
Number of data rows for BBS, T_c, and season: 38
Shapefile for BBS, T_c for and season Fall has been saved as SHP_BBS_T_c_Fall.shp
Number of data rows for BBS, DO_mgl, and season: 44
Shapefile for BBS, DO_mgl for and season Spring has been saved as SHP_BBS_DO_mgl_Spring.shp
Number of data r

In [97]:
dfAll.columns

Index(['WaterBody', 'ParameterName', 'ParameterUnits', 'Season', 'Latitude_DD',
       'Longitude_DD', 'WbodyAcronym', 'ResultValue', 'x', 'y', 'RowID'],
      dtype='object')

In [98]:
dfAll.head()

Unnamed: 0,WaterBody,ParameterName,ParameterUnits,Season,Latitude_DD,Longitude_DD,WbodyAcronym,ResultValue,x,y,RowID
0,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.0083,-82.82525,BBS,6.873333,514236.421562,556316.395208,1
1,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.125,-82.841666,BBS,7.73,512518.355037,569259.880703,2
2,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.1495,-83.0795,BBS,7.225,489395.665621,571785.712589,3
3,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.161167,-83.047333,BBS,7.11,492509.553281,573104.900093,4
4,Big Bend Seagrasses,Dissolved Oxygen,mg/L,Fall,29.19866,-82.772,BBS,5.055,519203.925699,577504.453517,5


In [None]:
(dfAll['WaterBody'] == 'Charlotte Harbor') & (dfAll['ParameterName'] == 'Salinity') & (dfAll['Year'] == '2017')

## 4. Cross Validation for IDW <a class="anchor" id="reg_cv_idw"></a>

In [94]:
# Empty the shapefile folder
idw_rk.delete_all_files(idwAll_folder)

In [95]:
# Select a section of table to process
seasons_slct = seasons_all.iloc[:]
seasons_slct.head()

Unnamed: 0,WaterBody,Season,Start Year,End Year,Parameter,Filename,NumDataPoints,RMSE,ME
0,Charlotte Harbor,Spring,2017,2017,Total Nitrogen,,,,
1,Charlotte Harbor,Summer,2017,2017,Total Nitrogen,,,,
2,Charlotte Harbor,Fall,2017,2017,Total Nitrogen,,,,
3,Charlotte Harbor,Winter,2017,2018,Total Nitrogen,,,,
4,Charlotte Harbor,Spring,2017,2017,Salinity,,,,


In [73]:
# If the number of data points is less than 3，skipping calculate IDW
idw_rk.idw_interpolation_new(seasons_slct, shpAll_folder, idwAll_folder, waterbody_extent, barrier_folder, "Season")

Shapefile not found for: SHP_CH_TN_mgl_Spring.shp
Shapefile not found for: SHP_CH_TN_mgl_Summer.shp
Shapefile not found for: SHP_CH_TN_mgl_Fall.shp
Processing file: SHP_CH_TN_mgl_Winter.shp
File SHP_CH_TN_mgl_Winter.shp has completed 30 cross-validation iterations.
Processing file: SHP_CH_Sal_ppt_Spring.shp
File SHP_CH_Sal_ppt_Spring.shp has completed 3 cross-validation iterations.
Processing file: SHP_CH_Sal_ppt_Summer.shp
File SHP_CH_Sal_ppt_Summer.shp has completed 3 cross-validation iterations.
Processing file: SHP_CH_Sal_ppt_Fall.shp
File SHP_CH_Sal_ppt_Fall.shp has completed 3 cross-validation iterations.
Processing file: SHP_CH_Sal_ppt_Winter.shp
File SHP_CH_Sal_ppt_Winter.shp has completed 3 cross-validation iterations.
Processing file: SHP_CH_DO_mgl_Spring.shp
File SHP_CH_DO_mgl_Spring.shp has completed 3 cross-validation iterations.
Processing file: SHP_CH_DO_mgl_Summer.shp
File SHP_CH_DO_mgl_Summer.shp has completed 3 cross-validation iterations.
Processing file: SHP_CH_DO_m