#### The below error analysis is performed using the mean absolute error in the units of the groundwater depth measured in feet.  
The test set is for YEAR = 2020
The last year in the train set is 2019 and it contains CURRENT_DEPTH 


In [1]:
import sys
sys.path.append('..')



import numpy as np
import pandas as pd
import altair as alt
from lib.supervised_tuning import get_model_errors, read_target_shifted_data
from lib.read_data import read_and_join_output_file
from lib.viz import sjv_color_range_17, sjv_color_range_9, chart_error_distribution, chart_error_by_township, chart_error_by_depth, chart_depth_diff_error





#### Get the actual data that has not been normalized

In [2]:
full_df = read_and_join_output_file()
full_df = full_df[full_df.index.get_level_values(1).isin(['2020', '2021'])]['GSE_GWE']
full_df = full_df.unstack(level=-1)
full_df['depth_diff'] = np.abs(full_df['2020'] - full_df['2021'])
full_df.reset_index(inplace=True)

Insights into the model

- Feature importance and feature ablation through SHAP can be seen here in the section called Explainabilty through SHAP.

- Failure/Error analysis is conducted below

In [3]:
test_model_errors_df, error_df = get_model_errors()

### Analyzing the pattern of errors made in the regressors RandomForest and SVR

In [5]:

chart_error_distribution(error_df)

In [6]:
model_name_list = [
    "CatBoostRegressor_absolute_error",
    "SVR_absolute_error",
    "RandomForestRegressor_absolute_error",
]

chart_error_by_depth(error_df, model_name_list)

Also check the top township ranges for which a high mean absolute error is indicated

In [7]:

errors_by_township_df, chart = chart_error_by_township(error_df, model_name_list, 20)
chart

The townships where the highest absolute errors made are shown above.
- T15S R10E
- T10S R21E
- T27S R27E
- T20S R18E
- T22S R17E
- T22S R28E
- T22S R16E
- T27S R26E


#### Since the model feature importances and SHAP indicated that the previous depth is the biggest predictor of the fuure depth, check the depth for the above townships for the previous year

In [8]:
# 2021 data current depth  was taken as predicted value for 2020 since target is shifted.
# Test year is 2020 It is predicting 2021 current depth
# Look into the current depth in 2020 in these townships


In [9]:
full_df

YEAR,TOWNSHIP_RANGE,2020,2021,depth_diff
0,T01N R02E,52.196000,53.193636,0.997636
1,T01N R03E,24.418788,32.676189,8.257401
2,T01N R04E,18.961667,16.672857,2.288810
3,T01N R05E,20.336154,19.476364,0.859790
4,T01N R06E,32.380000,33.198000,0.818000
...,...,...,...,...
473,T32S R26E,197.730769,220.866667,23.135897
474,T32S R27E,119.037500,151.778571,32.741071
475,T32S R28E,191.171429,174.023077,17.148352
476,T32S R29E,344.578571,326.627273,17.951299


Plot current year depth to target depth differences against mean absolute error

In [10]:
chart_depth_diff_error(error_df, full_df)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=b042e2da-6536-449d-95b8-d85fa08825de' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>