# Preparing the Data for Machine Learning
This notebook is used to prepare the data for machine learning by using Scikit Learn Pipelines to perform data imputation.

In [1]:
# For Deepnote to be able to use the custom libraries in the parent ../lib folder
import sys
sys.path.append('..')

In [2]:
import os
import pandas as pd
from lib.read_data import read_and_join_output_file
from lib.impute import create_transformation_pipelines
from lib.viz import draw_missing_data_chart, draw_corr_heatmap, draw_components_variance_chart, biplot, draw_feature_importance
from sklearn.model_selection import GroupShuffleSplit
from sklearn.decomposition import PCA
from sklearn import set_config

First we load and join all the datasets resulting from the ETL process and initialize some variables.

We have two potential targets for supervised and unsupervised learning, which we remove from the list of features:
* `GSE_GWE` - The Ground Surface Elevation to Groundwater Water Elevation - Depth to groundwater elevation in feet below ground surface
* `SHORTAGE_COUNT` -  The number of reported well shortages

In [3]:
indv_feature_dict, all_features_df = read_and_join_output_file()
feature_columns = list(all_features_df.columns)
targets = ["GSE_GWE", "SHORTAGE_COUNT"]
RANDOM_SEED = 42
feature_columns = list(set(feature_columns) - set(targets))
X = all_features_df[feature_columns]
y = all_features_df[targets]

## Missing Data

In [4]:
X.sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,CROP_V,CROP_D5,PCT_OF_CAPACITY,VEGETATION_CALIFORNIA_COAST_LIVE_OAK,CROP_T19,SOIL_ALFISOLS_B,VEGETATION_NON-NATIVE_HARDWOOD_FOREST,CROP_F10,WELL_COUNT_AGRICULTURE,CROP_D6,...,SOIL_WATER_,SOIL_INCEPTISOLS_D,POPULATION_DENSITY,CROP_C6,SOIL_INCEPTISOLS_B,CROP_T30,CROP_T8,VEGETATION_KNOBCONE_PINE,TOPOFPERFORATEDINTERVAL_AVG,SOIL_ENTISOLS_B
TOWNSHIP_RANGE,YEAR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
T27S R27E,2021,,,14.92,,,,,,0.0,,...,,,,,,,,,,
T27S R21E,2019,,,58.399137,0.0,,,0.0,,0.0,,...,,,2.921441,,,,,0.0,,
T10N R21W,2020,,,90.576923,,,,,,0.0,,...,,,8.978908,,,,,,,
T10N R20W,2020,,,90.576923,,,,,,0.0,,...,,,8.978908,,,,,,,
T29S R27E,2017,,,,,,,,,1.0,,...,,,3649.770482,,,,,,640.0,


Let's look at the features with missing data.

In [5]:
draw_missing_data_chart(X)

### Data Missing for Specific Years
Data were collected from the years 2014 to 2021 but some datasets only have data for specific years, when surveys were done/published. For example:
* Soils survey only has data for 2016
* Vegetations only has data for 2019
* Crops only has data for the the years 2014, 2016 and 2018
* Population density is available only for the years 2014 - 2020
* The reservoir water `PCT_OF_CAPACITY` is available only for the years 2018 - 2020

In [6]:
crops_columns = [col for col in X if col.startswith('CROP_')]
crops_df = X[crops_columns].dropna()
soils_columns = [col for col in X if col.startswith('SOIL_')]
soils_df = X[soils_columns].dropna()
print(f"Years present in the Soils dataset {list(crops_df.index.unique(level='YEAR'))}")
print(f"Years present in the Crops dataset {list(soils_df.index.unique(level='YEAR'))}")

Years present in the Soils dataset ['2014', '2016', '2018']
Years present in the Crops dataset ['2016']


### Data Missing for Specific Township-Ranges
The Well Completion Reports dataset has data for all the years but have missing data for some specific Township-Ranges. Typically, if no wells were drilled in a specific Township-Range during the 2014-2020 period, then there is no data for that Township-Range for any of the following features:
* `TOTALDRILLDEPTH_AVG`
* `WELLYIELD_AVG`
* `STATICWATERLEVEL_AVG`
* `TOPOFPERFORATEDINTERVAL_AVG`
* `BOTTOMOFPERFORATEDINTERVAL_AVG`
* `GROUNDSURFACEELEVATION_AVG`
* `TOTALCOMPLETEDDEPTH_AVG`

Wells can also be reported with incomplete data, which means that some of the above features data could be missing for some Township-Ranges, even if wells were reported in those Township-Range.

In [7]:
all_township_ranges = set(X.index.unique(level="TOWNSHIP_RANGE"))
wells_columns = [col for col in X if col.endswith('_AVG') or col == "TOWNSHIP_RANGE"]
wells_df = X[wells_columns].dropna()
missing_township_ranges = all_township_ranges - set(wells_df.index.unique(level="TOWNSHIP_RANGE"))
print(f"There are {len(missing_township_ranges)} out of {len(all_township_ranges)} Township-Ranges with missing well completion report data: {missing_township_ranges}")

There are 169 out of 478 Township-Ranges with missing well completion report data: {'T24S R20E', 'T14S R25E', 'T30S R22E', 'T25S R26E', 'T13S R11E', 'T11N R17W', 'T10S R19E', 'T24S R28E', 'T03N R04E', 'T26S R20E', 'T25S R28E', 'T28S R21E', 'T17S R16E', 'T03S R09E', 'T23S R21E', 'T24S R21E', 'T05S R06E', 'T25S R17E', 'T09S R12E', 'T31S R25E', 'T12N R22W', 'T26S R19E', 'T31S R24E', 'T08S R11E', 'T23S R20E', 'T22S R23E', 'T24S R19E', 'T29S R20E', 'T12N R19W', 'T29S R23E', 'T12S R23E', 'T24S R22E', 'T01N R05E', 'T08S R12E', 'T03S R13E', 'T16S R12E', 'T05N R05E', 'T12S R16E', 'T32S R25E', 'T23S R17E', 'T19S R15E', 'T31S R22E', 'T06S R09E', 'T29S R22E', 'T11S R15E', 'T11N R18W', 'T12S R09E', 'T10N R23W', 'T10S R13E', 'T10N R19W', 'T12N R18W', 'T25S R20E', 'T12S R11E', 'T07S R16E', 'T11N R19W', 'T10N R18W', 'T08S R18E', 'T11S R11E', 'T30S R20E', 'T21S R19E', 'T04N R04E', 'T12N R20W', 'T27S R27E', 'T06N R10E', 'T26S R28E', 'T27S R19E', 'T29S R29E', 'T28S R20E', 'T12N R24W', 'T01S R05E', 'T13S 

## Train-Test Split
We split the dataset into a training and test set before doing data imputation. As we deal with time series data grouped at the Township-Range level, we can't split the dataset by randomly splitting rows between the train and test sets. We need to keep data of Township-Ranges together.

In [8]:
tr_splitter = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=RANDOM_SEED)
X_no_index = X.reset_index(drop=False)
y_no_index = y.reset_index(drop=False)
split = tr_splitter.split(X_no_index, y_no_index, groups=X_no_index["TOWNSHIP_RANGE"])
train_idx, test_idx = next(split)
X_train = X_no_index.loc[train_idx].set_index(['TOWNSHIP_RANGE', 'YEAR'], drop=True)
X_test = X_no_index.loc[test_idx].set_index(['TOWNSHIP_RANGE', 'YEAR'], drop=True)
y_train = y_no_index.loc[train_idx].set_index(['TOWNSHIP_RANGE', 'YEAR'], drop=True)
y_test = y_no_index.loc[test_idx].set_index(['TOWNSHIP_RANGE', 'YEAR'], drop=True)

Let's look at 2 examples of the training and test sets.

In [9]:
X_train.head(16)

Unnamed: 0_level_0,Unnamed: 1_level_0,CROP_V,CROP_D5,PCT_OF_CAPACITY,VEGETATION_CALIFORNIA_COAST_LIVE_OAK,CROP_T19,SOIL_ALFISOLS_B,VEGETATION_NON-NATIVE_HARDWOOD_FOREST,CROP_F10,WELL_COUNT_AGRICULTURE,CROP_D6,...,SOIL_WATER_,SOIL_INCEPTISOLS_D,POPULATION_DENSITY,CROP_C6,SOIL_INCEPTISOLS_B,CROP_T30,CROP_T8,VEGETATION_KNOBCONE_PINE,TOPOFPERFORATEDINTERVAL_AVG,SOIL_ENTISOLS_B
TOWNSHIP_RANGE,YEAR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
T01N R03E,2014,0.0,0.017239,,,0.0,,,0.0,2.0,7e-05,...,,,1742.9492,0.001229,,0.0,0.0,,150.0,
T01N R03E,2015,,,,,,,,,0.0,,...,,,1742.255307,,,,,,180.0,
T01N R03E,2016,0.054666,0.015746,,,0.0,0.0,,0.048395,2.0,0.0,...,0.0,0.0,1727.268143,0.000711,0.0,0.0,0.0,,233.5,0.0
T01N R03E,2017,,,,,,,,,0.0,,...,,,1755.123782,,,,,,195.0,
T01N R03E,2018,0.055522,0.010151,81.213158,,0.0,,,0.023093,0.0,0.0,...,,,1767.462488,0.000797,,0.0,0.0,,226.833333,
T01N R03E,2019,,,85.609615,0.000137,,,0.095148,,0.0,,...,,,1820.997426,,,,,0.0,214.75,
T01N R03E,2020,,,74.961538,,,,,,0.0,,...,,,1813.320489,,,,,,195.625,
T01N R03E,2021,,,64.5,,,,,,2.0,,...,,,,,,,,,227.833333,
T01N R04E,2014,0.0,0.0,,,0.01238,,,0.0,0.0,0.0,...,,,398.826329,0.0,,0.0,0.0,,,
T01N R04E,2015,,,,,,,,,0.0,,...,,,406.16734,,,,,,,


In [10]:
X_test.head(16)

Unnamed: 0_level_0,Unnamed: 1_level_0,CROP_V,CROP_D5,PCT_OF_CAPACITY,VEGETATION_CALIFORNIA_COAST_LIVE_OAK,CROP_T19,SOIL_ALFISOLS_B,VEGETATION_NON-NATIVE_HARDWOOD_FOREST,CROP_F10,WELL_COUNT_AGRICULTURE,CROP_D6,...,SOIL_WATER_,SOIL_INCEPTISOLS_D,POPULATION_DENSITY,CROP_C6,SOIL_INCEPTISOLS_B,CROP_T30,CROP_T8,VEGETATION_KNOBCONE_PINE,TOPOFPERFORATEDINTERVAL_AVG,SOIL_ENTISOLS_B
TOWNSHIP_RANGE,YEAR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
T01N R02E,2014,0.0,0.010297,,,0.002588,,,0.0,1.0,0.0,...,,,2698.790108,0.005554,,0.0,0.0,,80.0,
T01N R02E,2015,,,,,,,,,0.0,,...,,,2714.296906,,,,,,,
T01N R02E,2016,0.00329,0.008706,,,0.0,0.0,,0.0,0.0,0.0,...,0.0,0.0,2727.545208,0.004971,0.0,0.0,0.0,,130.0,0.0
T01N R02E,2017,,,,,,,,,0.0,,...,,,2796.934562,,,,,,126.666667,
T01N R02E,2018,0.003806,0.005494,79.4,,0.0,,,0.0,1.0,0.0,...,,,2792.78195,0.00511,,0.0,0.0,,118.0,
T01N R02E,2019,,,91.2,0.002749,,,0.063803,,1.0,,...,,,2862.966172,,,,,0.0,60.0,
T01N R02E,2020,,,81.423077,,,,,,1.0,,...,,,2879.978198,,,,,,111.25,
T01N R02E,2021,,,64.5,,,,,,0.0,,...,,,,,,,,,,
T01N R11E,2014,0.0,0.0,,,0.0,,,0.0,0.0,0.0,...,,,30.872071,0.0,,0.0,0.0,,0.0,
T01N R11E,2015,,,,,,,,,1.0,,...,,,30.828991,,,,,,47.5,


## Data Imputation
### Imputation Strategies
To impute the missing data we will use the following strategies
1. We assume little year-to-year variation in Crops, Soils and Vegetation. The missing Crops data will thus be imputed from the previous year (e.g. the 2015 data will be set as the 2014 data). For the Soils and Vegetation where we only have data for 1 year, the missing data will all be imputed from the available year.
2. The 2021 population density data will be estimated based on the 2020 population density and the 2019-2020 trend.
3. For the pre-2018 missing reservoir water `PCT_OF_CAPACITY` data, as California was affected by sever droughts during those years, we will impute missing data by taking the **minimum** `PCT_OF_CAPACITY` for that Township-Range in the post 2018 data.
4. For the well completion reports' features with missing we will use 2 distinct strategies:
    * For the  `GROUNDSURFACEELEVATION_AVG` feature we will use the median values over all the years for that Township-Range. For Township-Ranges with no data at all for any of the 2014-2020 years, we will use the median value over all Township-Ranges.
    * For the other features they will be set to 0, since these are well measurements and missing data are mainly due to no wells being drilled in that Township-Range and year.

In [11]:
impute_pipeline, columns = create_transformation_pipelines(X_train)
X_train_impute = impute_pipeline.fit_transform(X_train)
X_test_impute = impute_pipeline.transform(X_test)

We combine the imputed training and test datasets into one dataset to visualize the results.

In [15]:
X_train_impute_df = pd.DataFrame(X_train_impute, index=X_train.index, columns=columns)
X_test_impute_df = pd.DataFrame(X_test_impute, index=X_test.index, columns=columns)
X_impute_df = pd.concat([X_train_impute_df, X_test_impute_df], axis=0)
X_impute_df.head(16)

Unnamed: 0_level_0,Unnamed: 1_level_0,TOTALDRILLDEPTH_AVG,WELLYIELD_AVG,STATICWATERLEVEL_AVG,TOPOFPERFORATEDINTERVAL_AVG,BOTTOMOFPERFORATEDINTERVAL_AVG,TOTALCOMPLETEDDEPTH_AVG,VEGETATION_CALIFORNIA_COAST_LIVE_OAK,VEGETATION_NON-NATIVE_HARDWOOD_FOREST,VEGETATION_PINYON-JUNIPER,VEGETATION_CANYON_LIVE_OAK,...,CROP_T8,POPULATION_DENSITY,PCT_OF_CAPACITY,GROUNDSURFACEELEVATION_AVG,WELL_COUNT_AGRICULTURE,AVERAGE_YEARLY_PRECIPITATION,WELL_COUNT_INDUSTRIAL,AREA,WELL_COUNT_DOMESTIC,WELL_COUNT_PUBLIC
TOWNSHIP_RANGE,YEAR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
T01N R03E,2014,0.097778,0.015967,0.037145,0.098039,0.111111,0.105856,0.000137,0.095148,0.0,0.0,...,0.0,0.252406,0.640021,0.023626,0.029412,0.163573,0.0,0.861464,0.041667,0.0
T01N R03E,2015,0.095238,0.018423,0.025042,0.117647,0.08046,0.079848,0.000137,0.095148,0.0,0.0,...,0.0,0.252305,0.640021,0.018249,0.0,0.2179,0.0,0.861464,0.027778,0.0
T01N R03E,2016,0.114286,0.006927,0.022398,0.152614,0.103768,0.10488,0.000137,0.095148,0.0,0.0,...,0.0,0.250132,0.640021,0.024153,0.029412,0.209056,0.0,0.861464,0.055556,0.0
T01N R03E,2017,0.0,0.011975,0.030885,0.127451,0.082375,0.081749,0.000137,0.095148,0.0,0.0,...,0.0,0.254171,0.640021,0.023541,0.0,0.213645,0.0,0.861464,0.027778,0.0
T01N R03E,2018,0.083873,0.002165,0.034558,0.148257,0.093934,0.107605,0.000137,0.095148,0.0,0.0,...,0.0,0.255961,0.830381,0.020523,0.0,0.181012,0.0,0.861464,0.097222,0.0
T01N R03E,2019,0.078492,0.0,0.029772,0.140359,0.094732,0.094011,0.000137,0.095148,0.0,0.0,...,0.0,0.263724,0.880456,0.020479,0.0,0.367632,0.0,0.861464,0.097222,0.125
T01N R03E,2020,0.074325,0.013357,0.030607,0.127859,0.08932,0.089021,0.000137,0.095148,0.0,0.0,...,0.0,0.26261,0.759176,0.018852,0.0,0.194167,0.0,0.861464,0.111111,0.0
T01N R03E,2021,0.115159,0.062159,0.020987,0.148911,0.112197,0.11749,0.000137,0.095148,0.0,0.0,...,0.0,0.261497,0.640021,0.023682,0.029412,0.125331,0.0,0.861464,0.083333,0.0
T01N R04E,2014,0.0,0.0,0.0,0.0,0.0,0.0,0.00797,8.9e-05,0.0,0.0,...,0.0,0.057494,0.489434,0.000696,0.0,0.163573,0.0,0.880044,0.0,0.0
T01N R04E,2015,0.0,0.0,0.0,0.0,0.0,0.0,0.00797,8.9e-05,0.0,0.0,...,0.0,0.058558,0.489434,0.000696,0.0,0.2179,0.0,0.880044,0.0,0.0


In [16]:
draw_missing_data_chart(X_impute_df)

In [17]:
set_config(display="diagram")
display(impute_pipeline)

In [18]:
X_impute_df["CROP_C6"]

TOWNSHIP_RANGE  YEAR
T01N R03E       2014    0.001229
                2015    0.001229
                2016    0.000711
                2017    0.000711
                2018    0.000797
                          ...   
T32S R26E       2017    0.000000
                2018    0.000000
                2019    0.000000
                2020    0.000000
                2021    0.000000
Name: CROP_C6, Length: 3824, dtype: float64

In [19]:
X["CROP_C6"]

TOWNSHIP_RANGE  YEAR
T01N R02E       2014    0.005554
                2015         NaN
                2016    0.004971
                2017         NaN
                2018    0.005110
                          ...   
T32S R30E       2017         NaN
                2018    0.000000
                2019         NaN
                2020         NaN
                2021         NaN
Name: CROP_C6, Length: 3824, dtype: float64