# Imputing Missing Data
This notebook is used to present the data imputation

In [None]:
# For Deepnote to be able to use the custom libraries in the parent ../lib folder
import sys
sys.path.append('..')

In [None]:
import pandas as pd
from lib.read_data import read_and_join_output_file
from lib.impute import create_transformation_pipelines, train_test_group_time_split
from lib.viz import draw_missing_data_chart
from sklearn.decomposition import PCA
from sklearn import set_config



First we load and join all the datasets resulting from the ETL process and initialize some variables.

We have two potential targets for supervised and unsupervised learning:
* `GSE_GWE` - The Ground Surface Elevation to Groundwater Water Elevation - Depth to groundwater elevation in feet below ground surface
* `SHORTAGE_COUNT` -  The number of reported well shortages

Since we have a time series datasets, the objective is to predict these values based on the historical data. For example we want to predict the 2022 count of well shortages (`SHORTAGE_COUNT`) per Township-Range based on all past data, including the past well shortages count. We thus do not split these target variables yet from the data.

In [None]:
RANDOM_SEED = 42
df = read_and_join_output_file()

## Missing Data

In [None]:
df.sample(5)

AttributeError: 'tuple' object has no attribute 'sample'

Let's look at the features with missing data.

In [None]:
draw_missing_data_chart(df)

### Data Missing for Specific Years
Data were collected from the years 2014 to 2021 but some datasets only have data for specific years, when surveys were done/published. For example:
* Soils survey only has data for 2016
* Vegetations only has data for 2019
* Crops only has data for the the years 2014, 2016 and 2018
* Population density is available only for the years 2014 - 2020
* The reservoir water `PCT_OF_CAPACITY` is available only for the years 2018 - 2020

In [None]:
crops_columns = [col for col in df if col.startswith('CROP_')]
crops_df = df[crops_columns].dropna()
soils_columns = [col for col in df if col.startswith('SOIL_')]
soils_df = df[soils_columns].dropna()
print(f"Years present in the Soils dataset {list(crops_df.index.unique(level='YEAR'))}")
print(f"Years present in the Crops dataset {list(soils_df.index.unique(level='YEAR'))}")

Years present in the Soils dataset ['2014', '2016', '2018']
Years present in the Crops dataset ['2016']


### Data Missing for Specific Township-Ranges
The Well Completion Reports dataset has data for all the years but have missing data for some specific Township-Ranges. Typically, if no wells were drilled in a specific Township-Range during the 2014-2020 period, then there is no data for that Township-Range for any of the following features:
* `TOTALDRILLDEPTH_AVG`
* `WELLYIELD_AVG`
* `STATICWATERLEVEL_AVG`
* `TOPOFPERFORATEDINTERVAL_AVG`
* `BOTTOMOFPERFORATEDINTERVAL_AVG`
* `GROUNDSURFACEELEVATION_AVG`
* `TOTALCOMPLETEDDEPTH_AVG`

Wells can also be reported with incomplete data, which means that some of the above features data could be missing for some Township-Ranges, even if wells were reported in those Township-Range.

In [None]:
all_township_ranges = set(df.index.unique(level="TOWNSHIP_RANGE"))
wells_columns = [col for col in df if col.endswith('_AVG') or col == "TOWNSHIP_RANGE"]
wells_df = df[wells_columns].dropna()
missing_township_ranges = all_township_ranges - set(wells_df.index.unique(level="TOWNSHIP_RANGE"))
print(f"There are {len(missing_township_ranges)} out of {len(all_township_ranges)} Township-Ranges with missing well completion report data: {missing_township_ranges}")

There are 169 out of 478 Township-Ranges with missing well completion report data: {'T08S R12E', 'T05S R06E', 'T08N R08E', 'T10S R08E', 'T12S R10E', 'T22S R29E', 'T11N R22W', 'T04S R15E', 'T12N R24W', 'T03N R03E', 'T28S R19E', 'T08S R11E', 'T10N R23W', 'T06S R07E', 'T25S R26E', 'T13S R11E', 'T22S R21E', 'T12N R22W', 'T23S R20E', 'T23S R18E', 'T28S R28E', 'T28S R24E', 'T06S R15E', 'T29S R23E', 'T02N R11E', 'T20S R14E', 'T12N R19W', 'T11N R19W', 'T23S R22E', 'T32S R26E', 'T15S R11E', 'T15S R10E', 'T12S R13E', 'T31S R24E', 'T14S R15E', 'T10S R19E', 'T27S R20E', 'T05N R05E', 'T30S R23E', 'T04N R04E', 'T19S R16E', 'T22S R23E', 'T06N R10E', 'T10S R13E', 'T08S R18E', 'T22S R16E', 'T12N R18W', 'T11S R23E', 'T31S R23E', 'T03S R13E', 'T12S R09E', 'T18S R14E', 'T11S R11E', 'T07N R06E', 'T30S R21E', 'T09S R11E', 'T22S R19E', 'T11N R24W', 'T10N R21W', 'T01S R04E', 'T20S R28E', 'T23S R21E', 'T32S R30E', 'T11N R17W', 'T14S R25E', 'T24S R20E', 'T03N R05E', 'T30S R22E', 'T30S R20E', 'T10N R22W', 'T06S 

## Train-Test Split
The dataset is a time series dataset. In order to do a train-test split with use the below custom strategy:
1. To generate training and test time series we split the data by using a group split strategy base on Township-Ranges. 80% of the Township-Ranges time series are put in the training set and 20% are put in the test set.
2. We split both the training and test sets based on the year, in order to split the X and y values. In both datasets, the X values will be the data for the years 2014-2020 and the y values will be the 2021 data.


In [None]:
X_train, X_test, y_train, y_test = train_test_group_time_split(df, index=["TOWNSHIP_RANGE", "YEAR"], group="TOWNSHIP_RANGE", random_seed=RANDOM_SEED)

Let's look at 2 examples of the training and test sets.
### The Training Set

In [None]:
X_train.head(14)

Unnamed: 0_level_0,Unnamed: 1_level_0,CROP_C,CROP_C6,CROP_D10,CROP_D12,CROP_D13,CROP_D14,CROP_D15,CROP_D16,CROP_D3,CROP_D5,...,GROUNDSURFACEELEVATION_AVG,STATICWATERLEVEL_AVG,TOPOFPERFORATEDINTERVAL_AVG,TOTALDRILLDEPTH_AVG,TOTALCOMPLETEDDEPTH_AVG,WELLYIELD_AVG,WELL_COUNT_AGRICULTURE,WELL_COUNT_DOMESTIC,WELL_COUNT_INDUSTRIAL,WELL_COUNT_PUBLIC
TOWNSHIP_RANGE,YEAR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
T01N R03E,2014,0.0,0.001229,0.000626,0.000435,0.023414,0.00012,0.000221,0.005375,0.027307,0.017239,...,10.854,44.5,150.0,308.0,278.4,86.666667,2.0,3.0,0.0,0.0
T01N R03E,2015,,,,,,,,,,,...,7.765,30.0,180.0,300.0,210.0,100.0,0.0,2.0,0.0,0.0
T01N R03E,2016,0.0,0.000711,0.000968,0.000422,0.022713,0.0,0.000214,0.003037,0.032867,0.015746,...,11.156667,26.833333,233.5,360.0,275.833333,37.6,2.0,4.0,0.0,0.0
T01N R03E,2017,,,,,,,,,,,...,10.805,37.0,195.0,,215.0,65.0,0.0,2.0,0.0,0.0
T01N R03E,2018,0.0,0.000797,0.000968,0.006218,0.021478,0.0,0.0,0.001684,0.047618,0.010151,...,9.071429,41.4,226.833333,264.2,283.0,11.75,0.0,7.0,0.0,0.0
T01N R03E,2019,,,,,,,,,,,...,9.04625,35.666667,214.75,247.25,247.25,,0.0,7.0,0.0,1.0
T01N R03E,2020,,,,,,,,,,,...,8.11125,36.666667,195.625,234.125,234.125,72.5,0.0,8.0,0.0,0.0
T01N R04E,2014,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,0.0,0.0,0.0,0.0
T01N R04E,2015,,,,,,,,,,,...,,,,,,,0.0,0.0,0.0,0.0
T01N R04E,2016,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-2.32,8.0,58.0,,80.0,,0.0,1.0,0.0,0.0


In [None]:
y_train.head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,CROP_C,CROP_C6,CROP_D10,CROP_D12,CROP_D13,CROP_D14,CROP_D15,CROP_D16,CROP_D3,CROP_D5,...,GROUNDSURFACEELEVATION_AVG,STATICWATERLEVEL_AVG,TOPOFPERFORATEDINTERVAL_AVG,TOTALDRILLDEPTH_AVG,TOTALCOMPLETEDDEPTH_AVG,WELLYIELD_AVG,WELL_COUNT_AGRICULTURE,WELL_COUNT_DOMESTIC,WELL_COUNT_INDUSTRIAL,WELL_COUNT_PUBLIC
TOWNSHIP_RANGE,YEAR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
T01N R03E,2021,,,,,,,,,,,...,10.88625,25.142857,227.833333,362.75,309.0,337.4,2.0,6.0,0.0,0.0
T01N R04E,2021,,,,,,,,,,,...,,,,,,,0.0,0.0,0.0,0.0


### The Test Set

In [None]:
X_test.head(14)

Unnamed: 0_level_0,Unnamed: 1_level_0,CROP_C,CROP_C6,CROP_D10,CROP_D12,CROP_D13,CROP_D14,CROP_D15,CROP_D16,CROP_D3,CROP_D5,...,GROUNDSURFACEELEVATION_AVG,STATICWATERLEVEL_AVG,TOPOFPERFORATEDINTERVAL_AVG,TOTALDRILLDEPTH_AVG,TOTALCOMPLETEDDEPTH_AVG,WELLYIELD_AVG,WELL_COUNT_AGRICULTURE,WELL_COUNT_DOMESTIC,WELL_COUNT_INDUSTRIAL,WELL_COUNT_PUBLIC
TOWNSHIP_RANGE,YEAR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
T01N R02E,2014,0.0,0.005554,0.000176,0.0,0.008879,0.003796,1.5e-05,0.002058,0.01879,0.010297,...,56.905,25.0,80.0,,337.5,3.0,1.0,1.0,0.0,0.0
T01N R02E,2015,,,,,,,,,,,...,,,,,,,0.0,0.0,0.0,0.0
T01N R02E,2016,0.0,0.004971,0.001729,0.0,0.005217,0.003099,0.000704,0.000792,0.014696,0.008706,...,20.28,44.0,130.0,,150.0,20.0,0.0,1.0,0.0,0.0
T01N R02E,2017,,,,,,,,,,,...,41.086667,31.0,126.666667,210.0,196.666667,39.333333,0.0,3.0,0.0,0.0
T01N R02E,2018,0.0,0.00511,0.001337,0.0,0.002,0.003084,0.00069,0.000612,0.013697,0.005494,...,69.05,75.5,118.0,169.0,169.0,4.0,1.0,1.0,0.0,0.0
T01N R02E,2019,,,,,,,,,,,...,85.32,35.0,60.0,370.0,370.0,,1.0,1.0,0.0,0.0
T01N R02E,2020,,,,,,,,,,,...,40.662,47.2,111.25,182.0,181.0,40.0,1.0,4.0,0.0,0.0
T01N R11E,2014,0.0,0.0,0.0,0.027729,0.047432,0.0,0.0,0.0,0.0,0.0,...,167.573333,29.666667,0.0,,,27.0,0.0,3.0,0.0,0.0
T01N R11E,2015,,,,,,,,,,,...,167.4675,43.0,47.5,,210.0,22.0,1.0,3.0,0.0,0.0
T01N R11E,2016,0.0,0.0,0.0,0.044833,0.051287,0.0,0.0,0.0,0.0,0.0,...,103.7,121.0,205.0,,240.0,,0.0,1.0,0.0,0.0


In [None]:
y_test.head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,CROP_C,CROP_C6,CROP_D10,CROP_D12,CROP_D13,CROP_D14,CROP_D15,CROP_D16,CROP_D3,CROP_D5,...,GROUNDSURFACEELEVATION_AVG,STATICWATERLEVEL_AVG,TOPOFPERFORATEDINTERVAL_AVG,TOTALDRILLDEPTH_AVG,TOTALCOMPLETEDDEPTH_AVG,WELLYIELD_AVG,WELL_COUNT_AGRICULTURE,WELL_COUNT_DOMESTIC,WELL_COUNT_INDUSTRIAL,WELL_COUNT_PUBLIC
TOWNSHIP_RANGE,YEAR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
T01N R02E,2021,,,,,,,,,,,...,,,,,,,0.0,0.0,0.0,0.0
T01N R11E,2021,,,,,,,,,,,...,,,,,,,0.0,0.0,0.0,0.0


## Data Imputation
### Imputation Strategies
To impute the missing data we will use the following strategies
1. We assume little year-to-year variation in Crops, Soils and Vegetation. The missing Crops data will thus be imputed from the previous year (e.g. the 2015 data will be set as the 2014 data). For the Soils and Vegetation where we only have data for 1 year, the missing data will all be imputed from the available year.
2. The 2021 population density data will be estimated based on the 2020 population density and the 2019-2020 trend.
3. For the pre-2018 missing reservoir water `PCT_OF_CAPACITY` data, as California was affected by sever droughts during those years, we will impute missing data by taking the **minimum** `PCT_OF_CAPACITY` for that Township-Range in the post 2018 data.
4. For the well completion reports' features with missing we will use 2 distinct strategies:
    * For the  `GROUNDSURFACEELEVATION_AVG` feature we will use the median values over all the years for that Township-Range. For Township-Ranges with no data at all for any of the 2014-2020 years, we will use the median value over all Township-Ranges.
    * For the other features they will be set to 0, since these are well measurements and missing data are mainly due to no wells being drilled in that Township-Range and year.

In [None]:
impute_pipeline, columns = create_transformation_pipelines(X_train)
X_train_impute = impute_pipeline.fit_transform(X_train)
X_test_impute = impute_pipeline.transform(X_test)

We combine the imputed training and test datasets into one dataset to visualize the results.

In [None]:
set_config(display="diagram")
display(impute_pipeline)

In [None]:
X_train_impute_df = pd.DataFrame(X_train_impute, index=X_train.index, columns=columns)
X_test_impute_df = pd.DataFrame(X_test_impute, index=X_test.index, columns=columns)

### The Training Set

In [None]:
X_train_impute_df.head(14)

Unnamed: 0_level_0,Unnamed: 1_level_0,TOTALDRILLDEPTH_AVG,WELLYIELD_AVG,STATICWATERLEVEL_AVG,TOPOFPERFORATEDINTERVAL_AVG,BOTTOMOFPERFORATEDINTERVAL_AVG,TOTALCOMPLETEDDEPTH_AVG,VEGETATION_BLUE_OAK-GRAY_PINE,VEGETATION_CALIFORNIA_COAST_LIVE_OAK,VEGETATION_CANYON_LIVE_OAK,VEGETATION_HARD_CHAPARRAL,...,PCT_OF_CAPACITY,GROUNDSURFACEELEVATION_AVG,AVERAGE_YEARLY_PRECIPITATION,SHORTAGE_COUNT,GSE_GWE,AREA,WELL_COUNT_AGRICULTURE,WELL_COUNT_DOMESTIC,WELL_COUNT_INDUSTRIAL,WELL_COUNT_PUBLIC
TOWNSHIP_RANGE,YEAR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
T01N R03E,2014,0.097778,0.018246,0.037145,0.098039,0.111111,0.105856,3.7e-05,0.000137,0.0,0.000386,...,0.717075,0.023626,0.163573,0.0,0.043005,0.861464,0.029412,0.041667,0.0,0.0
T01N R03E,2015,0.095238,0.021053,0.025042,0.117647,0.08046,0.079848,3.7e-05,0.000137,0.0,0.000386,...,0.717075,0.018249,0.2179,0.0,0.050637,0.861464,0.0,0.027778,0.0,0.0
T01N R03E,2016,0.114286,0.007916,0.022398,0.152614,0.103768,0.10488,3.7e-05,0.000137,0.0,0.000386,...,0.717075,0.024153,0.209056,0.0,0.03578,0.861464,0.029412,0.055556,0.0,0.0
T01N R03E,2017,0.0,0.013684,0.030885,0.127451,0.082375,0.081749,3.7e-05,0.000137,0.0,0.000386,...,0.717075,0.023541,0.213645,0.0,0.033202,0.861464,0.0,0.027778,0.0,0.0
T01N R03E,2018,0.083873,0.002474,0.034558,0.148257,0.093934,0.107605,3.7e-05,0.000137,0.0,0.000386,...,0.800728,0.020523,0.181012,0.0,0.030798,0.861464,0.0,0.097222,0.0,0.0
T01N R03E,2019,0.078492,0.0,0.029772,0.140359,0.094732,0.094011,3.7e-05,0.000137,0.0,0.000386,...,0.859558,0.020479,0.367632,0.0,0.031116,0.861464,0.0,0.097222,0.0,0.125
T01N R03E,2020,0.074325,0.015263,0.030607,0.127859,0.08932,0.089021,3.7e-05,0.000137,0.0,0.000386,...,0.717075,0.018852,0.194167,0.0,0.031302,0.861464,0.0,0.111111,0.0,0.0
T01N R04E,2014,0.0,0.0,0.0,0.0,0.0,0.0,0.000236,0.00797,0.0,0.005475,...,0.630613,0.000696,0.163573,0.0,0.004547,0.880044,0.0,0.0,0.0,0.0
T01N R04E,2015,0.0,0.0,0.0,0.0,0.0,0.0,0.000236,0.00797,0.0,0.005475,...,0.630613,0.000696,0.2179,0.0,0.024279,0.880044,0.0,0.0,0.0,0.0
T01N R04E,2016,0.0,0.0,0.006678,0.037908,0.026054,0.030418,0.000236,0.00797,0.0,0.005475,...,0.630613,0.000696,0.209056,0.0,0.018545,0.880044,0.0,0.013889,0.0,0.0


### The Test Set

In [None]:
X_test_impute_df.head(14)

Unnamed: 0_level_0,Unnamed: 1_level_0,TOTALDRILLDEPTH_AVG,WELLYIELD_AVG,STATICWATERLEVEL_AVG,TOPOFPERFORATEDINTERVAL_AVG,BOTTOMOFPERFORATEDINTERVAL_AVG,TOTALCOMPLETEDDEPTH_AVG,VEGETATION_BLUE_OAK-GRAY_PINE,VEGETATION_CALIFORNIA_COAST_LIVE_OAK,VEGETATION_CANYON_LIVE_OAK,VEGETATION_HARD_CHAPARRAL,...,PCT_OF_CAPACITY,GROUNDSURFACEELEVATION_AVG,AVERAGE_YEARLY_PRECIPITATION,SHORTAGE_COUNT,GSE_GWE,AREA,WELL_COUNT_AGRICULTURE,WELL_COUNT_DOMESTIC,WELL_COUNT_INDUSTRIAL,WELL_COUNT_PUBLIC
TOWNSHIP_RANGE,YEAR,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
T01N R02E,2014,0.0,0.000632,0.020868,0.052288,0.07567,0.128327,0.010798,0.002749,0.0,0.000633,...,0.776467,0.103779,0.269794,0.0,0.076255,0.70961,0.014706,0.013889,0.0,0.0
T01N R02E,2015,0.0,0.0,0.0,0.0,0.0,0.0,0.010798,0.002749,0.0,0.000633,...,0.776467,0.090013,0.283231,0.0,0.074852,0.70961,0.0,0.0,0.0,0.0
T01N R02E,2016,0.0,0.004211,0.036728,0.084967,0.057471,0.057034,0.010798,0.002749,0.0,0.000633,...,0.776467,0.040032,0.336495,0.0,0.064935,0.70961,0.0,0.013889,0.0,0.0
T01N R02E,2017,0.066667,0.008281,0.025876,0.082789,0.063857,0.074778,0.010798,0.002749,0.0,0.000633,...,0.776467,0.076247,0.647971,0.0,0.063802,0.70961,0.0,0.041667,0.0,0.0
T01N R02E,2018,0.053651,0.000842,0.063022,0.077124,0.052874,0.064259,0.010798,0.002749,0.0,0.000633,...,0.776467,0.124917,0.237508,0.0,0.061015,0.70961,0.014706,0.013889,0.0,0.0
T01N R02E,2019,0.11746,0.0,0.029215,0.039216,0.137931,0.140684,0.010798,0.002749,0.0,0.000633,...,0.934363,0.153236,0.481746,0.0,0.060735,0.70961,0.014706,0.013889,0.0,0.0
T01N R02E,2020,0.057778,0.008421,0.039399,0.072712,0.055556,0.068821,0.010798,0.002749,0.0,0.000633,...,0.803537,0.075507,0.227616,0.0,0.069573,0.70961,0.014706,0.055556,0.0,0.0
T01N R11E,2014,0.0,0.005684,0.024763,0.0,0.093614,0.0,0.158617,0.0,6e-06,0.000828,...,0.626103,0.296399,0.235842,0.0,0.111325,0.714367,0.0,0.041667,0.0,0.0
T01N R11E,2015,0.0,0.004632,0.035893,0.031046,0.088602,0.079848,0.158617,0.0,6e-06,0.000828,...,0.626103,0.296215,0.249575,0.0,0.154312,0.714367,0.014706,0.041667,0.0,0.0
T01N R11E,2016,0.0,0.0,0.101002,0.133987,0.086207,0.091255,0.158617,0.0,6e-06,0.000828,...,0.626103,0.185226,0.548766,0.0,0.151455,0.714367,0.0,0.013889,0.0,0.0


In [None]:
draw_missing_data_chart(X_train_impute_df)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=b042e2da-6536-449d-95b8-d85fa08825de' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>