# Lab 5 - Parcel Feature Extraction

Next, we will illustrate the construction of features related to our main task: finding the relationship between property development and water quality over time.  In a previous lab, you identified lakes for which we have complete information for the years from 2004 to 2015.  In this lab, we will

[Original Data and variable information](https://gisdata.mn.gov/organization/us-mn-state-metrogis?q=Metro+Regional+Parcel+Dataset&sort=score+desc%2C+metadata_modified+desc)

## Problem 1 - Feature construction

**Overview.** Remember that our target output file will have one row per year-lake combination.  To attach property information, we will need to group and aggregate the parcel data to create features for each lake-year combination.  When grouping the data, be sure to maintain the variables needed to join to the water quality data, namely the lake ID and year.  Since we are looking at tracking property development/change over time, we will want to generate features tracking

* Number of properties close to each lake,
* Summaries of the value of properties close to each lake,
* Aggregations on the size and type of the properties, and
* Other features that might impact water quality.
    
#### Task 1. Understanding parcel variables

Before we can construct features, we need to make sure we understand the parcel data.  The metro parcel data is provided by the State of Minnesota and the meta data can be found online.  For example, searching for *metro parcel 2014* lead to [this site](https://geo.btaa.org/catalog/304cf3d8-a53b-4ea9-b02a-f550bd68e320).  Clicking on the *Meta data* button in the top left, brought up more information.  Clicking *Download* opened in this meta data [in a separate page](https://resources.gisdata.mn.gov/pub/gdrs/data/pub/us_mn_state_metrogis/plan_regonal_parcels_2014/metadata/metadata.html)

Look through the **Section 4: Attributes** and identify variables that might impact the water quality of near-by lakes.

> <font color="orange"> Potential Important Variables:
- USE(1-4)_DESC : property use types
- EMV_LAND, EMV_BLDG, EMV_TOTAL : est market value of land, buildings, total
- ACRES_POLY, ACRES_DEED : acreage of the land
- BASEMENT : basement (Y/N)
- GARAGE, GARAGESQFT : large concrete surface? garage is Y/N
- FIN_SQ_FT: sq footage
- YEAR_BUILT
- NUM_UNITS
- GREEN_ACRE, OPEN_SPACE, AG_PRESERV : undeveloped land / open areas
- TOTAL_TAX, TAX_CAPAC
- SPEC_ASSES : special assessments </font>

In [None]:
import polars as pl
import polars.selectors as cs

In [2]:
(wq := pl.read_parquet('./data/water_quality_by_year.parquet'))

DNR_ID_Site_Number,LAKE_NAME,Year,latitude,longitude,avg_secchi_depth,avg_total_phosphorus
str,str,i64,str,str,f64,f64
"""82007700-01""","""Goggins Lake""",2006,"""45.13303508""","""-92.89283249""",0.956,0.102
"""19002700-01""","""Crystal Lake""",2014,"""44.72296805""","""-93.27036573""",1.96,0.0244
"""82012200-01""","""Pine Tree Lake""",2004,"""45.10231359""","""-92.95386928""",2.2625,0.027375
"""10001900-01""","""Bavaria Lake""",2013,"""44.83812233""","""-93.63778927""",1.2,0.034636
"""27062700-01""","""Northwood Lake""",2010,"""45.02556284""","""-93.39171496""",0.98,0.1369
…,…,…,…,…,…,…
"""27003501-01""","""Sweeney Lake""",2006,"""44.99052075""","""-93.34160616""",1.03,0.0939
"""82009700-01""","""La Lake""",2006,"""44.88725237""","""-92.9713984""",1.475,0.096333
"""82011602-01""","""Armstrong Lake""",2008,"""44.96252306""","""-92.93917709""",1.142857,0.054714
"""19002601-01""","""Marion Lake""",2012,"""44.65825741""","""-93.27557035""",1.964286,0.028571


In [3]:
# Load parcel data using lazy evaluation
(parcels_lazy := pl.scan_delta('./data/parcel.delta'))

In [4]:
parcels_lazy.collect_schema().names()

['ACRES_DEED',
 'ACRES_POLY',
 'AGPRE_ENRD',
 'AGPRE_EXPD',
 'AG_PRESERV',
 'BASEMENT',
 'BLDG_NUM',
 'BLOCK',
 'CITY',
 'CITY_USPS',
 'COOLING',
 'COUNTY_ID',
 'DWELL_TYPE',
 'EMV_BLDG',
 'EMV_LAND',
 'EMV_TOTAL',
 'FIN_SQ_FT',
 'GARAGE',
 'GARAGESQFT',
 'GREEN_ACRE',
 'HEATING',
 'HOMESTEAD',
 'HOME_STYLE',
 'LANDMARK',
 'LOT',
 'MULTI_USES',
 'NUM_UNITS',
 'OPEN_SPACE',
 'OWNER_MORE',
 'OWNER_NAME',
 'OWN_ADD_L1',
 'OWN_ADD_L2',
 'OWN_ADD_L3',
 'PARC_CODE',
 'PIN',
 'PLAT_NAME',
 'PREFIXTYPE',
 'PREFIX_DIR',
 'SALE_DATE',
 'SALE_VALUE',
 'SCHOOL_DST',
 'SPEC_ASSES',
 'STREETNAME',
 'STREETTYPE',
 'SUFFIX_DIR',
 'Shape_Area',
 'Shape_Leng',
 'TAX_ADD_L1',
 'TAX_ADD_L2',
 'TAX_ADD_L3',
 'TAX_CAPAC',
 'TAX_EXEMPT',
 'TAX_NAME',
 'TOTAL_TAX',
 'UNIT_INFO',
 'USE1_DESC',
 'USE2_DESC',
 'USE3_DESC',
 'USE4_DESC',
 'WSHD_DIST',
 'XUSE1_DESC',
 'XUSE2_DESC',
 'XUSE3_DESC',
 'XUSE4_DESC',
 'YEAR_BUILT',
 'ZIP',
 'ZIP4',
 'centroid_lat',
 'centroid_long',
 'Distance_Parcel_Lake_meters',
 'Yea

### Task 2. Feature Brainstorming

Our objective is to build a feature table with one row per lake-year, using grouped summary statistics. Here are effective strategies for feature construction:

1. **Numerical summaries:** Calculate group-level statistics (mean, median, standard deviation, IQR, etc.) for numeric variables.
2. **Categorical summaries:** For text data, consider:
   - **Success rates:** Compute proportions for binary variables (e.g., percent of homes with basements).
   - **Label cleaning:** Review and standardize unique labels to remove duplicates or inconsistencies.
   - **Broader categories:** Recode variables with many rare categories into a smaller, more meaningful set.
   - **Indicator columns:** Create indicator variables and aggregate them to show presence/absence or proportions (e.g., count of each property use type).

Review the variables you identified earlier and outline a feature construction strategy for each.

> <font color="orange"> **Feature Construction Strategy:**

**Numerical Variables - Use mean, median
- `EMV_LAND`, `EMV_BLDG`, `EMV_TOTAL`: Market value summaries (mean, median, std) - higher values may indicate more development
- `ACRES_POLY`, `ACRES_DEED`: Property size summaries - larger lots may have more runoff area
- `FIN_SQ_FT`: Average building footprint - more impervious surface
- `TOTAL_TAX`, `TAX_CAPAC`: Tax summaries - economic indicators

**Binary/Categorical Variables - Use proportions/counts:**
- `BASEMENT`: Proportion with basements (Y vs N)
- `GARAGE`: Proportion with garages (Y vs N)
- `HOMESTEAD` : y/n

**Aggregation Strategy:**
- Group by: `Monit_MAP_CODE1` (lake ID), `Year`, `distance_category` (within_500m, between_501_1600m, over_1600m)
- For each group, calculate the features above
- This gives us development characteristics at different distances from each lake over time
</font>

### Problem 2 & 3

In [5]:
# Define variable categories for aggregation
numerical_vars = ['EMV_LAND', 'EMV_BLDG', 'EMV_TOTAL', 'ACRES_POLY', 'ACRES_DEED', 
                  'FIN_SQ_FT', 'TOTAL_TAX', 'TAX_CAPAC']
categorical_vars = ['BASEMENT', 'GARAGE', 'HOMESTEAD']

In [22]:
# Aggregate parcel features by Monit_MAP_CODE1 (lake ID), Year, and distance_category
# Handling nulls by excluding them from aggregations, then filling remaining nulls with 0

(parcel_features_agg := 
    parcels_lazy
    .group_by(['Monit_MAP_CODE1', 'Year', 'distance_category'])
    .agg([
        # Count of parcels
        pl.len().alias('parcel_count'),
        
        # Keep centroid coordinates for joining , should be same for each lake
        pl.col('centroid_lat').first().cast(pl.Utf8).alias('centroid_lat'),
        pl.col('centroid_long').first().cast(pl.Utf8).alias('centroid_long'),
        
        # Numerical variables - mean, median, std
        # * is so each expression gets its own argument
        *[pl.col(var).cast(pl.Float64, strict=False).mean().alias(f'{var.lower()}_mean') 
          for var in numerical_vars],
        *[pl.col(var).cast(pl.Float64, strict=False).median().alias(f'{var.lower()}_median') 
          for var in numerical_vars],
        *[pl.col(var).cast(pl.Float64, strict=False).std().alias(f'{var.lower()}_std') 
          for var in numerical_vars],
        
        # Categorical variables - proportions only
        *[((pl.col(var) == 'Y').sum() / pl.len()).alias(f'{var.lower()}_prop') 
          for var in categorical_vars],
    ])
    .filter(pl.col('distance_category') != 'over_1600m')
    .fill_null(0)
    .collect()
)

Monit_MAP_CODE1,Year,distance_category,parcel_count,centroid_lat,centroid_long,emv_land_mean,emv_bldg_mean,emv_total_mean,acres_poly_mean,acres_deed_mean,fin_sq_ft_mean,total_tax_mean,tax_capac_mean,emv_land_median,emv_bldg_median,emv_total_median,acres_poly_median,acres_deed_median,fin_sq_ft_median,total_tax_median,tax_capac_median,emv_land_std,emv_bldg_std,emv_total_std,acres_poly_std,acres_deed_std,fin_sq_ft_std,total_tax_std,tax_capac_std,basement_prop,garage_prop,homestead_prop
str,str,str,u32,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""27019101-01""","""2010""","""within_500m""",628,"""45.05562""","""-93.66374""",186875.636943,184299.681529,371175.318471,3.265366,0.0,1403.550955,4666.555732,3864.807325,168000.0,178000.0,361000.0,1.07,0.0,1385.5,4394.0,3620.0,111164.163827,148746.490288,220868.449547,8.060234,0.0,1027.558856,2906.496545,2390.060447,0.160828,0.740446,0.762739
"""82012600-01""","""2004""","""between_501_1600m""",215,"""45.08001""","""-92.91215""",157260.930233,194804.651163,352065.581395,3.473349,7.49614,0.0,2411.702326,0.0,135000.0,173300.0,319900.0,0.01,4.98,0.0,2176.0,0.0,115828.14719,168336.173934,247009.281914,6.813859,10.99898,0.0,2000.273417,0.0,0.0,0.0,1.0
"""70006100-01""","""2005""","""within_500m""",710,"""44.70308""","""-93.43135""",52947.323944,150016.338028,202963.661972,0.878817,0.476592,1234.538028,0.0,0.0,50000.0,149000.0,202900.0,0.33,0.0,1152.0,0.0,0.0,50821.360439,161632.157783,202380.25613,4.354212,4.627303,2321.917183,0.0,0.0,0.0,0.0,0.726761
"""27009800-01""","""2009""","""between_501_1600m""",1526,"""45.06936""","""-93.43876""",118206.946265,326085.452163,444292.398427,1.803447,0.0,164.61599,6960.385976,5505.431193,70000.0,184000.0,263300.0,0.36,0.0,0.0,3007.0,2638.0,521380.569294,1.4741e6,1.8354e6,3.086121,0.0,479.010541,35114.680324,26268.113026,0.0,0.116645,0.859764
"""82016300-01""","""2004""","""between_501_1600m""",3956,"""45.2748""","""-93.00235""",74425.215369,165477.35996,240248.260364,0.806663,0.70526,0.0,2391.807887,73.26542,50000.0,106600.0,158200.0,0.03,0.0,0.0,414.0,0.0,155327.202576,1.0273e6,1.1052e6,4.07602,3.90577,0.0,10384.18086,560.90959,0.0,0.0,0.976239
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""10012100-01""","""2009""","""within_500m""",34,"""44.80609""","""-93.92839""",323544.117647,90679.411765,414223.529412,29.847941,32.827647,685.882353,1322.588235,1690.294118,201300.0,32100.0,233200.0,4.685,5.38,0.0,997.0,1370.0,549296.132635,140110.994484,678552.168138,48.16007,54.190598,770.89349,1312.197835,1546.95957,0.0,0.264706,0.382353
"""70012001-01""","""2010""","""between_501_1600m""",314,"""44.72282""","""-93.54186""",327480.254777,160052.229299,487532.484076,14.44121,14.106752,1080.56051,0.0,0.0,160000.0,136400.0,392000.0,2.77,1.11,1176.0,0.0,0.0,583684.284265,140695.965044,567187.745085,25.45301,27.41116,921.159287,0.0,0.0,0.0,0.0,0.770701
"""10008400-01""","""2014""","""within_500m""",775,"""44.84445""","""-93.79866""",107982.193548,244994.064516,352976.258065,0.729471,0.319665,1943.891613,3890.769032,2585.695484,52900.0,199500.0,258700.0,0.37,0.0,1680.0,3086.0,2036.0,194263.52887,396439.804797,535754.231136,2.463982,1.864986,2523.030445,7035.596588,4750.245978,0.0,0.910968,0.850323
"""82051400-01""","""2008""","""between_501_1600m""",230,"""44.96037""","""-92.81981""",222336.521739,307055.652174,531903.043478,8.76113,8.782304,0.0,524779.565217,0.0,174200.0,155200.0,403150.0,2.67,2.58,0.0,397500.0,0.0,262648.250283,512673.462508,667969.402353,19.286222,14.796331,0.0,669176.501328,0.0,0.0,0.0,0.0


### Problem 4

In [13]:
# Pivot to get one row per lake-year with distance_category as column prefixes
# Get all feature columns (excluding grouping/coordinate columns)
feature_cols = [col for col in parcel_features_agg.columns 
                if col not in ['Monit_MAP_CODE1', 'Year', 'distance_category', 'centroid_lat', 'centroid_long']]

# First, get centroid coordinates (take first value per lake since they vary slightly by distance_category)
centroids = parcel_features_agg.group_by(['Monit_MAP_CODE1', 'Year']).agg([
    pl.col('centroid_lat').first(),
    pl.col('centroid_long').first()
])

# Pivot without centroids in the index
(parcel_features_wide := parcel_features_agg
 .pivot(
     values=feature_cols,
     index=['Monit_MAP_CODE1', 'Year'],
     on='distance_category'
 )
 .join(centroids, on=['Monit_MAP_CODE1', 'Year'], how='left')
)

Monit_MAP_CODE1,Year,parcel_count_between_501_1600m,parcel_count_within_500m,emv_land_mean_between_501_1600m,emv_land_mean_within_500m,emv_bldg_mean_between_501_1600m,emv_bldg_mean_within_500m,emv_total_mean_between_501_1600m,emv_total_mean_within_500m,acres_poly_mean_between_501_1600m,acres_poly_mean_within_500m,acres_deed_mean_between_501_1600m,acres_deed_mean_within_500m,fin_sq_ft_mean_between_501_1600m,fin_sq_ft_mean_within_500m,total_tax_mean_between_501_1600m,total_tax_mean_within_500m,tax_capac_mean_between_501_1600m,tax_capac_mean_within_500m,emv_land_median_between_501_1600m,emv_land_median_within_500m,emv_bldg_median_between_501_1600m,emv_bldg_median_within_500m,emv_total_median_between_501_1600m,emv_total_median_within_500m,acres_poly_median_between_501_1600m,acres_poly_median_within_500m,acres_deed_median_between_501_1600m,acres_deed_median_within_500m,fin_sq_ft_median_between_501_1600m,fin_sq_ft_median_within_500m,total_tax_median_between_501_1600m,total_tax_median_within_500m,tax_capac_median_between_501_1600m,tax_capac_median_within_500m,emv_land_std_between_501_1600m,emv_land_std_within_500m,emv_bldg_std_between_501_1600m,emv_bldg_std_within_500m,emv_total_std_between_501_1600m,emv_total_std_within_500m,acres_poly_std_between_501_1600m,acres_poly_std_within_500m,acres_deed_std_between_501_1600m,acres_deed_std_within_500m,fin_sq_ft_std_between_501_1600m,fin_sq_ft_std_within_500m,total_tax_std_between_501_1600m,total_tax_std_within_500m,tax_capac_std_between_501_1600m,tax_capac_std_within_500m,basement_prop_between_501_1600m,basement_prop_within_500m,garage_prop_between_501_1600m,garage_prop_within_500m,homestead_prop_between_501_1600m,homestead_prop_within_500m,centroid_lat,centroid_long
str,str,u32,u32,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str
"""10008900-01""","""2008""",78,52,331946.153846,297788.461538,145596.153846,178771.153846,477542.307692,476559.615385,38.212179,22.059038,39.328974,20.327308,2008.884615,1601.576923,2916.589744,3274.673077,3365.782051,3783.326923,225400.0,265500.0,139500.0,154100.0,411150.0,388500.0,10.13,10.155,10.83,10.35,1608.0,1494.0,2485.0,3004.5,3131.0,3655.0,286480.627704,225184.249313,102550.722885,199966.838757,283906.651715,371205.175684,42.276111,28.382817,43.204548,25.641571,3891.821301,1462.008373,1938.226478,3174.669226,1908.508506,3627.618917,0.0,0.0,0.307692,0.307692,0.423077,0.461538,"""44.88105""","""-93.81521"""
"""82006400-01""","""2015""",234,52,164258.119658,169653.846154,91399.145299,97126.923077,255657.264957,266780.769231,7.130427,17.328846,0.0,0.0,838.512821,869.846154,0.0,0.0,2267.555556,2347.346154,134600.0,181800.0,49600.0,10000.0,255000.0,266900.0,1.7,12.76,0.0,0.0,664.0,0.0,0.0,0.0,2281.0,2646.0,142927.016544,119619.023442,109747.814311,115462.519367,198195.766279,208220.695548,10.814664,13.651852,0.0,0.0,881.764244,1023.736282,0.0,0.0,1950.398016,2070.038589,0.581197,0.461538,0.393162,0.423077,0.547009,0.730769,"""45.24157""","""-92.84204"""
"""02004500-01""","""2011""",5571,1363,66295.189374,73711.005136,129967.33082,145593.543654,196262.520194,219304.548789,0.477559,0.588019,0.143493,0.072443,1527.839347,1572.150404,2999.06839,3682.005869,1494.524143,1966.27146,60000.0,69300.0,105100.0,122000.0,160000.0,182100.0,0.25,0.28,0.0,0.0,1046.0,1210.0,2407.0,2812.0,1335.0,1574.0,220249.851501,103355.53104,584489.482686,292102.99951,781638.388401,353271.872372,2.237901,2.268408,1.681123,1.019385,10428.568266,3046.807874,7246.820982,8393.291175,3670.307144,4093.103994,0.823192,0.737344,0.832346,0.661775,0.807934,0.774762,"""45.1524""","""-93.14468"""
"""82001400-01""","""2010""",398,134,180091.457286,244904.477612,181873.869347,256440.298507,361965.326633,501344.776119,7.295025,5.346269,7.312161,5.735224,1620.798995,2150.507463,0.0,0.0,0.0,0.0,178500.0,191100.0,161900.0,222700.0,364100.0,459900.0,4.5,5.0,4.5,5.0,1756.0,2400.0,0.0,0.0,0.0,0.0,126564.13584,148271.88239,174963.467135,218112.232309,228506.115873,307130.215721,12.225293,10.140319,12.235351,11.284882,1224.805365,1414.019763,0.0,0.0,0.0,0.0,0.713568,0.820896,0.663317,0.80597,0.763819,0.791045,"""45.11073""","""-92.79517"""
"""10001900-01""","""2006""",1888,739,115452.224576,170871.718539,208476.536017,177405.006766,323928.760593,348276.725304,1.311933,1.73406,0.812405,0.813518,1820.840572,1584.26793,2623.261123,2949.391069,2281.774894,2675.151556,84600.0,131300.0,211550.0,126200.0,294900.0,246500.0,0.26,0.41,0.0,0.0,1907.0,1666.0,2691.0,1937.0,2371.0,1967.0,271022.949504,333368.596425,273031.188812,201379.677578,423292.218661,408995.364005,9.404847,11.672799,6.920276,4.69102,2170.293097,1301.820075,2337.331676,3039.795714,2050.735598,2735.727724,0.0,0.0,0.705508,0.575101,0.666843,0.594046,"""44.85162""","""-93.63664"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""10008900-01""","""2006""",77,50,276987.012987,250850.0,135588.311688,190120.0,412575.324675,440970.0,38.640779,22.8756,39.63026,21.4796,2004.337662,1625.44,2313.168831,2776.56,2659.766234,3133.84,200700.0,204650.0,144700.0,154600.0,358200.0,375400.0,10.13,10.43,10.83,11.35,1620.0,1494.0,2185.0,2486.0,2716.0,3018.5,260293.903901,185589.050357,95354.35846,237797.649359,260755.634322,374865.073005,42.484378,28.738936,43.247791,26.083118,3924.30344,1472.54162,1508.564518,2796.599766,1408.144738,3111.283031,0.0,0.0,0.584416,0.58,0.792208,0.82,"""44.89626""","""-93.8179"""
"""10004100-01""","""2015""",37,408,318254.054054,244701.470588,478324.324324,240530.882353,796578.378378,485232.352941,6.886757,1.816397,0.978378,1.183529,1571.297297,1679.335784,11639.675676,6177.727941,9211.459459,4817.666667,234200.0,138000.0,0.0,208100.0,252200.0,511700.0,1.65,0.42,0.0,0.0,0.0,1828.0,0.0,6507.0,0.0,5195.0,503053.732272,452360.302025,817972.930619,260381.24086,1.1547e6,611470.627735,10.780867,13.448753,4.149306,12.912583,1933.422266,1513.26909,17322.718651,6250.723759,14106.097895,5110.141938,0.432432,0.566176,0.486486,0.605392,0.432432,0.590686,"""44.88996""","""-93.68321"""
"""82000100-05""","""2012""",918,1344,39674.291939,78640.178571,69650.108932,71734.22619,109324.400871,150374.404762,0.622593,0.426384,0.0,0.0,769.305011,782.095238,0.0,0.0,967.030501,1392.072917,46000.0,51500.0,64900.0,47150.0,118600.0,105250.0,0.23,0.23,0.0,0.0,864.0,688.0,0.0,0.0,924.0,870.5,42381.98514,178461.000222,83511.661303,141436.84715,115867.816113,284202.843986,1.700025,0.846677,0.0,0.0,804.312145,1039.710829,0.0,0.0,1091.057638,2847.727071,0.577342,0.55506,0.511983,0.459821,0.729847,0.599702,"""44.92499""","""-92.76549"""
"""13005400-01""","""2005""",,6,,217966.666667,,217666.666667,,435633.333333,,20.025,,40.046667,,0.0,,2730.666667,,0.0,,85800.0,,298900.0,,439900.0,,1.265,,2.54,,0.0,,2344.0,,0.0,,204751.719569,,170401.510165,,43772.99015,,46.578975,,58.120571,,0.0,,939.850768,,0.0,,0.0,,0.0,,1.0,"""45.29343""","""-92.94182"""


In [14]:
(final_dataset := wq.join(
    parcel_features_wide.with_columns(pl.col('Year').cast(pl.Int64)),
    left_on=['DNR_ID_Site_Number', 'Year'],
    right_on=['Monit_MAP_CODE1', 'Year'],
    how='left'
)).fill_null(0)

DNR_ID_Site_Number,LAKE_NAME,Year,latitude,longitude,avg_secchi_depth,avg_total_phosphorus,parcel_count_between_501_1600m,parcel_count_within_500m,emv_land_mean_between_501_1600m,emv_land_mean_within_500m,emv_bldg_mean_between_501_1600m,emv_bldg_mean_within_500m,emv_total_mean_between_501_1600m,emv_total_mean_within_500m,acres_poly_mean_between_501_1600m,acres_poly_mean_within_500m,acres_deed_mean_between_501_1600m,acres_deed_mean_within_500m,fin_sq_ft_mean_between_501_1600m,fin_sq_ft_mean_within_500m,total_tax_mean_between_501_1600m,total_tax_mean_within_500m,tax_capac_mean_between_501_1600m,tax_capac_mean_within_500m,emv_land_median_between_501_1600m,emv_land_median_within_500m,emv_bldg_median_between_501_1600m,emv_bldg_median_within_500m,emv_total_median_between_501_1600m,emv_total_median_within_500m,acres_poly_median_between_501_1600m,acres_poly_median_within_500m,acres_deed_median_between_501_1600m,acres_deed_median_within_500m,fin_sq_ft_median_between_501_1600m,fin_sq_ft_median_within_500m,total_tax_median_between_501_1600m,total_tax_median_within_500m,tax_capac_median_between_501_1600m,tax_capac_median_within_500m,emv_land_std_between_501_1600m,emv_land_std_within_500m,emv_bldg_std_between_501_1600m,emv_bldg_std_within_500m,emv_total_std_between_501_1600m,emv_total_std_within_500m,acres_poly_std_between_501_1600m,acres_poly_std_within_500m,acres_deed_std_between_501_1600m,acres_deed_std_within_500m,fin_sq_ft_std_between_501_1600m,fin_sq_ft_std_within_500m,total_tax_std_between_501_1600m,total_tax_std_within_500m,tax_capac_std_between_501_1600m,tax_capac_std_within_500m,basement_prop_between_501_1600m,basement_prop_within_500m,garage_prop_between_501_1600m,garage_prop_within_500m,homestead_prop_between_501_1600m,homestead_prop_within_500m,centroid_lat,centroid_long
str,str,i64,str,str,f64,f64,u32,u32,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str
"""82007700-01""","""Goggins Lake""",2006,"""45.13303508""","""-92.89283249""",0.956,0.102,137,69,171323.357664,204921.73913,199991.240876,147011.594203,371314.59854,351933.333333,4.568102,5.32913,8.489343,10.593043,1503.124088,1234.115942,2182.788321,2050.202899,0.0,0.0,155400.0,205400.0,210100.0,153500.0,375000.0,368600.0,0.23,0.02,5.01,5.29,1540.0,1314.0,2424.0,2366.0,0.0,0.0,102820.380406,128945.44089,193184.768868,121999.144809,228000.633558,195013.07925,8.09445,9.497451,11.352502,11.679387,1280.52188,1023.118395,1734.102317,1530.214754,0.0,0.0,0.0,0.0,0.0,0.0,0.89781,0.811594,"""45.12616""","""-92.8821"""
"""19002700-01""","""Crystal Lake""",2014,"""44.72296805""","""-93.27036573""",1.96,0.0244,92,1765,84493.478261,110654.787535,189930.434783,159819.320113,274423.913043,270474.107649,0.606739,0.430295,0.0,0.0,2573.119565,2228.563173,4132.423913,3317.18187,2552.086957,2401.149008,61000.0,61200.0,175400.0,159500.0,235900.0,238100.0,0.31,0.32,0.0,0.0,2278.5,2188.0,2792.5,2754.0,2022.0,1982.0,120359.683298,138105.254957,129628.188275,82338.486478,230124.674938,168611.758605,1.731464,0.769999,0.0,0.0,2101.404391,1129.582736,9548.042251,3914.574714,2962.754219,1966.305293,0.0,0.0,0.0,0.0,0.858696,0.850425,"""44.71658""","""-93.26527"""
"""82012200-01""","""Pine Tree Lake""",2004,"""45.10231359""","""-92.95386928""",2.2625,0.027375,424,273,160262.971698,173076.556777,319631.839623,301198.168498,479894.811321,474274.725275,2.91342,2.584652,6.124481,5.146777,0.0,0.0,3905.820755,3770.131868,0.0,0.0,150000.0,185000.0,283000.0,267300.0,447600.0,477300.0,0.205,0.07,2.895,2.63,0.0,0.0,3458.0,3632.0,0.0,0.0,135619.676884,111947.950582,260368.938186,332707.974039,309997.671648,398363.670937,10.417772,7.763072,15.262668,10.391988,0.0,0.0,2891.66139,3475.142043,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,"""45.1099""","""-92.96103"""
"""10001900-01""","""Bavaria Lake""",2013,"""44.83812233""","""-93.63778927""",1.2,0.034636,2547,1042,72883.078131,109553.071017,175628.857479,180958.733205,248511.935611,290511.804223,1.025014,1.198081,0.597275,0.287226,1718.336867,1842.429942,3252.698076,3984.12476,0.0,0.0,68800.0,95000.0,160200.0,118200.0,231700.0,181800.0,0.24,0.375,0.0,0.0,1877.0,1810.0,2922.0,2396.0,0.0,0.0,141142.755588,128616.137746,407088.401661,208778.039194,463765.839322,292861.637464,8.072278,8.997494,5.961486,1.279927,1417.958724,1393.877187,6406.97782,4097.902842,0.0,0.0,0.0,0.0,0.709462,0.747601,0.630153,0.588292,"""44.83054""","""-93.63881"""
"""27062700-01""","""Northwood Lake""",2010,"""45.02556284""","""-93.39171496""",0.98,0.1369,5222,1875,89336.844121,68345.6,194060.283416,158703.573333,283397.127537,227049.173333,0.537135,0.481685,0.0,0.0,1135.151666,1177.661867,5176.414975,3278.376,3374.308311,2333.544533,63000.0,69000.0,149000.0,144000.0,213000.0,213000.0,0.24,0.24,0.0,0.0,1150.0,1155.0,3025.0,3024.0,2137.5,2130.0,204400.048882,112070.986926,409851.697753,328313.70949,587272.905164,439743.841661,2.085727,1.244839,0.0,0.0,493.054281,433.938655,15629.617448,6881.721048,9262.755279,5505.824908,0.144006,0.200533,0.883187,0.949333,0.876867,0.9184,"""45.02571""","""-93.38206"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""27003501-01""","""Sweeney Lake""",2006,"""44.99052075""","""-93.34160616""",1.03,0.0939,2440,523,125399.467213,150560.994264,232077.868852,256435.946463,358035.122951,406996.940727,0.0,0.0,0.0,0.0,0.0,0.0,6573.886475,7319.674952,4402.980738,4994.862333,82000.0,85000.0,162500.0,184000.0,249000.0,277000.0,0.0,0.0,0.0,0.0,0.0,0.0,3172.0,3617.0,2435.0,2740.0,286699.497111,226708.255838,710930.272865,423115.63799,966584.895519,628517.068254,0.0,0.0,0.0,0.0,0.0,0.0,30269.880638,19208.833042,17256.728454,11329.321394,0.0,0.0,0.0,0.0,0.816803,0.848948,"""44.9865""","""-93.34648"""
"""82009700-01""","""La Lake""",2006,"""44.88725237""","""-92.9713984""",1.475,0.096333,2781,501,93180.798274,112948.502994,202558.216469,259695.808383,295739.014743,372644.311377,0.635275,0.956806,1.549252,1.685828,1578.735707,1405.073852,2956.481122,2123.548902,0.0,0.0,75000.0,70000.0,205600.0,177800.0,289000.0,284800.0,0.0,0.04,0.25,0.0,1597.0,1424.0,2936.0,1440.0,0.0,0.0,109975.232095,250488.55148,185227.284071,890056.135692,227804.745781,1.1210e6,3.409762,4.281412,6.848429,6.141352,884.938433,802.386615,2498.900249,3014.818921,0.0,0.0,0.002157,0.0,0.0,0.0,0.831356,0.618762,"""44.89271""","""-92.96569"""
"""82011602-01""","""Armstrong Lake""",2008,"""44.96252306""","""-92.93917709""",1.142857,0.054714,2667,282,116364.491939,120707.092199,267969.403825,256650.35461,385777.465317,377357.446809,1.06204,2.714965,0.858759,1.92805,0.0,0.0,382979.077615,372109.219858,0.0,0.0,50000.0,125000.0,152800.0,298400.0,205200.0,423100.0,0.19,0.31,0.09,0.31,0.0,0.0,205200.0,423100.0,0.0,0.0,276997.36808,142561.609156,787195.029759,154744.750513,1.0142e6,220745.106138,5.09203,10.547912,5.013436,7.179134,0.0,0.0,1.0130e6,205289.148706,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"""44.96116""","""-92.93529"""
"""19002601-01""","""Marion Lake""",2012,"""44.65825741""","""-93.27557035""",1.964286,0.028571,4506,821,75910.164225,135218.026797,187343.497559,138013.6419,263253.661784,273231.668697,0.889001,1.159671,0.0,0.0,2480.224368,1962.17296,3744.37217,3744.356882,2756.863959,3165.797808,51200.0,62600.0,165200.0,143100.0,223250.0,257700.0,0.33,0.41,0.0,0.0,2204.0,2228.0,2851.0,3412.0,2184.0,2691.0,165030.111358,143744.998357,321628.721818,113892.979865,433941.337424,189048.439065,4.398532,4.195714,0.0,0.0,3886.438021,1320.45439,9066.263693,2936.880054,7311.657233,2880.199301,0.0,0.0,0.0,0.0,0.823347,0.755177,"""44.6618""","""-93.26701"""


In [9]:
# Write final dataset to CSV
final_dataset.write_csv('./data/final_dataset_with_parcel_features.csv')

### Problem 5 - Put it together

In [21]:
# Define variable categories for aggregation
numerical_vars = ['EMV_LAND', 'EMV_BLDG', 'EMV_TOTAL', 'ACRES_POLY', 'ACRES_DEED', 
                  'FIN_SQ_FT', 'TOTAL_TAX', 'TAX_CAPAC']
categorical_vars = ['BASEMENT', 'GARAGE', 'HOMESTEAD']

# Aggregate parcel features by Monit_MAP_CODE1 (lake ID), Year, and distance_category
# Handling nulls by excluding them from aggregations, then filling remaining nulls with 0
(parcel_features_agg := 
    parcels_lazy
    .group_by(['Monit_MAP_CODE1', 'Year', 'distance_category'])
    .agg([
        # Count of parcels
        pl.len().alias('parcel_count'),
        
        # Keep centroid coordinates for joining , should be same for each lake
        pl.col('centroid_lat').first().cast(pl.Utf8).alias('centroid_lat'),
        pl.col('centroid_long').first().cast(pl.Utf8).alias('centroid_long'),
        
        # Numerical variables - mean, median, std
        # * is so each expression gets its own argument
        *[pl.col(var).cast(pl.Float64, strict=False).mean().alias(f'{var.lower()}_mean') 
          for var in numerical_vars],
        *[pl.col(var).cast(pl.Float64, strict=False).median().alias(f'{var.lower()}_median') 
          for var in numerical_vars],
        *[pl.col(var).cast(pl.Float64, strict=False).std().alias(f'{var.lower()}_std') 
          for var in numerical_vars],
        
        # Categorical variables - proportions only
        *[((pl.col(var) == 'Y').sum() / pl.len()).alias(f'{var.lower()}_prop') 
          for var in categorical_vars],
    ])
    .filter(pl.col('distance_category') != 'over_1600m')
    .fill_null(0)
    .collect()
)
# not going to keep as lazy for pivoting

Monit_MAP_CODE1,Year,distance_category,parcel_count,centroid_lat,centroid_long,emv_land_mean,emv_bldg_mean,emv_total_mean,acres_poly_mean,acres_deed_mean,fin_sq_ft_mean,total_tax_mean,tax_capac_mean,emv_land_median,emv_bldg_median,emv_total_median,acres_poly_median,acres_deed_median,fin_sq_ft_median,total_tax_median,tax_capac_median,emv_land_std,emv_bldg_std,emv_total_std,acres_poly_std,acres_deed_std,fin_sq_ft_std,total_tax_std,tax_capac_std,basement_prop,garage_prop,homestead_prop
str,str,str,u32,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""27005800-01""","""2009""","""between_501_1600m""",5009,"""45.04987""","""-93.30589""",45277.859852,109993.012577,155270.87243,0.274079,0.0,126.580555,2535.434019,1735.155121,39900.0,105600.0,145500.0,0.14,0.0,0.0,1967.0,1460.0,77913.328591,65440.849562,119325.879804,1.257576,0.0,393.368811,8593.298366,4799.688574,0.004392,0.099421,0.789778
"""82000100-05""","""2014""","""between_501_1600m""",914,"""44.91691""","""-92.77126""",43178.993435,76605.689278,119784.682713,0.625952,0.0,766.868709,0.0,1073.881838,55000.0,71900.0,128900.0,0.23,0.0,864.0,0.0,1056.0,47349.748952,88952.863519,125018.517207,1.70475,0.0,806.815394,0.0,1167.691398,0.571116,0.507659,0.689278
"""02000400-01""","""2005""","""within_500m""",684,"""45.17134""","""-93.0572""",85128.070175,109346.19883,194474.269006,3.544971,0.0,1010.865497,0.0,1737.988304,81600.0,129300.0,217200.0,0.655,0.0,1109.0,0.0,2138.0,62214.618163,100727.724415,133537.090263,7.066759,0.0,930.349766,0.0,1423.058688,0.605263,0.0,0.581871
"""27012700-01""","""2015""","""within_500m""",116,"""45.18498""","""-93.50023""",102298.275862,195767.241379,298065.517241,9.229224,0.0,1577.758621,5308.327586,3065.732759,85000.0,260500.0,346400.0,1.265,0.0,1949.5,5660.0,3609.0,78501.7906,159132.163161,175079.266571,16.534975,0.0,1260.80582,4038.444287,1831.871988,0.655172,0.655172,0.698276
"""62002400-01""","""2014""","""within_500m""",1388,"""45.08429""","""-93.03137""",139206.195965,194682.420749,333888.616715,0.659229,0.646506,1347.933718,7312.932277,0.0,65100.0,104550.0,163950.0,0.23,0.23,1350.0,2226.0,0.0,352429.620566,517769.013292,795182.751206,1.574904,1.611071,892.755358,26272.951384,0.0,0.631124,0.81196,0.738473
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""82001502-01""","""2006""","""within_500m""",72,"""45.11161""","""-92.84383""",281872.222222,301938.888889,583811.111111,5.0925,14.531389,1110.388889,2454.777778,0.0,300300.0,177300.0,485150.0,0.57,6.0,1211.5,2669.0,0.0,171562.126243,899221.194264,969888.481401,9.213328,22.726833,876.746491,1672.663948,0.0,0.0,0.0,0.694444
"""10007800-01""","""2013""","""between_501_1600m""",49,"""44.80121""","""-93.89101""",203940.816327,61063.265306,265004.081633,29.682041,30.281429,832.040816,1735.142857,0.0,103900.0,76600.0,207700.0,10.0,10.0,920.0,1748.0,0.0,208958.683036,70431.564579,227116.215625,39.990707,40.615079,911.515015,1178.34637,0.0,0.0,0.408163,0.714286
"""10004200-01""","""2006""","""within_500m""",33,"""44.87804""","""-93.73797""",370393.939394,1.289e6,1.6594e6,35.321818,34.31697,1042.484848,1927.545455,2033.484848,220000.0,97000.0,408300.0,9.68,13.42,0.0,2169.0,2343.0,415447.063369,4.6828e6,5.0382e6,45.973158,43.069527,1247.105427,1833.298088,1927.287344,0.0,0.363636,0.69697
"""82046200-01""","""2006""","""within_500m""",258,"""44.98466""","""-92.872""",88075.193798,189047.286822,277122.48062,0.828566,1.468798,1360.72093,2058.868217,0.0,82400.0,184000.0,272700.0,0.03,0.645,1320.0,2033.0,0.0,64630.080011,122236.239398,162981.647407,2.884851,3.698765,856.093848,1308.84388,0.0,0.0,0.0,0.821705


In [38]:
(feature_cols := [col for col in parcel_features_agg.columns 
                  if col not in ['Monit_MAP_CODE1', 'Year', 'distance_category', 'centroid_lat', 'centroid_long']])
# need to pivot to get one row per year-lake with distance_category as column prefixes
(parcel_features_wide := (
    parcel_features_agg
    .pivot(
        values=feature_cols,
        index=['Monit_MAP_CODE1', 'Year'],
        on='distance_category'
    )
))
#
(final_dataset := (
    wq.join(
        parcel_features_wide.with_columns(pl.col('Year').cast(pl.Int64)),
        left_on=['DNR_ID_Site_Number', 'Year'],
        right_on=['Monit_MAP_CODE1', 'Year'],
        how='left'
    )
    # Lake ID : 13005300-01 : Big Comfort Lake present in water quality but missing parcel data, dropping those
    .filter(pl.col('DNR_ID_Site_Number') != '13005300-01')
))
final_dataset.write_csv('./data/final_dataset_with_parcel_features.csv')