# Lab 5 - Parcel Feature Extraction

Next, we will illustrate the construction of features related to our main task: finding the relationship between property development and water quality over time.  In a previous lab, you identified lakes for which we have complete information for the years from 2004 to 2015.  In this lab, we will

[Original Data and variable information](https://gisdata.mn.gov/organization/us-mn-state-metrogis?q=Metro+Regional+Parcel+Dataset&sort=score+desc%2C+metadata_modified+desc)

## Problem 1 - Feature construction

**Overview.** Remember that our target output file will have one row per year-lake combination.  To attach property information, we will need to group and aggregate the parcel data to create features for each lake-year combination.  When grouping the data, be sure to maintain the variables needed to join to the water quality data, namely the lake ID and year.  Since we are looking at tracking property development/change over time, we will want to generate features tracking

* Number of properties close to each lake,
* Summaries of the value of properties close to each lake,
* Aggregations on the size and type of the properties, and
* Other features that might impact water quality.
    
#### Task 1. Understanding parcel variables

Before we can construct features, we need to make sure we understand the parcel data.  The metro parcel data is provided by the State of Minnesota and the meta data can be found online.  For example, searching for *metro parcel 2014* lead to [this site](https://geo.btaa.org/catalog/304cf3d8-a53b-4ea9-b02a-f550bd68e320).  Clicking on the *Meta data* button in the top left, brought up more information.  Clicking *Download* opened in this meta data [in a separate page](https://resources.gisdata.mn.gov/pub/gdrs/data/pub/us_mn_state_metrogis/plan_regonal_parcels_2014/metadata/metadata.html)

Look through the **Section 4: Attributes** and identify variables that might impact the water quality of near-by lakes.

> <font color="orange"> Your thoughts here </font>

Several parcel attributes are likely to influence lake water quality due to their relationship with development intensity, impervious surface coverage, and potential runoff contaminants. Based on the metro parcel metadata, the most relevant variables include:

Finished square footage (FIN_SQ_FT): Larger homes often correlate with larger roofs/driveways → more impervious surfaces → more runoff.

Lot size / Acres (ACRES): Smaller lots can indicate denser development, which tends to increase runoff.

Property value (TOTAL_VALUE, BLDG_VALUE, LAND_VALUE): Higher-value properties often correlate with larger footprints or more landscaping fertilizer use.

Dwelling type (DWELL_TYPE): Multi-family units vs. single-family affect the density of development.

Home style (HOME_STYLE): Can correlate with roof shape/size → drainage patterns.

Garage / Basement (GARAGE, BASEMENT): Indicators of home expansion, size, and land disturbance.

Cooling/Heating type: Can indirectly reflect building size and infrastructure.

Homestead status (HOMESTEAD): Non-homestead properties might be rentals or seasonal homes with different land-use patterns.

Tax exempt status (TAX_EXEMPT): Schools, churches, and parks behave differently from residential properties.

Overall, these variables relate to density, impervious surface area, building footprint, and land management practices, all of which influence water quality.

### Task 2. Feature Brainstorming

Our objective is to build a feature table with one row per lake-year, using grouped summary statistics. Here are effective strategies for feature construction:

1. **Numerical summaries:** Calculate group-level statistics (mean, median, standard deviation, IQR, etc.) for numeric variables.
2. **Categorical summaries:** For text data, consider:
   - **Success rates:** Compute proportions for binary variables (e.g., percent of homes with basements).
   - **Label cleaning:** Review and standardize unique labels to remove duplicates or inconsistencies.
   - **Broader categories:** Recode variables with many rare categories into a smaller, more meaningful set.
   - **Indicator columns:** Create indicator variables and aggregate them to show presence/absence or proportions (e.g., count of each property use type).

Review the variables you identified earlier and outline a feature construction strategy for each.

> <font color="orange"> Your thoughts here </font>

For each variable chosen, we can define a feature strategy:

1. Numerical Variables

We will create:

Mean, median, std, min, max, IQR for each lake-year.

Example variables:

FIN_SQ_FT (finished square footage)

ACRES (parcel size)

TOTAL_VALUE, BLDG_VALUE, LAND_VALUE (assessed values)

2. Binary Variables

Variables such as GARAGE and BASEMENT are "Y" / "N" flags.

Feature strategy:

Convert "Y" → 1 and "N" → 0

Compute percent of parcels with garage, percent with basement

3. Categorical Variables

Choose 1–2:

DWELL_TYPE

HOME_STYLE

HEATING

COOLING

HOMESTEAD, TAX_EXEMPT

Feature strategy:

Standardize labels (remove whitespace, collapse rare categories)

Create indicator dummy columns

Aggregate sums / proportions per lake-year

### Task 4. Initial querying with filter and select

First, you should build a query that filters the parcel data to 
1. only include parcels within 1600 feet of the lakes we are studying, and 
2. only for the lakes with complete information.  

You should also select only the columns you will need for feature construction and joining to the water quality data.

In [1]:
# your query here

In [2]:
import polars as pl
from glob import glob

# Load ALL parcel parquet files
parcel_paths = glob("./data/parcel_combined/*.parquet")

parcel_lf = pl.scan_parquet(parcel_paths)

# Quick check
parcel_lf.collect_schema()


Schema([('BLDG_NUM', String),
        ('CITY', String),
        ('COUNTY_ID', String),
        ('EMV_BLDG', String),
        ('EMV_LAND', String),
        ('EMV_TOTAL', String),
        ('HOMESTEAD', String),
        ('NUM_UNITS', String),
        ('OWN_ADD_L1', String),
        ('OWN_ADD_L2', String),
        ('OWN_ADD_L3', String),
        ('PARC_CODE', String),
        ('PIN', String),
        ('SALE_DATE', String),
        ('SALE_VALUE', String),
        ('SCHOOL_DST', String),
        ('Shape_Area', String),
        ('Shape_Leng', String),
        ('TAX_ADD_L1', String),
        ('TAX_ADD_L2', String),
        ('TAX_ADD_L3', String),
        ('TAX_CAPAC', String),
        ('TAX_EXEMPT', String),
        ('TAX_NAME', String),
        ('TOTAL_TAX', String),
        ('WSHD_DIST', String),
        ('YEAR_BUILT', String),
        ('Year', String),
        ('ZIP', String),
        ('centroid_lat', String),
        ('centroid_long', String),
        ('lake_id', String),
        ('Distanc

In [3]:
xref_lf = pl.scan_parquet("./data/xref_parquet/**/*.parquet")

xref_filtered = (
    xref_lf
    .filter(pl.col("lake_id").is_in(lakes_w_complete_info))
    .filter(pl.col("Distance_Parcel_Lake_meters") <= 1600)
)


NameError: name 'lakes_w_complete_info' is not defined

In [None]:
xref_filtered.collect_schema()


NameError: name 'xref_filtered' is not defined

In [None]:
parcel_keyed = parcel_lf.with_columns(
    (pl.col("centroid_lat") + "_" + pl.col("centroid_long")).alias("coord_key")
)

xref_keyed = xref_filtered.with_columns(
    (pl.col("centroid_lat") + "_" + pl.col("centroid_long")).alias("coord_key")
)


: 

In [None]:
parcel_near_lakes.head().collect()


In [None]:
parcel_paths = glob("./data/parcel_combined/*.parquet")
parcel_lf = pl.scan_parquet(parcel_paths)
parcel_schema = parcel_lf.collect_schema().names()


In [None]:
# Required columns
base_cols = [
    "lake_id", 
    "Year",
    "PIN",
    "EMV_TOTAL", "EMV_LAND", "EMV_BLDG",
    "Shape_Area",
    "Distance_Parcel_Lake_meters"
]

# Optional categorical variables
optional_cols = [
    "HOMESTEAD", 
    "TAX_EXEMPT",
    "DWELL_TYPE",
    "HOME_STYLE",
    "BASEMENT",
    "GARAGE",
    "COOLING",
    "HEATING",
]

# Keep only those that exist in the parquet schema
keep_cols = [c for c in base_cols + optional_cols if c in parcel_schema]

keep_cols


['lake_id',
 'Year',
 'PIN',
 'EMV_TOTAL',
 'EMV_LAND',
 'EMV_BLDG',
 'Shape_Area',
 'Distance_Parcel_Lake_meters',
 'HOMESTEAD',
 'TAX_EXEMPT']

In [None]:
parcel_near_lakes = (
    parcel_lf
    .filter(pl.col("lake_id").is_in(lakes_w_complete_info))
    .filter(pl.col("Distance_Parcel_Lake_meters") <= 1600)
    .select(keep_cols)
)

parcel_near_lakes.head().collect()


lake_id,Year,PIN,EMV_TOTAL,EMV_LAND,EMV_BLDG,Shape_Area,Distance_Parcel_Lake_meters,HOMESTEAD,TAX_EXEMPT
str,str,str,str,str,str,str,f64,str,str
"""02000400-01""","""2004""","""003-233122120056""","""178410.0""","""59220.0""","""110338.0""",,934.171939,"""Y""","""N"""
"""02000400-01""","""2004""","""003-233122120056""","""178410.0""","""59220.0""","""110338.0""","""1396.55365327""",934.171939,"""Y""","""N"""
"""02000400-01""","""2004""","""003-233122210002""","""150411.0""","""67360.0""","""78077.0""",,805.338727,"""Y""","""N"""
"""02000400-01""","""2004""","""003-233122210002""","""150411.0""","""67360.0""","""78077.0""","""3740.92390391""",805.338727,"""Y""","""N"""
"""02000400-01""","""2004""","""003-233122120017""","""172523.0""","""61720.0""","""99478.0""",,1049.627846,"""Y""","""N"""


Fixing the numeric columns for next questions

In [None]:
parcel_near_lakes_clean = parcel_near_lakes.with_columns([
    pl.col("EMV_TOTAL").cast(pl.Float64, strict=False),
    pl.col("EMV_LAND").cast(pl.Float64, strict=False),
    pl.col("EMV_BLDG").cast(pl.Float64, strict=False),
    pl.col("Shape_Area").cast(pl.Float64, strict=False),
    pl.col("Year").cast(pl.Int32, strict=False)
])


## Problem 2.  Numerical Summaries

Two important categories of property data involve the size (e.g., finished square footage) and value (e.g., accessed value and/or taxes paid).

**Tasks.** 

1. Identify 2-3 variables for each of these categories.
2. Write a query that computes the summary statistics for each of these variables for each lake-year.  
3. Write this summary table out to a CSV file named `parcel_numerical_summaries.csv`.  Again, you should partition by lake ID and year.

In [None]:
# Your code here

In [None]:
parcel_numeric_summary = (
    parcel_near_lakes_clean
    .group_by(["lake_id", "Year"])
    .agg([
        # Value summaries
        pl.col("EMV_TOTAL").mean().alias("mean_value_total"),
        pl.col("EMV_TOTAL").median().alias("median_value_total"),
        pl.col("EMV_TOTAL").std().alias("sd_value_total"),

        pl.col("EMV_LAND").mean().alias("mean_value_land"),
        pl.col("EMV_BLDG").mean().alias("mean_value_building"),

        # Size summaries
        pl.col("Shape_Area").mean().alias("mean_area"),
        pl.col("Shape_Area").median().alias("median_area"),
        pl.col("Shape_Area").std().alias("sd_area"),

        # Count of parcels
        pl.count().alias("num_parcels"),
    ])
)


(Deprecated in version 0.20.5)
  pl.count().alias("num_parcels"),


In [None]:
parcel_numeric_summary.collect().write_csv(
    "./data/parcel_numerical_summaries.csv"
)


## Problem 3.  Simple categorical summaries.

In this part, you will create summary statistics for some of the simpler categorical variables.

**Binary variables.** There are two examples of binary variables, listed below.  You will need to compute the percent of `Yes` for each.

* GARAGE: Garage Y/N
* BASEMENT: Basement Y/N

**Other categorical variables.** There are a number of other categorical variables.  You need to select one of these variables, inspect/clean your variable as needed, create indicator variables for each resulting label, and compute summary statistics for each label.

* HOMESTEAD: Homestead Status
* TAX_EXEMPT: Tax Exempt Status 
* DWELL_TYPE: Dwelling Type 
* HOME_STYLE: Home Style
* HEATING: Heating type
* COOLING: Cooling type

**Tasks.**
Create a query that

1. Select one binary and two other categorical variables for feature construction,
2. Reads in the parcel data and selects the relevant columns (be sure to keep the lake ID and year),
3. Inspect unique labels and recode/clean as needed,
4. Create a literal column of ones, and
5. Pivot to get the counts of each label per lake-year (do this once per category).

Write this summary table out to a csv file named `parcel_categorical_summaries.csv`.  Again, you should partition by lake ID and year.

In [None]:
# Your code here

In [None]:
parcel_cat_clean = parcel_near_lakes_clean.with_columns([
    pl.col("HOMESTEAD")
        .fill_null("UNKNOWN")
        .str.strip_chars()
        .str.to_uppercase()
        .alias("HOMESTEAD"),

    pl.col("TAX_EXEMPT")
        .fill_null("UNKNOWN")
        .str.strip_chars()
        .str.to_uppercase()
        .alias("TAX_EXEMPT")
])


In [None]:
parcel_cat_clean.head().collect()


lake_id,Year,PIN,EMV_TOTAL,EMV_LAND,EMV_BLDG,Shape_Area,Distance_Parcel_Lake_meters,HOMESTEAD,TAX_EXEMPT
str,i32,str,f64,f64,f64,f64,f64,str,str
"""02000400-01""",2004,"""003-233122120056""",178410.0,59220.0,110338.0,,934.171939,"""Y""","""N"""
"""02000400-01""",2004,"""003-233122120056""",178410.0,59220.0,110338.0,1396.553653,934.171939,"""Y""","""N"""
"""02000400-01""",2004,"""003-233122210002""",150411.0,67360.0,78077.0,,805.338727,"""Y""","""N"""
"""02000400-01""",2004,"""003-233122210002""",150411.0,67360.0,78077.0,3740.923904,805.338727,"""Y""","""N"""
"""02000400-01""",2004,"""003-233122120017""",172523.0,61720.0,99478.0,,1049.627846,"""Y""","""N"""


In [None]:
homestead_summary = (
    parcel_cat_clean
    .with_columns(pl.lit(1).alias("count"))
    .group_by(["lake_id", "Year", "HOMESTEAD"])
    .agg(pl.col("count").sum().alias("count"))
    .collect()                                 # IMPORTANT: switch to eager mode
    .pivot(
        index=["lake_id", "Year"],
        columns="HOMESTEAD",
        values="count"
    )
    .fill_null(0)
)


  .pivot(


In [None]:
homestead_summary.head()


lake_id,Year,3,Y,UNKNOWN,N,0,P,7,1,2,5
str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32
"""82002000-01""",2007,8,0,91,0,1238,0,0,3288,12,45
"""27067000-01""",2005,0,6543,0,814,0,0,0,0,0,0
"""82036500-01""",2009,0,0,4,0,32,0,0,197,0,0
"""27015300-01""",2014,0,384,0,142,0,0,0,0,0,0
"""10005900-01""",2015,0,431,0,210,0,0,0,0,0,0


In [None]:
tax_summary = (
    parcel_cat_clean
    .with_columns(pl.lit(1).alias("count"))
    .group_by(["lake_id", "Year", "TAX_EXEMPT"])
    .agg(pl.col("count").sum().alias("count"))
    .collect()
    .pivot(
        index=["lake_id", "Year"],
        columns="TAX_EXEMPT",
        values="count"
    )
    .fill_null(0)
)


  .pivot(


In [None]:
tax_summary.head()


lake_id,Year,N,Y,UNKNOWN
str,i32,i32,i32,i32
"""27015300-01""",2004,474,61,0
"""02013300-01""",2010,365,13,0
"""19045100-01""",2008,3979,356,4079
"""82009400-01""",2004,7134,280,0
"""10001100-01""",2011,723,28,0


In [None]:
parcel_categorical_summary = (
    homestead_summary
    .join(tax_summary, on=["lake_id", "Year"], how="inner")
)


In [None]:
parcel_categorical_summary.head()


lake_id,Year,3,Y,UNKNOWN,N,0,P,7,1,2,5,N_right,Y_right,UNKNOWN_right
str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32
"""27015300-01""",2004,0,390,4,141,0,0,0,0,0,0,474,61,0
"""02013300-01""",2010,0,231,0,147,0,0,0,0,0,0,365,13,0
"""19045100-01""",2008,0,7270,39,1097,0,8,0,0,0,0,3979,356,4079
"""82009400-01""",2004,0,7414,0,0,0,0,0,0,0,0,7134,280,0
"""10001100-01""",2011,0,663,0,88,0,0,0,0,0,0,723,28,0


In [None]:
parcel_categorical_summary.write_csv(
    "./data/parcel_categorical_summaries.csv"
)


In [None]:
# from Project_2_Lab_4_Filter_and_aggregate_water_quality_data_V2 import wq_final

# I couldn't import it from the last lab. so, I copy pasted it here directly.

import polars as pl

# Path to water quality file (same as Lab 4)
wq_path = "./data/MinneMUDAC_raw_files/mces_lakes_1999_2014.v2.txt"

# Columns we need
wq_cols = [
    "DNR_ID_Site_Number",
    "END_DATE",
    "LAKE_NAME",
    "Secchi_Depth_RESULT",
    "Secchi_Depth_QUALIFIER",
    "Secchi_Depth_Units",
    "Total_Phosphorus_RESULT",
    "Total_Phosphorus_QUALIFIER",
    "Total_Phosphorus_Units",
    "longitude",
    "latitude"
]

# Load entire CSV lazily, THEN select the needed columns
wq_lf = (
    pl.scan_csv(
        wq_path,
        separator="\t",
        infer_schema_length=10000
    )
    .select(wq_cols)   # <--- safe and correct
)


# Recreate filtered water-quality dataset
wq_filtered = (
    wq_lf
    .filter(
        (pl.col("Secchi_Depth_QUALIFIER") == "Approved") &
        (pl.col("Total_Phosphorus_QUALIFIER") == "Approved")
    )
    .filter(
        pl.col("Secchi_Depth_RESULT").is_not_null() &
        (pl.col("Secchi_Depth_RESULT") > 0) &
        pl.col("Total_Phosphorus_RESULT").is_not_null() &
        (pl.col("Total_Phosphorus_RESULT") > 0)
    )
    .with_columns(
        pl.col("END_DATE").str.slice(0, 4).cast(pl.Int32).alias("Year")
    )
    .filter((pl.col("Year") >= 2004) & (pl.col("Year") <= 2015))
)


# Compute aggregated yearly values
wq_final = (
    wq_filtered
    .group_by(["DNR_ID_Site_Number", "LAKE_NAME", "Year", "latitude", "longitude"])
    .agg([
        pl.col("Secchi_Depth_RESULT").mean().alias("avg_secchi"),
        pl.col("Total_Phosphorus_RESULT").mean().alias("avg_phosphorus")
    ])
)

# Rename to lake_id for joining with parcel data
wq_ready = wq_final.rename({"DNR_ID_Site_Number": "lake_id"})


In [None]:
wq_ready.head().collect()


lake_id,LAKE_NAME,Year,latitude,longitude,avg_secchi,avg_phosphorus
str,str,i32,f64,f64,f64,f64
"""82001900-01""","""South Twin Lake""",2007,45.078177,-92.847089,1.110429,0.074429
"""82001900-01""","""South Twin Lake""",2009,45.078177,-92.847089,1.776429,0.061571
"""10000600-01""","""Lotus Lake""",2004,44.86977,-93.525561,1.25,0.044
"""82010300-01""","""Olson Lake""",2010,45.018571,-92.945426,2.330769,0.026077
"""82014800-01""","""Plaisted Lake""",2012,45.151783,-92.912533,1.617636,0.108182


## Problem 4.  Join all the summaries.

Finally, you need to join all the summaries created above, along with the water quality summaries created in a previous lab, into one overall summary file.  Write the resulting table to a CSV file named `water_quality_and_parcel_summaries_2004_to_2015.csv`.

In [None]:
# Your code here.

In [None]:
parcel_numeric_lf = parcel_numeric_summary.lazy()
parcel_categorical_lf = parcel_categorical_summary.lazy()


In [None]:
wq_plus_numeric = (
    wq_ready
    .join(parcel_numeric_lf, on=["lake_id", "Year"], how="left")
)

final_summary = (
    wq_plus_numeric
    .join(parcel_categorical_lf, on=["lake_id", "Year"], how="left")
)

final_summary.head().collect()


lake_id,LAKE_NAME,Year,latitude,longitude,avg_secchi,avg_phosphorus,mean_value_total,median_value_total,sd_value_total,mean_value_land,mean_value_building,mean_area,median_area,sd_area,num_parcels,3,Y,UNKNOWN,N,0,P,7,1,2,5,N_right,Y_right,UNKNOWN_right
str,str,i32,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,u32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32
"""82013401-01""","""Lost Lake""",2013,45.052059,-92.969924,0.883333,0.077,215169.694767,181400.0,264446.456601,90517.296512,124652.398256,1867.683184,1105.931443,4222.285991,2752,0,2030,10,712,0,0,0,0,0,0,0,0,2752
"""19002100-01""","""Alimagnet Lake""",2005,44.748126,-93.248213,0.528,0.1234,277915.563457,234950.0,617634.25903,70622.073304,207293.490153,2760.246086,1186.866288,19735.621721,3656,0,3251,67,336,0,2,0,0,0,0,3525,68,63
"""10000500-01""","""Courthouse Lake""",2006,44.78927,-93.590028,4.7,0.018692,235191.396226,173000.0,1029300.0,70834.415094,164356.981132,7123.492723,790.86938,38011.022013,1325,0,1032,0,293,0,0,0,0,0,0,0,26,1299
"""19002100-01""","""Alimagnet Lake""",2012,44.748126,-93.248213,1.404545,0.068909,235652.852769,193900.0,775205.129508,64348.149179,171304.70359,2173.96852,1182.346208,12822.168073,3593,0,3145,0,448,0,0,0,0,0,0,3519,74,0
"""19009400-01""","""Anderson Pond""",2011,44.886645,-93.060364,1.24375,0.101125,282717.307692,161200.0,959243.70055,72038.942308,210678.365385,2181.308649,870.145778,8677.866677,2704,0,2163,0,541,0,0,0,0,0,0,2551,153,0


In [None]:
parcel_numeric_lf = parcel_numeric_summary.lazy()
parcel_categorical_lf = parcel_categorical_summary.lazy()

wq_plus_numeric = (
    wq_ready
    .join(parcel_numeric_lf, on=["lake_id", "Year"], how="left")
)

final_summary = (
    wq_plus_numeric
    .join(parcel_categorical_lf, on=["lake_id", "Year"], how="left")
)

final_summary.head().collect()


NameError: name 'parcel_categorical_summary' is not defined

In [None]:
final_df = final_summary.collect()

final_df.write_csv(
    "./data/water_quality_and_parcel_summaries_2004_to_2015.csv"
)


## Problem 5.  Put it all together

It is often useful to package all of the data constructions steps together in one convenient place.  Your last task is to

1. Gather all of your data construction code below.
    * You don't need to include exploratory code, e.g., exploring join mismatches; only the code necessary to combine, clean, and write your data.
2. Clean/refactor the code.
3. Be sure to display all important intermediate results.

In [None]:
# Your code here.

In [None]:
# Paths
wq_path = "./data/MinneMUDAC_raw_files/mces_lakes_1999_2014.v2.txt"

wq_cols = [
    "DNR_ID_Site_Number",
    "LAKE_NAME",
    "END_DATE",
    "Secchi_Depth_RESULT",
    "Secchi_Depth_QUALIFIER",
    "Secchi_Depth_Units",
    "Total_Phosphorus_RESULT",
    "Total_Phosphorus_QUALIFIER",
    "Total_Phosphorus_Units",
    "longitude",
    "latitude"
]

# Load lazily
wq_lf = (
    pl.scan_csv(
        wq_path,
        separator="\t",
        infer_schema_length=10000
    )
    .select(wq_cols)
)

# Main filtering + aggregation
wq_filtered = (
    wq_lf
    .filter(
        (pl.col("Secchi_Depth_QUALIFIER") == "Approved") &
        (pl.col("Total_Phosphorus_QUALIFIER") == "Approved") &
        (pl.col("Secchi_Depth_RESULT") > 0) &
        (pl.col("Total_Phosphorus_RESULT") > 0)
    )
    .with_columns(
        pl.col("END_DATE").str.slice(0, 4).cast(pl.Int32).alias("Year")
    )
    .filter((pl.col("Year") >= 2004) & (pl.col("Year") <= 2015))
)

wq_summary = (
    wq_filtered
    .group_by(["DNR_ID_Site_Number", "LAKE_NAME", "Year", "latitude", "longitude"])
    .agg([
        pl.col("Secchi_Depth_RESULT").mean().alias("avg_secchi"),
        pl.col("Total_Phosphorus_RESULT").mean().alias("avg_phosphorus")
    ])
)

# Collect as LazyFrame again for joining
wq_ready = wq_summary
wq_ready.head().collect()


DNR_ID_Site_Number,LAKE_NAME,Year,latitude,longitude,avg_secchi,avg_phosphorus
str,str,i32,f64,f64,f64,f64
"""82010700-01""","""Sunfish Lake""",2013,44.99924,-92.891602,1.2935,0.0383
"""82010300-01""","""Olson Lake""",2011,45.018571,-92.945426,3.757222,0.023667
"""62007200-01""","""Karth Lake""",2011,45.074001,-93.149834,2.507692,0.035615
"""70005400-01""","""Spring Lake""",2010,44.701811,-93.468434,0.94,0.0766
"""19002601-01""","""Marion Lake""",2013,44.658257,-93.27557,1.992308,0.023923


In [None]:
wq_path = "./data/MinneMUDAC_raw_files/mces_lakes_1999_2014.v2.txt"

wq_lf = pl.scan_csv(
    wq_path,
    separator="\t",
    ignore_errors=True,
    infer_schema_length=20000
)


In [None]:
wq_clean = (
    wq_lf
    .with_columns([
        pl.col("END_DATE").str.strptime(pl.Date, strict=False),
        pl.col("Secchi_Depth_RESULT").cast(pl.Float64, strict=False),
        pl.col("Total_Phosphorus_RESULT").cast(pl.Float64, strict=False),
    ])
    .filter(
        (pl.col("Secchi_Depth_QUALIFIER") == "Approved") &
        (pl.col("Total_Phosphorus_QUALIFIER") == "Approved")
    )
    .with_columns(
        pl.col("END_DATE").dt.year().alias("Year")
    )
    .filter((pl.col("Year") >= 2004) & (pl.col("Year") <= 2015))
)


In [None]:
wq_summary = (
    wq_clean
    .group_by(["DNR_ID_Site_Number", "LAKE_NAME", "Year", "latitude", "longitude"])
    .agg([
        pl.col("Secchi_Depth_RESULT").mean().alias("avg_secchi"),
        pl.col("Total_Phosphorus_RESULT").mean().alias("avg_phosphorus")
    ])
)


In [None]:
wq_ready = (
    wq_summary
    .rename({"DNR_ID_Site_Number": "lake_id"})
    .select([
        "lake_id", "LAKE_NAME", "Year",
        "latitude", "longitude",
        "avg_secchi", "avg_phosphorus"
    ])
)


In [None]:
wq_ready.head().collect()


lake_id,LAKE_NAME,Year,latitude,longitude,avg_secchi,avg_phosphorus
str,str,i32,f64,f64,f64,f64
"""82008700-01""","""Regional Park Lake""",2008,44.805532,-92.902484,2.199143,0.076143
"""82012300-01""","""Bass Lake""",2007,45.097142,-92.917847,1.974857,0.042257
"""27104501-01""","""Normandale Lake""",2007,44.848844,-93.352527,1.333333,0.042
"""10009300-01""","""Oak Lake""",2004,44.955434,-93.794864,1.496429,0.084143
"""10021800-01""","""Grace Lake""",2011,44.824484,-93.606231,0.925,0.146429


In [None]:
print(parcel_numeric_summary.columns)
print(parcel_categorical_summary.columns)


  print(parcel_numeric_summary.columns)


ColumnNotFoundError: unable to find column "DWELL_TYPE"; valid columns: ["BLDG_NUM", "CITY", "COUNTY_ID", "EMV_BLDG", "EMV_LAND", "EMV_TOTAL", "HOMESTEAD", "NUM_UNITS", "OWN_ADD_L1", "OWN_ADD_L2", "OWN_ADD_L3", "PARC_CODE", "PIN", "SALE_DATE", "SALE_VALUE", "SCHOOL_DST", "Shape_Area", "Shape_Leng", "TAX_ADD_L1", "TAX_ADD_L2", "TAX_ADD_L3", "TAX_CAPAC", "TAX_EXEMPT", "TAX_NAME", "TOTAL_TAX", "WSHD_DIST", "YEAR_BUILT", "Year", "ZIP", "centroid_lat", "centroid_long", "lake_id", "Distance_Parcel_Lake_meters", "distance_category", "coord_key", "lake_id_right", "centroid_lat_right", "centroid_long_right", "Distance_Parcel_Lake_meters_right"]

Resolved plan until failure:

	---> FAILED HERE RESOLVING 'group_by' <---
INNER JOIN:
LEFT PLAN ON: [col("coord_key")]
   WITH_COLUMNS:
   [[([(col("centroid_lat")) + ("_")]) + (col("centroid_long"))].alias("coord_key")] 
    Parquet SCAN [./data/parcel_combined\parcel_0.parquet, ... 11 other sources]
    PROJECT */34 COLUMNS
RIGHT PLAN ON: [col("coord_key")]
   WITH_COLUMNS:
   [[([(col("centroid_lat")) + ("_")]) + (col("centroid_long"))].alias("coord_key")] 
    SELECT [col("lake_id"), col("centroid_lat"), col("centroid_long"), col("Distance_Parcel_Lake_meters")]
      FILTER [(col("Distance_Parcel_Lake_meters")) <= (1600.0)]
      FROM
        Parquet SCAN [./data/xref_parquet\lake_id=02000300-01\0.parquet, ... 439 other sources]
        PROJECT */5 COLUMNS
END INNER JOIN