# Lab 5 - Parcel Feature Extraction

Next, we will illustrate the construction of features related to our main task: finding the relationship between property development and water quality over time.  In a previous lab, you identified lakes for which we have complete information for the years from 2004 to 2015.  In this lab, we will

[Original Data and variable information](https://gisdata.mn.gov/organization/us-mn-state-metrogis?q=Metro+Regional+Parcel+Dataset&sort=score+desc%2C+metadata_modified+desc)

## Problem 1 - Feature construction

**Overview.** Remember that our target output file will have one row per year-lake combination.  To attach property information, we will need to group and aggregate the parcel data to create features for each lake-year combination.  When grouping the data, be sure to maintain the variables needed to join to the water quality data, namely the lake ID and year.  Since we are looking at tracking property development/change over time, we will want to generate features tracking

* Number of properties close to each lake,
* Summaries of the value of properties close to each lake,
* Aggregations on the size and type of the properties, and
* Other features that might impact water quality.
    
#### Task 1. Understanding parcel variables

Before we can construct features, we need to make sure we understand the parcel data.  The metro parcel data is provided by the State of Minnesota and the meta data can be found online.  For example, searching for *metro parcel 2014* lead to [this site](https://geo.btaa.org/catalog/304cf3d8-a53b-4ea9-b02a-f550bd68e320).  Clicking on the *Meta data* button in the top left, brought up more information.  Clicking *Download* opened in this meta data [in a separate page](https://resources.gisdata.mn.gov/pub/gdrs/data/pub/us_mn_state_metrogis/plan_regonal_parcels_2014/metadata/metadata.html)

Look through the **Section 4: Attributes** and identify variables that might impact the water quality of near-by lakes.

> <font color="orange"> Your thoughts here </font>

Several parcel attributes are likely to influence lake water quality due to their relationship with development intensity, impervious surface coverage, and potential runoff contaminants. Based on the metro parcel metadata, the most relevant variables include:

Finished square footage (FIN_SQ_FT): Larger homes often correlate with larger roofs/driveways → more impervious surfaces → more runoff.

Lot size / Acres (ACRES): Smaller lots can indicate denser development, which tends to increase runoff.

Property value (TOTAL_VALUE, BLDG_VALUE, LAND_VALUE): Higher-value properties often correlate with larger footprints or more landscaping fertilizer use.

Dwelling type (DWELL_TYPE): Multi-family units vs. single-family affect the density of development.

Home style (HOME_STYLE): Can correlate with roof shape/size → drainage patterns.

Garage / Basement (GARAGE, BASEMENT): Indicators of home expansion, size, and land disturbance.

Cooling/Heating type: Can indirectly reflect building size and infrastructure.

Homestead status (HOMESTEAD): Non-homestead properties might be rentals or seasonal homes with different land-use patterns.

Tax exempt status (TAX_EXEMPT): Schools, churches, and parks behave differently from residential properties.

Overall, these variables relate to density, impervious surface area, building footprint, and land management practices, all of which influence water quality.

### Task 2. Feature Brainstorming

Our objective is to build a feature table with one row per lake-year, using grouped summary statistics. Here are effective strategies for feature construction:

1. **Numerical summaries:** Calculate group-level statistics (mean, median, standard deviation, IQR, etc.) for numeric variables.
2. **Categorical summaries:** For text data, consider:
   - **Success rates:** Compute proportions for binary variables (e.g., percent of homes with basements).
   - **Label cleaning:** Review and standardize unique labels to remove duplicates or inconsistencies.
   - **Broader categories:** Recode variables with many rare categories into a smaller, more meaningful set.
   - **Indicator columns:** Create indicator variables and aggregate them to show presence/absence or proportions (e.g., count of each property use type).

Review the variables you identified earlier and outline a feature construction strategy for each.

> <font color="orange"> Your thoughts here </font>

For each variable chosen, we can define a feature strategy:

1. Numerical Variables

We will create:

Mean, median, std, min, max, IQR for each lake-year.

Example variables:

FIN_SQ_FT (finished square footage)

ACRES (parcel size)

TOTAL_VALUE, BLDG_VALUE, LAND_VALUE (assessed values)

2. Binary Variables

Variables such as GARAGE and BASEMENT are "Y" / "N" flags.

Feature strategy:

Convert "Y" → 1 and "N" → 0

Compute percent of parcels with garage, percent with basement

3. Categorical Variables

Choose 1–2:

DWELL_TYPE

HOME_STYLE

HEATING

COOLING

HOMESTEAD, TAX_EXEMPT

Feature strategy:

Standardize labels (remove whitespace, collapse rare categories)

Create indicator dummy columns

Aggregate sums / proportions per lake-year

### Task 4. Initial querying with filter and select

First, you should build a query that filters the parcel data to 
1. only include parcels within 1600 feet of the lakes we are studying, and 
2. only for the lakes with complete information.  

You should also select only the columns you will need for feature construction and joining to the water quality data.

In [13]:
# your query here

Getting all the data 

In [None]:
import polars as pl
from glob import glob
from lake import lakes_w_complete_info


# Load ALL parcel parquet files
parcel_paths = glob("./data/parcel_combined/*.parquet")

parcel_lf = pl.scan_parquet(parcel_paths)

# Quick check
parcel_lf.collect_schema()


Schema([('BLDG_NUM', String),
        ('CITY', String),
        ('COUNTY_ID', String),
        ('EMV_BLDG', String),
        ('EMV_LAND', String),
        ('EMV_TOTAL', String),
        ('HOMESTEAD', String),
        ('NUM_UNITS', String),
        ('OWN_ADD_L1', String),
        ('OWN_ADD_L2', String),
        ('OWN_ADD_L3', String),
        ('PARC_CODE', String),
        ('PIN', String),
        ('SALE_DATE', String),
        ('SALE_VALUE', String),
        ('SCHOOL_DST', String),
        ('Shape_Area', String),
        ('Shape_Leng', String),
        ('TAX_ADD_L1', String),
        ('TAX_ADD_L2', String),
        ('TAX_ADD_L3', String),
        ('TAX_CAPAC', String),
        ('TAX_EXEMPT', String),
        ('TAX_NAME', String),
        ('TOTAL_TAX', String),
        ('WSHD_DIST', String),
        ('YEAR_BUILT', String),
        ('Year', String),
        ('ZIP', String),
        ('centroid_lat', String),
        ('centroid_long', String),
        ('lake_id', String),
        ('Distanc

Filtering the Cross-Reference Table to Parcels Near Study Lakes

The cross-reference dataset links individual parcels to lakes and provides the computed distance between each parcel centroid and the lake boundary. We load this file lazily because it contains hundreds of parquet partitions and is expensive to materialize. The filtering step keeps only parcels associated with lakes that have complete water-quality information (imported earlier from lake.py) and restricts parcels to those within 1600 meters of a lake. This ensures that subsequent feature construction focuses only on parcels close enough to plausibly influence lake water quality.

In [16]:
xref_lf = pl.scan_parquet("./data/xref_parquet/**/*.parquet")

xref_filtered = (
    xref_lf
    .filter(pl.col("lake_id").is_in(lakes_w_complete_info))
    .filter(pl.col("Distance_Parcel_Lake_meters") <= 1600)
)


In [None]:
xref_filtered.collect_schema()  # Quick check


Schema([('lake_id', String),
        ('centroid_lat', String),
        ('centroid_long', String),
        ('Distance_Parcel_Lake_meters', Float64),
        ('distance_category', String)])

Creating Coordinate Keys for Joining Parcel and Cross-Reference Data

Both the parcel dataset and the cross-reference dataset identify parcel locations using latitude and longitude fields. To enable an exact join between the two sources, a composite key is created by concatenating the centroid latitude and centroid longitude into a single string (coord_key). This key uniquely identifies each parcel’s geographic point and ensures that the parcel records correctly align with their corresponding lake-distance entries when the datasets are later joined.

In [18]:
parcel_keyed = parcel_lf.with_columns(
    (pl.col("centroid_lat") + "_" + pl.col("centroid_long")).alias("coord_key")
)

xref_keyed = xref_filtered.with_columns(
    (pl.col("centroid_lat") + "_" + pl.col("centroid_long")).alias("coord_key")
)


In [None]:
# parcel_near_lakes.head().collect() # Quick check


In [20]:
parcel_paths = glob("./data/parcel_combined/*.parquet")
parcel_lf = pl.scan_parquet(parcel_paths)
parcel_schema = parcel_lf.collect_schema().names()


Selecting Relevant Parcel Variables for Feature Engineering

To prepare the parcel dataset for feature construction, a set of required variables is defined. These include identifiers (lake_id, Year, PIN), property value metrics (EMV_TOTAL, EMV_LAND, EMV_BLDG), parcel size (Shape_Area), and the parcel’s distance from the lake.
Additional categorical fields (such as homestead status, dwelling type, garage, basement, heating/cooling) are also listed as optional variables.

Because schema differences exist across counties and years, not all fields appear in every parquet file. For this reason, the code dynamically filters the combined list and retains only the columns actually present in the loaded parcel schema. This ensures robustness and prevents runtime errors during downstream processing.

In [21]:
# Required columns
base_cols = [
    "lake_id", 
    "Year",
    "PIN",
    "EMV_TOTAL", "EMV_LAND", "EMV_BLDG",
    "Shape_Area",
    "Distance_Parcel_Lake_meters"
]

# Optional categorical variables
optional_cols = [
    "HOMESTEAD", 
    "TAX_EXEMPT",
    "DWELL_TYPE",
    "HOME_STYLE",
    "BASEMENT",
    "GARAGE",
    "COOLING",
    "HEATING",
]

# Keep only those that exist in the parquet schema
keep_cols = [c for c in base_cols + optional_cols if c in parcel_schema]

keep_cols


['lake_id',
 'Year',
 'PIN',
 'EMV_TOTAL',
 'EMV_LAND',
 'EMV_BLDG',
 'Shape_Area',
 'Distance_Parcel_Lake_meters',
 'HOMESTEAD',
 'TAX_EXEMPT']

Filtering Parcel Data to Lakes of Interest and Selecting Final Columns

The parcel dataset is now restricted to only those parcels that belong to lakes with complete water-quality information and are located within 1600 meters of the lake shoreline. This spatial filtering aligns the parcel features with the environmental footprint expected to influence lake conditions.

After filtering, only the previously validated keep_cols are selected to form a clean, consistent subset of the parcel data. A preview of the resulting table is shown to confirm that the filtering and column selection behaved as expected.

In [22]:
parcel_near_lakes = (
    parcel_lf
    .filter(pl.col("lake_id").is_in(lakes_w_complete_info))
    .filter(pl.col("Distance_Parcel_Lake_meters") <= 1600)
    .select(keep_cols)
)

parcel_near_lakes.head().collect()


lake_id,Year,PIN,EMV_TOTAL,EMV_LAND,EMV_BLDG,Shape_Area,Distance_Parcel_Lake_meters,HOMESTEAD,TAX_EXEMPT
str,str,str,str,str,str,str,f64,str,str
"""02000400-01""","""2004""","""003-143122440093""","""185922.0""","""64860.0""","""111775.0""","""1373.03067631""",910.384483,"""Y""","""N"""
"""02000400-01""","""2004""","""003-143122440093""","""185922.0""","""64860.0""","""111775.0""",,910.384483,"""Y""","""N"""
"""02000400-01""","""2004""","""003-143122440091""","""185198.0""","""64860.0""","""109643.0""","""1697.87995791""",883.881918,"""Y""","""N"""
"""02000400-01""","""2004""","""003-143122440091""","""185198.0""","""64860.0""","""109643.0""",,883.881918,"""Y""","""N"""
"""02000400-01""","""2004""","""003-143122430063""","""196189.0""","""70500.0""","""114418.0""","""1417.13489238""",654.195209,"""Y""","""N"""


Fixing the numeric columns for next questions

This ensures:

All numeric aggregations (mean, median, standard deviation) work correctly.

Polars does not raise type-mismatch errors during group-by operations.

Downstream machine learning models receive clean, fully numeric input

In [23]:
parcel_near_lakes_clean = parcel_near_lakes.with_columns([
    pl.col("EMV_TOTAL").cast(pl.Float64, strict=False),
    pl.col("EMV_LAND").cast(pl.Float64, strict=False),
    pl.col("EMV_BLDG").cast(pl.Float64, strict=False),
    pl.col("Shape_Area").cast(pl.Float64, strict=False),
    pl.col("Year").cast(pl.Int32, strict=False)
])


## Problem 2.  Numerical Summaries

Two important categories of property data involve the size (e.g., finished square footage) and value (e.g., accessed value and/or taxes paid).

**Tasks.** 

1. Identify 2-3 variables for each of these categories.
2. Write a query that computes the summary statistics for each of these variables for each lake-year.  
3. Write this summary table out to a CSV file named `parcel_numerical_summaries.csv`.  Again, you should partition by lake ID and year.

In [24]:
# Your code here

Creating Yearly Numerical Parcel Summaries per Lake

This step aggregates key numeric parcel attributes for every lake–year combination. After filtering to parcels within 1600 meters of the study lakes, we compute summary statistics that describe development intensity around each lake.

The query groups parcels by lake_id and Year, then produces:

Property value summaries (mean, median, standard deviation of total, land, and building values)

Parcel size summaries (mean, median, standard deviation of lot area)

A parcel count, capturing how many properties are near a lake in a given year

These numerical features serve as predictors for the water-quality models in later labs, capturing how surrounding property characteristics evolve over time.

In [25]:
parcel_numeric_summary = (
    parcel_near_lakes_clean
    .group_by(["lake_id", "Year"])
    .agg([
        # Value summaries
        pl.col("EMV_TOTAL").mean().alias("mean_value_total"),
        pl.col("EMV_TOTAL").median().alias("median_value_total"),
        pl.col("EMV_TOTAL").std().alias("sd_value_total"),

        pl.col("EMV_LAND").mean().alias("mean_value_land"),
        pl.col("EMV_BLDG").mean().alias("mean_value_building"),

        # Size summaries
        pl.col("Shape_Area").mean().alias("mean_area"),
        pl.col("Shape_Area").median().alias("median_area"),
        pl.col("Shape_Area").std().alias("sd_area"),

        # Count of parcels
        pl.count().alias("num_parcels"),
    ])
)


(Deprecated in version 0.20.5)
  pl.count().alias("num_parcels"),


In [26]:
parcel_numeric_summary.collect().write_csv(
    "./data/parcel_numerical_summaries.csv"
)


## Problem 3.  Simple categorical summaries.

In this part, you will create summary statistics for some of the simpler categorical variables.

**Binary variables.** There are two examples of binary variables, listed below.  You will need to compute the percent of `Yes` for each.

* GARAGE: Garage Y/N
* BASEMENT: Basement Y/N

**Other categorical variables.** There are a number of other categorical variables.  You need to select one of these variables, inspect/clean your variable as needed, create indicator variables for each resulting label, and compute summary statistics for each label.

* HOMESTEAD: Homestead Status
* TAX_EXEMPT: Tax Exempt Status 
* DWELL_TYPE: Dwelling Type 
* HOME_STYLE: Home Style
* HEATING: Heating type
* COOLING: Cooling type

**Tasks.**
Create a query that

1. Select one binary and two other categorical variables for feature construction,
2. Reads in the parcel data and selects the relevant columns (be sure to keep the lake ID and year),
3. Inspect unique labels and recode/clean as needed,
4. Create a literal column of ones, and
5. Pivot to get the counts of each label per lake-year (do this once per category).

Write this summary table out to a csv file named `parcel_categorical_summaries.csv`.  Again, you should partition by lake ID and year.

In [27]:
# Your code here

Cleaning and Standardizing Categorical Parcel Variables

Categorical attributes often contain inconsistent formatting (mixed case, extra spaces, missing values). Before constructing categorical features, these variables must be cleaned so that categories are reliably grouped.

This step standardizes two key variables:

HOMESTEAD – Indicates whether a property is owner-occupied.

TAX_EXEMPT – Identifies parcels exempt from property taxes.

For each variable, we:

Replace nulls with "UNKNOWN" so missing values form their own category.

Strip stray characters or whitespace, removing inconsistencies in the raw data.

Convert text to uppercase, ensuring all categories use a common format.

In [28]:
parcel_cat_clean = parcel_near_lakes_clean.with_columns([
    pl.col("HOMESTEAD")
        .fill_null("UNKNOWN")
        .str.strip_chars()
        .str.to_uppercase()
        .alias("HOMESTEAD"),

    pl.col("TAX_EXEMPT")
        .fill_null("UNKNOWN")
        .str.strip_chars()
        .str.to_uppercase()
        .alias("TAX_EXEMPT")
])


In [29]:
parcel_cat_clean.head().collect()


lake_id,Year,PIN,EMV_TOTAL,EMV_LAND,EMV_BLDG,Shape_Area,Distance_Parcel_Lake_meters,HOMESTEAD,TAX_EXEMPT
str,i32,str,f64,f64,f64,f64,f64,str,str
"""02000400-01""",2004,"""003-143122440093""",185922.0,64860.0,111775.0,1373.030676,910.384483,"""Y""","""N"""
"""02000400-01""",2004,"""003-143122440093""",185922.0,64860.0,111775.0,,910.384483,"""Y""","""N"""
"""02000400-01""",2004,"""003-143122440091""",185198.0,64860.0,109643.0,1697.879958,883.881918,"""Y""","""N"""
"""02000400-01""",2004,"""003-143122440091""",185198.0,64860.0,109643.0,,883.881918,"""Y""","""N"""
"""02000400-01""",2004,"""003-143122430063""",196189.0,70500.0,114418.0,1417.134892,654.195209,"""Y""","""N"""


Creating Summary Counts for the HOMESTEAD Categorical Variable

This step converts the cleaned HOMESTEAD labels into usable numerical features by aggregating category counts for each lake-year combination.

Key operations:

Create a constant indicator column (count = 1) so each row contributes one unit to its category total.

Group by lake_id, Year, and HOMESTEAD to compute how many parcels fall into each homestead category.

Switch to eager mode (collect()) before pivoting—Polars requires an in-memory DataFrame to reshape data.

Pivot the table so each HOMESTEAD category becomes its own column (e.g., Y, N, UNKNOWN), with counts as values.

Replace missing values with zero, ensuring all combinations have defined counts.

This produces a wide-format table summarizing homestead status for each lake-year, which becomes part of the final feature set.

In [30]:
homestead_summary = (
    parcel_cat_clean
    .with_columns(pl.lit(1).alias("count"))
    .group_by(["lake_id", "Year", "HOMESTEAD"])
    .agg(pl.col("count").sum().alias("count"))
    .collect()                                 # IMPORTANT: switch to eager mode
    .pivot(
        index=["lake_id", "Year"],
        columns="HOMESTEAD",
        values="count"
    )
    .fill_null(0)
)


  .pivot(


In [31]:
homestead_summary.head()


lake_id,Year,Y,N,P,UNKNOWN,0,5,2,1,3,7
str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32
"""10024900-01""",2009,19,68,0,87,0,0,0,0,0,0
"""70009100-01""",2009,421,322,0,0,0,0,0,0,0,0
"""19045600-01""",2010,4351,677,17,72,0,0,0,0,0,0
"""19002700-01""",2012,1594,263,0,0,0,0,0,0,0,0
"""62005400-01""",2012,6440,2101,37,31,0,0,0,0,0,0


Summarizing the TAX_EXEMPT Categorical Variable

This step mirrors the homestead summarization process, producing categorical feature counts for TAX_EXEMPT status across lake-year groups.

What this accomplishes:

Creates a constant indicator (count = 1) so each parcel contributes a single count to its category.

Groups by lake_id, Year, and TAX_EXEMPT to tally how many parcels fall into each exemption category.

Uses collect() to perform the pivot operation in eager mode (pivoting is not supported lazily).

Pivots the table so each TAX_EXEMPT category becomes a separate column containing the count of parcels in that category.

Replaces null values with zero to avoid missing entries in the feature matrix.

This creates a wide-format summary table reflecting the tax-exempt property distribution for each lake-year, which enhances the categorical feature set used later in modeling.

In [32]:
tax_summary = (
    parcel_cat_clean
    .with_columns(pl.lit(1).alias("count"))
    .group_by(["lake_id", "Year", "TAX_EXEMPT"])
    .agg(pl.col("count").sum().alias("count"))
    .collect()
    .pivot(
        index=["lake_id", "Year"],
        columns="TAX_EXEMPT",
        values="count"
    )
    .fill_null(0)
)


  .pivot(


In [33]:
tax_summary.head()


lake_id,Year,Y,N,UNKNOWN
str,i32,i32,i32,i32
"""27008600-02""",2009,58,1428,0
"""82048200-01""",2010,48,778,0
"""27062700-01""",2015,177,6918,0
"""10005900-01""",2011,16,637,0
"""27008800-01""",2009,92,2052,0


In [34]:
parcel_categorical_summary = (
    homestead_summary
    .join(tax_summary, on=["lake_id", "Year"], how="inner")
)


In [35]:
parcel_categorical_summary.head()


lake_id,Year,Y,N,P,UNKNOWN,0,5,2,1,3,7,Y_right,N_right,UNKNOWN_right
str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32
"""27008600-02""",2009,1231,255,0,0,0,0,0,0,0,0,58,1428,0
"""82048200-01""",2010,742,84,0,0,0,0,0,0,0,0,48,778,0
"""27062700-01""",2015,6080,1015,0,0,0,0,0,0,0,0,177,6918,0
"""10005900-01""",2011,335,318,0,0,0,0,0,0,0,0,16,637,0
"""27008800-01""",2009,1860,284,0,0,0,0,0,0,0,0,92,2052,0


In [36]:
parcel_categorical_summary.write_csv(
    "./data/parcel_categorical_summaries.csv"
)


In [None]:
# from Project_2_Lab_4_Filter_and_aggregate_water_quality_data_V2 import wq_final

# I couldn't import it from the last lab. Don't know what was the issue. So, I copy pasted it here directly.

import polars as pl

# Path to water quality file (same as Lab 4)
wq_path = "./data/MinneMUDAC_raw_files/mces_lakes_1999_2014.v2.txt"

# Columns we need
wq_cols = [
    "DNR_ID_Site_Number",
    "END_DATE",
    "LAKE_NAME",
    "Secchi_Depth_RESULT",
    "Secchi_Depth_QUALIFIER",
    "Secchi_Depth_Units",
    "Total_Phosphorus_RESULT",
    "Total_Phosphorus_QUALIFIER",
    "Total_Phosphorus_Units",
    "longitude",
    "latitude"
]

# Load entire CSV lazily, THEN select the needed columns
wq_lf = (
    pl.scan_csv(
        wq_path,
        separator="\t",
        infer_schema_length=10000
    )
    .select(wq_cols)   # <--- safe and correct
)


# Recreate filtered water-quality dataset
wq_filtered = (
    wq_lf
    .filter(
        (pl.col("Secchi_Depth_QUALIFIER") == "Approved") &
        (pl.col("Total_Phosphorus_QUALIFIER") == "Approved")
    )
    .filter(
        pl.col("Secchi_Depth_RESULT").is_not_null() &
        (pl.col("Secchi_Depth_RESULT") > 0) &
        pl.col("Total_Phosphorus_RESULT").is_not_null() &
        (pl.col("Total_Phosphorus_RESULT") > 0)
    )
    .with_columns(
        pl.col("END_DATE").str.slice(0, 4).cast(pl.Int32).alias("Year")
    )
    .filter((pl.col("Year") >= 2004) & (pl.col("Year") <= 2015))
)


# Compute aggregated yearly values
wq_final = (
    wq_filtered
    .group_by(["DNR_ID_Site_Number", "LAKE_NAME", "Year", "latitude", "longitude"])
    .agg([
        pl.col("Secchi_Depth_RESULT").mean().alias("avg_secchi"),
        pl.col("Total_Phosphorus_RESULT").mean().alias("avg_phosphorus")
    ])
)

# Rename to lake_id for joining with parcel data
wq_ready = wq_final.rename({"DNR_ID_Site_Number": "lake_id"})


In [38]:
wq_ready.head().collect()


lake_id,LAKE_NAME,Year,latitude,longitude,avg_secchi,avg_phosphorus
str,str,i32,f64,f64,f64,f64
"""82004400-01""","""West Boot Lake""",2013,45.16316,-92.83822,4.0735,0.0301
"""10021600-01""","""McKnight Lake""",2013,44.837238,-93.608794,0.53,0.1388
"""82036800-01""","""Klawitter Pond""",2010,45.032726,-92.90855,0.363636,0.154909
"""82031800-01""","""July Lake""",2007,45.123734,-92.907611,0.206571,0.279571
"""82001502-01""","""Loon Lake""",2006,45.114143,-92.8372,0.7318,0.156


## Problem 4.  Join all the summaries.

Finally, you need to join all the summaries created above, along with the water quality summaries created in a previous lab, into one overall summary file.  Write the resulting table to a CSV file named `water_quality_and_parcel_summaries_2004_to_2015.csv`.

In [39]:
# Your code here.

Both the numeric and categorical summary tables were created in eager mode. To efficiently join them with the water-quality dataset—processed lazily—we convert these summary DataFrames back into LazyFrames.

Why this matters:

Consistent join behavior: Polars performs optimizations (projection pushdown, predicate pushdown, join reordering) only when data is lazy.

Avoids unnecessary materialization: Keeping data lazy until the final write reduces memory use and speeds execution.

Required for chaining joins: The next steps involve joining multiple tables; working entirely in LazyFrames keeps the pipeline clean and efficient.

This prepares the summary tables for the final integration step.

In [40]:
parcel_numeric_lf = parcel_numeric_summary.lazy()
parcel_categorical_lf = parcel_categorical_summary.lazy()


Joining Water-Quality Data with Parcel Features

This stage integrates all components of the dataset into one unified table for analysis.

Join water-quality summaries with numeric parcel features
We first merge wq_ready with parcel_numeric_lf using lake_id and Year.
A left join ensures all lake-year combinations with valid water-quality data remain in the dataset even if parcel data is missing for some years.

Join categorical parcel summaries
The intermediate result is then joined with parcel_categorical_lf on the same keys, again using a left join.
This step attaches counts of HOMESTEAD/TAX_EXEMPT categories to each lake-year row.

Why this order matters:

Water quality is the primary outcome; parcel features supplement it.

Joining numeric features first keeps the schema simpler before adding wide pivoted categorical features.

Using lazy joins maintains efficiency by allowing Polars to optimize the entire query graph.

The final output is a single, wide dataset with one row per lake-year containing:

Water-quality averages

Numeric parcel summaries

Categorical parcel distributions

Previewing the head confirms the structure before saving.

In [41]:
wq_plus_numeric = (
    wq_ready
    .join(parcel_numeric_lf, on=["lake_id", "Year"], how="left")
)

final_summary = (
    wq_plus_numeric
    .join(parcel_categorical_lf, on=["lake_id", "Year"], how="left")
)

final_summary.head().collect()


lake_id,LAKE_NAME,Year,latitude,longitude,avg_secchi,avg_phosphorus,mean_value_total,median_value_total,sd_value_total,mean_value_land,mean_value_building,mean_area,median_area,sd_area,num_parcels,Y,N,P,UNKNOWN,0,5,2,1,3,7,Y_right,N_right,UNKNOWN_right
str,str,i32,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,u32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32
"""13002300-01""","""Moody Lake""",2006,45.301616,-92.867172,1.078538,0.150077,225709.090909,245100.0,190046.797466,112612.121212,113096.969697,35004.354447,10511.355833,54127.336025,33,26,6,0,1,0,0,0,0,0,0,6,27,0
"""82010700-01""","""Sunfish Lake""",2009,44.99924,-92.891602,2.55,0.027286,332363.053435,210400.0,866316.308428,137390.992366,194972.061069,11103.912604,2320.549376,43541.149741,655,0,0,0,8,364,0,0,283,0,0,98,549,8
"""27071100-01""","""Westwood Lake""",2008,44.970826,-93.387652,1.192857,0.029857,401767.521535,229000.0,2197300.0,159175.13704,242592.384495,15076.70583,1220.617649,33448.17636,5108,4008,1100,0,0,0,0,0,0,0,0,120,4988,0
"""82008700-01""","""Regional Park Lake""",2006,44.805532,-92.902484,2.184333,0.092833,312014.729574,232200.0,328782.663131,165199.539701,146815.189873,20178.398515,1224.269041,67156.312439,869,713,144,0,12,0,0,0,0,0,0,68,801,0
"""19002400-01""","""Wood Lake""",2010,44.741118,-93.26586,2.293846,0.039923,346309.320695,144200.0,1515800.0,74258.886256,272050.434439,2395.41529,764.608769,17531.412596,2532,2061,446,2,23,0,0,0,0,0,0,73,2436,23


Joining Water-Quality Data with Numeric and Categorical Parcel Features

To create the final analysis dataset, we join the water-quality summaries with both numeric and categorical parcel features.

Convert parcel summary tables to LazyFrames so Polars can optimize the join operations across the entire pipeline.

First join: Attach numeric parcel summaries (parcel_numeric_lf) to the water-quality data (wq_ready) using lake_id and Year.

Second join: Attach categorical parcel summaries (parcel_categorical_lf) to the intermediate result using the same keys.

Both joins use left joins to ensure every lake-year with valid water-quality measurements remains in the final dataset even if some parcel features are missing.

A preview (head().collect()) confirms the combined structure before writing it.

This produces a complete, wide table containing water-quality metrics along with numeric and categorical parcel features for each lake-year.

In [42]:
parcel_numeric_lf = parcel_numeric_summary.lazy()
parcel_categorical_lf = parcel_categorical_summary.lazy()

wq_plus_numeric = (
    wq_ready
    .join(parcel_numeric_lf, on=["lake_id", "Year"], how="left")
)

final_summary = (
    wq_plus_numeric
    .join(parcel_categorical_lf, on=["lake_id", "Year"], how="left")
)

final_summary.head().collect()


lake_id,LAKE_NAME,Year,latitude,longitude,avg_secchi,avg_phosphorus,mean_value_total,median_value_total,sd_value_total,mean_value_land,mean_value_building,mean_area,median_area,sd_area,num_parcels,Y,N,P,UNKNOWN,0,5,2,1,3,7,Y_right,N_right,UNKNOWN_right
str,str,i32,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,u32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32
"""19003100-01""","""Orchard Lake""",2006,44.702254,-93.310894,2.5,0.035385,350876.899128,380600.0,232273.507048,118729.016189,232147.882939,7083.062706,1820.466434,46925.430852,1606,1212,346,8,40,0,0,0,0,0,0,133,1365,108
"""10001100-01""","""St. Joe Lake""",2008,44.875646,-93.622969,2.78125,0.019,400434.748011,386850.0,201454.647945,140867.506631,259567.241379,2654.589465,1635.939901,5248.366363,754,336,44,0,374,0,0,0,0,0,0,0,0,754
"""82001100-01""","""Bay Pond""",2008,44.994986,-92.832737,0.490909,0.329545,358904.59588,346300.0,201774.976647,179201.584786,177567.511886,26371.853471,11919.400771,52086.75197,631,0,0,0,0,162,0,0,469,0,0,30,601,0
"""82012000-01""","""Benz Lake""",2008,45.111286,-92.891117,1.524071,0.085286,477151.923077,465650.0,219416.168438,272551.442308,202749.519231,43142.760264,37005.360783,40346.522621,208,0,0,0,0,31,0,0,177,0,0,9,199,0
"""02002200-01""","""Island Lake""",2006,45.366919,-93.094885,1.492308,0.042385,188010.982659,187400.0,232319.776737,75080.732177,112930.250482,26387.825406,9513.057528,43914.607716,519,272,247,0,0,0,0,0,0,0,0,98,421,0


In [43]:
final_df = final_summary.collect()

final_df.write_csv(
    "./data/water_quality_and_parcel_summaries_2004_to_2015.csv"
)


## Problem 5.  Put it all together

It is often useful to package all of the data constructions steps together in one convenient place.  Your last task is to

1. Gather all of your data construction code below.
    * You don't need to include exploratory code, e.g., exploring join mismatches; only the code necessary to combine, clean, and write your data.
2. Clean/refactor the code.
3. Be sure to display all important intermediate results.

In [44]:
# Your code here.

Preparing the Water-Quality Dataset (Filtering, Cleaning, and Yearly Aggregation)

This step constructs the clean lake-level water-quality table used throughout the project. The raw MCES file contains many measurements across dates, qualifiers, and unit types, so the workflow performs several required transformations:

Lazy Loading the Raw File:
The water-quality text file is loaded with scan_csv, allowing Polars to defer computation and optimize later joins and filters.

Cleaning and Validating Measurements:

Convert the END_DATE column into a proper date type.

Cast Secchi depth and phosphorus values to numeric.

Keep only measurements marked as Approved to ensure reliability.

Extracting and Restricting Years:
The Year is extracted from END_DATE, and the dataset is restricted to the analysis window (2004–2015), matching the years used for parcel summaries.

Aggregating to Yearly Lake-Level Metrics:
Because there may be multiple samples per lake per year, we compute yearly averages for Secchi depth and phosphorus.
This produces one row per lake-year, which is required for joining with parcel summaries.

Final Formatting:
The lake ID is renamed to lake_id, and only the columns needed for downstream feature construction are kept.

This results in wq_ready, the standardized water-quality dataset used for all later joins and modeling steps.

In [None]:
# Load and Prepare Water Quality Data
wq_path = "./data/MinneMUDAC_raw_files/mces_lakes_1999_2014.v2.txt"

wq_lf = (
    pl.scan_csv(
        wq_path,
        separator="\t",
        infer_schema_length=20000,
        ignore_errors=True
    )
)

# Filter → Clean → Extract year → Restrict to 2004–2015
wq_clean = (
    wq_lf
    .with_columns([
        pl.col("END_DATE").str.strptime(pl.Date, strict=False),
        pl.col("Secchi_Depth_RESULT").cast(pl.Float64, strict=False),
        pl.col("Total_Phosphorus_RESULT").cast(pl.Float64, strict=False),
    ])
    .filter(
        (pl.col("Secchi_Depth_QUALIFIER") == "Approved") &
        (pl.col("Total_Phosphorus_QUALIFIER") == "Approved")
    )
    .with_columns(
        pl.col("END_DATE").dt.year().alias("Year")
    )
    .filter((pl.col("Year") >= 2004) & (pl.col("Year") <= 2015))
)

# Yearly lake-level averages
wq_summary = (
    wq_clean
    .group_by(["DNR_ID_Site_Number", "LAKE_NAME", "Year", "latitude", "longitude"])
    .agg([
        pl.col("Secchi_Depth_RESULT").mean().alias("avg_secchi"),
        pl.col("Total_Phosphorus_RESULT").mean().alias("avg_phosphorus")
    ])
)

# Final formatted water-quality table
wq_ready = (
    wq_summary
    .rename({"DNR_ID_Site_Number": "lake_id"})
    .select([
        "lake_id", "LAKE_NAME", "Year",
        "latitude", "longitude",
        "avg_secchi", "avg_phosphorus"
    ])
)

Preparing Parcel Data for Feature Construction

This section loads, filters, and standardizes the parcel dataset so it can be aggregated and joined with the water-quality table.

1. Load all parcel parquet files lazily
All yearly parcel files are read with scan_parquet, allowing Polars to optimize operations before execution. The schema is collected to determine which columns are available since different counties/years may include different fields.

2. Select required and optional feature columns

Required columns include identifiers, parcel value fields, parcel size, and distance from the lake.

Optional categorical variables (homestead status, tax status, dwelling type, etc.) are included only if they appear in the schema.
This prevents errors when missing columns occur in some counties’ parcel files.

3. Filter the parcel data to the relevant lake-year combinations
Two project constraints are applied:

Keep only parcels belonging to lakes with complete water-quality histories.

Restrict to parcels within 1600 meters of the lake shore, consistent with the project's definition of "near-lake development."

4. Fix numeric data types for downstream aggregation
Value fields and parcel size values are cast to Float64, and the year is coerced to an integer.
This ensures consistent dtypes across files and prevents type-related errors during grouping, summarization, or joins.

The resulting parcel_near_lakes_clean is the standardized parcel dataset used to compute numerical and categorical features in the following steps.

In [None]:
parcel_paths = glob("./data/parcel_combined/*.parquet")
parcel_lf = pl.scan_parquet(parcel_paths)

parcel_schema = parcel_lf.collect_schema().names()

# Required columns
base_cols = [
    "lake_id", "Year", "PIN",
    "EMV_TOTAL", "EMV_LAND", "EMV_BLDG",
    "Shape_Area", "Distance_Parcel_Lake_meters",
]

# Optional categorical features (keep only those that exist)
optional_cols = [
    "HOMESTEAD", "TAX_EXEMPT", "DWELL_TYPE",
    "HOME_STYLE", "BASEMENT", "GARAGE",
]

keep_cols = [c for c in base_cols + optional_cols if c in parcel_schema]

# Filter parcels to lakes with complete info and within 1600m
parcel_near_lakes = (
    parcel_lf
    .filter(pl.col("lake_id").is_in(lakes_w_complete_info))
    .filter(pl.col("Distance_Parcel_Lake_meters") <= 1600)
    .select(keep_cols)
)

# Fix numeric dtypes
parcel_near_lakes_clean = parcel_near_lakes.with_columns([
    pl.col("EMV_TOTAL").cast(pl.Float64, strict=False),
    pl.col("EMV_LAND").cast(pl.Float64, strict=False),
    pl.col("EMV_BLDG").cast(pl.Float64, strict=False),
    pl.col("Shape_Area").cast(pl.Float64, strict=False),
    pl.col("Year").cast(pl.Int32, strict=False)
])

Numerical Parcel Feature Construction

This section aggregates parcel-level numeric variables into yearly lake-level summaries. These summary features quantify development intensity around each lake and will later serve as predictors for water-quality outcomes.

Key steps:

Group the cleaned parcel data by lake and year
Each row in the final model must represent a lake-year, so all parcel information within that lake and year is aggregated into a single record.

Compute statistical summaries for property values

Mean, median, and standard deviation of total estimated market value (EMV_TOTAL).

Mean values for land (EMV_LAND) and building (EMV_BLDG) assessments.
These metrics capture economic development patterns around each lake.

Summarize parcel size (Shape_Area)

Average, median, and variability of parcel area reflect changes in density and land use.

Count the number of parcels near the lake
The parcel count serves as a direct measure of development intensity.

Write the results to CSV for reuse
Exporting ensures these numeric summaries can be joined with water-quality data and reused across labs without recomputation.

This table forms the numerical component of the feature set used later in machine-learning modeling.

In [None]:
parcel_numeric_summary = (
    parcel_near_lakes_clean
    .group_by(["lake_id", "Year"])
    .agg([
        pl.col("EMV_TOTAL").mean().alias("mean_value_total"),
        pl.col("EMV_TOTAL").median().alias("median_value_total"),
        pl.col("EMV_TOTAL").std().alias("sd_value_total"),

        pl.col("EMV_LAND").mean().alias("mean_value_land"),
        pl.col("EMV_BLDG").mean().alias("mean_value_building"),

        pl.col("Shape_Area").mean().alias("mean_area"),
        pl.col("Shape_Area").median().alias("median_area"),
        pl.col("Shape_Area").std().alias("sd_area"),

        pl.count().alias("num_parcels"),
    ])
)

parcel_numeric_summary.collect().write_csv(
    "./data/parcel_numerical_summaries.csv"
)


(Deprecated in version 0.20.5)
  pl.count().alias("num_parcels"),


Categorical Parcel Feature Construction

This part transforms key categorical parcel attributes into quantitative features that describe how different types of land use vary across lakes and years. These categorical summaries complement the numerical parcel features created earlier.

Cleaning categorical labels
Categorical fields such as HOMESTEAD and TAX_EXEMPT often contain inconsistent formatting or missing values. Labels are standardized by:

Replacing nulls with "UNKNOWN",

Stripping whitespace,

Converting to uppercase.
This ensures consistent grouping and avoids splitting the same category into multiple labels.

Counting category frequencies
For each lake-year, the code:

Creates a helper column of ones,

Groups by lake, year, and category value,

Sums the helper column to compute how many parcels fall into each category.

Pivoting to wide format
The grouped counts are pivoted so each category becomes its own column (e.g., HOMESTEAD_Y, HOMESTEAD_N, UNKNOWN).
This produces a clean feature matrix capturing the composition of parcel types around a lake.

Combining summaries
The homestead and tax-exempt summaries are joined on lake-year, forming a single categorical feature table.
These features quantify important land-use distinctions—such as primary residences vs. rentals or tax-exempt parcels—that may influence water-quality conditions.

Final output
The completed categorical summary table is written to CSV, and lazy versions are created for merging with the full dataset in the next step.

In [None]:
parcel_cat_clean = parcel_near_lakes_clean.with_columns([
    pl.col("HOMESTEAD").fill_null("UNKNOWN").str.strip_chars().str.to_uppercase(),
    pl.col("TAX_EXEMPT").fill_null("UNKNOWN").str.strip_chars().str.to_uppercase()
])

# Homestead pivot
homestead_summary = (
    parcel_cat_clean
    .with_columns(pl.lit(1).alias("count"))
    .group_by(["lake_id", "Year", "HOMESTEAD"])
    .agg(pl.sum("count"))
    .collect()
    .pivot(
        index=["lake_id", "Year"],
        columns="HOMESTEAD",
        values="count"
    )
    .fill_null(0)
)

# Tax Exempt pivot
tax_summary = (
    parcel_cat_clean
    .with_columns(pl.lit(1).alias("count"))
    .group_by(["lake_id", "Year", "TAX_EXEMPT"])
    .agg(pl.sum("count"))
    .collect()
    .pivot(
        index=["lake_id", "Year"],
        columns="TAX_EXEMPT",
        values="count"
    )
    .fill_null(0)
)

parcel_categorical_summary = (
    homestead_summary
    .join(tax_summary, on=["lake_id", "Year"], how="inner")
)

parcel_categorical_summary.write_csv(
    "./data/parcel_categorical_summaries.csv"
)

parcel_numeric_lf = parcel_numeric_summary.lazy()
parcel_categorical_lf = parcel_categorical_summary.lazy()



  .pivot(
  .pivot(


In [None]:
# Join Everything Together
wq_plus_numeric = (
    wq_ready
    .join(parcel_numeric_lf, on=["lake_id", "Year"], how="left")
)

final_summary = (
    wq_plus_numeric
    .join(parcel_categorical_lf, on=["lake_id", "Year"], how="left")
)

final_df = final_summary.collect()

final_df.write_csv(
    "./data/water_quality_and_parcel_summaries_2004_to_2015.csv"
)

# Display preview
final_df.head()

lake_id,LAKE_NAME,Year,latitude,longitude,avg_secchi,avg_phosphorus,mean_value_total,median_value_total,sd_value_total,mean_value_land,mean_value_building,mean_area,median_area,sd_area,num_parcels,N,Y,5,UNKNOWN,P,0,7,1,2,3,N_right,UNKNOWN_right,Y_right
str,str,i32,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,u32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32
"""82012000-01""","""Benz Lake""",2013,45.111286,-92.891117,1.22,0.12,360137.623762,355200.0,175012.421934,184822.772277,175314.851485,43588.291427,37912.122998,40556.587223,202,26,176,0,0,0,0,0,0,0,0,0,202,0
"""27004201-01""","""Twin Lake""",2007,45.050776,-93.33619,1.296875,0.09125,234480.469552,193600.0,510011.044283,75734.585473,158474.629494,1874.033669,942.496258,20227.458742,6815,957,5858,0,0,0,0,0,0,0,0,6629,0,186
"""27005800-01""","""Ryan Lake""",2008,45.040164,-93.321326,1.923077,0.038615,193701.958064,180000.0,173905.690163,52872.109158,140829.848905,1091.935015,583.723295,3875.915648,6486,1334,5152,0,0,0,0,0,0,0,0,6159,0,327
"""13005700-01""","""School Lake""",2009,45.303334,-92.912779,1.4,0.0475,273837.857143,268550.0,145372.219085,120114.285714,153723.571429,28278.543991,19298.097959,29989.775805,140,0,0,0,2,0,12,0,126,0,0,132,2,6
"""82002800-01""","""Staples Lake""",2006,45.183941,-92.854792,3.2004,0.0306,308547.65625,340700.0,193387.788769,184771.09375,123776.5625,50069.327119,28535.700269,54405.994362,128,56,70,0,2,0,0,0,0,0,0,84,0,44
