# Lab 5 - Parcel Feature Extraction

Next, we will illustrate the construction of features related to our main task: finding the relationship between property development and water quality over time.  In a previous lab, you identified lakes for which we have complete information for the years from 2004 to 2015.  In this lab, we will

[Original Data and variable information](https://gisdata.mn.gov/organization/us-mn-state-metrogis?q=Metro+Regional+Parcel+Dataset&sort=score+desc%2C+metadata_modified+desc)

## Problem 1 - Feature construction

**Overview.** Remember that our target output file will have one row per year-lake combination.  To attach property information, we will need to group and aggregate the parcel data to create features for each lake-year combination.  When grouping the data, be sure to maintain the variables needed to join to the water quality data, namely the lake ID and year.  Since we are looking at tracking property development/change over time, we will want to generate features tracking

* Number of properties close to each lake,
* Summaries of the value of properties close to each lake,
* Aggregations on the size and type of the properties, and
* Other features that might impact water quality.
    
#### Task 1. Understanding parcel variables

Before we can construct features, we need to make sure we understand the parcel data.  The metro parcel data is provided by the State of Minnesota and the meta data can be found online.  For example, searching for *metro parcel 2014* lead to [this site](https://geo.btaa.org/catalog/304cf3d8-a53b-4ea9-b02a-f550bd68e320).  Clicking on the *Meta data* button in the top left, brought up more information.  Clicking *Download* opened in this meta data [in a separate page](https://resources.gisdata.mn.gov/pub/gdrs/data/pub/us_mn_state_metrogis/plan_regonal_parcels_2014/metadata/metadata.html)

Look through the **Section 4: Attributes** and identify variables that might impact the water quality of near-by lakes.

> <font color="orange"> 

from the parcel data I decided to take the below columns, they were discriminately chosen, but I  think that some of them sound like they might have a decent chance of  having an affect on water quality.

I chose many here, but  some may not be used in my final analysis.

Join strategy discussion.

- Monit_MAP_CODE1     - join operator, only used for joining / showing which lake is which.
- Year                - join operator.
- YEAR_BUILT          - avg, do newer or older homes have different impacts due to construction standards?
- distance_category   - count & group, understand effect of closeness to lake.
- TOTAL_TAX           - avg, do houses taxed at different rates have a different effect on water quality of lake? Used as a stand-in for property value, as it is based off a 
valuation that is updated periodically unlike sale value.
- HOMESTEAD           - percent yes, do residential homesteads affect water quality?
- GARAGE              - percent yes, does presence of a garage relate to impervious surface and runoff?
- FIN_SQ_FT           - avg, do larger homes have a greater impact on water quality?
- ACRES_DEED          - avg, do larger parcels contribute more or less to runoff and nutrient loading?
- ACRES_POLY          - avg, similar to ACRES_DEED, spatially derived parcel size.
- GREEN_ACRE          - avg, does agricultural preservation status relate to land management and water quality?
</font>

# List of selected columns for feature construction
### Task 2. Feature Brainstorming

Our objective is to build a feature table with one row per lake-year, using grouped summary statistics. Here are effective strategies for feature construction:

1. **Numerical summaries:** Calculate group-level statistics (mean, median, standard deviation, IQR, etc.) for numeric variables.
2. **Categorical summaries:** For text data, consider:
   - **Success rates:** Compute proportions for binary variables (e.g., percent of homes with basements).
   - **Label cleaning:** Review and standardize unique labels to remove duplicates or inconsistencies.
   - **Broader categories:** Recode variables with many rare categories into a smaller, more meaningful set.
   - **Indicator columns:** Create indicator variables and aggregate them to show presence/absence or proportions (e.g., count of each property use type).

Review the variables you identified earlier and outline a feature construction strategy for each.

> <font color="orange"> there  is quite a lot  of information inside of our parcel files, as  such with how many rows / lake there are, to  get the table down to one row /  lake we must aggregate the  features for each lake  as suggested. 

currently our data is in the format of xref file + tax parcel data which has many rows / year / lake as it is currently going  to be split out by one row = one house that is near  a given lake. 

our lakes quality data is  currently in the format we want, one lake / lake-year.

my plan, I will create the summarized statistics for our xref + parcel data first which will be summarized to the level of each monit_mapcode1 / year combination this willl then be joined to the lake  quality information  </font>

### Task 4. Initial querying with filter and select

First, you should build a query that filters the parcel data to 
1. only include parcels within 1600 feet of the lakes we are studying, and 
2. only for the lakes with complete information.  

You should also select only the columns you will need for feature construction and joining to the water quality data.

In [2]:
import re,os, functools
import polars as pl
from glob import glob
import polars.selectors as cs
from glob import glob
from operator  import mul
from columns import complete_lakes
import re

#set  path to my  parcel data
parcel_path = 'data/parcel_combined.parquet'

## note: This Final project is about  the  management of large data sets. As such you may find that the metrics created  below are  pretty basic.
## this ipynb will take the parcel data and put it in it's  final form before joining it to the water quality data that we managed in the lab 4 class.
## the final format for tax parcel data is one row /  lake-year combination and  we will also set  up some metrics for conducting random forest used in the next lab


In [3]:
# these are teh only data columns  needed from the tax  data, which we  will use to  filter later.

selected_parcel_columns = [
    'Monit_MAP_CODE1',
    'Year',
    'YEAR_BUILT',
    'distance_category',
    'TOTAL_TAX',
    'SALE_VALUE',
    'HOMESTEAD',
    'GARAGE',
    'FIN_SQ_FT',
    'ACRES_DEED',
    'ACRES_POLY',
    'GREEN_ACRE'
 ]

In [4]:
# Create the lazy query that will be used  throughout this file to read / handle initial filtering of the parcel data

(parcel_queries := [
    pl.scan_parquet(parcel_path)
    .select(selected_parcel_columns)
    .filter(pl.col('Year') == str(year))
    .filter(
        (pl.col('distance_category') != 'over_1600_meters') 
        & (pl.col('Monit_MAP_CODE1').is_in(complete_lakes))
    )
    .fill_null('0')
    for year in range(2004, 2016)
 ])

[<LazyFrame at 0x7F35882DF380>,
 <LazyFrame at 0x7F35882DF980>,
 <LazyFrame at 0x7F3588171A60>,
 <LazyFrame at 0x7F35883A94F0>,
 <LazyFrame at 0x7F35897ACE60>,
 <LazyFrame at 0x7F358813F170>,
 <LazyFrame at 0x7F358819C500>,
 <LazyFrame at 0x7F358819C3E0>,
 <LazyFrame at 0x7F358819C050>,
 <LazyFrame at 0x7F358819C6B0>,
 <LazyFrame at 0x7F358819C650>,
 <LazyFrame at 0x7F358819C5C0>]

In [5]:
#SAMPLE OUTPUT OF THE PARCEL QUERIES

[q.limit(2).collect() for q in parcel_queries]

[shape: (2, 12)
 ┌────────────┬──────┬────────────┬────────────┬───┬───────────┬────────────┬───────────┬───────────┐
 │ Monit_MAP_ ┆ Year ┆ YEAR_BUILT ┆ distance_c ┆ … ┆ FIN_SQ_FT ┆ ACRES_DEED ┆ ACRES_POL ┆ GREEN_ACR │
 │ CODE1      ┆ ---  ┆ ---        ┆ ategory    ┆   ┆ ---       ┆ ---        ┆ Y         ┆ E         │
 │ ---        ┆ str  ┆ str        ┆ ---        ┆   ┆ str       ┆ str        ┆ ---       ┆ ---       │
 │ str        ┆      ┆            ┆ str        ┆   ┆           ┆            ┆ str       ┆ str       │
 ╞════════════╪══════╪════════════╪════════════╪═══╪═══════════╪════════════╪═══════════╪═══════════╡
 │ 02000500-0 ┆ 2004 ┆ 1993.0     ┆ between_50 ┆ … ┆ 0.0       ┆ 0.0        ┆ 0.23      ┆ N         │
 │ 1          ┆      ┆            ┆ 1_1600m    ┆   ┆           ┆            ┆           ┆           │
 │ 02000500-0 ┆ 2004 ┆ 1993.0     ┆ between_50 ┆ … ┆ 0.0       ┆ 0.0        ┆ 0.23      ┆ N         │
 │ 1          ┆      ┆            ┆ 1_1600m    ┆   ┆           ┆  

## Problem 2.  Numerical Summaries

Two important categories of property data involve the size (e.g., finished square footage) and value (e.g., accessed value and/or taxes paid).

**Tasks.** 

1. Identify 2-3 variables for each of these categories.
2. Write a query that computes the summary statistics for each of these variables for each lake-year.  
3. Write this summary table out to a CSV file named `parcel_numerical_summaries.csv`.  Again, you should partition by lake ID and year.

In [6]:
#pull exact needed  columns
num_summary_columns = [
    'Monit_MAP_CODE1',
    'Year',
    'TOTAL_TAX',
    'SALE_VALUE',
    'FIN_SQ_FT',
    'ACRES_DEED',
 ]

In [7]:
# SET UP FILTERS FOR PARCEL DATA as a lazy expression. Will be joined with other metrics later.
# create averages / numerical summaries for different values

(parcel_summary := [
    year
    .select(num_summary_columns)
    .with_columns([
        cs.exclude(['Monit_MAP_CODE1','Year']).cast(pl.Float64)   
    ])
    .group_by([   
         'Monit_MAP_CODE1',
         'Year'
    ])
    .agg([
        cs.exclude(['Monit_MAP_CODE1','Year']).mean().name.prefix('mean_'),## select all columns except Monit_MAP_CODE1 and Year, then add mean_ prefix to each column name
    ])
    for year in parcel_queries
])

[<LazyFrame at 0x7F358819CD70>,
 <LazyFrame at 0x7F358819C2F0>,
 <LazyFrame at 0x7F358819CFE0>,
 <LazyFrame at 0x7F358819D010>,
 <LazyFrame at 0x7F358819D070>,
 <LazyFrame at 0x7F358819D0A0>,
 <LazyFrame at 0x7F358819D0D0>,
 <LazyFrame at 0x7F358819D100>,
 <LazyFrame at 0x7F358819D130>,
 <LazyFrame at 0x7F358819D1C0>,
 <LazyFrame at 0x7F358819D1F0>,
 <LazyFrame at 0x7F358819D220>]

In [8]:
#sample the outputs
[q.limit(7).collect() for q in parcel_summary]

[shape: (7, 6)
 ┌─────────────────┬──────┬────────────────┬─────────────────┬────────────────┬─────────────────┐
 │ Monit_MAP_CODE1 ┆ Year ┆ mean_TOTAL_TAX ┆ mean_SALE_VALUE ┆ mean_FIN_SQ_FT ┆ mean_ACRES_DEED │
 │ ---             ┆ ---  ┆ ---            ┆ ---             ┆ ---            ┆ ---             │
 │ str             ┆ str  ┆ f64            ┆ f64             ┆ f64            ┆ f64             │
 ╞═════════════════╪══════╪════════════════╪═════════════════╪════════════════╪═════════════════╡
 │ 82012200-01     ┆ 2004 ┆ 3852.674319    ┆ 93027.842181    ┆ 0.0            ┆ 5.741535        │
 │ 82005400-01     ┆ 2004 ┆ 1530.997753    ┆ 26066.741573    ┆ 0.0            ┆ 6.86818         │
 │ 27005300-01     ┆ 2004 ┆ 3263.128073    ┆ 114058.716327   ┆ 0.0            ┆ 0.0             │
 │ 19002200-01     ┆ 2004 ┆ 2464.189224    ┆ 137042.147649   ┆ 2233.314597    ┆ 0.0             │
 │ 27071100-01     ┆ 2004 ┆ 7442.508784    ┆ 160405.113127   ┆ 0.0            ┆ 0.0             │
 │ 82

# Write parcel_summary tables to CSV, partitioned by lake ID and year
for summary in parcel_summary:
    summary.write_csv(
        'parcel_numerical_summaries.csv',
        include_header=True,
        partition_by=['Monit_MAP_CODE1', 'Year']
    )## Problem 3.  Simple categorical summaries.

In this part, you will create summary statistics for some of the simpler categorical variables.

**Binary variables.** There are two examples of binary variables, listed below.  You will need to compute the percent of `Yes` for each.

* GARAGE: Garage Y/N
* BASEMENT: Basement Y/N


**Other categorical variables.** There are a number of other categorical variables.  You need to select one of these variables, inspect/clean your variable as needed, create indicator variables for each resulting label, and compute summary statistics for each label.

* HOMESTEAD: Homestead Status
* TAX_EXEMPT: Tax Exempt Status 
* DWELL_TYPE: Dwelling Type 
* HOME_STYLE: Home Style
* HEATING: Heating type
* COOLING: Cooling type

**Tasks.**
Create a query that

1. Select one binary and two other categorical variables for feature construction,
2. Reads in the parcel data and selects the relevant columns (be sure to keep the lake ID and year),
3. Inspect unique labels and recode/clean as needed,
4. Create a literal column of ones, and
5. Pivot to get the counts of each label per lake-year (do this once per category).

Write this summary table out to a csv file named `parcel_categorical_summaries.csv`.  Again, you should partition by lake ID and year.

In [9]:
#pull exact needed  columns for  categorical data metrics
cat_summary_columns = [
    'Monit_MAP_CODE1',
    'Year',
    'GARAGE', # binary
    'HOMESTEAD', #categorical
    'GREEN_ACRE'#categorical
 ]

NOTE!  THIS IS WHERE YOU LEFT OFF!!!

In [10]:
# Percent of 'Y' for GARAGE, only among valid ('Y' or 'N') responses
# calculate my summary statistics for categorical variables.
# the column selector (cs) is used  to specify that all columns that  have a value == to Y or N should have a 1  else 0, with a suffix  of _yes or _no respectively

(parcel_categorical_summ := [
    year
    .select(cat_summary_columns)
    .with_columns([
        pl.when(pl.col('GARAGE') == 'Y').then(1).otherwise(0).alias('GARAGE_YES'),
        pl.when((pl.col('GARAGE') == 'Y') | (pl.col('GARAGE') == 'N')).then(1).otherwise(0).alias('GARAGE_VALID'),   
        pl.when(pl.col('HOMESTEAD') == 'Y').then(1).otherwise(0).alias('HOMESTEAD_Y'),
        pl.when(pl.col('HOMESTEAD') == 'N').then(1).otherwise(0).alias('HOMESTEAD_N'),
        pl.when(pl.col('GREEN_ACRE') == 'Y').then(1).otherwise(0).alias('GREEN_ACRE_Y'),
        pl.when(pl.col('GREEN_ACRE') == 'N').then(1).otherwise(0).alias('GREEN_ACRE_N'),
    ])

    .group_by(['Monit_MAP_CODE1', 'Year'])
    .agg([
        (pl.col('GARAGE_YES').sum() / pl.col('GARAGE_VALID').sum() * 100).alias('pct_garage_yes'),
        pl.col('HOMESTEAD_Y').sum().alias('homestead_y_count'),
        pl.col('HOMESTEAD_N').sum().alias('homestead_n_count'),
        pl.col('GREEN_ACRE_Y').sum().alias('GREEN_ACRE_y_count'),
        pl.col('GREEN_ACRE_N').sum().alias('GREEN_ACRE_n_count'),
    ])
    .sort('GREEN_ACRE_y_count', descending=True)
    for year in parcel_queries
])

[<LazyFrame at 0x7F358819DAF0>,
 <LazyFrame at 0x7F358819DA60>,
 <LazyFrame at 0x7F358819DA30>,
 <LazyFrame at 0x7F358819C260>,
 <LazyFrame at 0x7F358819DA00>,
 <LazyFrame at 0x7F358819DB80>,
 <LazyFrame at 0x7F358819DA90>,
 <LazyFrame at 0x7F358819DBB0>,
 <LazyFrame at 0x7F358819DBE0>,
 <LazyFrame at 0x7F358819DC10>,
 <LazyFrame at 0x7F358819DC40>,
 <LazyFrame at 0x7F358819DC70>]

In [11]:
#show  the metric for the binary #
[q.limit(7).collect() for q in parcel_categorical_summ]

[shape: (7, 7)
 ┌───────────────┬──────┬───────────────┬──────────────┬──────────────┬──────────────┬──────────────┐
 │ Monit_MAP_COD ┆ Year ┆ pct_garage_ye ┆ homestead_y_ ┆ homestead_n_ ┆ GREEN_ACRE_y ┆ GREEN_ACRE_n │
 │ E1            ┆ ---  ┆ s             ┆ count        ┆ count        ┆ _count       ┆ _count       │
 │ ---           ┆ str  ┆ ---           ┆ ---          ┆ ---          ┆ ---          ┆ ---          │
 │ str           ┆      ┆ f64           ┆ i32          ┆ i32          ┆ i32          ┆ i32          │
 ╞═══════════════╪══════╪═══════════════╪══════════════╪══════════════╪══════════════╪══════════════╡
 │ 19002601-01   ┆ 2004 ┆ NaN           ┆ 8678         ┆ 3086         ┆ 304          ┆ 8889         │
 │ 82008700-01   ┆ 2004 ┆ NaN           ┆ 12330        ┆ 278          ┆ 209          ┆ 395          │
 │ 82009700-01   ┆ 2004 ┆ NaN           ┆ 10560        ┆ 4            ┆ 90           ┆ 12           │
 │ 10005200-01   ┆ 2004 ┆ 69.254658     ┆ 516          ┆ 128       

In [12]:
# Collect all years, concatenate, and write once to CSV
(cat_summ_all := pl.concat([cat_summ.collect() for cat_summ in parcel_categorical_summ]))



Monit_MAP_CODE1,Year,pct_garage_yes,homestead_y_count,homestead_n_count,GREEN_ACRE_y_count,GREEN_ACRE_n_count
str,str,f64,i32,i32,i32,i32
"""19002601-01""","""2004""",,8678,3086,304,8889
"""82008700-01""","""2004""",,12330,278,209,395
"""82009700-01""","""2004""",,10560,4,90,12
"""10005200-01""","""2004""",69.254658,516,128,65,579
"""82005400-01""","""2004""",,445,0,64,0
…,…,…,…,…,…,…
"""19002100-01""","""2015""",,4678,771,0,5449
"""19002500-01""","""2015""",,2119,259,0,2378
"""82009200-01""","""2015""",100.0,950,289,0,1239
"""27062700-01""","""2015""",98.634368,11126,2057,0,13185


In [13]:
cat_summ_all.write_csv(
    'data/parcel_categorical_summaries.csv',
    include_header=True,
    )

## Problem 4.  Join all the summaries.

Finally, you need to join all the summaries created above, along with the water quality summaries created in a previous lab, into one overall summary file.  Write the resulting table to a CSV file named `water_quality_and_parcel_summaries_2004_to_2015.csv`.

In [14]:
# Lazily join parcel_summary and parcel_categorical_summ on ['Monit_MAP_CODE1', 'Year']
parcel_joined = [
    summary.join(
        cat_summ,
        on=['Monit_MAP_CODE1', 'Year'],
        how='inner'
    )
    for  summary, cat_summ in zip( parcel_summary, parcel_categorical_summ)
 ]



In [15]:
# sample the outputs
[q.limit(7).collect() for q in parcel_joined]

[shape: (7, 11)
 ┌────────────┬──────┬────────────┬────────────┬───┬────────────┬───────────┬───────────┬───────────┐
 │ Monit_MAP_ ┆ Year ┆ mean_TOTAL ┆ mean_SALE_ ┆ … ┆ homestead_ ┆ homestead ┆ GREEN_ACR ┆ GREEN_ACR │
 │ CODE1      ┆ ---  ┆ _TAX       ┆ VALUE      ┆   ┆ y_count    ┆ _n_count  ┆ E_y_count ┆ E_n_count │
 │ ---        ┆ str  ┆ ---        ┆ ---        ┆   ┆ ---        ┆ ---       ┆ ---       ┆ ---       │
 │ str        ┆      ┆ f64        ┆ f64        ┆   ┆ i32        ┆ i32       ┆ i32       ┆ i32       │
 ╞════════════╪══════╪════════════╪════════════╪═══╪════════════╪═══════════╪═══════════╪═══════════╡
 │ 19002601-0 ┆ 2004 ┆ 2597.65942 ┆ 135206.335 ┆ … ┆ 8678       ┆ 3086      ┆ 304       ┆ 8889      │
 │ 1          ┆      ┆            ┆ 582        ┆   ┆            ┆           ┆           ┆           │
 │ 82008700-0 ┆ 2004 ┆ 2845.59619 ┆ 31040.4944 ┆ … ┆ 12330      ┆ 278       ┆ 209       ┆ 395       │
 │ 1          ┆      ┆ 2          ┆ 3          ┆   ┆            ┆ 

In [16]:
#pull in data for water quality

(quality_query := [
    pl.scan_parquet('data/water_quality_by_year.parquet')
    .filter(pl.col('Year') == year)
    .fill_null('0')
    for year in range(2004, 2016)
 ])

[<LazyFrame at 0x7F358819EA80>,
 <LazyFrame at 0x7F358819EA20>,
 <LazyFrame at 0x7F358819EB10>,
 <LazyFrame at 0x7F358819EA50>,
 <LazyFrame at 0x7F358819EB40>,
 <LazyFrame at 0x7F35896C6C00>,
 <LazyFrame at 0x7F358819D700>,
 <LazyFrame at 0x7F358819E9F0>,
 <LazyFrame at 0x7F358819E030>,
 <LazyFrame at 0x7F358819EC90>,
 <LazyFrame at 0x7F358819ED20>,
 <LazyFrame at 0x7F358819E870>]

In [17]:
# sample the outputs
[q.limit(7).collect() for q in quality_query]

[shape: (7, 7)
 ┌────────────────┬──────┬────────────────┬───────────┬────────────┬────────────────┬───────────────┐
 │ DNR_ID_Site_Nu ┆ Year ┆ LAKE_NAME      ┆ latitude  ┆ longitude  ┆ avg_secchi_dep ┆ avg_total_pho │
 │ mber           ┆ ---  ┆ ---            ┆ ---       ┆ ---        ┆ th             ┆ sphorus       │
 │ ---            ┆ i32  ┆ str            ┆ f64       ┆ f64        ┆ ---            ┆ ---           │
 │ str            ┆      ┆                ┆           ┆            ┆ f64            ┆ f64           │
 ╞════════════════╪══════╪════════════════╪═══════════╪════════════╪════════════════╪═══════════════╡
 │ 02000500-01    ┆ 2004 ┆ George Watch   ┆ 45.176405 ┆ -93.089055 ┆ 0.705          ┆ 0.199         │
 │                ┆      ┆ Lake           ┆           ┆            ┆                ┆               │
 │ 10000200-01    ┆ 2004 ┆ Riley Lake     ┆ 44.83469  ┆ -93.516952 ┆ 2.364286       ┆ 0.046714      │
 │ 10001100-01    ┆ 2004 ┆ St. Joe Lake   ┆ 44.875646 ┆ -93.622969 

In [18]:
# Cast Year in quality_query to string and join to parcel_joined
final_joined = [
    pq.join(
        q.with_columns([pl.col('Year').cast(pl.Utf8)]),
        left_on=['Year', 'Monit_MAP_CODE1'],
        right_on=['Year', 'DNR_ID_Site_Number'],
        how='inner'
    )
    for pq, q in zip(parcel_joined, quality_query)
 ]

In [19]:
# sample the outputs
[q.limit(7).collect() for q in final_joined]

[shape: (7, 16)
 ┌────────────┬──────┬────────────┬────────────┬───┬───────────┬────────────┬───────────┬───────────┐
 │ Monit_MAP_ ┆ Year ┆ mean_TOTAL ┆ mean_SALE_ ┆ … ┆ latitude  ┆ longitude  ┆ avg_secch ┆ avg_total │
 │ CODE1      ┆ ---  ┆ _TAX       ┆ VALUE      ┆   ┆ ---       ┆ ---        ┆ i_depth   ┆ _phosphor │
 │ ---        ┆ str  ┆ ---        ┆ ---        ┆   ┆ f64       ┆ f64        ┆ ---       ┆ us        │
 │ str        ┆      ┆ f64        ┆ f64        ┆   ┆           ┆            ┆ f64       ┆ ---       │
 │            ┆      ┆            ┆            ┆   ┆           ┆            ┆           ┆ f64       │
 ╞════════════╪══════╪════════════╪════════════╪═══╪═══════════╪════════════╪═══════════╪═══════════╡
 │ 02000500-0 ┆ 2004 ┆ 2511.99043 ┆ 85214.2112 ┆ … ┆ 45.176405 ┆ -93.089055 ┆ 0.705     ┆ 0.199     │
 │ 1          ┆      ┆ 2          ┆ 51         ┆   ┆           ┆            ┆           ┆           │
 │ 10000200-0 ┆ 2004 ┆ 820.433675 ┆ 169311.453 ┆ … ┆ 44.83469  ┆ -

In [20]:
# Collect all years, concatenate and prepare to write this  data to  csv.
(cat_summ_all := pl.concat([cat_summ.collect() for cat_summ in final_joined]))

Monit_MAP_CODE1,Year,mean_TOTAL_TAX,mean_SALE_VALUE,mean_FIN_SQ_FT,mean_ACRES_DEED,pct_garage_yes,homestead_y_count,homestead_n_count,GREEN_ACRE_y_count,GREEN_ACRE_n_count,LAKE_NAME,latitude,longitude,avg_secchi_depth,avg_total_phosphorus
str,str,f64,f64,f64,f64,f64,i32,i32,i32,i32,str,f64,f64,f64,f64
"""02000500-01""","""2004""",2511.990432,85214.211251,0.0,0.0,,1800,813,63,2550,"""George Watch Lake""",45.176405,-93.089055,0.705,0.199
"""10000200-01""","""2004""",820.433675,169311.453697,1284.034357,0.476291,66.666667,3328,1061,8,699,"""Riley Lake""",44.83469,-93.516952,2.364286,0.046714
"""10001100-01""","""2004""",3612.712142,207885.570259,1335.454297,0.203411,87.858117,637,94,0,733,"""St. Joe Lake""",44.875646,-93.622969,2.629412,0.024824
"""10001900-01""","""2004""",2250.115794,199070.782324,1029.328969,1.075806,66.85761,1548,885,31,2413,"""Bavaria Lake""",44.838122,-93.637789,1.98625,0.037313
"""10005200-01""","""2004""",1632.563665,116258.10559,1151.194099,16.295776,69.254658,516,128,65,579,"""Reitz Lake""",44.838626,-93.744522,1.936364,0.109091
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""82013700-01""","""2014""",1199.895512,212105.022103,1699.357669,0.197515,99.350649,1180,237,18,1403,"""Fish Lake""",45.122425,-92.977967,0.794167,0.115083
"""82015300-01""","""2014""",0.0,142151.199438,1590.103933,0.0,100.0,610,100,34,676,"""Sunset Lake""",45.135271,-92.94157,3.28125,0.015875
"""82015900-01""","""2014""",0.0,116348.852319,919.500119,0.0,100.0,2477,1720,4,4193,"""Forest Lake""",45.282206,-92.966546,1.6,0.02636
"""82033400-01""","""2014""",0.0,106300.45977,1298.252874,0.0,100.0,146,28,26,148,"""Kismet Lake""",45.095364,-92.891733,1.625,0.032917


In [21]:
#write the data to csv.
cat_summ_all.write_csv(
    'data/water_quality_and_parcel_summaries_2004_to_2015.csv',
    include_header=True,
    )
