# Lab 1 - Getting Started

## Project Overview

In this project, you will use both `polars` and `pyspark` to process the data from the MinneMUDAC 2016 competition Dive into Water Data.  While the MinneMUDAC 2016 site, is no longer live, a copy was obtained using the [Wayback Machine (https://web.archive.org) and has been provided in [the overview notebook](./MinneMUDAC_2016_Overview.ipynb).  You should document your work in a Jupyter notebook, which will be used to submit your solution.

## Lab 1 Tasks

In this lab, you will perform the following tasks

1. Download and unzip the data.
2. Investigating the columns in various property data files.

### Task 1 - Data download and unzip

While the download links on the original site no longer work, you can access the data by clicking on [this link](https://mnscu-my.sharepoint.com/:u:/g/personal/bn8210wy_minnstate_edu/EdUePet8JsdKv5aUt9gvjoMBxQhXrOx73WpQyVNwLVDfkA?e=rR8qrc)

1. Move the zip file unto your repository
2. Unzip and move the files into your data folder.

**Hint.** Take a look the the Colab section of any module 5 lecture for an example.

#### Questions

1. Notice that we have multiple property files, one per year.  What verb(s) will be used to combine these files?
2. Why is it important to compare the columns of these files?
3. Use the `%%bash` cell magic along with `head path` to inspect the first few lines of one of the files, where `...` is replaced with the specific path to one of the files.  How are the columns separated?

> <font color="orange"> 1.  year, zip, city, possibly county_id will be used to combine these files.  Also, we  will use  centroids lats  and longs  to link different parcels of land between the parcel data and the xref file.

it looks like in the latest data set, the MCES data set, the key column is called DNR ID site number, and in the xref file, the key is MONITMAP code 

 </font>
2. If we start unioning the tax parcels data together, and they  don't have all the same columns, or the columns that would end up being missing contain  important information... we would see further challenges down  the road.
3. the columns are deliminated by a pipe symbol.

In [10]:
import polars as pl
import polars.selectors as cs
from glob import glob
import re,os, functools
from columns import cols_to_keep, column_schema

## hello you should be able to read along with what I have here, however, if you want to run  this code you will need  to download the data manually  from  the link above. use the 3rd link
## In this  document an example of   column exploration has been conducted, and we will also define the  time frame for this analysis.
## It's worth pointing out that the point of this class was to explore the management of very large datasets. So  the metrics used in the final model  are fairly simplistic to keep things simple.

In [11]:
# pull in all  of the file paths for files in the data directory
(txt_paths := glob('data/**/*.txt', recursive=True))



['data/MinneMUDAC_raw_files/2002_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2003_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2004_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2005_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2006_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2007_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2008_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2009_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2010_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2011_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2012_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2013_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2014_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2015_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/mces_lakes_1999_2014.txt',
 'data/MinneMUDAC_raw_files/mces_lakes_1999_2014_v2.txt',
 'data/MinneMUDAC_raw_files/Parcel_Lake_Monitoring_Site_Xref.txt']

In [12]:
# pull in data for only the parcel files as there are many of them.
(parcel_paths := sorted(glob('data/**/*parcels*.txt', recursive=True)))   

['data/MinneMUDAC_raw_files/2002_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2003_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2004_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2005_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2006_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2007_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2008_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2009_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2010_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2011_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2012_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2013_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2014_metro_tax_parcels.txt',
 'data/MinneMUDAC_raw_files/2015_metro_tax_parcels.txt']

In [13]:
# Precompile regex to extract a 4-digit year from a path
(re_year := re.compile(r'(\d{4})'))


re.compile(r'(\d{4})', re.UNICODE)

In [14]:
# Use `head path` to inspect the first line of one of the files to  explore the structure.
(pl.scan_csv(parcel_paths[3], has_header = True, ignore_errors=True,  separator='|').limit(1).collect()
 )


ACRES_DEED,ACRES_POLY,AGPRE_ENRD,AGPRE_EXPD,AG_PRESERV,BASEMENT,BLDG_NUM,BLOCK,CITY,CITY_USPS,COOLING,COUNTY_ID,DWELL_TYPE,EMV_BLDG,EMV_LAND,EMV_TOTAL,FIN_SQ_FT,GARAGE,GARAGESQFT,GREEN_ACRE,HEATING,HOMESTEAD,HOME_STYLE,LANDMARK,LOT,MULTI_USES,NUM_UNITS,OPEN_SPACE,OWNER_MORE,OWNER_NAME,OWN_ADD_L1,OWN_ADD_L2,OWN_ADD_L3,PARC_CODE,PIN,PLAT_NAME,PREFIXTYPE,PREFIX_DIR,SALE_DATE,SALE_VALUE,SCHOOL_DST,SPEC_ASSES,STREETNAME,STREETTYPE,SUFFIX_DIR,Shape_Area,Shape_Leng,TAX_ADD_L1,TAX_ADD_L2,TAX_ADD_L3,TAX_CAPAC,TAX_EXEMPT,TAX_NAME,TOTAL_TAX,UNIT_INFO,USE1_DESC,USE2_DESC,USE3_DESC,USE4_DESC,WSHD_DIST,XUSE1_DESC,XUSE2_DESC,XUSE3_DESC,XUSE4_DESC,YEAR_BUILT,Year,ZIP,ZIP4,centroid_long,centroid_lat
f64,f64,str,str,str,str,i64,i64,str,str,str,i64,str,f64,f64,f64,f64,str,str,str,str,str,str,str,i64,str,str,str,str,str,str,str,str,f64,str,str,str,str,str,f64,i64,f64,str,str,str,str,str,str,str,str,f64,str,str,f64,str,i64,str,str,str,str,str,str,str,str,f64,i64,i64,i64,f64,f64
0.0,8.03,,,"""N""","""N""",,,"""ST FRANCIS""",,,3,"""Misc. structure""",6600.0,17800.0,24400.0,2560.0,,,,,"""N""","""Misc. structure""",,,,,,,"""LEHNE TIRE SERVICE""","""24457 DOGWOOD ST NW""","""BETHEL""","""MN, 55005""",0.0,"""003-253424110001""",,,,,0.0,15,0.0,,,,,,"""24457 DOGWOOD ST NW""","""BETHEL""","""MN, 55005""",366.0,"""N""","""LEHNE TIRE SERVICE""",0.0,,3410,,,,"""UPPER RUM RIVER WMO""",,,,,1980.0,2005,,,-93.26739,45.41332


## Task 2 - Perform a column exploration on the property files

<img src="./img/column_master_file.png" width="800">

**Hints.**

1. Use `glob` to get a list of all the parcel files.
2. Use a list comprehension contains pairs of value of the form `(year, df)` where `df` a `polars` or `pyspark` data frame for each file. Keep in mind that the files are large so we will need to leverage the lazy natures of `polars` and `pyspark`. You will want to use `scan_csv` in `polars`.  On the other hand, `pyspark` is lazy and will do minimal work on this step.
3. Perform a column exploration by creating an indicator summary table, with on row per column and one column per year/file.  The values in the table should be `1` if the column is present in that year's file and `0` otherwise.  You may need to pivot or reshape your data frame to get it into this format.
4. There is one problematic year.  Which year is it and what is the problem?  We will be skipping this year in future analysis.
5. Sort the summary table by the column names and inspect the results.  In particular, look for similar names with different spellings or capitalizations.
6. Identify the common columns that are present in all years.

In [15]:
# Lambda that returns the first matched year or None
getYear = lambda p: (m.group(1) if (m := re_year.search(p)) else None)
# Test getYear across discovered parcel paths (assign only)
[getYear(p) for p in parcel_paths]



['2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015']

In [16]:
#create a list of the  schema's, this was saved into the column.py file for easier calling in  later scripts
(schemas := {
    path: {col: pl.String for col in pl.scan_csv(path, has_header = True, ignore_errors=True,  separator='|').limit(1).collect().columns}
    for path in parcel_paths
})


{'data/MinneMUDAC_raw_files/2002_metro_tax_parcels.txt': {'ACRES_DEED': String,
  'ACRES_POLY': String,
  'AGPRE_ENRD': String,
  'AGPRE_EXPD': String,
  'AG_PRESERV': String,
  'BASEMENT': String,
  'BLDG_NUM': String,
  'BLOCK': String,
  'CITY': String,
  'CITY_USPS': String,
  'COOLING': String,
  'COUNTY_ID': String,
  'DWELL_TYPE': String,
  'EMV_BLDG': String,
  'EMV_LAND': String,
  'EMV_TOTAL': String,
  'FIN_SQ_FT': String,
  'GARAGE': String,
  'GARAGESQFT': String,
  'GREEN_ACRE': String,
  'HEATING': String,
  'HOMESTEAD': String,
  'HOME_STYLE': String,
  'LANDMARK': String,
  'LOT': String,
  'MULTI_USES': String,
  'NUM_UNITS': String,
  'OPEN_SPACE': String,
  'OWNER_MORE': String,
  'OWNER_NAME': String,
  'OWN_ADD_L1': String,
  'OWN_ADD_L2': String,
  'OWN_ADD_L3': String,
  'OWN_NAME': String,
  'PARC_CODE': String,
  'PIN': String,
  'PIN_1': String,
  'PLAT_NAME': String,
  'PREFIXTYPE': String,
  'PREFIX_DIR': String,
  'SALE_DATE': String,
  'SALE_VALUE': Strin

In [17]:
# Build one unpivoted table per parcel file: scan one row, unpivot,
# rename 'variable'->'column', drop 'value', and add a literal year column with 1
# what this step say's is that the columns on the left occur 1 time inside of the 2002 parcel file for  that  given year.
# this will be used soon to identify column mismatches across year files.
(indicator_tables := [
    ( pl.scan_csv(p, separator='|', has_header=True, schema = schemas[p]).limit(1)
        # .limit(1)
         .unpivot()
         .rename({'variable': 'column'})
         .drop('value')
         .with_columns(pl.lit(1).alias(getYear(p)))
         .collect()
    )
    for p in parcel_paths
])

[shape: (75, 2)
 ┌───────────────┬──────┐
 │ column        ┆ 2002 │
 │ ---           ┆ ---  │
 │ str           ┆ i32  │
 ╞═══════════════╪══════╡
 │ ACRES_DEED    ┆ 1    │
 │ ACRES_POLY    ┆ 1    │
 │ AGPRE_ENRD    ┆ 1    │
 │ AGPRE_EXPD    ┆ 1    │
 │ AG_PRESERV    ┆ 1    │
 │ …             ┆ …    │
 │ Year          ┆ 1    │
 │ ZIP           ┆ 1    │
 │ ZIP4          ┆ 1    │
 │ centroid_lat  ┆ 1    │
 │ centroid_long ┆ 1    │
 └───────────────┴──────┘,
 shape: (34, 2)
 ┌───────────────┬──────┐
 │ column        ┆ 2003 │
 │ ---           ┆ ---  │
 │ str           ┆ i32  │
 ╞═══════════════╪══════╡
 │ BLDG_NUM      ┆ 1    │
 │ CITY          ┆ 1    │
 │ COUNTY_ID     ┆ 1    │
 │ EMV_BLDG      ┆ 1    │
 │ EMV_LAND      ┆ 1    │
 │ …             ┆ …    │
 │ YEAR_BUILT    ┆ 1    │
 │ Year          ┆ 1    │
 │ ZIP           ┆ 1    │
 │ centroid_long ┆ 1    │
 │ centroid_lat  ┆ 1    │
 └───────────────┴──────┘,
 shape: (71, 2)
 ┌───────────────┬──────┐
 │ column        ┆ 2004 │
 │ ---        

In [18]:
# Lambda to join two indicator tables on 'column' using a full outer join
# create  join function to keep code tidy later 
join_next = lambda cur, nxt: cur.join(nxt, on='column', how='full').drop('column_right')


In [19]:
# do  a initial  test Join the first two DataFrames from indicator_tables using join_next
# null rows means that the  column on left does not exist  within the year of data in the parcel data
join_next(indicator_tables[0], indicator_tables[1])

column,2002,2003
str,i32,i32
"""ACRES_DEED""",1,
"""ACRES_POLY""",1,
"""AGPRE_ENRD""",1,
"""AGPRE_EXPD""",1,
"""AG_PRESERV""",1,
…,…,…
"""Year""",1,1
"""ZIP""",1,1
"""ZIP4""",1,
"""centroid_lat""",1,1


In [20]:
# Reduce all indicator tables into one using join_next
# this will be used to identify column mismatches across year files.
# 0's  mean that the column on the left does not exist within the year of data in the parcel data
(indicator_table := functools.reduce(join_next, indicator_tables)
     .filter(pl.col('column').is_not_null())
     .fill_null(0)
)

column,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32
"""ACRES_DEED""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
"""ACRES_POLY""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
"""AGPRE_ENRD""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
"""AGPRE_EXPD""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
"""AG_PRESERV""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Year""",1,1,1,1,1,1,1,1,1,1,1,1,1,1
"""ZIP""",1,1,1,1,1,1,1,1,1,1,1,1,1,1
"""ZIP4""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
"""centroid_lat""",1,1,1,1,1,1,1,1,1,1,1,1,1,1


In [21]:
# Summarize: sum all integer columns in indicator_table
# summs the number of times a column appears across all years
# 2003 has really few, we will cut our study to 2004-2015
indicator_table.select(cs.integer()).sum()

2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32
75,34,70,70,70,70,70,70,70,70,70,70,70,70


In [22]:
# Set Polars display config for more rows/columns
pl.Config.set_tbl_rows(200)
pl.Config.set_tbl_cols(25)

polars.config.Config

#### Your conclusions

<font color="orange">
2003 was missing quite a bit of data, so I plan to take only the last  10  years of data as it still gives us a lot to go on, while  ensuring that we don't run into the missing data issue of 2003.
</font>