# Lab 1 - Getting Started

## Project Overview

In this project, you will use both `polars` and `pyspark` to process the data from the MinneMUDAC 2016 competition Dive into Water Data.  While the MinneMUDAC 2016 site, is no longer live, a copy was obtained using the [Wayback Machine (https://web.archive.org) and has been provided in [the overview notebook](./MinneMUDAC_2016_Overview.ipynb).  You should document your work in a Jupyter notebook, which will be used to submit your solution.

## Lab 1 Tasks

In this lab, you will perform the following tasks

1. Download and unzip the data.
2. Investigating the columns in various property data files.

### Task 1 - Data download and unzip

While the download links on the original site no longer work, you can access the data by clicking on [this link](https://mnscu-my.sharepoint.com/:u:/g/personal/bn8210wy_minnstate_edu/EdUePet8JsdKv5aUt9gvjoMBxQhXrOx73WpQyVNwLVDfkA?e=rR8qrc)

1. Move the zip file unto your repository
2. Unzip and move the files into your data folder.

**Hint.** Take a look the the Colab section of any module 5 lecture for an example.

In [1]:
import polars as pl
import polars.selectors as cs
from glob import glob
import re
from functools import reduce


In [4]:
import zipfile
from pathlib import Path

Path("data").mkdir(exist_ok=True)

with zipfile.ZipFile("data/MinneMUDAC_raw_files.zip", "r") as z:
    for file in z.namelist():
        z.extract(file, "data")


In [10]:
# Perform the tasks listed above then use `glob` to verify the contents/paths.
# sorted(glob("./data/*.csv"))
sorted(glob("./data/MinneMUDAC_raw_files/*.txt"))


['./data/MinneMUDAC_raw_files\\2002_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files\\2003_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files\\2004_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files\\2005_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files\\2006_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files\\2007_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files\\2008_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files\\2009_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files\\2010_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files\\2011_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files\\2012_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files\\2013_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files\\2014_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files\\2015_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files\\Parcel_Lake_Monitoring_Site_Xref.txt',
 './data/MinneMUDAC_raw_files\\mces_lakes_1999_2014.txt']

#### Questions

1. Notice that we have multiple property files, one per year.  What verb(s) will be used to combine these files?
2. Why is it important to compare the columns of these files?
3. Use the `%%bash` cell magic along with `head path` to inspect the first few lines of one of the files, where `...` is replaced with the specific path to one of the files.  How are the columns separated?

> <font color="orange"> Your answer here </font>

In [189]:
# Use `head path` to inspect the first few lines of one of the files.
# Be sure to replace the ... with one of the paths.
%%bash

head ...

In [15]:
%%bash
head ./data/MinneMUDAC_raw_files/2002_metro_tax_parcels.txt


ACRES_DEED|ACRES_POLY|AGPRE_ENRD|AGPRE_EXPD|AG_PRESERV|BASEMENT|BLDG_NUM|BLOCK|CITY|CITY_USPS|COOLING|COUNTY_ID|DWELL_TYPE|EMV_BLDG|EMV_LAND|EMV_TOTAL|FIN_SQ_FT|GARAGE|GARAGESQFT|GREEN_ACRE|HEATING|HOMESTEAD|HOME_STYLE|LANDMARK|LOT|MULTI_USES|NUM_UNITS|OPEN_SPACE|OWNER_MORE|OWNER_NAME|OWN_ADD_L1|OWN_ADD_L2|OWN_ADD_L3|OWN_NAME|PARC_CODE|PIN|PIN_1|PLAT_NAME|PREFIXTYPE|PREFIX_DIR|SALE_DATE|SALE_VALUE|SCHOOL_DST|SPEC_ASSES|STREET|STREETNAME|STREETTYPE|STRUC_TYPE|SUFFIX_DIR|Shape_Area|Shape_Leng|TAX_ADD_L1|TAX_ADD_L2|TAX_ADD_L3|TAX_ADD_LI|TAX_CAPAC|TAX_EXEMPT|TAX_NAME|TOTAL_TAX|UNIT_INFO|USE1_DESC|USE2_DESC|USE3_DESC|USE4_DESC|WSHD_DIST|XUSE1_DESC|XUSE2_DESC|XUSE3_DESC|XUSE4_DESC|YEAR_BUILT|Year|ZIP|ZIP4|centroid_lat|centroid_long
||||||14195||ANDOVER|||003||222460.0|55510.0|292596.0||||||Y|||||||||14195 ALDER ST NW||ANDOVER, MN 55304||0.0|003-253224440139|||||2000-11-17|295547.0|11||14195 ALDER ST NW|||RAMBLER BASEMENT||630.998818085|103.296560124|14195 ALDER ST NW||ANDOVER, MN 55304||2566

[('2002', <LazyFrame at 0x2339E427320>),
 ('2003', <LazyFrame at 0x2339E4271A0>),
 ('2004', <LazyFrame at 0x2339E4273E0>),
 ('2005', <LazyFrame at 0x2339E4272C0>),
 ('2006', <LazyFrame at 0x2339E4240E0>),
 ('2007', <LazyFrame at 0x2339E427200>),
 ('2008', <LazyFrame at 0x2339E426FF0>),
 ('2009', <LazyFrame at 0x2339E427650>),
 ('2010', <LazyFrame at 0x2339E4277A0>),
 ('2011', <LazyFrame at 0x2339E4274D0>),
 ('2012', <LazyFrame at 0x2339E427560>),
 ('2013', <LazyFrame at 0x2339E427470>),
 ('2014', <LazyFrame at 0x2339E4275F0>),
 ('2015', <LazyFrame at 0x2339E425490>)]

  "column": lf.columns,


In [20]:
summary = (
    column_presence
    .pivot(index="column", columns="year", values="present")
    .fill_null(0)
    .sort("column")
)

summary


  .pivot(index="column", columns="year", values="present")


column,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32
"""ACRES_DEED""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
"""ACRES_POLY""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
"""AGPRE_ENRD""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
"""AGPRE_EXPD""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
"""AG_PRESERV""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Year""",1,1,1,1,1,1,1,1,1,1,1,1,1,1
"""ZIP""",1,1,1,1,1,1,1,1,1,1,1,1,1,1
"""ZIP4""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
"""centroid_lat""",1,1,1,1,1,1,1,1,1,1,1,1,1,1


## Task 2 - Perform a column exploration on the property files

<img src="./img/column_master_file.png" width="800">

**Hints.**

1. Use `glob` to get a list of all the parcel files.
2. Use a list comprehension contains pairs of value of the form `(year, df)` where `df` a `polars` or `pyspark` data frame for each file. Keep in mind that the files are large so we will need to leverage the lazy natures of `polars` and `pyspark`. You will want to use `scan_csv` in `polars`.  On the other hand, `pyspark` is lazy and will do minimal work on this step.
3. Perform a column exploration by creating an indicator summary table, with on row per column and one column per year/file.  The values in the table should be `1` if the column is present in that year's file and `0` otherwise.  You may need to pivot or reshape your data frame to get it into this format.
4. There is one problematic year.  Which year is it and what is the problem?  We will be skipping this year in future analysis.
5. Sort the summary table by the column names and inspect the results.  In particular, look for similar names with different spellings or capitalizations.
6. Identify the common columns that are present in all years.

In [21]:
# Glob/path processing code here
parcel_paths = sorted(glob("./data/MinneMUDAC_raw_files/*_metro_tax_parcels.txt"))


Extract year + lazy load using Polars

In [22]:
year_frames = [
    (
        re.search(r"\d{4}", path).group(),  # extract year from filename
        pl.scan_csv(path, separator="|")    # lazily read
    )
    for path in parcel_paths
]

year_frames


[('2002', <LazyFrame at 0x2339E427740>),
 ('2003', <LazyFrame at 0x2339E425C10>),
 ('2004', <LazyFrame at 0x2339E4246E0>),
 ('2005', <LazyFrame at 0x2339E426630>),
 ('2006', <LazyFrame at 0x2339E427A10>),
 ('2007', <LazyFrame at 0x2339E426510>),
 ('2008', <LazyFrame at 0x2339E426600>),
 ('2009', <LazyFrame at 0x2339E4263F0>),
 ('2010', <LazyFrame at 0x2339E4261B0>),
 ('2011', <LazyFrame at 0x2339E425B20>),
 ('2012', <LazyFrame at 0x2339E425FD0>),
 ('2013', <LazyFrame at 0x2339E424950>),
 ('2014', <LazyFrame at 0x2339E425E50>),
 ('2015', <LazyFrame at 0x2339E4262D0>)]

Build the long-format “column presence” table

In [23]:
column_presence = pl.concat(
    [
        pl.DataFrame({
            "column": lf.columns,
            "year": year,
            "present": 1
        })
        for year, lf in year_frames
    ]
)


  "column": lf.columns,


Pivot to wide format (summary table)

In [25]:
summary = (
    column_presence
    .pivot(index="column", columns="year", values="present")
    .fill_null(0)
    .sort("column")
)

summary


  .pivot(index="column", columns="year", values="present")


column,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32
"""ACRES_DEED""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
"""ACRES_POLY""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
"""AGPRE_ENRD""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
"""AGPRE_EXPD""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
"""AG_PRESERV""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Year""",1,1,1,1,1,1,1,1,1,1,1,1,1,1
"""ZIP""",1,1,1,1,1,1,1,1,1,1,1,1,1,1
"""ZIP4""",1,0,1,1,1,1,1,1,1,1,1,1,1,1
"""centroid_lat""",1,1,1,1,1,1,1,1,1,1,1,1,1,1


Identify the problematic year

In [26]:
summary.select(pl.exclude("column").sum()).transpose()


column_0
i32
75
34
71
70
70
…
70
70
70
74


In [33]:
# List of boolean expressions checking if each year-column == 1
exprs = [(pl.col(c) == 1) for c in summary.columns if c != "column"]

# Old-style fold signature: fold(accumulator, function, exprs)
common_cols = summary.filter(
    pl.fold(
        True,
        lambda acc, x: acc & x,
        exprs
    )
)

common_cols


column,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
str,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32,i32
"""BLDG_NUM""",1,1,1,1,1,1,1,1,1,1,1,1,1,1
"""CITY""",1,1,1,1,1,1,1,1,1,1,1,1,1,1
"""COUNTY_ID""",1,1,1,1,1,1,1,1,1,1,1,1,1,1
"""EMV_BLDG""",1,1,1,1,1,1,1,1,1,1,1,1,1,1
"""EMV_LAND""",1,1,1,1,1,1,1,1,1,1,1,1,1,1
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""YEAR_BUILT""",1,1,1,1,1,1,1,1,1,1,1,1,1,1
"""Year""",1,1,1,1,1,1,1,1,1,1,1,1,1,1
"""ZIP""",1,1,1,1,1,1,1,1,1,1,1,1,1,1
"""centroid_lat""",1,1,1,1,1,1,1,1,1,1,1,1,1,1


#### Your conclusions

<font color="orange">
Your thoughts here
</font>

Across the parcel files from 2002–2015, the number of columns varies widely by year, indicating that several years use different schemas. However, using a column-presence summary table, we identified a set of 31 columns that appear in every year. These common variables form the consistent core of the dataset and will be the basis for combining files in later analysis.