# Project 1 Part 1

## Chunking Files in Pandas – Part 1 (20 Points)

In this project, you will use `Panda`’s to process the data from the MinneMUDAC 2016 competition Dive into Water Data.  The data can be found at the [MinneMUDAC site](http://minneanalytics.org/minnemudac/data/).  You should document your work in a Jupyter notebook, which will be used to submit your solution.

1.	Download all of the data in csv file format.  Use `!head -n 5 file_name` to inspect one of the files.  Note that the files are separated with pipes, i.e. `|`.
2.	First, you will explore the columns of each file.  These files are pretty large, so you will need to use `Pandas` chunksize.  Our first task is the exploration of the column labels.  Pick two of the files and do the following:<br>
    a.	Read in the first chunk of each file.  Remember that you need to use `read_csv` with `chunksize=500`, `sep='|'`, and `toolz.first`.<br>
    b.	Turn each of the columns into `Python` sets.  Use the `set` `union`, `intersection`, and `difference` to answer the following questions: Are the columns the same?  If not, which columns are in common?<br>
3.	Now you need to make a list of file names.  The easiest way to do this is using a list comprehension and string formatting.  Make a base file name string with a “hole” where the year goes.  Then using `range` and the string `format` method to create a list of file names.
4.	Now we are going to make a list of `Pandas` `read_csv` iterators.  Use a list comprehension to iterate through the list of file names and apply `read_csv` with `chunksize=500` and `sep='|'`.  (If your program stalls, you probably forgot the `chunksize`!)
5.	Now we will extract the first chuck from each file.  Use `toolz.first` in a list comprehension to create a list of `Pandas` data frames, each consisting of the first chunk of the perspective file.
6. Now we are going to make a list of column name sets, one for each file.  Use a list comprehension, `df.columns`, and the `set` constructor to create a list of column name `set`s.
7. Finally, we want a list of the common column names.  You will need to use the accumulator pattern to accomplish this task.  The initial value of the accumulator is the empty set, i.e. `set([])`, and you will update the accumulator by taking the `union` of the accumulator and the next column name set.
8. Finally, let's determine which files have extra columns.  Use a list comprehension on the `zip`ped file names and column name sets.  Use a filter that only keeps the entries with more columns (`len` will be helpful here). Keep a `tuple` with the file name and number of extra columns.


#### Problem 1

In [3]:
!head -n 3 ./data/MinneMUDAC_raw_files/2002_metro_tax_parcels.txt

ACRES_DEED|ACRES_POLY|AGPRE_ENRD|AGPRE_EXPD|AG_PRESERV|BASEMENT|BLDG_NUM|BLOCK|CITY|CITY_USPS|COOLING|COUNTY_ID|DWELL_TYPE|EMV_BLDG|EMV_LAND|EMV_TOTAL|FIN_SQ_FT|GARAGE|GARAGESQFT|GREEN_ACRE|HEATING|HOMESTEAD|HOME_STYLE|LANDMARK|LOT|MULTI_USES|NUM_UNITS|OPEN_SPACE|OWNER_MORE|OWNER_NAME|OWN_ADD_L1|OWN_ADD_L2|OWN_ADD_L3|OWN_NAME|PARC_CODE|PIN|PIN_1|PLAT_NAME|PREFIXTYPE|PREFIX_DIR|SALE_DATE|SALE_VALUE|SCHOOL_DST|SPEC_ASSES|STREET|STREETNAME|STREETTYPE|STRUC_TYPE|SUFFIX_DIR|Shape_Area|Shape_Leng|TAX_ADD_L1|TAX_ADD_L2|TAX_ADD_L3|TAX_ADD_LI|TAX_CAPAC|TAX_EXEMPT|TAX_NAME|TOTAL_TAX|UNIT_INFO|USE1_DESC|USE2_DESC|USE3_DESC|USE4_DESC|WSHD_DIST|XUSE1_DESC|XUSE2_DESC|XUSE3_DESC|XUSE4_DESC|YEAR_BUILT|Year|ZIP|ZIP4|centroid_lat|centroid_long
||||||14195||ANDOVER|||003||222460.0|55510.0|292596.0||||||Y|||||||||14195 ALDER ST NW||ANDOVER, MN 55304||0.0|003-253224440139|||||2000-11-17|295547.0|11||14195 ALDER ST NW|||RAMBLER BASEMENT||630.998818085|103.296560124|14195 ALDER ST NW||ANDOVER, MN 55304||256

#### Problem 2

In [1]:
from glob import glob
import re

files = glob('./data/MinneMUDAC_raw_files/*_metro_tax_parcels.txt')
year = re.compile(r'(\d{4})')
files[:3]

['./data/MinneMUDAC_raw_files/2003_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/2015_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/2009_metro_tax_parcels.txt']

In [2]:
year_and_files = sorted([(int(year.search(f).group(1)), f) for f in files], key=lambda t: t[0])
year_and_files

[(2002, './data/MinneMUDAC_raw_files/2002_metro_tax_parcels.txt'),
 (2003, './data/MinneMUDAC_raw_files/2003_metro_tax_parcels.txt'),
 (2004, './data/MinneMUDAC_raw_files/2004_metro_tax_parcels.txt'),
 (2005, './data/MinneMUDAC_raw_files/2005_metro_tax_parcels.txt'),
 (2006, './data/MinneMUDAC_raw_files/2006_metro_tax_parcels.txt'),
 (2007, './data/MinneMUDAC_raw_files/2007_metro_tax_parcels.txt'),
 (2008, './data/MinneMUDAC_raw_files/2008_metro_tax_parcels.txt'),
 (2009, './data/MinneMUDAC_raw_files/2009_metro_tax_parcels.txt'),
 (2010, './data/MinneMUDAC_raw_files/2010_metro_tax_parcels.txt'),
 (2011, './data/MinneMUDAC_raw_files/2011_metro_tax_parcels.txt'),
 (2012, './data/MinneMUDAC_raw_files/2012_metro_tax_parcels.txt'),
 (2013, './data/MinneMUDAC_raw_files/2013_metro_tax_parcels.txt'),
 (2014, './data/MinneMUDAC_raw_files/2014_metro_tax_parcels.txt'),
 (2015, './data/MinneMUDAC_raw_files/2015_metro_tax_parcels.txt')]

In [3]:
from toolz import first
import pandas as pd

#### Problem 4

In [11]:
c_size = 500
df_iters = [(y, pd.read_csv(file, chunksize=500, sep='|')) for y, file in year_and_files if y >= 2004]

#### Problem 5

In [12]:
first_chunks = [(y, first(df_iter)) for y, df_iter in df_iters]
first_chunks[0][1].head()

Unnamed: 0,ACRES_DEED,ACRES_POLY,AGPRE_ENRD,AGPRE_EXPD,AG_PRESERV,BASEMENT,BLDG_NUM,BLOCK,CITY,CITY_USPS,...,XUSE1_DESC,XUSE2_DESC,XUSE3_DESC,XUSE4_DESC,YEAR_BUILT,Year,ZIP,ZIP4,centroid_lat,centroid_long
0,0.0,8.03,,,N,,,,SAINT FRANCIS,,...,,,,,1980.0,2004,,,45.41332,-93.26739
1,0.0,0.93,,,N,,24457.0,,SAINT FRANCIS,BETHEL,...,,,,,1974.0,2004,55005.0,,45.41354,-93.2701
2,0.0,8.75,,,N,,24442.0,,SAINT FRANCIS,BETHEL,...,,,,,1969.0,2004,55005.0,,45.41318,-93.27344
3,0.0,11.17,,,N,,410.0,,SAINT FRANCIS,BETHEL,...,,,,,1989.0,2004,55005.0,,45.41167,-93.27684
4,0.0,14.46,,,N,,480.0,,SAINT FRANCIS,BETHEL,...,,,,,1995.0,2004,55070.0,,45.41169,-93.27849


#### Problem 6

In [13]:
all_headers = [(y, chunk.columns) for y, chunk in first_chunks]
all_headers[0]

(2004,
 Index(['ACRES_DEED', 'ACRES_POLY', 'AGPRE_ENRD', 'AGPRE_EXPD', 'AG_PRESERV',
        'BASEMENT', 'BLDG_NUM', 'BLOCK', 'CITY', 'CITY_USPS', 'COOLING',
        'COUNTY_ID', 'DWELL_TYPE', 'EMV_BLDG', 'EMV_LAND', 'EMV_TOTAL',
        'FIN_SQ_FT', 'GARAGE', 'GARAGESQFT', 'GREEN_ACRE', 'HEATING',
        'HOMESTEAD', 'HOME_STYLE', 'ID', 'LANDMARK', 'LOT', 'MULTI_USES',
        'NUM_UNITS', 'OPEN_SPACE', 'OWNER_MORE', 'OWNER_NAME', 'OWN_ADD_L1',
        'OWN_ADD_L2', 'OWN_ADD_L3', 'PARC_CODE', 'PIN', 'PLAT_NAME',
        'PREFIXTYPE', 'PREFIX_DIR', 'SALE_DATE', 'SALE_VALUE', 'SCHOOL_DST',
        'SPEC_ASSES', 'STREETNAME', 'STREETTYPE', 'SUFFIX_DIR', 'Shape_Area',
        'Shape_Leng', 'TAX_ADD_L1', 'TAX_ADD_L2', 'TAX_ADD_L3', 'TAX_CAPAC',
        'TAX_EXEMPT', 'TAX_NAME', 'TOTAL_TAX', 'UNIT_INFO', 'USE1_DESC',
        'USE2_DESC', 'USE3_DESC', 'USE4_DESC', 'WSHD_DIST', 'XUSE1_DESC',
        'XUSE2_DESC', 'XUSE3_DESC', 'XUSE4_DESC', 'YEAR_BUILT', 'Year', 'ZIP',
        'ZIP4', 'centr

#### Problem 7 

In [14]:
col_sets = [(y, set(h)) for y, h in all_headers]
col_sets[0]

(2004,
 {'ACRES_DEED',
  'ACRES_POLY',
  'AGPRE_ENRD',
  'AGPRE_EXPD',
  'AG_PRESERV',
  'BASEMENT',
  'BLDG_NUM',
  'BLOCK',
  'CITY',
  'CITY_USPS',
  'COOLING',
  'COUNTY_ID',
  'DWELL_TYPE',
  'EMV_BLDG',
  'EMV_LAND',
  'EMV_TOTAL',
  'FIN_SQ_FT',
  'GARAGE',
  'GARAGESQFT',
  'GREEN_ACRE',
  'HEATING',
  'HOMESTEAD',
  'HOME_STYLE',
  'ID',
  'LANDMARK',
  'LOT',
  'MULTI_USES',
  'NUM_UNITS',
  'OPEN_SPACE',
  'OWNER_MORE',
  'OWNER_NAME',
  'OWN_ADD_L1',
  'OWN_ADD_L2',
  'OWN_ADD_L3',
  'PARC_CODE',
  'PIN',
  'PLAT_NAME',
  'PREFIXTYPE',
  'PREFIX_DIR',
  'SALE_DATE',
  'SALE_VALUE',
  'SCHOOL_DST',
  'SPEC_ASSES',
  'STREETNAME',
  'STREETTYPE',
  'SUFFIX_DIR',
  'Shape_Area',
  'Shape_Leng',
  'TAX_ADD_L1',
  'TAX_ADD_L2',
  'TAX_ADD_L3',
  'TAX_CAPAC',
  'TAX_EXEMPT',
  'TAX_NAME',
  'TOTAL_TAX',
  'UNIT_INFO',
  'USE1_DESC',
  'USE2_DESC',
  'USE3_DESC',
  'USE4_DESC',
  'WSHD_DIST',
  'XUSE1_DESC',
  'XUSE2_DESC',
  'XUSE3_DESC',
  'XUSE4_DESC',
  'YEAR_BUILT',
  'Year',

In [15]:
_ , common_cols = col_sets[0]
for _, s in col_sets[1:]:
    common_cols = common_cols.intersection(s)
len(common_cols)

70

In [16]:
common_cols

{'ACRES_DEED',
 'ACRES_POLY',
 'AGPRE_ENRD',
 'AGPRE_EXPD',
 'AG_PRESERV',
 'BASEMENT',
 'BLDG_NUM',
 'BLOCK',
 'CITY',
 'CITY_USPS',
 'COOLING',
 'COUNTY_ID',
 'DWELL_TYPE',
 'EMV_BLDG',
 'EMV_LAND',
 'EMV_TOTAL',
 'FIN_SQ_FT',
 'GARAGE',
 'GARAGESQFT',
 'GREEN_ACRE',
 'HEATING',
 'HOMESTEAD',
 'HOME_STYLE',
 'LANDMARK',
 'LOT',
 'MULTI_USES',
 'NUM_UNITS',
 'OPEN_SPACE',
 'OWNER_MORE',
 'OWNER_NAME',
 'OWN_ADD_L1',
 'OWN_ADD_L2',
 'OWN_ADD_L3',
 'PARC_CODE',
 'PIN',
 'PLAT_NAME',
 'PREFIXTYPE',
 'PREFIX_DIR',
 'SALE_DATE',
 'SALE_VALUE',
 'SCHOOL_DST',
 'SPEC_ASSES',
 'STREETNAME',
 'STREETTYPE',
 'SUFFIX_DIR',
 'Shape_Area',
 'Shape_Leng',
 'TAX_ADD_L1',
 'TAX_ADD_L2',
 'TAX_ADD_L3',
 'TAX_CAPAC',
 'TAX_EXEMPT',
 'TAX_NAME',
 'TOTAL_TAX',
 'UNIT_INFO',
 'USE1_DESC',
 'USE2_DESC',
 'USE3_DESC',
 'USE4_DESC',
 'WSHD_DIST',
 'XUSE1_DESC',
 'XUSE2_DESC',
 'XUSE3_DESC',
 'XUSE4_DESC',
 'YEAR_BUILT',
 'Year',
 'ZIP',
 'ZIP4',
 'centroid_lat',
 'centroid_long'}

In [49]:
cols_not_in_common = [(y, s.difference(common_cols)) for y, s in col_sets]

In [51]:
col_sets = [(y, set(h)) for y, h in all_headers]
col_sets[0]

(2002,
 {'ACRES_DEED',
  'ACRES_POLY',
  'AGPRE_ENRD',
  'AGPRE_EXPD',
  'AG_PRESERV',
  'BASEMENT',
  'BLDG_NUM',
  'BLOCK',
  'CITY',
  'CITY_USPS',
  'COOLING',
  'COUNTY_ID',
  'DWELL_TYPE',
  'EMV_BLDG',
  'EMV_LAND',
  'EMV_TOTAL',
  'FIN_SQ_FT',
  'GARAGE',
  'GARAGESQFT',
  'GREEN_ACRE',
  'HEATING',
  'HOMESTEAD',
  'HOME_STYLE',
  'LANDMARK',
  'LOT',
  'MULTI_USES',
  'NUM_UNITS',
  'OPEN_SPACE',
  'OWNER_MORE',
  'OWNER_NAME',
  'OWN_ADD_L1',
  'OWN_ADD_L2',
  'OWN_ADD_L3',
  'OWN_NAME',
  'PARC_CODE',
  'PIN',
  'PIN_1',
  'PLAT_NAME',
  'PREFIXTYPE',
  'PREFIX_DIR',
  'SALE_DATE',
  'SALE_VALUE',
  'SCHOOL_DST',
  'SPEC_ASSES',
  'STREET',
  'STREETNAME',
  'STREETTYPE',
  'STRUC_TYPE',
  'SUFFIX_DIR',
  'Shape_Area',
  'Shape_Leng',
  'TAX_ADD_L1',
  'TAX_ADD_L2',
  'TAX_ADD_L3',
  'TAX_ADD_LI',
  'TAX_CAPAC',
  'TAX_EXEMPT',
  'TAX_NAME',
  'TOTAL_TAX',
  'UNIT_INFO',
  'USE1_DESC',
  'USE2_DESC',
  'USE3_DESC',
  'USE4_DESC',
  'WSHD_DIST',
  'XUSE1_DESC',
  'XUSE2_DES

In [52]:
_ , common_cols = col_sets[0]
for _, s in col_sets[1:]:
    common_cols = common_cols.intersection(s)
len(common_cols)

31

In [53]:
cols_not_in_common = [(y, s.difference(common_cols)) for y, s in col_sets]

In [54]:
num_cols_not_in_common = [(y, len(s)) for y, s in cols_not_in_common]
num_cols_not_in_common

[(2002, 44),
 (2003, 3),
 (2004, 40),
 (2005, 39),
 (2006, 39),
 (2007, 41),
 (2008, 41),
 (2009, 41),
 (2010, 40),
 (2011, 39),
 (2012, 39),
 (2013, 39),
 (2014, 43),
 (2015, 39)]

#### Problem 8

In [56]:
[(y, s) for y, s in cols_not_in_common if len(s) > 0]

[(2002,
  {'ACRES_DEED',
   'ACRES_POLY',
   'AGPRE_ENRD',
   'AGPRE_EXPD',
   'AG_PRESERV',
   'BASEMENT',
   'BLOCK',
   'CITY_USPS',
   'COOLING',
   'DWELL_TYPE',
   'FIN_SQ_FT',
   'GARAGE',
   'GARAGESQFT',
   'GREEN_ACRE',
   'HEATING',
   'HOME_STYLE',
   'LANDMARK',
   'LOT',
   'MULTI_USES',
   'OPEN_SPACE',
   'OWNER_MORE',
   'OWNER_NAME',
   'OWN_NAME',
   'PIN_1',
   'PLAT_NAME',
   'PREFIXTYPE',
   'PREFIX_DIR',
   'SPEC_ASSES',
   'STREET',
   'STREETNAME',
   'STREETTYPE',
   'STRUC_TYPE',
   'SUFFIX_DIR',
   'TAX_ADD_LI',
   'UNIT_INFO',
   'USE1_DESC',
   'USE2_DESC',
   'USE3_DESC',
   'USE4_DESC',
   'XUSE1_DESC',
   'XUSE2_DESC',
   'XUSE3_DESC',
   'XUSE4_DESC',
   'ZIP4'}),
 (2003, {'OWN_NAME', 'STREET', 'STRUC_TYPE'}),
 (2004,
  {'ACRES_DEED',
   'ACRES_POLY',
   'AGPRE_ENRD',
   'AGPRE_EXPD',
   'AG_PRESERV',
   'BASEMENT',
   'BLOCK',
   'CITY_USPS',
   'COOLING',
   'DWELL_TYPE',
   'FIN_SQ_FT',
   'GARAGE',
   'GARAGESQFT',
   'GREEN_ACRE',
   'HEATING',
  

#### Problem 7 (2004+)

In [40]:
col_sets = [(y, set(h)) for y, h in all_headers if y >= 2004]
col_sets[0]

(2004,
 {'ACRES_DEED',
  'ACRES_POLY',
  'AGPRE_ENRD',
  'AGPRE_EXPD',
  'AG_PRESERV',
  'BASEMENT',
  'BLDG_NUM',
  'BLOCK',
  'CITY',
  'CITY_USPS',
  'COOLING',
  'COUNTY_ID',
  'DWELL_TYPE',
  'EMV_BLDG',
  'EMV_LAND',
  'EMV_TOTAL',
  'FIN_SQ_FT',
  'GARAGE',
  'GARAGESQFT',
  'GREEN_ACRE',
  'HEATING',
  'HOMESTEAD',
  'HOME_STYLE',
  'ID',
  'LANDMARK',
  'LOT',
  'MULTI_USES',
  'NUM_UNITS',
  'OPEN_SPACE',
  'OWNER_MORE',
  'OWNER_NAME',
  'OWN_ADD_L1',
  'OWN_ADD_L2',
  'OWN_ADD_L3',
  'PARC_CODE',
  'PIN',
  'PLAT_NAME',
  'PREFIXTYPE',
  'PREFIX_DIR',
  'SALE_DATE',
  'SALE_VALUE',
  'SCHOOL_DST',
  'SPEC_ASSES',
  'STREETNAME',
  'STREETTYPE',
  'SUFFIX_DIR',
  'Shape_Area',
  'Shape_Leng',
  'TAX_ADD_L1',
  'TAX_ADD_L2',
  'TAX_ADD_L3',
  'TAX_CAPAC',
  'TAX_EXEMPT',
  'TAX_NAME',
  'TOTAL_TAX',
  'UNIT_INFO',
  'USE1_DESC',
  'USE2_DESC',
  'USE3_DESC',
  'USE4_DESC',
  'WSHD_DIST',
  'XUSE1_DESC',
  'XUSE2_DESC',
  'XUSE3_DESC',
  'XUSE4_DESC',
  'YEAR_BUILT',
  'Year',

In [41]:
_ , common_cols = col_sets[0]
for _, s in col_sets[1:]:
    common_cols = common_cols.intersection(s)
len(common_cols)

70

In [42]:
cols_not_in_common = [(y, s.difference(common_cols)) for y, s in col_sets]

In [43]:
num_cols_not_in_common = [(y, len(s)) for y, s in cols_not_in_common]
num_cols_not_in_common

[(2004, 1),
 (2005, 0),
 (2006, 0),
 (2007, 2),
 (2008, 2),
 (2009, 2),
 (2010, 1),
 (2011, 0),
 (2012, 0),
 (2013, 0),
 (2014, 4),
 (2015, 0)]

#### Problem 8 (2004+)

In [44]:
[(y, s) for y, s in cols_not_in_common if len(s) > 0]

[(2004, {'ID'}),
 (2007, {'Garage', 'Homestead'}),
 (2008, {'Garage', 'Homestead'}),
 (2009, {'Garage', 'Homestead'}),
 (2010, {'Garage'}),
 (2014, {'Shape_Le_1', 'Shape_STAr', 'Shape_STLe', 'TORRENS'})]