# Project 1 Part 4 - Creating a master parcel database

In this part of the project, we will use Python to read, process, and double all of the parcel data into a database.  Note that this is not our only alternative, and in Project 1 Part 4 b, we will look at another alternative, that is reading all the of original, raw files into their own database table, then using SQL to join/link/aggregate the tables.

## Chunking Files in Pandas – Part 1 (20 Points)

In this part of the project, you will use `Panda`’s to process the data from the MinneMUDAC 2016 competition Dive into Water Data.  The data can be found at the [MinneMUDAC site](http://minneanalytics.org/minnemudac/data/).  You should document your work in a Jupyter notebook, which will be used to submit your solution.  **For the rest of the parts of this project, we will limit ourselves to the years 2004-2014.**

1. Remind me why we want to skip 2003.

2003 has a lot less columns in common with the other years

2. Import the common columns list and translation dictionaries from the `.py` file you created in the last part of the project.

In [47]:
from project_data_Buske import common_columns

In [48]:
from project_data_Buske import latlong_to_code

In [49]:
from project_data_Buske import latlong_to_name

In [50]:
from project_data_Buske import code_to_name

In [51]:
from project_data_Buske import code_to_distance

In [52]:
from project_data_Buske import latlong_to_distance

3. Use glob and a list comprehension to get a list of file names for the years 2004-2014.

In [53]:
from glob import glob
files = glob('./data/MinneMUDAC_raw_files/20**_metro_tax_parcels.txt')[2:-1]
[file for file in files]

['./data/MinneMUDAC_raw_files/2004_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/2005_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/2006_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/2007_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/2008_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/2009_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/2010_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/2011_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/2012_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/2013_metro_tax_parcels.txt',
 './data/MinneMUDAC_raw_files/2014_metro_tax_parcels.txt']

In [None]:
float_cols = ['EMV_BLDG', 'EMV_LAND', 'EMV_TOTAL']
string_cols = ['CITY', 'CITY_USPS', 'COOLING', 'DWELL_TYPE' ]
int_cols = ['COUNTY_ID' ]
nullable_int_cols = [ ]
date_cols = [ ]
messy_cols = ['BLOCK']

In [10]:
parc = pd.read_csv('./data/MinneMUDAC_raw_files/2004_metro_tax_parcels.txt',sep = '|', dtype={'centroid_lat':str, 'centroid_long':str})

  interactivity=interactivity, compiler=compiler, result=result)


In [44]:
parc.columns.shape

(71,)

In [39]:
#list((sorted(common_columns)))

4. Use the first chunk of the first file to prototype an expression that <br>
    a. Selects the common columns <br>
    b. Fixes any issues with the column names <br>
    c. Changes columns to the correct types (if necessary).  More information about the columns can be found [here](ftp://ftp.gisdata.mn.gov/pub/gdrs/data/pub/us_mn_state_metrogis/plan_regonal_prcls_open/metadata/metadata.html). It is **imperative** that you keep the lat and long columns as strings. <br>
    d. Use the translation dictionaries from the last part to add three new columns to the chunk: lake code, lake name, parcel distance to the lake.
    e. Filters to only properties that are within 1600 m (~1 mile) of the closest lake.

In [67]:
import pandas as pd
fileiterator = [pd.read_csv(file, chunksize = 50000, sep = '|', dtype={'centroid_lat': str, 'centroid_long': str}) for file in files]
fileiterator

[<pandas.io.parsers.TextFileReader at 0x16fa1aa90>,
 <pandas.io.parsers.TextFileReader at 0x16fcc8828>,
 <pandas.io.parsers.TextFileReader at 0x16fcc8c18>,
 <pandas.io.parsers.TextFileReader at 0x16fcc8f60>,
 <pandas.io.parsers.TextFileReader at 0x16fcc84e0>,
 <pandas.io.parsers.TextFileReader at 0x16fcc82b0>,
 <pandas.io.parsers.TextFileReader at 0x16f19e5c0>,
 <pandas.io.parsers.TextFileReader at 0x16f19e860>,
 <pandas.io.parsers.TextFileReader at 0x16f19ee48>,
 <pandas.io.parsers.TextFileReader at 0x16f19ecc0>,
 <pandas.io.parsers.TextFileReader at 0x16f19e240>]

In [55]:
from toolz import first

In [56]:
firstchunk = first(fileiterator[0])
firstchunk.head()

  if self.run_code(code, result):


Unnamed: 0,ACRES_DEED,ACRES_POLY,AGPRE_ENRD,AGPRE_EXPD,AG_PRESERV,BASEMENT,BLDG_NUM,BLOCK,CITY,CITY_USPS,...,XUSE1_DESC,XUSE2_DESC,XUSE3_DESC,XUSE4_DESC,YEAR_BUILT,Year,ZIP,ZIP4,centroid_lat,centroid_long
0,0.0,8.03,,,N,,,,SAINT FRANCIS,,...,,,,,1980.0,2004,,,45.41332,-93.26739
1,0.0,0.93,,,N,,24457.0,,SAINT FRANCIS,BETHEL,...,,,,,1974.0,2004,55005.0,,45.41354,-93.2701
2,0.0,8.75,,,N,,24442.0,,SAINT FRANCIS,BETHEL,...,,,,,1969.0,2004,55005.0,,45.41318,-93.27344
3,0.0,11.17,,,N,,410.0,,SAINT FRANCIS,BETHEL,...,,,,,1989.0,2004,55005.0,,45.41167,-93.27684
4,0.0,14.46,,,N,,480.0,,SAINT FRANCIS,BETHEL,...,,,,,1995.0,2004,55070.0,,45.41169,-93.27849


In [57]:
from dfply import *
from more_dfply import *

In [61]:
exp = (firstchunk 
         >> select(cols_to_keep)
         >> mutate(lat_long = pd.Series(zip(firstchunk.centroid_lat, firstchunk.centroid_long)))
         >> mutate(lake_code = recode(X.lat_long, latlong_to_code))
         >> mutate(lake_name = recode(X.lat_long, latlong_to_name))
         >> mutate(parcel_dist_to_lake = recode(X.lat_long, latlong_to_distance))
         >> filter_by(~X.lake_name.isna())
         >> filter_by(X.parcel_dist_to_lake <= 1600)
      )         
exp

Unnamed: 0,ACRES_DEED,ACRES_POLY,AGPRE_ENRD,AG_PRESERV,BASEMENT,CITY,COOLING,DWELL_TYPE,EMV_BLDG,EMV_LAND,...,XUSE3_DESC,XUSE4_DESC,YEAR_BUILT,Year,centroid_lat,centroid_long,lat_long,lake_code,lake_name,parcel_dist_to_lake
42707,0.0,28.29,,N,,LINO LAKES,,,0.0,209900.0,...,,,0.0,2004,45.21007,-93.0241,"(45.21007, -93.0241)",02000400-01,Peltier Lake,777.247256
42708,0.0,39.31,,N,,LINO LAKES,,,0.0,141520.0,...,,,0.0,2004,45.20913,-93.03251,"(45.20913, -93.03251)",02000400-01,Peltier Lake,777.247256
42709,0.0,27.69,,N,,LINO LAKES,,,0.0,261000.0,...,,,0.0,2004,45.20933,-93.03805,"(45.20933, -93.03805)",02000400-01,Peltier Lake,777.247256
42710,0.0,29.05,,N,,LINO LAKES,,,0.0,14700.0,...,,,0.0,2004,45.20952,-93.0438,"(45.20952, -93.0438)",02000400-01,Peltier Lake,777.247256
42711,0.0,37.53,,N,,LINO LAKES,,,0.0,148960.0,...,,,0.0,2004,45.20919,-93.04851,"(45.20919, -93.04851)",02000400-01,Peltier Lake,777.247256
42712,0.0,20.17,,N,,LINO LAKES,,,0.0,147940.0,...,,,0.0,2004,45.20927,-93.05237,"(45.20927, -93.05237)",02000400-01,Peltier Lake,777.247256
42713,0.0,16.14,,N,,LINO LAKES,,,0.0,147940.0,...,,,0.0,2004,45.2091,-93.05506,"(45.2091, -93.05506)",02000400-01,Peltier Lake,777.247256
42714,0.0,10.51,,N,,LINO LAKES,,,0.0,154700.0,...,,,0.0,2004,45.21015,-93.05759,"(45.21015, -93.05759)",02000400-01,Peltier Lake,777.247256
42715,0.0,14.38,,N,,LINO LAKES,,,0.0,7200.0,...,,,0.0,2004,45.20923,-93.06032,"(45.20923, -93.06032)",02000400-01,Peltier Lake,777.247256
42716,0.0,24.79,,N,,LINO LAKES,,,0.0,771215.0,...,,,0.0,2004,45.20659,-93.06247,"(45.20659, -93.06247)",02000400-01,Peltier Lake,777.247256


In [59]:
cols_to_keep = ['ACRES_DEED',
                'ACRES_POLY',
                'AGPRE_ENRD',
                'AG_PRESERV',
                'BASEMENT',
                'CITY',
                'COOLING',
                'DWELL_TYPE',
                'EMV_BLDG',
                'EMV_LAND',
                'FIN_SQ_FT',
                'GARAGE', 
                'GARAGESQFT', 
                'GREEN_ACRE',
                'HOMESTEAD',
                #'ID',
                'LANDMARK',
                'OWN_ADD_L1',
                'OWN_ADD_L2',
                'OWN_ADD_L3',
                'PARC_CODE',
                'PIN', 
                'SALE_VALUE', 
                'SPEC_ASSES',
                'TAX_CAPAC', 
                'TAX_EXEMPT', 
                'TOTAL_TAX', 
                'USE1_DESC',
                'USE2_DESC',
                'USE3_DESC',
                'USE4_DESC',
                'WSHD_DIST',
                'XUSE1_DESC',
                'XUSE2_DESC',
                'XUSE3_DESC',
                'XUSE4_DESC',
                'YEAR_BUILT',
                'Year',
                'centroid_lat',
                'centroid_long']

In [60]:
cols_to_drop = ['AGPRE_EXPD',#drop
                'BLDG_NUM',#drop
                'BLOCK',#drop
                'CITY_USPS',#drop
                'COUNTY_ID',#drop
                'EMV_TOTAL',#drop
                'HEATING', #drop
                'HOME_STYLE',#drop
                'LOT',#drop
                'MULTI_USES',#drop
                'NUM_UNITS',#drop
                'OPEN_SPACE',#drop
                'OWNER_MORE',#drop
                'OWNER_NAME',#drop
                'PLAT_NAME', #drop
                'PREFIXTYPE', #drop
                'PREFIX_DIR', #drop
                'SALE_DATE', #drop
                'SCHOOL_DST', #drop
                'STREETNAME', #drop
                'STREETTYPE', #drop
                'SUFFIX_DIR', #drop
                'Shape_Area', #drop
                'Shape_Leng', #drop
                'TAX_ADD_L1', #drop
                'TAX_ADD_L2', #drop
                'TAX_ADD_L3', #drop
                'TAX_NAME', #drop
                'UNIT_INFO', #drop
                'ZIP',#drop
                'ZIP4',#drop
                ]

5. Now convert your expression from the last problem to a function and test that this function works on the first few chunks of each file.

In [62]:
from functoolz import pipeable
read_parcel = lambda path: pd.read_csv(path, chunksize = 500, sep = '|', dtype={'centroid_lat':str, 'centroid_long':str})
read_parcel('./data/MinneMUDAC_raw_files/2004_metro_tax_parcels.txt')

<pandas.io.parsers.TextFileReader at 0x1971a20b8>

In [63]:
addcolumns = pipeable(lambda chunk: (chunk 
                                         >> select(cols_to_keep)
                                         >> mutate(lat_long = pd.Series(zip(chunk.centroid_lat, chunk.centroid_long)))
                                         >> mutate(lake_code = recode(X.lat_long, latlong_to_code))
                                         >> mutate(lake_name = recode(X.lat_long, latlong_to_name))
                                         >> mutate(parcel_dist_to_lake = recode(X.lat_long, latlong_to_distance))
                                         >> filter_by(~X.lake_name.isna())
                                         >> filter_by(X.parcel_dist_to_lake <= 1600)
                            )     )    

In [64]:
ex = read_parcel(files[2])
next(ex).head()

Unnamed: 0,ACRES_DEED,ACRES_POLY,AGPRE_ENRD,AGPRE_EXPD,AG_PRESERV,BASEMENT,BLDG_NUM,BLOCK,CITY,CITY_USPS,...,XUSE1_DESC,XUSE2_DESC,XUSE3_DESC,XUSE4_DESC,YEAR_BUILT,Year,ZIP,ZIP4,centroid_long,centroid_lat
0,0.0,0.16,,,N,Y,14195.0,1,ANDOVER,ANDOVER,...,,,,,2000.0,2006,55304.0,4187.0,-93.26607,45.22905
1,0.0,0.15,,,N,Y,14189.0,1,ANDOVER,ANDOVER,...,,,,,1999.0,2006,55304.0,4187.0,-93.26591,45.22892
2,0.0,0.14,,,N,Y,14177.0,1,ANDOVER,ANDOVER,...,,,,,1999.0,2006,55304.0,4187.0,-93.26566,45.22864
3,0.0,0.16,,,N,Y,14165.0,1,ANDOVER,ANDOVER,...,,,,,1999.0,2006,55304.0,4187.0,-93.26547,45.22829
4,0.0,0.13,,,N,Y,14159.0,1,ANDOVER,ANOKA,...,,,,,2000.0,2006,55304.0,4187.0,-93.26548,45.22811


In [65]:
addcolumns(firstchunk)

Unnamed: 0,ACRES_DEED,ACRES_POLY,AGPRE_ENRD,AG_PRESERV,BASEMENT,CITY,COOLING,DWELL_TYPE,EMV_BLDG,EMV_LAND,...,XUSE3_DESC,XUSE4_DESC,YEAR_BUILT,Year,centroid_lat,centroid_long,lat_long,lake_code,lake_name,parcel_dist_to_lake
42707,0.0,28.29,,N,,LINO LAKES,,,0.0,209900.0,...,,,0.0,2004,45.21007,-93.0241,"(45.21007, -93.0241)",02000400-01,Peltier Lake,777.247256
42708,0.0,39.31,,N,,LINO LAKES,,,0.0,141520.0,...,,,0.0,2004,45.20913,-93.03251,"(45.20913, -93.03251)",02000400-01,Peltier Lake,777.247256
42709,0.0,27.69,,N,,LINO LAKES,,,0.0,261000.0,...,,,0.0,2004,45.20933,-93.03805,"(45.20933, -93.03805)",02000400-01,Peltier Lake,777.247256
42710,0.0,29.05,,N,,LINO LAKES,,,0.0,14700.0,...,,,0.0,2004,45.20952,-93.0438,"(45.20952, -93.0438)",02000400-01,Peltier Lake,777.247256
42711,0.0,37.53,,N,,LINO LAKES,,,0.0,148960.0,...,,,0.0,2004,45.20919,-93.04851,"(45.20919, -93.04851)",02000400-01,Peltier Lake,777.247256
42712,0.0,20.17,,N,,LINO LAKES,,,0.0,147940.0,...,,,0.0,2004,45.20927,-93.05237,"(45.20927, -93.05237)",02000400-01,Peltier Lake,777.247256
42713,0.0,16.14,,N,,LINO LAKES,,,0.0,147940.0,...,,,0.0,2004,45.2091,-93.05506,"(45.2091, -93.05506)",02000400-01,Peltier Lake,777.247256
42714,0.0,10.51,,N,,LINO LAKES,,,0.0,154700.0,...,,,0.0,2004,45.21015,-93.05759,"(45.21015, -93.05759)",02000400-01,Peltier Lake,777.247256
42715,0.0,14.38,,N,,LINO LAKES,,,0.0,7200.0,...,,,0.0,2004,45.20923,-93.06032,"(45.20923, -93.06032)",02000400-01,Peltier Lake,777.247256
42716,0.0,24.79,,N,,LINO LAKES,,,0.0,771215.0,...,,,0.0,2004,45.20659,-93.06247,"(45.20659, -93.06247)",02000400-01,Peltier Lake,777.247256


In [18]:
ch = first(fileiterator[4])
addcolumns(ch)

AttributeError: 'TextFileReader' object has no attribute 'iloc'

In [None]:
ch2 = next(fileiterator[5])
addcolumns(ch2)

6. We need to make a unique primary key for each row in the combined parcel file.<br>
    a. There is a column that appears to be a unique parcel id.  Double check that this is a true primary key for each individual file. (To do this you need to verify that the number of unique values is the same as the number of rows for each of the parcel files.  **Hint:** For each file, use of the accumulator pattern with two accumualtors (one number and one data frame). <br>
    b. Explain why this column will not work as a primary key if we want to combine all years in one database. <br>
    c. Suppose we make a new column that consist of `str(year) + '-' + PIN`.  Explain why this should make a proper primary key for the combined data. <br>

In [68]:
firstchunks = [first(f) for f in fileiterator]

AttributeError: 'TextFileReader' object has no attribute 'iloc'

In [25]:
nrows = 0
unique_list = []
for ch in firstchunks:
    nrows = nrows + len(ch)
    print(nrows)
    for x in ch['PIN']:
        if x not in unique_list:
            unique_list.append(x)
    #print(sum(unique_list))    
    #pinset = set(ch['PIN'])
    #unique = unique 
    #unique = unique + ch['PIN'].intersection(unique)
    #unique = pinset.intersection(pinset)
    #print(unique)

NameError: name 'firstchunks' is not defined

In [135]:
#[(col, exp[col].is_unique) for col in exp]

This columm will not work as a primary key if we want to combine all the years because there isn't a difference between each year, we can't tell them apart.

str(year) + '-' + PIN should make a proper primary key because adding the year will make the key unique.

7. Make a function to add the key suggested in the last problem (`str(year) + '-' + PIN`) to a given chunk.

In [69]:
add_primary_key = pipeable(lambda start, df: (df
                                              >> mutate(id = np.arange(start, start + len(df))
                                              )))

In [70]:
c_size = 50000
from dfply import head
i = 0
(exp 
 >> add_primary_key(i*c_size) 
 >> head(3))

Unnamed: 0,ACRES_DEED,ACRES_POLY,AGPRE_ENRD,AG_PRESERV,BASEMENT,CITY,COOLING,DWELL_TYPE,EMV_BLDG,EMV_LAND,...,XUSE4_DESC,YEAR_BUILT,Year,centroid_lat,centroid_long,lat_long,lake_code,lake_name,parcel_dist_to_lake,id
42707,0.0,28.29,,N,,LINO LAKES,,,0.0,209900.0,...,,0.0,2004,45.21007,-93.0241,"(45.21007, -93.0241)",02000400-01,Peltier Lake,777.247256,0
42708,0.0,39.31,,N,,LINO LAKES,,,0.0,141520.0,...,,0.0,2004,45.20913,-93.03251,"(45.20913, -93.03251)",02000400-01,Peltier Lake,777.247256,1
42709,0.0,27.69,,N,,LINO LAKES,,,0.0,261000.0,...,,0.0,2004,45.20933,-93.03805,"(45.20933, -93.03805)",02000400-01,Peltier Lake,777.247256,2


In [71]:
process_chunk = pipeable(lambda i, df, chunksize=c_size: df >> addcolumns >> add_primary_key(i*c_size))
exp >> process_chunk(0) >> tail(1)

Unnamed: 0,ACRES_DEED,ACRES_POLY,AGPRE_ENRD,AG_PRESERV,BASEMENT,CITY,COOLING,DWELL_TYPE,EMV_BLDG,EMV_LAND,...,XUSE4_DESC,YEAR_BUILT,Year,centroid_lat,centroid_long,lat_long,lake_code,lake_name,parcel_dist_to_lake,id


In [88]:
from more_sqlalchemy import get_sql_types
i = 0
complete_first_chunk = exp >> process_chunk(i)
sql_types = common_parcel_types
#sql_types = get_sql_types(complete_first_chunk)
sql_types

{'ACRES_DEED': sqlalchemy.sql.sqltypes.Float,
 'ACRES_POLY': sqlalchemy.sql.sqltypes.Float,
 'AGPRE_ENRD': sqlalchemy.sql.sqltypes.DateTime,
 'AGPRE_EXPD': sqlalchemy.sql.sqltypes.DateTime,
 'AG_PRESERV': sqlalchemy.sql.sqltypes.String,
 'BASEMENT': sqlalchemy.sql.sqltypes.String,
 'BLDG_NUM': sqlalchemy.sql.sqltypes.Float,
 'BLOCK': sqlalchemy.sql.sqltypes.String,
 'CITY': sqlalchemy.sql.sqltypes.String,
 'CITY_USPS': sqlalchemy.sql.sqltypes.String,
 'COOLING': sqlalchemy.sql.sqltypes.String,
 'COUNTY_ID': sqlalchemy.sql.sqltypes.Integer,
 'DWELL_TYPE': sqlalchemy.sql.sqltypes.String,
 'EMV_BLDG': sqlalchemy.sql.sqltypes.Float,
 'EMV_LAND': sqlalchemy.sql.sqltypes.Float,
 'EMV_TOTAL': sqlalchemy.sql.sqltypes.Float,
 'FIN_SQ_FT': sqlalchemy.sql.sqltypes.Float,
 'GARAGE': sqlalchemy.sql.sqltypes.String,
 'GARAGESQFT': sqlalchemy.sql.sqltypes.Float,
 'GREEN_ACRE': sqlalchemy.sql.sqltypes.Float,
 'HEATING': sqlalchemy.sql.sqltypes.String,
 'HOMESTEAD': sqlalchemy.sql.sqltypes.String,
 'HO

In [81]:
from sqlalchemy import String, Float, Integer, DateTime
common_parcel_types1 = {'ACRES_DEED':String,
                       'ACRES_POLY':String,
                       'AGPRE_ENRD':String,
                       'AGPRE_EXPD':String,
                       'AG_PRESERV':String,
                       'BASEMENT':String,
                       'BLDG_NUM':String,
                       'BLOCK':String,
                       'CITY':String,
                       'CITY_USPS':String,
                       'COOLING':String,
                       'COUNTY_ID':String,
                       'DWELL_TYPE':String,
                       'EMV_BLDG':String,
                       'EMV_LAND':String,
                       'EMV_TOTAL':String,
                       'FIN_SQ_FT':String,
                       'GARAGE':String,
                       'GARAGESQFT':String,
                       'GREEN_ACRE':String,
                       'HEATING':String,
                       'HOMESTEAD':String,
                       'HOME_STYLE':String,
                       'ID':String,
                       'LANDMARK':String,
                       'LOT':String,
                       'MULTI_USES':String,
                       'NUM_UNITS':String,
                       'OPEN_SPACE':String,
                       'OWNER_MORE':String,
                       'OWNER_NAME':String,
                       'OWN_ADD_L1':String,
                       'OWN_ADD_L2':String,
                       'OWN_ADD_L3':String,
                       'PARC_CODE':String,
                       'PIN':String,
                       'PLAT_NAME':String,
                       'PREFIXTYPE':String,
                       'PREFIX_DIR':String,
                       'SALE_DATE':String,
                       'SALE_VALUE':String,
                       'SCHOOL_DST':String,
                       'SPEC_ASSES':String,
                       'STREETNAME':String,
                       'STREETTYPE':String,
                       'SUFFIX_DIR':String,
                       'Shape_Area':String,
                       'Shape_Leng':String,
                       'TAX_ADD_L1':String,
                       'TAX_ADD_L2':String,
                       'TAX_ADD_L3':String,
                       'TAX_CAPAC':String,
                       'TAX_EXEMPT':String,
                       'TAX_NAME':String,
                       'TOTAL_TAX':String,
                       'UNIT_INFO':String,
                       'USE1_DESC':String,
                       'USE2_DESC':String,
                       'USE3_DESC':String,
                       'USE4_DESC':String,
                       'WSHD_DIST':String,
                       'XUSE1_DESC':String,
                       'XUSE2_DESC':String,
                       'XUSE3_DESC':String,
                       'XUSE4_DESC':String,
                       'YEAR_BUILT':String,
                       'Year':String,
                       'ZIP':String,
                       'ZIP4':String,
                       'centroid_lat':String,
                       'centroid_long':String}

In [87]:
from sqlalchemy import String, Float, Integer, DateTime
common_parcel_types = {'ACRES_DEED':Float,
                       'ACRES_POLY':Float,
                       'AGPRE_ENRD':DateTime,
                       'AGPRE_EXPD':DateTime,
                       'AG_PRESERV':String,
                       'BASEMENT':String,
                       'BLDG_NUM':Float,
                       'BLOCK':String,
                       'CITY':String,
                       'CITY_USPS':String,
                       'COOLING':String,
                       'COUNTY_ID':Integer,
                       'DWELL_TYPE':String,
                       'EMV_BLDG':Float,
                       'EMV_LAND':Float,
                       'EMV_TOTAL':Float,
                       'FIN_SQ_FT':Float,
                       'GARAGE':String,
                       'GARAGESQFT':Float,
                       'GREEN_ACRE':Float,
                       'HEATING':String,
                       'HOMESTEAD':String,
                       'HOME_STYLE':Float,
                       'ID':String,
                       'LANDMARK':String,
                       'LOT':String,
                       'MULTI_USES':Float,
                       'NUM_UNITS':String,
                       'OPEN_SPACE':Float,
                       'OWNER_MORE':String,
                       'OWNER_NAME':String,
                       'OWN_ADD_L1':String,
                       'OWN_ADD_L2':String,
                       'OWN_ADD_L3':String,
                       'PARC_CODE':Float,
                       'PIN':String,
                       'PLAT_NAME':String,
                       'PREFIXTYPE':Float, 
                       'PREFIX_DIR':String,
                       'SALE_DATE':Integer, 
                       'SALE_VALUE':Float,
                       'SCHOOL_DST':Float, 
                       'SPEC_ASSES':Float,
                       'STREETNAME':String,
                       'STREETTYPE':String, 
                       'SUFFIX_DIR':String, 
                       'Shape_Area':Float, 
                       'Shape_Leng':Float, 
                       'TAX_ADD_L1':String, 
                       'TAX_ADD_L2':String, 
                       'TAX_ADD_L3':String, 
                       'TAX_CAPAC':Float, 
                       'TAX_EXEMPT':String, 
                       'TAX_NAME':String, 
                       'TOTAL_TAX':Float, 
                       'UNIT_INFO':String, 
                       'USE1_DESC':String,
                       'USE2_DESC':String,
                       'USE3_DESC':String,
                       'USE4_DESC':String,
                       'WSHD_DIST':String,
                       'XUSE1_DESC':String,
                       'XUSE2_DESC':String,
                       'XUSE3_DESC':String,
                       'XUSE4_DESC':String,
                       'YEAR_BUILT':Integer,
                       'Year':String,
                       'ZIP':Integer, 
                       'ZIP4':Integer,
                       'centroid_lat':String,
                       'centroid_long':String}

In [89]:
!rm ./databases/lakes.db

In [90]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///databases/lakes.db', echo=False)

In [91]:
schema = pd.io.sql.get_schema(complete_first_chunk, # dataframe
                              'lakes', # name in SQL db
                              keys='id', # primary key
                              con=engine, # connection
                              dtype=sql_types # SQL types
)
print(schema)
engine.execute(schema)


CREATE TABLE lakes (
	"ACRES_DEED" FLOAT, 
	"ACRES_POLY" FLOAT, 
	"AGPRE_ENRD" DATETIME, 
	"AG_PRESERV" VARCHAR, 
	"BASEMENT" VARCHAR, 
	"CITY" VARCHAR, 
	"COOLING" VARCHAR, 
	"DWELL_TYPE" VARCHAR, 
	"EMV_BLDG" FLOAT, 
	"EMV_LAND" FLOAT, 
	"FIN_SQ_FT" FLOAT, 
	"GARAGE" VARCHAR, 
	"GARAGESQFT" FLOAT, 
	"GREEN_ACRE" FLOAT, 
	"HOMESTEAD" VARCHAR, 
	"LANDMARK" VARCHAR, 
	"OWN_ADD_L1" VARCHAR, 
	"OWN_ADD_L2" VARCHAR, 
	"OWN_ADD_L3" VARCHAR, 
	"PARC_CODE" FLOAT, 
	"PIN" VARCHAR, 
	"SALE_VALUE" FLOAT, 
	"SPEC_ASSES" FLOAT, 
	"TAX_CAPAC" FLOAT, 
	"TAX_EXEMPT" VARCHAR, 
	"TOTAL_TAX" FLOAT, 
	"USE1_DESC" VARCHAR, 
	"USE2_DESC" VARCHAR, 
	"USE3_DESC" VARCHAR, 
	"USE4_DESC" VARCHAR, 
	"WSHD_DIST" VARCHAR, 
	"XUSE1_DESC" VARCHAR, 
	"XUSE2_DESC" VARCHAR, 
	"XUSE3_DESC" VARCHAR, 
	"XUSE4_DESC" VARCHAR, 
	"YEAR_BUILT" INTEGER, 
	"Year" VARCHAR, 
	centroid_lat VARCHAR, 
	centroid_long VARCHAR, 
	lat_long TEXT, 
	lake_code TEXT, 
	lake_name TEXT, 
	parcel_dist_to_lake FLOAT, 
	id BIGINT NOT NULL, 
	CON

<sqlalchemy.engine.result.ResultProxy at 0x171cf9e48>

In [92]:
rows_so_far = 0
nrows = rows_so_far
for f in files:
    c_size = 50000
    print('processing file {0}'.format(f))
    df_iter = enumerate(pd.read_csv(f, 
                                chunksize = c_size, sep = '|', 
                                    dtype={'centroid_lat':str, 'centroid_long':str}))
    for i, chunk in df_iter:
        processed_chunk = chunk >> process_chunk(nrows)
        print('writing chunk {0}'.format(i))
        nrows = nrows + len(chunk)
        processed_chunk.to_sql('lakes', 
                           con=engine, 
                           dtype=sql_types, 
                           index=False,
                           if_exists='append')

processing file ./data/MinneMUDAC_raw_files/2004_metro_tax_parcels.txt


  interactivity=interactivity, compiler=compiler, result=result)


writing chunk 0


StatementError: (builtins.ValueError) could not convert string to float: 'N' [SQL: 'INSERT INTO lakes ("ACRES_DEED", "ACRES_POLY", "AGPRE_ENRD", "AG_PRESERV", "BASEMENT", "CITY", "COOLING", "DWELL_TYPE", "EMV_BLDG", "EMV_LAND", "FIN_SQ_FT", "GARAGE", "GARAGESQFT", "GREEN_ACRE", "HOMESTEAD", "LANDMARK", "OWN_ADD_L1", "OWN_ADD_L2", "OWN_ADD_L3", "PARC_CODE", "PIN", "SALE_VALUE", "SPEC_ASSES", "TAX_CAPAC", "TAX_EXEMPT", "TOTAL_TAX", "USE1_DESC", "USE2_DESC", "USE3_DESC", "USE4_DESC", "WSHD_DIST", "XUSE1_DESC", "XUSE2_DESC", "XUSE3_DESC", "XUSE4_DESC", "YEAR_BUILT", "Year", centroid_lat, centroid_long, lat_long, lake_code, lake_name, parcel_dist_to_lake, id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters: [{'ACRES_DEED': 0.0, 'ACRES_POLY': 28.29, 'AGPRE_ENRD': None, 'AG_PRESERV': 'N', 'BASEMENT': None, 'CITY': 'LINO LAKES', 'COOLING': None, 'DWELL_TYPE': None, 'EMV_BLDG': 0.0, 'EMV_LAND': 209900.0, 'FIN_SQ_FT': 0.0, 'GARAGE': None, 'GARAGESQFT': None, 'GREEN_ACRE': 'N', 'HOMESTEAD': 'N', 'LANDMARK': None, 'OWN_ADD_L1': 'ATTN: GENE HOULE  6497 N UPPER 48TH ST', 'OWN_ADD_L2': 'OAKDALE', 'OWN_ADD_L3': 'MN,  55128', 'PARC_CODE': 0.0, 'PIN': '003-013122110001', 'SALE_VALUE': 0.0, 'SPEC_ASSES': 0.0, 'TAX_CAPAC': 1119.0, 'TAX_EXEMPT': 'N', 'TOTAL_TAX': 1172.0, 'USE1_DESC': None, 'USE2_DESC': None, 'USE3_DESC': None, 'USE4_DESC': None, 'WSHD_DIST': 'RICE CREEK WATERSHED DISTRICT', 'XUSE1_DESC': None, 'XUSE2_DESC': None, 'XUSE3_DESC': None, 'XUSE4_DESC': None, 'YEAR_BUILT': 0.0, 'Year': 2004, 'centroid_lat': '45.21007', 'centroid_long': '-93.0241', 'lat_long': ('45.21007', '-93.0241'), 'lake_code': '02000400-01', 'lake_name': 'Peltier Lake', 'parcel_dist_to_lake': 777.2472562912886, 'id': 0}, {'ACRES_DEED': 0.0, 'ACRES_POLY': 39.31, 'AGPRE_ENRD': None, 'AG_PRESERV': 'N', 'BASEMENT': None, 'CITY': 'LINO LAKES', 'COOLING': None, 'DWELL_TYPE': None, 'EMV_BLDG': 0.0, 'EMV_LAND': 141520.0, 'FIN_SQ_FT': 0.0, 'GARAGE': None, 'GARAGESQFT': None, 'GREEN_ACRE': 'N', 'HOMESTEAD': 'N', 'LANDMARK': None, 'OWN_ADD_L1': '8101 20TH AVE N', 'OWN_ADD_L2': 'HUGO', 'OWN_ADD_L3': 'MN,  55038', 'PARC_CODE': 0.0, 'PIN': '003-013122210001', 'SALE_VALUE': 40270.0, 'SPEC_ASSES': 0.0, 'TAX_CAPAC': 771.0, 'TAX_EXEMPT': 'N', 'TOTAL_TAX': 808.0, 'USE1_DESC': None, 'USE2_DESC': None, 'USE3_DESC': None, 'USE4_DESC': None, 'WSHD_DIST': 'RICE CREEK WATERSHED DISTRICT', 'XUSE1_DESC': None, 'XUSE2_DESC': None, 'XUSE3_DESC': None, 'XUSE4_DESC': None, 'YEAR_BUILT': 0.0, 'Year': 2004, 'centroid_lat': '45.20913', 'centroid_long': '-93.03251', 'lat_long': ('45.20913', '-93.03251'), 'lake_code': '02000400-01', 'lake_name': 'Peltier Lake', 'parcel_dist_to_lake': 777.2472562912886, 'id': 1}, {'ACRES_DEED': 0.0, 'ACRES_POLY': 27.69, 'AGPRE_ENRD': None, 'AG_PRESERV': 'N', 'BASEMENT': None, 'CITY': 'LINO LAKES', 'COOLING': None, 'DWELL_TYPE': None, 'EMV_BLDG': 0.0, 'EMV_LAND': 261000.0, 'FIN_SQ_FT': 0.0, 'GARAGE': None, 'GARAGESQFT': None, 'GREEN_ACRE': 'N', 'HOMESTEAD': 'N', 'LANDMARK': None, 'OWN_ADD_L1': '8301 20TH AVE N', 'OWN_ADD_L2': 'HUGO', 'OWN_ADD_L3': 'MN,  55038', 'PARC_CODE': 0.0, 'PIN': '003-013122220002', 'SALE_VALUE': 112500.0, 'SPEC_ASSES': 0.0, 'TAX_CAPAC': 1422.0, 'TAX_EXEMPT': 'N', 'TOTAL_TAX': 1490.0, 'USE1_DESC': None, 'USE2_DESC': None, 'USE3_DESC': None, 'USE4_DESC': None, 'WSHD_DIST': 'RICE CREEK WATERSHED DISTRICT', 'XUSE1_DESC': None, 'XUSE2_DESC': None, 'XUSE3_DESC': None, 'XUSE4_DESC': None, 'YEAR_BUILT': 0.0, 'Year': 2004, 'centroid_lat': '45.20933', 'centroid_long': '-93.03805', 'lat_long': ('45.20933', '-93.03805'), 'lake_code': '02000400-01', 'lake_name': 'Peltier Lake', 'parcel_dist_to_lake': 777.2472562912886, 'id': 2}, {'ACRES_DEED': 0.0, 'ACRES_POLY': 29.05, 'AGPRE_ENRD': None, 'AG_PRESERV': 'N', 'BASEMENT': None, 'CITY': 'LINO LAKES', 'COOLING': None, 'DWELL_TYPE': None, 'EMV_BLDG': 0.0, 'EMV_LAND': 14700.0, 'FIN_SQ_FT': 0.0, 'GARAGE': None, 'GARAGESQFT': None, 'GREEN_ACRE': 'N', 'HOMESTEAD': 'N', 'LANDMARK': None, 'OWN_ADD_L1': '7445 341ST ST', 'OWN_ADD_L2': 'STACY', 'OWN_ADD_L3': 'MN,  55079', 'PARC_CODE': 0.0, 'PIN': '003-023122110001', 'SALE_VALUE': 0.0, 'SPEC_ASSES': 0.0, 'TAX_CAPAC': 90.0, 'TAX_EXEMPT': 'N', 'TOTAL_TAX': 105.0, 'USE1_DESC': None, 'USE2_DESC': None, 'USE3_DESC': None, 'USE4_DESC': None, 'WSHD_DIST': 'RICE CREEK WATERSHED DISTRICT', 'XUSE1_DESC': None, 'XUSE2_DESC': None, 'XUSE3_DESC': None, 'XUSE4_DESC': None, 'YEAR_BUILT': 0.0, 'Year': 2004, 'centroid_lat': '45.20952', 'centroid_long': '-93.0438', 'lat_long': ('45.20952', '-93.0438'), 'lake_code': '02000400-01', 'lake_name': 'Peltier Lake', 'parcel_dist_to_lake': 777.2472562912886, 'id': 3}, {'ACRES_DEED': 0.0, 'ACRES_POLY': 37.53, 'AGPRE_ENRD': None, 'AG_PRESERV': 'N', 'BASEMENT': None, 'CITY': 'LINO LAKES', 'COOLING': None, 'DWELL_TYPE': None, 'EMV_BLDG': 0.0, 'EMV_LAND': 148960.0, 'FIN_SQ_FT': 0.0, 'GARAGE': None, 'GARAGESQFT': None, 'GREEN_ACRE': 'N', 'HOMESTEAD': 'N', 'LANDMARK': None, 'OWN_ADD_L1': '2100 3RD AVE', 'OWN_ADD_L2': 'ANOKA', 'OWN_ADD_L3': 'MN,  55303', 'PARC_CODE': 0.0, 'PIN': '003-023122120001', 'SALE_VALUE': 24637.0, 'SPEC_ASSES': 0.0, 'TAX_CAPAC': 0.0, 'TAX_EXEMPT': 'Y', 'TOTAL_TAX': 0.0, 'USE1_DESC': None, 'USE2_DESC': None, 'USE3_DESC': None, 'USE4_DESC': None, 'WSHD_DIST': 'RICE CREEK WATERSHED DISTRICT', 'XUSE1_DESC': None, 'XUSE2_DESC': None, 'XUSE3_DESC': None, 'XUSE4_DESC': None, 'YEAR_BUILT': 0.0, 'Year': 2004, 'centroid_lat': '45.20919', 'centroid_long': '-93.04851', 'lat_long': ('45.20919', '-93.04851'), 'lake_code': '02000400-01', 'lake_name': 'Peltier Lake', 'parcel_dist_to_lake': 777.2472562912886, 'id': 4}, {'ACRES_DEED': 0.0, 'ACRES_POLY': 20.17, 'AGPRE_ENRD': None, 'AG_PRESERV': 'N', 'BASEMENT': None, 'CITY': 'LINO LAKES', 'COOLING': None, 'DWELL_TYPE': None, 'EMV_BLDG': 0.0, 'EMV_LAND': 147940.0, 'FIN_SQ_FT': 0.0, 'GARAGE': None, 'GARAGESQFT': None, 'GREEN_ACRE': 'N', 'HOMESTEAD': 'N', 'LANDMARK': None, 'OWN_ADD_L1': '2100 3RD AVE', 'OWN_ADD_L2': 'ANOKA', 'OWN_ADD_L3': 'MN,  55303', 'PARC_CODE': 0.0, 'PIN': '003-023122210001', 'SALE_VALUE': 26662.0, 'SPEC_ASSES': 0.0, 'TAX_CAPAC': 0.0, 'TAX_EXEMPT': 'Y', 'TOTAL_TAX': 0.0, 'USE1_DESC': None, 'USE2_DESC': None, 'USE3_DESC': None, 'USE4_DESC': None, 'WSHD_DIST': 'RICE CREEK WATERSHED DISTRICT', 'XUSE1_DESC': None, 'XUSE2_DESC': None, 'XUSE3_DESC': None, 'XUSE4_DESC': None, 'YEAR_BUILT': 0.0, 'Year': 2004, 'centroid_lat': '45.20927', 'centroid_long': '-93.05237', 'lat_long': ('45.20927', '-93.05237'), 'lake_code': '02000400-01', 'lake_name': 'Peltier Lake', 'parcel_dist_to_lake': 777.2472562912886, 'id': 5}, {'ACRES_DEED': 0.0, 'ACRES_POLY': 16.14, 'AGPRE_ENRD': None, 'AG_PRESERV': 'N', 'BASEMENT': None, 'CITY': 'LINO LAKES', 'COOLING': None, 'DWELL_TYPE': None, 'EMV_BLDG': 0.0, 'EMV_LAND': 147940.0, 'FIN_SQ_FT': 0.0, 'GARAGE': None, 'GARAGESQFT': None, 'GREEN_ACRE': 'N', 'HOMESTEAD': 'N', 'LANDMARK': None, 'OWN_ADD_L1': '2100 3RD AVE', 'OWN_ADD_L2': 'ANOKA', 'OWN_ADD_L3': 'MN,  55303', 'PARC_CODE': 0.0, 'PIN': '003-023122210001', 'SALE_VALUE': 26662.0, 'SPEC_ASSES': 0.0, 'TAX_CAPAC': 0.0, 'TAX_EXEMPT': 'Y', 'TOTAL_TAX': 0.0, 'USE1_DESC': None, 'USE2_DESC': None, 'USE3_DESC': None, 'USE4_DESC': None, 'WSHD_DIST': 'RICE CREEK WATERSHED DISTRICT', 'XUSE1_DESC': None, 'XUSE2_DESC': None, 'XUSE3_DESC': None, 'XUSE4_DESC': None, 'YEAR_BUILT': 0.0, 'Year': 2004, 'centroid_lat': '45.2091', 'centroid_long': '-93.05506', 'lat_long': ('45.2091', '-93.05506'), 'lake_code': '02000400-01', 'lake_name': 'Peltier Lake', 'parcel_dist_to_lake': 777.2472562912886, 'id': 6}, {'ACRES_DEED': 0.0, 'ACRES_POLY': 10.51, 'AGPRE_ENRD': None, 'AG_PRESERV': 'N', 'BASEMENT': None, 'CITY': 'LINO LAKES', 'COOLING': None, 'DWELL_TYPE': None, 'EMV_BLDG': 0.0, 'EMV_LAND': 154700.0, 'FIN_SQ_FT': 0.0, 'GARAGE': None, 'GARAGESQFT': None, 'GREEN_ACRE': 'N', 'HOMESTEAD': 'N', 'LANDMARK': None, 'OWN_ADD_L1': '8300 RONDEAU LAKE RD E', 'OWN_ADD_L2': 'HUGO', 'OWN_ADD_L3': 'MN,  55038', 'PARC_CODE': 0.0, 'PIN': '003-023122220003', 'SALE_VALUE': 0.0, 'SPEC_ASSES': 0.0, 'TAX_CAPAC': 795.0, 'TAX_EXEMPT': 'N', 'TOTAL_TAX': 946.0, 'USE1_DESC': None, 'USE2_DESC': None, 'USE3_DESC': None, 'USE4_DESC': None, 'WSHD_DIST': 'RICE CREEK WATERSHED DISTRICT', 'XUSE1_DESC': None, 'XUSE2_DESC': None, 'XUSE3_DESC': None, 'XUSE4_DESC': None, 'YEAR_BUILT': 0.0, 'Year': 2004, 'centroid_lat': '45.21015', 'centroid_long': '-93.05759', 'lat_long': ('45.21015', '-93.05759'), 'lake_code': '02000400-01', 'lake_name': 'Peltier Lake', 'parcel_dist_to_lake': 777.2472562912886, 'id': 7}  ... displaying 10 of 5182 total bound parameter sets ...  {'ACRES_DEED': 0.0, 'ACRES_POLY': 0.74, 'AGPRE_ENRD': None, 'AG_PRESERV': 'N', 'BASEMENT': None, 'CITY': 'LINO LAKES', 'COOLING': None, 'DWELL_TYPE': None, 'EMV_BLDG': 280864.0, 'EMV_LAND': 77550.0, 'FIN_SQ_FT': 0.0, 'GARAGE': None, 'GARAGESQFT': None, 'GREEN_ACRE': 'N', 'HOMESTEAD': 'Y', 'LANDMARK': None, 'OWN_ADD_L1': '1186 DURANGO POINT', 'OWN_ADD_L2': 'LINO LAKES', 'OWN_ADD_L3': 'MN,  55038', 'PARC_CODE': 0.0, 'PIN': '003-333122110013', 'SALE_VALUE': 379500.0, 'SPEC_ASSES': 0.0, 'TAX_CAPAC': 3721.0, 'TAX_EXEMPT': 'N', 'TOTAL_TAX': 4565.0, 'USE1_DESC': None, 'USE2_DESC': None, 'USE3_DESC': None, 'USE4_DESC': None, 'WSHD_DIST': 'RICE CREEK WATERSHED DISTRICT', 'XUSE1_DESC': None, 'XUSE2_DESC': None, 'XUSE3_DESC': None, 'XUSE4_DESC': None, 'YEAR_BUILT': 1999.0, 'Year': 2004, 'centroid_lat': '45.13731', 'centroid_long': '-93.08196', 'lat_long': ('45.13731', '-93.08196'), 'lake_code': '02000900-01', 'lake_name': 'Reshanau Lake', 'parcel_dist_to_lake': 649.7591706238178, 'id': 5180}, {'ACRES_DEED': 0.0, 'ACRES_POLY': 0.36, 'AGPRE_ENRD': None, 'AG_PRESERV': 'N', 'BASEMENT': None, 'CITY': 'LINO LAKES', 'COOLING': None, 'DWELL_TYPE': None, 'EMV_BLDG': 235610.0, 'EMV_LAND': 70500.0, 'FIN_SQ_FT': 0.0, 'GARAGE': None, 'GARAGESQFT': None, 'GREEN_ACRE': 'N', 'HOMESTEAD': 'Y', 'LANDMARK': None, 'OWN_ADD_L1': '1180 DURANGO POINT', 'OWN_ADD_L2': 'LINO LAKES', 'OWN_ADD_L3': 'MN,  55038', 'PARC_CODE': 0.0, 'PIN': '003-333122110014', 'SALE_VALUE': 342900.0, 'SPEC_ASSES': 0.0, 'TAX_CAPAC': 3210.0, 'TAX_EXEMPT': 'N', 'TOTAL_TAX': 3893.0, 'USE1_DESC': None, 'USE2_DESC': None, 'USE3_DESC': None, 'USE4_DESC': None, 'WSHD_DIST': 'RICE CREEK WATERSHED DISTRICT', 'XUSE1_DESC': None, 'XUSE2_DESC': None, 'XUSE3_DESC': None, 'XUSE4_DESC': None, 'YEAR_BUILT': 2002.0, 'Year': 2004, 'centroid_lat': '45.13752', 'centroid_long': '-93.08235', 'lat_long': ('45.13752', '-93.08235'), 'lake_code': '02000900-01', 'lake_name': 'Reshanau Lake', 'parcel_dist_to_lake': 649.7591706238178, 'id': 5181}]]

#### Note: If you are clever, you can do parts 8 in one double loop, which will save you from having to read the parcel files twice.

8. It is probably worth our time to test that our new key column is truely unique. (If not, we might be wasting out time loading the data into a database, only to have process fail hours in.) Test that the new column works by <br>
    a. Iterating over all the files.<br>
    b. Using an accumulator to count total number of rows across all parcel files. <br>
    c. Using an accumulator to accumulate a set of all unique values of our new key. <br>
    d. Verifying that we have as many total rows as unique keys.
    a. Selecting just this column. <br>
    b. Dumping this column into a temporary database <br>

9. If the last step succeeded, you can proceed to make a master parcel data database.  If not, you will need to figure out another primary key, probably an `id` column similar to the example in the lectures.