# Merging Sales with Building Footprints
<hr>
The purpose of this script is to combine the list of sales downloaded from the Suffolk County Recorder—which do not include addresses or cities—and use their tax-map number to determine whether they are in the North Fork or the South Fork.

In [4]:
import pandas as pd
import os
import re

This function is for taking a raw 19-digit tax map ID and formatting it into a sequence of district, section, block, and lot numbers.

In [45]:
def format_parcel(number):
    
    section = number[:4]
    block_prefix = number[4:7]
    block_suffix = number[7:9]
    lot_prefix = number[9:11]
    lot_suffix = number[11:13]
    parcel_prefix = number[13:16]
    parcel_suffix = number[16:]
    format_string = '{}-{}.{}-{}.{}-{}.{}'.format(
        section,
        block_prefix,
        block_suffix,
        lot_prefix,
        lot_suffix,
        parcel_prefix,
        parcel_suffix
    )
    return format_string

And this function does the reverse.

In [3]:
def strip_parcel(number):
    
    raw_string = number.replace('-', '').replace('.', '')
    return raw_string
    

### Aside A: Merging and Cleaning Raw CSVs into a Single File
<hr>
This step has already been completed and can be skipped for future use.

In [59]:
hamptons_file = '/Users/haru/Documents/Supplements/hamptons/2018-06/top_brokerages/data/gis/hamptons-footprints.csv'
northfork_file = '/Users/haru/Documents/Supplements/hamptons/2018-06/top_brokerages/data/gis/north-fork-footprints.csv'
root_dir = '/Users/haru/Documents/Supplements/hamptons/2018-06/top-sales/data/suffolk-county'

In [60]:
southFrame = pd.read_csv(hamptons_file, dtype=str)
northFrame = pd.read_csv(northfork_file, dtype=str)

In [72]:
Frames = []
for root,_,files in os.walk(root_dir):
    
    for f in files:
        
        if f.endswith('csv'):
            
            inp = os.path.join(root, f)
            dF = pd.read_csv(inp, dtype=str)
            Frames.append(dF)
            
outFrame = pd.concat(Frames, axis=0)
outFrame.fillna('', inplace=True)

2187


In [75]:
Array = outFrame.values
Keys = outFrame.columns.values
Dicts = []
for Row in Array:
    
    Data = [re.sub('<B>.*?B>', '', cell) for cell in Row]
    d = dict(zip(Keys, Data))
    Dicts.append(d)
    
cleanFrame = pd.DataFrame(Dicts)

In [76]:
Formatted_Parcels = cleanFrame.TAXMAPNO.values

In [77]:
Stripped_Parcels = [strip_parcel(p) for p in Formatted_Parcels]

In [78]:
cleanFrame = cleanFrame.assign(PARCELID=Stripped_Parcels)
cleanFrame.to_csv('/Users/haru/Documents/Supplements/hamptons/2018-06/top_brokerages/data/gis/sales.csv', index=False)

### End Aside.
<hr>

### Aside B: Merging the Clean Sales Data with Separate Files for North and South Fork
This isn't necessary, since we're going to follow these steps for the entire Suffolk County file instead.

The same principle applies, however.

In [79]:
nmergeFrame = cleanFrame.merge(northFrame, on='PARCELID', how='inner')
smergeFrame = cleanFrame.merge(southFrame, on='PARCELID', how='inner')

In [81]:
nmergeFrame.to_csv('/Users/haru/Documents/Supplements/hamptons/2018-06/top_brokerages/data/gis/northfork_sales_with-footprints.csv', index=False)
smergeFrame.to_csv('/Users/haru/Documents/Supplements/hamptons/2018-06/top_brokerages/data/gis/hamptons_sales_with-footprints.csv', index=False)

### End Aside.
<hr>

## Step 1: Loading the Data
Here, we're going to load the entire Suffolk County Building Footprints file into a Pandas DataFrame. The link to the data is: https://gis.ny.gov/gisdata/inventories/details.cfm?DSID=1300.

Please note that this data is <b>not</b> the CSV file that you can download directly from the site. Rather, it is the ESRI shapefile that has been imported into QGIS and then exported from there into a CSV. The reason for this roundabout method is that the shapefile contains more details than the CSV—including location data.

In [5]:
foot_file = '/Users/haru/Documents/Supplements/hamptons/2018-06/top-sales/data/gis/Suffolk-Tax-Parcels-Centroid-Points-SHP.csv'
sales_file = '/Users/haru/Documents/Supplements/hamptons/2018-06/top-sales/data/gis/hamptons_recorded-sales_all-suffolk_1m+.csv'

In [6]:
footFrame = pd.read_csv(foot_file, dtype=str)
salesFrame = pd.read_csv(sales_file, dtype=str)

## Step 2: Cleaning the Data

### a) Getting District Codes
It turns out that the NY state file doesn't include the Suffolk County district codes—i.e. the first four digits of the complete tax ID number.

But it does include the name of the town.

We can get the district code from the name of the town, using the County's own reference sheet: http://www.suffolkcountyny.gov/Departments/CountyClerk/TownandDistrictCodes.aspx.

Once we put that into a dictionary, we can use it to convert the towns into codes.

In [7]:
distDict = {
    'Amityville': '0101',
    'Asharoken': '0401',
    'Babylon': '0100',#0102
    'Babylon, Village': '0102',#0100
    'Lindenhurst': '0103',
    'Belle Terre': '0201',
    'Bellport': '0202',
    'Brightwaters': '0501',
    'Brookhaven': '0200',
    'Shoreham': '0207',
    'Old Field': '0203',
    'Poquott': '0205',
    'Port Jefferson': '0206',
    'Lake Grove': '0208',
    'Patchogue': '0204',
    'Mastic Beach': '0209',
    'Dering Harbor': '0701',
    'East Hampton': '0300',#0301
    'East Hampton, Village': '0301',#0300
    'Sag Harbor': '0903',
    'Greenport': '1001',
    'Head Of The Harbor': '0801',
    'Huntington': '0400',
    'Lloyd Harbor': '0403',
    'Northport': '0404',
    'Huntington Bay': '0400',
    'Islandia': '0504',
    'Islip': '0500',
    'Saltaire': '0503',
    'Ocean Beach': '0502',
    'Nissequogue': '0802',
    'North Haven': '0901',
    'Quogue': '0902',
    'Riverhead': '0900',
    'Sagaponack': '0908',
    'Shelter Island': '0700',
    'Smithtown': '0800',
    'Village Of The Branch': '0803',
    'West Hampton Dunes': '0907',
    'Southampton': '0904',
    'Westhampton Beach': '0905',
    'Southampton, Village': '0900',
    'Southold': '1000',
}

In [None]:
distDict = {
    'Amityville':'0101',
    'Asharoken':'0401',
    'Babylon 0100',
    'Babylon, Village':'0102',
    'Belle Terre 0201',
    'Bellport':'0202',
    'Brightwaters':'0501',
    'Brookhaven':'0200',
    'Dering Harbor':'0701',
    'East Hampton':'0300',
    'East Hampton, Village':'0301',
    'Greenport':'1001',
    'Head Of The Harbor':'0801',
    'Huntington':'0400',
    'Huntington Bay':'0400',
    'Islandia':'0504',
    'Islip':'0500',
    'Lake Grove':'0208',
    'Lindenhurst 0103',
    'Lloyd Harbor':'0403',
    'Mastic Beach':'0209',
    'Nissequogue 0802',
    'North Haven 0901',
    'Northport':'0404',
    'Ocean Beach 0502',
    'Old Field':'0203',
    'Patchogue':'0204',
    'Poquott 0205',
    'Port Jefferson':'0206',
    'Quogue':'0902',
    'Riverhead':'0600',
    'Sag Harbor':'0302',
    'Sagaponack':'0908',
    'Saltaire':'0503',
    'Shelter Island':'0700',
    'Shoreham':'0207',
    'Smithtown':'0800',
    'Southampton 0900',
    'Southampton, Village':'0904',
    'Southold':'1000',
    'Village Of The Branch':'0803',
    'West Hampton Dunes':'0907',
    'Westhampton Beach':'0905',
}

In [8]:
def get_district(muni):
    
    try:
        
        district = distDict[muni]
        
    except KeyError:
        
        district = '0000'
        
    return district

In [9]:
Towns = footFrame.MUNI_NAME.values
Dists = [distDict[t] for t in Towns]

In [10]:
footFrame = footFrame.assign(district=Dists)

In [22]:
for t in footFrame.MUNI_NAME.unique():
    print(t)

Amityville
Asharoken
Babylon
Babylon, Village
Lindenhurst
Belle Terre
Bellport
Brightwaters
Brookhaven
Shoreham
Old Field
Poquott
Port Jefferson
Lake Grove
Patchogue
Mastic Beach
Dering Harbor
East Hampton
East Hampton, Village
Sag Harbor
Greenport
Head Of The Harbor
Huntington
Lloyd Harbor
Northport
Huntington Bay
Islandia
Islip
Saltaire
Ocean Beach
Nissequogue
North Haven
Quogue
Riverhead
Sagaponack
Shelter Island
Smithtown
Village Of The Branch
West Hampton Dunes
Southampton
Westhampton Beach
Southampton, Village
Southold


### B) Fixing the SBLs
Also, the state file's SBL numbers are formatted in a number of different ways.

The cleanest is a simply 15-digit code that is essentially a truncated version of the complete tax number.

But there are others that are a result of the state recombining elements of the tax number, probably in a botched attempt to standardize them.

Luckily, there are only a few variants, and the pattern for cleaning the number aligns with the length of the string.

In [11]:
def correct_sbl(sbl):
    
    length = len(sbl)
    if length == 20:
        
        #Break string apart.
        section = sbl[:5]
        block = sbl[5:10]
        lot = sbl[10:]

        #Re-order and trim block
        block = block[-2:] + block[:-2]
        block = block[:4]

        #Trim 0000 from lot
        lot = lot[:-4]

        #Combine
        revised = ''.join([section, block, lot])
        
    elif length == 16:
        
        #Break string apart.
        section = sbl[:5]
        block = sbl[5:10]
        lot = sbl[10:]
        
        #Trim block
        block = block[1:]
        
        #Combine
        revised = ''.join([section, block, lot])
        
    elif length == 17:
        
        #Break string apart.
        section = sbl[:5]
        block = sbl[5:10]
        lot = sbl[10:]
        
        #Trim block
        block = block[1:]
        
        #Trim lot.
        lot = lot[:-1]
        
        #Combine
        revised = ''.join([section, block, lot])
    
    else:
        
        print('{}: {}'.format(sbl, length))
        raise
        
    return revised
    

In [12]:
Raw_Codes = footFrame.SBL.values
Clean_Codes = [correct_sbl(c) if len(c) != 15 else c for c in Raw_Codes]

In [13]:
footFrame = footFrame.assign(clean_sbl=Clean_Codes)

### C) Creating Complete Tax ID Numbers
You can concatenate the strings in two DataFrame columns simply by adding them:

In [14]:
footFrame = footFrame.assign(recombined_parcel=footFrame.district + footFrame.clean_sbl)

But it turns out that this produces a low success rate—likely because of slight variations in the district codes used for each parcel. (Only about one in five sales was successfully mapped to a parcel number.)

Instead, we're going to map sales to parcels based on their SBL number only—without the district.

This carries a small risk of matching sales with parcels with the same SBL number but in a different town.

But we can spot-check those when we have the final list.

In [15]:
Parcel_Nos = salesFrame.PARCELID.values
SBL_Only = [p[4:] for p in Parcel_Nos]
salesFrame = salesFrame.assign(clean_sbl=SBL_Only)

In [17]:
salesFrame = salesFrame.rename(columns={'PARCELID':'recombined_parcel'})

In [18]:
mergeFrame = salesFrame.merge(footFrame, on='recombined_parcel', how='left')
mergeFrame.to_csv('/Users/haru/Documents/Supplements/hamptons/2018-06/top-sales/data/gis/hamptons_sales_merged_all-suffolk.csv', index=False)

## Step 3: Filtering and Sorting the Data



In [21]:
mask = footFrame.recombined_parcel.str.contains('024031')
group = footFrame.loc[mask]
group

Unnamed: 0,X,Y,COUNTY,MUNI_NAME,SWIS,PARCELADDR,PRINT_KEY,SBL,CT_NAME,CT_SWIS,...,ROLL_YR,SPATIAL_YR,OWNER_TYPE,NYS_NAME,NAMESOURCE,DUP_GEO,CALC_ACRES,district,clean_sbl,recombined_parcel
8993,643419.1558999997,4513602.5412,Suffolk,Babylon,472089,960 GRAND BL,67.-1-24.31,6700000010240310000,Babylon,472000,...,2016,2014,8,,,,2.00182898885,102,67000100024031,102067000100024031
262827,734471.9822000003,4542105.3416,Suffolk,East Hampton,472489,2 DERING LN,113.-3-24.31,11300000030240310000,East Hampton,472400,...,2016,2014,8,,,,1.41803377641,301,113000300024031,301113000300024031
585359,730756.6533000004,4559490.3216,Suffolk,Southold,473889,80 Ryder Farm Ln,15.-5-24.31,1500000050240310000,Southold,473800,...,2016,2014,8,,,,0.95129833971,1000,15000500024031,1000015000500024031
