### Explanation

This notebook attempts to create a dataset suitable for Poisson regression (or log-linear regression more generally) by finding the net new units built on each site in the BlueSky dataset. I was not able to figure out how to do this after perhaps a dozen attempts, so I need to circle back with SF staff.

### Code

In [319]:
import pandas as pd
import geopandas as gpd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import contextily as ctx
from shapely.wkt import load

import clean_utils

In [479]:
plan_permits = gpd.read_file('../data/SF_Planning_Permitting_Data.geojson', low_memory=False)

In [312]:
clean_utils.clean_dates(plan_permits)

In [5]:
parcels = pd.read_csv('../data/Blue Sky Code and Inputs/SF_Logistic_Data.csv')

In [304]:
allParcels = gpd.read_file('../data/Parcels   Active and Retired/parcels.shp')

In [7]:
sites = gpd.read_file('../data/site_inventory/xn--Bay_Area_Housing_Opportunity_Sites_Inventory__20072023_-it38a.shp')

### Training Set is RHNA 4

In [422]:
trainParcels = parcels[(parcels.year >= 2007) & (parcels.year < 2015)]
trainY = trainParcels.groupby('MapBlkLot_Master')['Developed'].agg(lambda x: x.ne(0).sum())
trainX = trainParcels[trainParcels.year == 2007]
trainY.sum()

253

No duplicative index.

In [424]:
nunique_lots = trainParcels[trainParcels.year == 2007].MapBlkLot_Master.nunique()
n_lots = trainParcels[trainParcels.year == 2007].shape[0]
assert nunique_lots == n_lots

trainDf = pd.merge(trainX.drop('Developed', axis=1), trainY, left_on='MapBlkLot_Master', right_index=True)
trainDf.Developed.value_counts()

0    152965
1       251
2         1
Name: Developed, dtype: int64

### Make BlueSky data geospatial

In [428]:
df = clean_utils.transform_bluesky_to_geospatial(trainDf)

In [429]:
df.CANTID_blklot_backup.notna().sum()

7756

### Developed parcels

In [364]:
built = df.loc[df.Developed > 0,]

In [463]:
dbi_permits = clean_utils.get_dbi_permits()

In [464]:
round(built.MapBlkLot_Master.isin(dbi_permits.blocklot).mean(), 2)

0.53

In [465]:
success = ['complete', 'issued', 'approved', 'granted', 'issuing']
completed_projects = dbi_permits[dbi_permits['status'].isin(success)].copy()

## Use blklot and apn. Compare

In [466]:
completed_projects['blocklot'].nunique()

1986

In [473]:
dbi_units = completed_projects.groupby(['blocklot'], sort=False)['units'].median()
dbi_units = dbi_units.reset_index()
merge_blklot = built.merge(dbi_units, how='inner', left_on='blklot', right_on='blocklot')
merge_mapblklot = built.merge(dbi_units, how='inner', left_on='mapblklot', right_on='blocklot')
merge_mapblklotm = built.merge(dbi_units, how='inner', left_on='MapBlkLot_Master', right_on='blocklot')

In [474]:
merge_blklot.units.sum()

8969.5

In [396]:
all_match1 = pd.concat((built_poisson_mapblklotm, 
                        built_poisson_mapblklot,
                        built_poisson_blklot), axis=0)
all_match1 = all_match1.to_crs('EPSG:4326')
all_match1.MapBlkLot_Master.nunique()

In [401]:
built = built.to_crs('EPSG:4326')
merge_geo = gpd.sjoin(built, completed_projects, how="inner", predicate='contains')
merge_geo.MapBlkLot_Master.nunique()

In [478]:
all_match = pd.concat((all_match1, merge_geo), axis=0)
all_match = all_match[~all_match.MapBlkLot_Master.duplicated()]
all_match.MapBlkLot_Master.nunique()

276

I can capture all but 30 permits using DBI dataset.

#### How many matches do I get if I use SF Planning Permits

In the SF Planning Permits dataset, almost all mapblock lots are block + lot. Fewer are lot + block. And 29 have some non digit character I need to strip out.

Also, 7% blocklots are nans.

In [480]:
plan_permits[
    ~((plan_permits.mapblocklot == (plan_permits.block + plan_permits.lot))
       | plan_permits.block.isna()
       | (plan_permits.mapblocklot == (plan_permits.lot + plan_permits.block)))
].shape

(29, 139)

In [134]:
ppermits = plan_permits

In [481]:
clean_utils.clean_dates(ppermits)
clean_utils.clean_numbers(ppermits)

In [483]:
ppermits['NA_NUMBER_OF_UNITS_EXIST'] = ppermits['number_of_units_exist'].isna()
ppermits['units'] = ppermits['number_of_units'].fillna(0) - ppermits['number_of_units_exist'].fillna(0)

statuses = ['Closed - Approved', 'Closed',
            'Closed - Issued', 'Closed - DR taken-Approved', 
            'Closed - Appeal Upheld', 'Closed - DR not taken-Approved',
            'Approved', 'Permitted', 'Complete',
            'Accepted', 'Application Accepted', 'Closed - No DR action-Approved']

rhna_ppermits = ppermits[
    (ppermits['units'] > 0)
    & (ppermits['record_status'].isin(statuses)) 
    & (((ppermits['close_date'].dt.year >= 2007)
        & (ppermits['close_date'].dt.year < 2015))
       | ((ppermits['open_date'].dt.year >= 2007)
          & (ppermits['open_date'].dt.year < 2015))
       | ((ppermits['date_application_accepted'].dt.year >= 2007)
          & (ppermits['date_application_accepted'].dt.year < 2015))
       | ((ppermits['date_application_submitted'].dt.year >= 2007)
          & (ppermits['date_application_submitted'].dt.year < 2015)))
].copy()

In [484]:
merge_pp_geo = gpd.sjoin(built, rhna_ppermits, how="inner", predicate='contains')

In [486]:
plan_units = rhna_ppermits.groupby(['mapblocklot'], sort=False)['units'].median()
merge_pp_mapblklot = built.merge(plan_units, how='inner', left_on='mapblklot', right_on='mapblocklot')
all_match3 = pd.concat((all_match, merge_pp_mapblklot, merge_pp_geo), axis=0)
all_match3.MapBlkLot_Master.nunique()

279

In [487]:
all_match3 = all_match3[~all_match3.MapBlkLot_Master.duplicated()]

In [488]:
finalDf = all_match3[list(trainDf.columns.values) + ['units']]

In [489]:
cantID = built[~built.MapBlkLot_Master.isin(finalDf.MapBlkLot_Master)]

In [490]:
cantID.shape

(28, 23)

In [491]:
cantID.CANTID_blklot_backup.notna().sum()

9

#### Can I identify missing permits by geo? Only one more.

In [494]:
gpd.sjoin(cantID, rhna_permits).MapBlkLot_Master.unique()

array(['4044031'], dtype=object)

#### Do these unidentified parcels have diff geometries I can try in Parcels?

No. I tried looking at parcels where mapblocklot matched multiple rows in AllParcels, and none of those are the parcels that were developed 2007-2015.

In [495]:
remaining = dbi[
    (dbi['units'] > 0)
    & (dbi['permit_type'].isin([1, 2, 3, 8]))
    & (dbi.blocklot.isin(cantID.MapBlkLot_Master))]

In [496]:
remaining.blocklot.nunique()

5

In [497]:
dbi_units_remain = remaining.groupby(['blocklot'], sort=False)['units'].median()

built_poisson_remainder = cantID.merge(dbi_units_remain, how='inner', left_on='MapBlkLot_Master', right_on='blocklot')
built_poisson_remainder.MapBlkLot_Master.nunique()

5

In [498]:
built_poisson_remainder = built_poisson_remainder[list(trainDf.columns.values) + ['units']]

In [499]:
all_match_last = pd.concat((finalDf, built_poisson_remainder), axis=0)

In [500]:
all_match_last.shape

(284, 18)

In [501]:
remaining2 = dbi[
    (dbi['units'] > 0)
    & (dbi['permit_type'].isin([1, 2, 3, 8]))]

In [502]:
cantID2 = cantID[~cantID.MapBlkLot_Master.isin(built_poisson_remainder.MapBlkLot_Master)].copy()

In [503]:
lastgeo =  gpd.sjoin(cantID2, remaining2, how="inner", predicate='contains')

In [504]:
lastgeoUq = lastgeo[~lastgeo.MapBlkLot_Master.duplicated()]

In [505]:
lastgeoUq = lastgeoUq[list(trainDf.columns.values) + ['units']]

In [506]:
fdf = pd.concat((all_match_last, lastgeoUq), axis=0)

In [510]:
type(all_match_last), type(lastgeoUq)

(pandas.core.frame.DataFrame, pandas.core.frame.DataFrame)

In [507]:
fdf.shape

(288, 18)

In [508]:
fdf['area'] = fdf.to_crs(5070).geometry.area

AttributeError: 'DataFrame' object has no attribute 'to_crs'

In [279]:
fdf.to_file('clean_built_data.geojson')

In [282]:
trainDf.MapBlkLot_Master.isin(fdf.MapBlkLot_Master).sum()

580