## Introducing Parcel Data

The following notebook utilizes parcel data in California to more accurately retain the proper building footprints (avoiding missing data that occurred when Zillow data points did not perfectly overlap multi-unit housing complexes).

Steps:
1. Load parcel and Zillow data
2. Left join Zillow data to parcel data (keeping only parcels with Zillow points)
3. Load building footprint data
4. Left join parcel data to building footprint data (keeping only buildings within Zillow parcels)
5. Make sure that Zillow data is properly applied to each multi-unit home?
6. Create linear regression to predict # of units for buildings that are missing data

In [None]:
import pandas as pd
from shapely.geometry import box
import numpy as np
import geopandas as gpd
import os
import matplotlib.pyplot as plt
import zipfile

from sklearn.linear_model import LinearRegression

### Load data

**Zillow**

In [None]:
# read in the zillow geo data (takes about 10 minutes)
fp = os.path.join('data', 'final_zillow.gpkg')
zillow = gpd.read_file(fp)

**Building footprints (from parquet file)**

In [None]:
# read in original building data
# specify tile download url; this url is for area containing Santa Barbara, CA
url = 'https://data.source.coop/tge-labs/globalbuildingatlas-lod1/w120_n35_w115_n30.parquet'

# read the parquet file into a DataFrame
building_pqt = pd.read_parquet(url)

In [None]:
# extract limit information from `bbox` column using shapely (also takes some time)
building_pqt["geometry"] = building_pqt["bbox"].apply(
    lambda b: box(b["xmin"], b["ymin"], b["xmax"], b["ymax"])
)

In [None]:
# convert dataframe to a geopandas object
building_raw = gpd.GeoDataFrame(building_pqt, geometry="geometry", crs="EPSG:4326")

# confirm transformation worked and we have a geodataframe
building_raw.head()

**Parcel data**

In [None]:
# read in parcel data (from zip file)
with zipfile.ZipFile("data/Parcels_CA_2014.zip", 'r') as z:
    z.extractall("data/")

# Then read the extracted .gbd file
parcels = gpd.read_file("data/Parcels_CA_2014.gdb")

parcels.head

### Select only residential parcels
By joining Zillow data to parcels. Parcel data is stored as a single geometry per county.

In [None]:
# view what parcel geometry looks like for one county
parcels[parcels['County' == 'Alameda']].plot()

In [None]:
# explode!
parcels_exploded = parcels.explode(index_parts=False).reset_index(drop=True)

In [None]:
# make sure exploding works
print(f"Original rows: {len(parcels)}")
print(f"Exploded rows: {len(parcels_exploded)}")

In [None]:
# residential parcels (should keep all zillow points and their corresponding parcel geometries)
parcels_res = gpd.sjoin(
    zillow,
    parcels_exploded,
    how = "left",
    predicate = "intersects"
)

### Select buildings only in residential parcels

In [None]:
# keep all residential parcels and attach buildings within them
building_res = gpd.sjoin(
    parcels_res,
    building_raw,
    how = "left",
    predicate = "intersects"
)

In [None]:
# drop parcels column (don't need it anymore)
building_res = building_res.drop(["insert parcel column name"], axis = 1)

### Attach Zillow data to residential buildings

In [None]:
# keep all residential buildings, and add zillow points only where they match up
building = gpd.sjoin(
    building_res,
    zillow,
    how = "left",
    predicate = "intersects")

### Find volume information from building footprints

In [None]:
# reproject data frame to crs with meters as units
building_m = building.to_crs("EPSG:6933")

In [None]:
# find and create column from polygon area
building_m['area_m2'] = building_m.geometry.area

# rename height column to be clear about units
building_m.rename(columns={"height":"height_m"}, inplace = True)

building_m.head(2)

In [None]:
# create volume column
building_m['volume_m3'] = building_m['area_m2'] * building_m['height_m']

building_m.head(2)

In [None]:
# explore number of rows that don't have unit data -- it's a lot
building_m['unit'].isna().sum()

In [None]:
# keep only observations with unit data
building_w_units = building_m[~building_m['unit'].isna()]

# confirm operation worked
building_w_units['unit'].isna().sum()

### Regression analysis

In [None]:
# plot units vs volume^3
building_w_units.plot(kind='scatter',
              x='volume_m3', 
              y='unit')

In [None]:
# x-values
x = np.array(building_w_units['volume_m3']).reshape((-1,1)) # Reshape to (n_samples, 1) ; -1 tells numpy that it can figure out how many rows we need
print(f"Input data shape: {x.shape}")

# y-values
y = np.array(building_w_units['unit'])
print(f"Output data shape: {y.shape}")

In [None]:
# Fit model
model = LinearRegression().fit(x, y)

In [None]:
R_sq = model.score(x,y)
print(f"Coefficient of determination (R^2): {R_sq}")

# Retrieve intercept and slope
intercept = model.intercept_
print(f"x-axis intercept: {intercept}")

slope = model.coef_[0]
print(f"Slope: {slope}")