## Introduction

This analysis seeks to gather and prove valuable observations about the impact of various home attributes on its value. The project combines multiple publicly available datasets [made available](https://data.kingcounty.gov/) by King County. One dataset provides a record of home and land sales alongside various identifying feautres. The second dataset gives very granular data about housing in King County, going beyond just square footage to a breakdown by room, many identifying features such as porch size, and more. Combining these two datasets opens the door for the analysis to come. 

This notebook provides a somewhat condensed analysis compared to the full sequence necessary to understand the full details of choosing specific models and the nitty-gritty details. Please refer to the notebooks in the repository folder notebooks->exploratory if you would like to see a deep dive.

### The Process
1. Basic Setup and Data Assembly
2. Data Aggregation and Cleaning
3. Feature Selection and Creation
4. The Model
5. The Findings

## 1. Basic Setup and Data Assembly

In [None]:
# Import necessary packages
import numpy as np
import pandas as pd
import sqlite3
import os, sys

# Import functions from a Python file in this repository with context-relevant functionality
path_to_src = os.path.join('..', '..', 'src')
sys.path.insert(1, path_to_src)
from custom_functions import *
%load_ext autoreload
%autoreload 2

##### Import King County housing data

In [None]:
# Read csv files from the data directory
df_lookup = pd.read_csv(os.path.join('..','..', 'data', 'raw', 'EXTR_LookUp.csv'), dtype='str')
df_resbldg = pd.read_csv(os.path.join('..','..', 'data', 'raw', 'EXTR_ResBldg.csv'), dtype='str')
df_rpsale = pd.read_csv(os.path.join('..','..', 'data', 'raw', 'EXTR_RpSale.csv'), dtype='str')

# Use the user-defined strip_spaces function to remove leading and trailing spaces from the entire dataframe
df_lookup = strip_spaces(df_lookup)
df_resbldg = strip_spaces(df_resbldg)
df_rpsale = strip_spaces(df_rpsale)

##### Eliminate unecessary data. After close investigation, the below columns were deemed the most worthy of continued analysis.

In [None]:
# Manual selection of the features of choice
resbldg_desired_columns = ['Major', 'Minor', 'NbrLivingUnits', 'Stories', 'BldgGrade', 
                           'BldgGradeVar', 'SqFt1stFloor', 'SqFtHalfFloor', 'SqFt2ndFloor',
                           'SqFtUpperFloor', 'SqFtUnfinFull', 'SqFtUnfinHalf', 'SqFtTotLiving', 'SqFtTotBasement', 
                           'SqFtFinBasement', 'FinBasementGrade', 'SqFtGarageBasement', 'SqFtGarageAttached', 
                           'DaylightBasement','SqFtOpenPorch', 'SqFtEnclosedPorch', 'SqFtDeck', 'HeatSystem',
                           'HeatSource', 'BrickStone', 'ViewUtilization', 'Bedrooms','BathHalfCount', 
                           'Bath3qtrCount', 'BathFullCount', 'FpSingleStory','FpMultiStory', 'FpFreestanding', 
                           'FpAdditional', 'YrBuilt','YrRenovated', 'PcntComplete', 'Obsolescence', 
                           'PcntNetCondition','Condition']
rpsale_desired_columns = ['ExciseTaxNbr', 'Major', 'Minor', 'DocumentDate', 'SalePrice', 'RecordingNbr', 'PropertyType', 
                          'PrincipalUse', 'SaleInstrument', 'AFForestLand', 'AFCurrentUseLand', 'AFNonProfitUse', 
                          'AFHistoricProperty', 'SaleReason', 'PropertyClass', 'SaleWarning']

# Remove all columns that are not in one of the above two lists.
df_resbldg = df_resbldg[resbldg_desired_columns].copy()
df_rpsale = df_rpsale[rpsale_desired_columns].copy()

##### Create identifier that will be used to connect the two dataframes. 
In this case, each database provides *Major* and *Minor*, which serve as location-specific identifiers. From here on, the combination of *Major* and *Minor* will simply be referred to as the *parcel*. Although there is often more than one sale associated with a parcel, this is a great place to start for narrowing down our search. The goal is to narrow down the *Sales* dataset to include only one sale per parcel. This allows for a connection with the second database, *Residential Buildings*. 

In [None]:
# Create ParcelIDs
df_rpsale['Parcel_ID'] = df_rpsale.Major + '-' + df_rpsale.Minor
df_resbldg['Parcel_ID'] = df_resbldg.Major + '-' + df_resbldg.Minor

##### The *Sales* database: some of the nitty gritty data selection

In [None]:
# Select only sales for "Residential" plots, corresponding to code #6, as can be found in the data dictionary
# This eliminates Commerical, Condominium, Apartment, etc.
df_rpsale['PrincipalUse'] = elimination_by_code(df_rpsale['PrincipalUse'], '6')

# PropertyClass is another distinction between Commerical/Industrial and Residential, as well as 
# other fundamental features. Code #8 corresponds to Residential Improved property
df_rpsale['PropertyClass'] = elimination_by_code(df_rpsale['PropertyClass'], '8')

# Yet another classification of property type. Code #11 corresponds to single family households
# Here we eliminate multiple family residences, alongside many commercial uses
df_rpsale['PropertyType'] = elimination_by_code(df_rpsale['PropertyType'], '11')

##### Limit scope to 2019 sales

In [None]:
# Type conversion
df_rpsale['DocumentDate'] = df_rpsale.DocumentDate.astype(np.datetime64)

# Isolate SaleYear as its own column
df_rpsale['SaleYear'] = [sale.year for sale in df_rpsale['DocumentDate']]

# Eliminate rows corresponding to sales in a year other than 2019
df_rpsale = df_rpsale.loc[df_rpsale['SaleYear']==2019].copy()

##### Eliminate unrealistically small sales

In [None]:
min_acceptable_sale_price = 25000
df_rpsale['SalePrice'] = df_rpsale.SalePrice.astype('int')
df_rpsale = df_rpsale.loc[df_rpsale.SalePrice > min_acceptable_sale_price].copy()

##### Create column to identify duplicates, a necessary process before combining the two datasets

In [None]:
df_rpsale['SaleCount'] = list(map(dict(df_rpsale.Parcel_ID.value_counts()).get, df_rpsale.Parcel_ID))