In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
%matplotlib inline
orig_df = pd.read_csv('data/original.csv')
res = pd.read_csv('data/residential.csv', low_memory=False)
parcel = pd.read_csv('data/parcel.csv', low_memory=False)
sales = pd.read_csv('data/sales.csv', low_memory=False)
ods = pd.read_csv('data/open_datasoft.csv')

### Original Dataset
For this project I used the dataset provided by the Flatiron School which comes from [King County Assessor Data](https://info.kingcounty.gov/assessor/DataDownload/default.aspx) 

Here are the variables I used, including their descriptions:
* **id** - Unique identifier for a house
* **date** - Date house was sold
* **price** - Sale price (prediction target)
* **bedrooms** - Number of bedrooms
* **bathrooms** - Number of bathrooms
* **sqft_living** - Square footage of living space in the home
* **sqft_lot** - Square footage of the lot
* **floors** - Number of floors (levels) in house
* **waterfront** - Whether the house is on a waterfront
* **greenbelt** - Whether the house is adjacent to a green belt
* **nuisance** - Whether the house has traffic noise or other recorded nuisances
* **view** - Quality of view from house
* **condition** - How good the overall condition of the house is. Related to maintenance of house.
* **grade** - Overall grade of the house. Related to the construction and design of the house.
* **sqft_basement** - Square footage of the basement
* **yr_built** - Year when house was built
* **yr_renovated** - Year when house was renovated
* **address** - The street address

In [2]:
orig_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30155 entries, 0 to 30154
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             30155 non-null  int64  
 1   date           30155 non-null  object 
 2   price          30155 non-null  float64
 3   bedrooms       30155 non-null  int64  
 4   bathrooms      30155 non-null  float64
 5   sqft_living    30155 non-null  int64  
 6   sqft_lot       30155 non-null  int64  
 7   floors         30155 non-null  float64
 8   waterfront     30155 non-null  object 
 9   greenbelt      30155 non-null  object 
 10  nuisance       30155 non-null  object 
 11  view           30155 non-null  object 
 12  condition      30155 non-null  object 
 13  grade          30155 non-null  object 
 14  sqft_basement  30155 non-null  int64  
 15  yr_built       30155 non-null  int64  
 16  yr_renovated   30155 non-null  int64  
 17  address        30155 non-null  object 
dtypes: flo

In [3]:
orig_df.head(3)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,greenbelt,nuisance,view,condition,grade,sqft_basement,yr_built,yr_renovated,address
0,7399300360,5/24/2022,675000.0,4,1.0,1180,7140,1.0,NO,NO,NO,NONE,Good,7 Average,0,1969,0,"2102 Southeast 21st Court, Renton, Washington ..."
1,8910500230,12/13/2021,920000.0,5,2.5,2770,6703,1.0,NO,NO,YES,AVERAGE,Average,7 Average,1570,1950,0,"11231 Greenwood Avenue North, Seattle, Washing..."
2,1180000275,9/29/2021,311000.0,6,2.0,2880,6156,1.0,NO,NO,NO,AVERAGE,Average,7 Average,1580,1956,0,"8504 South 113th Street, Seattle, Washington 9..."


### Additional Datasets
I also incorporated additional datasets from [King County Assessor Data](https://info.kingcounty.gov/assessor/DataDownload/default.aspx): 
* Residential Building
* Parcel
* Real Property Sales

Each of these datasets includes a `Major` and `Minor` column which have six and four characters respectively and may have leading zeros. When concatenated they create a unique `id` which will be used for merging.

In [4]:
res.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 524552 entries, 0 to 524551
Data columns (total 13 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   Major            524552 non-null  int64  
 1   Minor            524552 non-null  int64  
 2   ZipCode          472502 non-null  object 
 3   Stories          524552 non-null  float64
 4   BldgGrade        524552 non-null  int64  
 5   SqFtTotLiving    524552 non-null  int64  
 6   SqFtTotBasement  524552 non-null  int64  
 7   SqFtFinBasement  524552 non-null  int64  
 8   Bedrooms         524552 non-null  int64  
 9   BathFullCount    524552 non-null  int64  
 10  YrBuilt          524552 non-null  int64  
 11  YrRenovated      524552 non-null  int64  
 12  Condition        524552 non-null  int64  
dtypes: float64(1), int64(11), object(1)
memory usage: 52.0+ MB


In [5]:
res.head(3)

Unnamed: 0,Major,Minor,ZipCode,Stories,BldgGrade,SqFtTotLiving,SqFtTotBasement,SqFtFinBasement,Bedrooms,BathFullCount,YrBuilt,YrRenovated,Condition
0,12603,9624,98133,1.0,8,1810,720,420,3,2,1982,0,3
1,12603,9625,98177,1.5,7,4340,2320,1740,3,2,1994,0,3
2,12603,9628,98133,1.0,7,1800,660,660,4,1,1982,0,3


In [6]:
parcel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 622667 entries, 0 to 622666
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   Major                622667 non-null  int64 
 1   Minor                622667 non-null  int64 
 2   PropType             622667 non-null  object
 3   SqFtLot              622667 non-null  int64 
 4   MtRainier            622667 non-null  int64 
 5   Olympics             622667 non-null  int64 
 6   Cascades             622667 non-null  int64 
 7   Territorial          622667 non-null  int64 
 8   SeattleSkyline       622667 non-null  int64 
 9   PugetSound           622667 non-null  int64 
 10  LakeWashington       622667 non-null  int64 
 11  LakeSammamish        622667 non-null  int64 
 12  SmallLakeRiverCreek  622667 non-null  int64 
 13  OtherView            622667 non-null  int64 
 14  WfntLocation         622667 non-null  int64 
 15  TrafficNoise         622667 non-nu

In [7]:
parcel.head(3)

Unnamed: 0,Major,Minor,PropType,SqFtLot,MtRainier,Olympics,Cascades,Territorial,SeattleSkyline,PugetSound,LakeWashington,LakeSammamish,SmallLakeRiverCreek,OtherView,WfntLocation,TrafficNoise,AirportNoise,PowerLines,OtherNuisances,AdjacentGreenbelt
0,714760,85,R,16693,0,0,0,0,0,0,0,0,0,0,0,0,0,N,N,N
1,739920,210,R,8686,0,0,0,0,0,0,0,0,0,0,0,0,0,N,N,N
2,510140,8598,R,6434,0,0,0,0,0,0,0,0,0,0,0,0,0,N,N,N


In [8]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2281280 entries, 0 to 2281279
Data columns (total 5 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   Major         object
 1   Minor         object
 2   DocumentDate  object
 3   SalePrice     int64 
 4   PrincipalUse  int64 
dtypes: int64(2), object(3)
memory usage: 87.0+ MB


In [9]:
sales.head(3)

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,PrincipalUse
0,4000,228,04/29/1997,103500,0
1,799671,190,06/26/2019,0,6
2,327620,100,09/01/2020,430000,6


Lastly, I used a dataset from [OpenDataSoft](https://data.opendatasoft.com/) to gather information about King County zip codes. This included:

* **Zip Code** - The 5-digit zip code assigned by the U.S. Postal Service. Only includes zip codes in King County, WA.
* **Population** - An estimate of the zip code's population.
* **Density** - The estimated population per square kilometer.  

In [10]:
ods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Zip Code    87 non-null     int64  
 1   Population  87 non-null     float64
 2   Density     87 non-null     float64
dtypes: float64(2), int64(1)
memory usage: 2.2 KB


In [11]:
ods.head(3)

Unnamed: 0,Zip Code,Population,Density
0,98029,29250.0,1261.5
1,98070,10291.0,107.6
2,98074,28775.0,1041.4
