In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
%matplotlib inline
orig_df = pd.read_csv('data/original.csv')
res = pd.read_csv('data/residential.csv', low_memory=False)
parcel = pd.read_csv('data/parcel.csv', low_memory=False)
sales = pd.read_csv('data/sales.csv', low_memory=False)
ods = pd.read_csv('data/open_datasoft.csv',sep=';')

### Original Dataset
For this project I used the dataset provided by the Flatiron School which comes from [King County Assessor Data](https://info.kingcounty.gov/assessor/DataDownload/default.aspx) 

Here are the variables I used, including their descriptions:
* **id** - Unique identifier for a house
* **date** - Date house was sold
* **price** - Sale price (prediction target)
* **bedrooms** - Number of bedrooms
* **bathrooms** - Number of bathrooms
* **sqft_living** - Square footage of living space in the home
* **sqft_lot** - Square footage of the lot
* **floors** - Number of floors (levels) in house
* **waterfront** - Whether the house is on a waterfront
* **greenbelt** - Whether the house is adjacent to a green belt
* **nuisance** - Whether the house has traffic noise or other recorded nuisances
* **view** - Quality of view from house
* **condition** - How good the overall condition of the house is. Related to maintenance of house.
* **grade** - Overall grade of the house. Related to the construction and design of the house.
* **sqft_basement** - Square footage of the basement
* **yr_built** - Year when house was built
* **yr_renovated** - Year when house was renovated
* **address** - The street address

In [2]:
orig_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30155 entries, 0 to 30154
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             30155 non-null  int64  
 1   date           30155 non-null  object 
 2   price          30155 non-null  float64
 3   bedrooms       30155 non-null  int64  
 4   bathrooms      30155 non-null  float64
 5   sqft_living    30155 non-null  int64  
 6   sqft_lot       30155 non-null  int64  
 7   floors         30155 non-null  float64
 8   waterfront     30155 non-null  object 
 9   greenbelt      30155 non-null  object 
 10  nuisance       30155 non-null  object 
 11  view           30155 non-null  object 
 12  condition      30155 non-null  object 
 13  grade          30155 non-null  object 
 14  sqft_basement  30155 non-null  int64  
 15  yr_built       30155 non-null  int64  
 16  yr_renovated   30155 non-null  int64  
 17  address        30155 non-null  object 
dtypes: flo

### Additional Datasets
I also incorporated additional datasets from [King County Assessor Data](https://info.kingcounty.gov/assessor/DataDownload/default.aspx): 
* Residential Building
* Parcel
* Real Property Sales

In [3]:
res.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 524552 entries, 0 to 524551
Data columns (total 13 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   Major            524552 non-null  int64  
 1   Minor            524552 non-null  int64  
 2   ZipCode          472502 non-null  object 
 3   Stories          524552 non-null  float64
 4   BldgGrade        524552 non-null  int64  
 5   SqFtTotLiving    524552 non-null  int64  
 6   SqFtTotBasement  524552 non-null  int64  
 7   SqFtFinBasement  524552 non-null  int64  
 8   Bedrooms         524552 non-null  int64  
 9   BathFullCount    524552 non-null  int64  
 10  YrBuilt          524552 non-null  int64  
 11  YrRenovated      524552 non-null  int64  
 12  Condition        524552 non-null  int64  
dtypes: float64(1), int64(11), object(1)
memory usage: 52.0+ MB


In [4]:
parcel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 622667 entries, 0 to 622666
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   Major                622667 non-null  int64 
 1   Minor                622667 non-null  int64 
 2   PropType             622667 non-null  object
 3   SqFtLot              622667 non-null  int64 
 4   MtRainier            622667 non-null  int64 
 5   Olympics             622667 non-null  int64 
 6   Cascades             622667 non-null  int64 
 7   Territorial          622667 non-null  int64 
 8   SeattleSkyline       622667 non-null  int64 
 9   PugetSound           622667 non-null  int64 
 10  LakeWashington       622667 non-null  int64 
 11  LakeSammamish        622667 non-null  int64 
 12  SmallLakeRiverCreek  622667 non-null  int64 
 13  OtherView            622667 non-null  int64 
 14  WfntLocation         622667 non-null  int64 
 15  TrafficNoise         622667 non-nu

In [5]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2281280 entries, 0 to 2281279
Data columns (total 5 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   Major         object
 1   Minor         object
 2   DocumentDate  object
 3   SalePrice     int64 
 4   PrincipalUse  int64 
dtypes: int64(2), object(3)
memory usage: 87.0+ MB


Lastly, I used a dataset from [OpenDataSoft](https://data.opendatasoft.com/) to gather information about King County zip codes. 

In [6]:
ods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 1 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Zip Code,Population,Density  87 non-null     object
dtypes: object(1)
memory usage: 824.0+ bytes
