# Initial Analysis of King County Housing Dataset

In [142]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [143]:
# Reading in .csv, assigning to dataframe `df`
df = pd.read_csv('../data/kc_house_data.csv')

# Shows all columns, i.e. forces pandas/Jupyter
# not to truncate dataframe horizontally
pd.set_option('display.max_columns', False)

In [144]:
# Checking out the data
df.head(10);

In [145]:
# Checking out the data types and null/non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  object 
 9   view           21534 non-null  object 
 10  condition      21597 non-null  object 
 11  grade          21597 non-null  object 
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

### Initial thoughts

#### The Good
- How should we approach `sqft_living` vs. `sqft_above`? What, if anything, is the difference between these two metrics? `sqft_above` counts living area separate from the basement, but it seems the parameter of interest there is whether or not the house *has* livable basement area.
    - Running a heat map proves that these variables have a colinearity coefficient of `0.88`; it's likely we'll need to choose one or the other for a linear model.
- Because `sqft_living` refers to a record and `sqft_living15` refers to properties in the geographical vicinity of that record, we can compare the values stored in these two columns in a given row to determine if a home's square footage is greater than or less than nearby homes.
    - It might be useful to, early on, perform some simple linear regressions using these as predictor variables to determine goodness of fit, etc.
    
#### The Less Good
- Is `id` going to be useful in any way? It doesn't seem like an instance where a 'unique identifier' is going to provide us with any information we can use for prediction.
    - However! - see "Notes for Cleaning" below.
- We are working with data from a *single county* - will `lat` and `long` be able to tell us anything that we can't glean from, say, zip code? This might also make our presentation and recommendations unnecessarily complex.

#### The... Not Sure?
- `yr_renovated` has a fairly substantial number of null values (~4,000), and an even greater number of records with value `0.0`, which likely indicates that no renovation has been done to the house (need to check data dictionary).
    - Heat map examination also indicates a low correlation, < 0.1, between `yr_built` and `price`.   
- `waterfront` may be useful as a boolean value, i.e. to test whether homes located on water are higher priced than homes that we **know** are not located on water.

#### Notes for Cleaning
- Coded null values exist in the following columns:
    - `waterfront`
    - `view`
    - `yr_renovated`
- Duplicates exist in the `id` column - these are houses that were sold more than once! What are we going to do about this, and why are we going to go about it that way?

In [146]:
# Dealing with duplicates in the `id` column, keeping only the most recent sale

## code goes here

In [147]:
# Some exploratory value counts on columns of interest

# Counting the number of homes with recorded
# waterfront status vs. those without it
df.waterfront.value_counts()


# Number of bathrooms - note that many values exist
# with decimal places, i.e. half baths. This could get
# messy and convoluted quickly.

## df.bathrooms.value_counts()

NO     19075
YES      146
Name: waterfront, dtype: int64

In [148]:
# Filling null values in column `view` with string indicating
# that no record exists on the property's view
df.view.fillna(value = 'NO RECORD', inplace = True)

# Filling null values in column `yr_renovated` with integer 0,
# indicating that no renovation year exists for this record
df.yr_renovated.fillna(value = 0, inplace = True)

# Filling null values in column `waterfront` to reflect
# unknown status - this code may go unused
## df.waterfront.fillna(value = 'UNKNOWN', inplace = True)

In [149]:
# Creating variables to count values in columns `grade`, `condition`, 
# and `yr_renovated` in King County real estate data
grade_counts = df['grade'].value_counts()
condition_counts = df['condition'].value_counts()
renovation_counts = df['yr_renovated'].value_counts()

# Printing value counts
print(f'Condition value counts:\n{condition_counts}\n')
print(f'Grade value counts:\n{grade_counts}\n')
print(f'Renovation year value counts:\n{renovation_counts}\n')

Condition value counts:
Average      14020
Good          5677
Very Good     1701
Fair           170
Poor            29
Name: condition, dtype: int64

Grade value counts:
7 Average        8974
8 Good           6065
9 Better         2615
6 Low Average    2038
10 Very Good     1134
11 Excellent      399
5 Fair            242
12 Luxury          89
4 Low              27
13 Mansion         13
3 Poor              1
Name: grade, dtype: int64

Renovation year value counts:
0.00       20853
2014.00       73
2003.00       31
2013.00       31
2007.00       30
           ...  
1946.00        1
1959.00        1
1971.00        1
1951.00        1
1954.00        1
Name: yr_renovated, Length: 70, dtype: int64



In [150]:
# Dropping columns we determined were either superfluous or irrelevant
df.drop(columns = ['id', 'floors', 'condition', 'lat', 'long', 'zipcode', 'sqft_above', 'sqft_basement',
                   'sqft_lot15', ], inplace = True)
# potentially include yr_renovated? depending on use

In [151]:
# Applied universally to notebook -- converts any scientific notation
# to standard notation, rounded to two decimal places

# Use with caution! Will need to restart kernel to reset effect.

## pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [159]:
# Creating new df, grouped by column `view`, and looking
# at mean values for the categorical variables in `view`
view_df = df.groupby('view').mean()

view_df.sort_values('price', ascending = False)

Unnamed: 0_level_0,price,bedrooms,bathrooms,sqft_living,sqft_lot,yr_built,yr_renovated,sqft_living15
view,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
EXCELLENT,1452465.88,3.62,2.78,3334.48,21624.01,1965.95,270.18,2841.0
GOOD,973285.2,3.67,2.67,3016.85,34877.13,1967.39,192.58,2702.21
FAIR,813373.27,3.54,2.35,2571.05,12370.6,1962.89,151.11,2407.35
AVERAGE,791390.37,3.57,2.43,2650.72,22317.0,1964.82,106.27,2427.34
NO RECORD,621958.17,3.43,2.15,2249.17,18111.57,1970.11,31.67,2096.13
NONE,496806.07,3.35,2.07,1998.36,14156.57,1971.62,59.1,1924.74


Takeaways from this quick analysis of `view` include...

- Clear descending order with `EXCELLENT` at top and `NONE` at bottom
- Significant differences between `EXCELLENT` and middle `view` values, and between middle `view` values and `NONE`
    - Differences not as significant between `GOOD`, `FAIR`, and `AVERAGE`
- `NO RECORD` (formerly null/NaN) will need further investigation, as it seems to lie between `AVERAGE` (i.e. **having** a view, but an average one) and `NONE`, which ostensibly means no scenic view whatsoever.
    - Fortunately, we have < 100 null values in this column, so we can either drop them or modify them without worrying about those changes affecting our conclusions significantly.

In [153]:
# Multiples in id column
# df[df['id'] == 795000620]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           21597 non-null  object 
 1   price          21597 non-null  float64
 2   bedrooms       21597 non-null  int64  
 3   bathrooms      21597 non-null  float64
 4   sqft_living    21597 non-null  int64  
 5   sqft_lot       21597 non-null  int64  
 6   waterfront     19221 non-null  object 
 7   view           21597 non-null  object 
 8   grade          21597 non-null  object 
 9   yr_built       21597 non-null  int64  
 10  yr_renovated   21597 non-null  float64
 11  sqft_living15  21597 non-null  int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 2.0+ MB


### Side notes and things to revisit

- `yr_renovated` has null values and a lot of `0` values too. Of the houses *with* values in the `yr_renovated` column, the vast majority were renovations from 2013 and 2014. It might be worth further exploring how these recently renovated homes compare pricewise to homes built in prior years.