# Introduction
This notebook is basic EDA for the `kc_house_data.csv` file. I create a data frame, check data types for logical consistency (no numbers as strings, etc.), make preliminary observations on the data, and return a finished `cleaned_kc.csv` file to be used in the final project.

## Importing Data

In [33]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('data/kc_house_data.csv')

In [67]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
id               21597 non-null int64
date             21597 non-null object
price            21597 non-null float64
bedrooms         21597 non-null int64
bathrooms        21597 non-null float64
sqft_living      21597 non-null int64
sqft_lot         21597 non-null int64
floors           21597 non-null float64
waterfront       19221 non-null float64
view             21534 non-null float64
condition        21597 non-null int64
grade            21597 non-null int64
sqft_above       21597 non-null int64
sqft_basement    21597 non-null object
yr_built         21597 non-null int64
yr_renovated     17755 non-null float64
zipcode          21597 non-null int64
lat              21597 non-null float64
long             21597 non-null float64
sqft_living15    21597 non-null int64
sqft_lot15       21597 non-null int64
dtypes: float64(8), int64(11), object(2)
memory usage: 3.5+ MB


## Column Investigation

In [32]:
data.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

Most the column names make sense intuitively, or are explained decently in the supplementary `column_names.md` file. 

Columns `condition` and `grade` seem to be redundant, but [this](https://www.slideshare.net/PawanShivhare1/predicting-king-county-house-prices) project found through Kaggle asserts that `condition` speaks to the "condition of the apartment" and `grade` speaks to the "level of [building] construction and design." Still unclear, but I think that `condition` is more for a single unit and `condition` is for a whole apartment complex. Then again, the .csv file is named `house_data`. Hard to define a meaning here, but that doesn't mean these numbers are useless in analysis

In [57]:
data.view.describe()

count    21534.000000
mean         0.233863
std          0.765686
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          4.000000
Name: view, dtype: float64

Moreover, the `view` column does not make sense when compared to its description in `markdown.md`, which simply states "Has been viewed." This would imply a boolean column where **True** indicates having-been-viewed. Instead, the column is integers 0-4:

In [63]:
sorted(data.view.unique())

[0.0, nan, 1.0, 2.0, 3.0, 4.0]

The documentation referenced above found via Kaggle suggests this may refer to the quality of the viewing, perhaps how well a visitor and potential-buyer rated their tour of the unit.

## Null And Zero Values

In [65]:
data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,0.0,...,7,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,0.0,0.0,...,7,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,0.0,0.0,...,6,770,0.0,1933,,98028,47.7379,-122.233,2720,8062
3,2487200875,12/9/2014,604000.0,4,3.0,1960,5000,1.0,0.0,0.0,...,7,1050,910.0,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000.0,3,2.0,1680,8080,1.0,0.0,0.0,...,8,1680,0.0,1987,0.0,98074,47.6168,-122.045,1800,7503


Some columns, such as `waterfront` and `yr_renovated` contain 0's as well as NaN's. I'm assuming that the 0 means "No waterfront" in the first case and "Has not been renovated" in the second case. NaN's may mean either there is no data, or the same as the 0's

In [55]:
np.sum(data.isna())

id                  0
date                0
price               0
bedrooms            0
bathrooms           0
sqft_living         0
sqft_lot            0
floors              0
waterfront       2376
view               63
condition           0
grade               0
sqft_above          0
sqft_basement       0
yr_built            0
yr_renovated     3842
zipcode             0
lat                 0
long                0
sqft_living15       0
sqft_lot15          0
dtype: int64

We see `view` also contains NaN's, but hardly enough to matter in our data frame of roughly 22,000 rows. We can just replace these NaN's with 0's.

Let's look at the percentage of NaN's in the data:

In [73]:
for i in data:
    print(np.round(100 * np.sum(data[i].isna()) / len(data), 2), "% NaN values")

0.0 % NaN values
0.0 % NaN values
0.0 % NaN values
0.0 % NaN values
0.0 % NaN values
0.0 % NaN values
0.0 % NaN values
0.0 % NaN values
11.0 % NaN values
0.29 % NaN values
0.0 % NaN values
0.0 % NaN values
0.0 % NaN values
0.0 % NaN values
0.0 % NaN values
17.79 % NaN values
0.0 % NaN values
0.0 % NaN values
0.0 % NaN values
0.0 % NaN values
0.0 % NaN values


# 11% missing and 18% missing for `waterfront` and `yr_renovated` respectively. We can replace these with 0's for `waterfront`, on the assumption that most homes aren't on a body of water, and do the same for `yr_renovated` on the assumption that most homes have not been renovated.

# NOTE TO STEVEN: CHECK HOW MANY VALUES ARE 0, IF ITS A PLURALITY REPLACE THE NANS WITH THOSE. FOR YR_RENOVATED PROB USE AVERAGE YEAR OR DROP THOSE ROWS

In [77]:
100 * np.sum(data.yr_renovated != 0) / len(data)

21.2344307079687

## Question: What is the range of sale prices?

In [29]:
print("Spread of house sales is ${}M.".format((np.max(data.price) - np.min(data.price))/1000000))

Spread of house sales is $7.622M.


## Continuous Variables

## Categorical Variables, Dummies, Etc.

## Multicollinearity

In [79]:
dupe_rows = data[data.duplicated()]