# Data Validation
In this notebook we will inspect the data and identify any issues with data types and missing values.

## Importing modules and setting up the environment

In [None]:
import pandas as pd
import numpy as np

## Importing the data

In [None]:
df = pd.read_csv('data/kc_house_data.csv')
df.head()

## Checking for missing values

In [None]:
df.isna().sum()

### Investigating `waterfront` variable

In [None]:
df.waterfront.value_counts(normalize=True, dropna=False)

It looks like `waterfront` is an incomplete boolean variable. We will fill `NaNs` with 0 and convert the data type to integer.

### Investigating `view` variable

In [None]:
df.view.value_counts(normalize=True, dropna=False)

The description for this variable says 'has been viewed'. We expect this variable to be a boolean indicator for whether the property has been viewed. Since there are other rating variables, we assume that this variable is a count of the number of times the house has been viewed. Based on this assumption, we will fill `NaNs` with 0 and convert the data type to integer.

### Investigating `yr_renovated` variable

In [None]:
df.yr_renovated.value_counts(normalize=True, dropna=False)

It looks like both 0 and `NaN` indicate the a house has never been renovated. We will introduce a `renovated` indicator variable to indicate whether the house has ever been renovated. It might also be reasonable fill values for un-renovated houses with the value from `yr_built` providing a year of most recent construction for all houses. It would also be wise to convert to either a integer or date-time data type.

## Checking data types

In [None]:
df.dtypes

In [None]:
df.waterfront.value_counts(normalize=True, dropna=False)

### Inspecting `date` variable
The `date` variable refers to the date that the house was sold. The current `object` data type indicates that this variable will need to be processed and converted to an `integer` or `datetime` data type. It may also be wise to use this variable in conjunction with the price variable to produce an inflation adjusted target.

In [None]:
df.date[:5]

### Inspecting `sqft_basement` variable
The `sqft_basement` variable refers to square footage of the basement of the house. This variable will need to be converted from `object` to `integer` or `float` data type.

In [None]:
df.sqft_basement.value_counts(normalize=True)

This variable can be converted to an integer. `?`s will need to be replaced with zero assuming that houses with unknown basement square footage have no basement.