# Data Cleaning
In this notebook we execute the data cleaning tasks that we identified in the Data Validation notebook.

## Data cleanup functions
Below we write functions that will implement the cleanup tasks that we identified. Through out this section we use test driven design to ensure that functions work as expected. 

### `waterfront` clean up function
We wish to convert NaNs to zeros and set the data type to integer.

In [None]:
import numpy as np
def clean_waterfront(value):
    """Converts float encoded boolean input to integer encoded boolean and replaces NaNs with zeros."""
    if np.isnan(value) or value == 0:
        output = int(0)
    elif value == 1:
        output = int(1)
    return output

def test_clean_waterfront():
    """Tests the clean_waterfront function."""
    test_value1 = 0.0
    test_value2 = 1.0
    test_value3 = np.nan
    test1 = clean_waterfront(test_value1) == 0
    test2 = clean_waterfront(test_value2) == 1
    test3 = clean_waterfront(test_value3) == 0
    return test1 and test2 and test3

test_clean_waterfront()

### `view` clean up function
we wish to convert values to natural numbers filling NaNs with zeros.

In [None]:
import numpy as np

def clean_view(value):
    """Converts numerical inputs to integers and replaces NaNs with zeros."""
    if np.isnan(value):
        output = int(0)
    else:
        output = int(value)
    return output

def test_clean_view():
    """Tests the clean_view function."""
    test_values = [ 1.0, 13.0, 5.0, np.nan]
    is_int = True
    for value in test_values:
        is_int = type(clean_view(value)) == int
        if not is_int:
            break
    return is_int

test_clean_view()

### `yr_renovated` clean up function
Convert to integer and replace zeros and NaNs with False.

In [None]:
def clean_yr_renovated(value):
    """Converts year renovated to an integer and replaces zeroes and NaNs with False."""
    if np.isnan(value) or value == 0:
        output = False
    else:
        output = int(value)
    return output

def test_clean_yr_renovated():
    test_values = [0.0, np.nan, 1957.0, 2020.0, 1987]
    output = [ clean_yr_renovated(value) for value in test_values]
    return output == [False, False, 1957, 2020, 1987]

test_clean_yr_renovated()

### `date` clean up function
the date field is formatted as a string. for the purposes of this analysis, we will need to convert to a `datetime`

In [None]:
from datetime import datetime as dt

def clean_date(date_str):
    """Converts 'MM/DD/YYYY' formatted date strings to datetime objects."""
    return dt.strptime(date_str, '%m/%d/%Y')
    
def test_clean_date():
    """Tests the clean_date function."""
    return clean_date('7/21/1987') == dt(1987, 7, 21, 0, 0)

test_clean_date()

### `sqft_basement` cleanup function
Replace `?` with zero and convert to integer.

In [None]:
def clean_sqft_basement(value):
    """Convert '?' to zero and string encoded floats to integers."""
    if value == '?':
        output = int(0)
    else:
        output = int(float(value))
    return output

def test_clean_sqft_basement():
    test_values = ['?', '0.0', '600.0', 2000.0, 334]
    output = [ clean_sqft_basement(value) for value in test_values]
    return output == [0, 0, 600, 2000, 334]

test_clean_sqft_basement()

## Execute data cleaning
In this section we will clean the data and save it for future use.

### Load the raw data

In [None]:
import pandas as pd
df = pd.read_csv('data/kc_house_data.csv')
df.head()

### Data cleaning function

In [None]:
def clean_data(df):
    """Execute all data cleaning tasks and save resulting datafram to data/cleaned.csv"""
    clean = df.copy()
    clean['waterfront'] = df['waterfront'].apply(clean_waterfront)
    clean['view'] = df['view'].apply(clean_view)
    clean['yr_renovated'] = df['yr_renovated'].apply(clean_yr_renovated)
    clean['date'] = df['date'].apply(clean_date)
    clean['sqft_basement'] = df['sqft_basement'].apply(clean_sqft_basement)
    clean.to_csv('data/cleaned.csv', index=False)

### Execute data cleaning

In [None]:
clean_data(df)

### Load cleaned data for inspection

In [None]:
clean = pd.read_csv('data/cleaned.csv')
clean.head()