# D.C. Properties - Condition Prediction

This notebook loads a subset of the DC Properties dataset to build an classification model that would predict the condition of a building. The columns on the data are:

 * **NUM_UNITS** - Number of Units
 * **ROOMS** - Number of Rooms
 * **BEDRM** - Number of Bedrooms
 * **BATHRM** - Number of Full Bathrooms
 * **HF_BATHRM** - Number of Half Bathrooms (no bathtub or shower)
 * **KITCHENS** - Number of kitchens
 * **STORIES** - Number of stories in primary dwelling
 * **HEAT** - Heating
 * **AC** - Cooling
 * **FIREPLACES** - Number of fireplaces
 * **ROOF** - Roof type
 * **EXTWALL** - Exterior wall
 * **AYB** - The earliest time the main portion of the building was built
 * **EYB** - The year an improvement was built more recent than actual year built
 * **YR_SALE** - Date of most recent sale
 * **CNDTN** - Condition
 * **GBA** - Gross building area in square feet
 * **LANDAREA** - Land area of property in square feet
 * **WARD** - Ward (District is divided into eight wards, each with approximately 75,000 residents)
 * **PRICE** - Price of most recent sale

## Imports and Config setting

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import math

In [None]:
pd.set_option('display.max_columns', None)

## Data loading and Selection

Define a series of parameters that will be used in the notebook

In [None]:
# Params
input_data_path = '2_dc_properties_processed_zipped.csv'
numerical_cols = ['NUM_UNITS', 'ROOMS', 'BEDRM', 'BATHRM', 'HF_BATHRM', 'KITCHENS',
                   'STORIES', 'FIREPLACES', 'AYB', 'EYB', 'GBA', 'LANDAREA', 'X', 'Y', 'PRICE', 'YR_SALE']
categorical_cols = ['HEAT', 'AC', 'ROOF', 'EXTWALL', 'CNDTN', 'WARD']

Load the data and give a preview of it

In [None]:
# Load the data and preview it


## Split data

It is very important how the data for training and testing purposes is selected. In this case, we want to keep things simple and we want to 2/3 for training and 1/3 for testing. How would you do it?

In [None]:
# Split data to train and test


In [None]:
# Preview the training features


In [None]:
# Preview the training labels


In [None]:
# Preview the test features


In [None]:
# Preview the test labels


## Fix nulls

Our data still has a vew nulls, let's take a look at those and see what we can do about it.

In [None]:
# Check the number of nulls


There are multiple ways to fix missing data, here we want to keep things simple so we will use their mean values.

In [None]:
# Impute the YR_SALE and PRICE missing values with their means in the train data


In [None]:
# Impute the YR_SALE and PRICE missing values with their means in the test data


## Encode categorical variables

There are two main ways to encode categorical values in your data. Either you can use a One Hot Encoder or a Labeling Encoding. Let's go ahead and decide which strategy we want to use and transform the categorical variables to a numeric value

### One Hot Encoder

In [None]:
# One hot encoding of the training data


In [None]:
# One hot encoding of the test data


### Ordinal encoding

In [None]:
# Ordinal encoding of the training data


In [None]:
# Ordinal encoding of the test data


## Train model

There are countless training algorithms that we could use. Let's take a second to think which one/s would you think would be a good fit for this case and why. 

In [None]:
# Train a simple model


## Test model

In order to evaluate the performance of the model we want to obtain the predictions on the unseen data (test set).

In [None]:
# Predict the labels on the test set


## Evaluate model

Finally, we can get the performance of the model using the predictions and the real labels of the test set

In [None]:
# Find the classification metrics on the test set
