# D.C. Properties - Fixing the data

This notebook fixes the most important columns of the DC Properties dataset and selects the most revelant columns.

## Imports and Config setting

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
pd.set_option('display.max_columns', None)

## Data loading and Selection

Define a series of parameters that will be used in the notebook

In [None]:
# Params
input_data_path = '0_dc_properties_raw_zipped.csv'
output_data_path = '1_dc_properties_fixed_zipped.csv'


Load the data file and give a preview of it

In [None]:
# Read the data and preview it
pd.read_csv(input_data_path, low_memory=False, index_col=0, compression='zip')


## Drop some nulls

One of the most important aspects of this dataset is the condition of the building. Later on we will try to predict such condition, therefore we want to ignore those rows that don't have a value for it

In [None]:
# We don't want data that we don't know the condition


In [None]:
# Check the number of nulls in the data


## Pre-processing

Originally the data had the SALEDATE as a timestamp, usually it's kind of painful to work with timestamps and for the purposes of this exercise we just want to know the year of sale

In [None]:
# SALEDATE conversion to extract the year and name it YR_SALE


The other relevant value that we already looked at is condition. Let's take a look at the count of each value of it.

In [None]:
# Look at the distribution of the values


Given that is not clear what default means (in compare to the other values), and that is present in such a small number of rows, let's go ahead a delete those rows

In [None]:
# Remove Default


Finally, it will be kind of painful to work with a string value in our target variable. Therefore, let's encode it to a numerical value. 

Note: The order of the encoding is important

In [None]:
# Encode the CNDTN values


## Save Data

Finally, let's save out results so we can continue using them in the next notebook

In [None]:
data_df.reset_index(drop=True).to_csv(output_data_path, compression='zip')