# D.C. Properties - Preprocessing

This notebook pre-process the DC Properties dataset to extract the desired columns and prepare some of its datatypes. The selected columns are:

 * **NUM_UNITS** - Number of Units
 * **ROOMS** - Number of Rooms
 * **BEDRM** - Number of Bedrooms
 * **BATHRM** - Number of Full Bathrooms
 * **HF_BATHRM** - Number of Half Bathrooms (no bathtub or shower)
 * **KITCHENS** - Number of kitchens
 * **STORIES** - Number of stories in primary dwelling
 * **HEAT** - Heating
 * **AC** - Cooling
 * **FIREPLACES** - Number of fireplaces
 * **ROOF** - Roof type
 * **EXTWALL** - Exterior wall
 * **AYB** - The earliest time the main portion of the building was built
 * **EYB** - The year an improvement was built more recent than actual year built
 * **YR_SALE** - Year of most recent sale
 * **CNDTN** - Condition
 * **GBA** - Gross building area in square feet
 * **LANDAREA** - Land area of property in square feet
 * **WARD** - Ward (District is divided into eight wards, each with approximately 75,000 residents)
 * **X** - The longitude
 * **Y** - The latitude
 * **PRICE** - Price of most recent sale

## Imports and Config setting

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
pd.set_option('display.max_columns', None)

## Data loading and Selection

Define a series of parameters that will be used in the notebook

In [None]:
# Params
input_data_path = '1_dc_properties_fixed_zipped.csv'
output_data_path = '2_dc_properties_processed_zipped.csv'
selected_cols = ['NUM_UNITS','ROOMS','BEDRM','BATHRM','HF_BATHRM','KITCHENS','STORIES','HEAT','AC','FIREPLACES','ROOF','EXTWALL','AYB','EYB','YR_SALE','CNDTN','GBA','LANDAREA','WARD', 'X', 'Y', 'PRICE']


Load the data and give a preview of it

In [None]:
# Load and preview the data


Use the parameter of selected_cols to remove the undesired columns from out data.

In [None]:
# Filter the columns in the data


## Drop some nulls

The original raw data has been slightly cleaned and prepared. However, there are still some nulls that we might want to fix.

*Note: For now, you can ignore the nulls in PRICE and YR_SALE*

In [None]:
# Check the nulls per column


In [None]:
# We drop the few properties that have some nulls in the following columns, they are not important


In [None]:
# Check the nulls again


## Pre-processing

One of the most important variables that we will be focusing on later is WARD. Let's take a look to it and see the different values it takes.

In [None]:
# Look at the unique values of WARD


At the moment the values of WARD are string, we would like to transform them to int. How would you do that?

In [None]:
# Transform the WARD values


## Save Data

Finally, let's save out results so we can continue using them in the next notebook

In [None]:
data_df.reset_index(drop=True).to_csv(output_data_path, compression='zip')