# D.C. Properties - Exploratory Data Analysis

This notebook analyzes the DC Properties dataset and answers some fundamental questions about the use case. The selected columns are:

 * **NUM_UNITS** - Number of Units
 * **ROOMS** - Number of Rooms
 * **BEDRM** - Number of Bedrooms
 * **BATHRM** - Number of Full Bathrooms
 * **HF_BATHRM** - Number of Half Bathrooms (no bathtub or shower)
 * **KITCHENS** - Number of kitchens
 * **STORIES** - Number of stories in primary dwelling
 * **HEAT** - Heating
 * **AC** - Cooling
 * **FIREPLACES** - Number of fireplaces
 * **ROOF** - Roof type
 * **EXTWALL** - Exterior wall
 * **AYB** - The earliest time the main portion of the building was built
 * **EYB** - The year an improvement was built more recent than actual year built
 * **SALEDATE** - Date of most recent sale
 * **CNDTN** - Condition
 * **GBA** - Gross building area in square feet
 * **LANDAREA** - Land area of property in square feet
 * **WARD** - Ward (District is divided into eight wards, each with approximately 75,000 residents)
 * **X** - The longitude
 * **Y** - The latitude
 * **PRICE** - Price of most recent sale

## Imports and Config setting

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import math

In [None]:
pd.set_option('display.max_columns', None)
sns.set(rc={'figure.figsize':(11.7,8.27)})

## Data loading

Define a series of parameters that will be used in the notebook

In [None]:
# Params
input_data_path = '2_dc_properties_processed_zipped.csv'

Load the data file and give a preview of it

In [None]:
# Load the data and preview it


## Exploratory Data Analysis

The data contains the historical sale records of properties in DC. There are more than 100k rows and 22 selected columns. 

There are a series of questions we would like to answer about out dataset. Let's get started!

First, for those that are not familiar this is what the wards in DC look like.  

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 

Image(url= "https://images.squarespace-cdn.com/content/v1/56e6cad12fe13155d5243018/1467308071684-I84GAR58UWK3D180PL74/image-asset.jpeg?format=300w", width=300, height=300)

First, how many wards are there considered in the dataset?. What is the distribution of properties sold in each of them?

In [None]:
# Plot a bar chart with the number of properties sold in each ward


This gives us an idea of what area is 'hotter' in terms of number of properties sold, however, we would like to identify what are the most aforable areas and the most expensive ones.

In [None]:
# Plot a chart that would be representative of the prices in each ward


If we compare the highest property sold in each of the wards, against their average we can see there are some clear outliers. Let's go further into the prices of the properties being sold. 

How would we take a glimpse on the price variable over time? 

In [None]:
# Plot a chart that would give us a general idea on the prices over time


In [None]:
# You might want to describe the PRICE variable


Let's remove the properties sold before the 1990 and ignore the ones with an astronomical price. 

Now, it might be easier to look at how the price per ward looks like. 

In [None]:
# Filter data_df by YR_SALE and PRICE


In [None]:
# Plot again the price per ward


It seems like location is a really important factor for the price. We would like to know what is more important, being NORTH/SOUTH or WEST/EAST of downtown.

In [None]:
# Look at the correlation between X, Y, and PRICE


Now that we have a clear picture of the impact of the location in the price. We would like to know what is more important for the price in a property between the number of kitchens, bathrooms, bedrooms, and total number of rooms.

In [None]:
# Look at ROOMS against PRICE


In [None]:
# Look at KITCHENS against PRICE


In [None]:
# Look at BEDRM against PRICE


In [None]:
# Look at BATHRM against PRICE


In [None]:
# Any other way to look at it?


Now we have a good idea of which locations have a higher impact on the price, and what is one of the most important features to determine the price on a property. Well done!