This is the boring but neccessary step to get some kick-ass data that we can use to predict property prices and slowly but surely climb up the property ladder. I am using the cleaned data in this [kernel](http://www.kaggle.com/akosciansky/how-to-become-a-property-tycoon-in-new-york/) to predict property prices.

In [1]:
# Import the modules

import pandas as pd
import numpy as np
from scipy import stats
import sklearn as sk
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
warnings.filterwarnings('ignore')
%matplotlib inline
sns.set(style='white', context='notebook', palette='deep') 

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

In [2]:
data = pd.read_csv('../input/nyc-rolling-sales.csv')

In [3]:
data.head()

# Data Inspection - Overview

There are NANs or NULLs. However, as we'll see later that doesn't mean the data is clean. We have to look at each attribute in more detail.

In [4]:
print(data.isnull().sum())

Before I do that I want to know whether all attributes are in the right data format. For example, categorical variables can be numbers but should be stored as objects rather than numerical data. 

In [5]:
data.dtypes

**Unnamed: 0**

Something seems to have gone wrong when loading the data. There is no mentioning of this field and it doesn't seem to contain any useful information. I will delete this after I finish the initial data inspection. Let's delete it.

In [6]:
data['Unnamed: 0'].value_counts()

In [7]:
#Delete the column
data = data.drop('Unnamed: 0', axis=1)

**Check for duplicates**

There are 84,548 records in total

In [8]:
len(data)

And there are 765 duplicates. Let's delete those. There are more, which we will later discover. The reason why I didn't detect them is because they have different sale prices of which some are unrealistic, e.g. 0 or 10.

In [9]:
#Get the names of each column. I need to check that for duplicates across all columns
columns = data.columns
#Count the number of duplicates
sum(data.duplicated(columns))

In [10]:
#Delete the duplicates and check that it worked
data = data.drop_duplicates(columns, keep='last')
sum(data.duplicated(columns))

# Data Inspection - Detailed

I'll now check every variable in a bit more detail. That's the boring but necessary part...

**BOROUGH**

BOROUGH contains numbers rather than the actual names. From the Overview page I know that, for example, 1 stands for Manhattan. So let's give the boroughs a name instead. It's easier to read.

In [11]:
data['BOROUGH'].value_counts()

In [12]:
data['BOROUGH'][data['BOROUGH'] == 1] = 'Manhatten'
data['BOROUGH'][data['BOROUGH'] == 2] = 'Bronx'
data['BOROUGH'][data['BOROUGH'] == 3] = 'Brooklyn'
data['BOROUGH'][data['BOROUGH'] == 4] = 'Queens'
data['BOROUGH'][data['BOROUGH'] == 5] = 'Staten Island'

Check that it worked.

In [13]:
data.head()

Now that the boroughs have names, let's visualise how many records there are.

In [14]:
sns.countplot(y = 'BOROUGH',
              data = data,
              order = data['BOROUGH'].value_counts().index)
plt.show()

**NEIGHBORHOOD**

In [15]:
data['NEIGHBORHOOD'].value_counts()

Let's visualise this just to get a rough idea about the distribution. There are too many neighbourhoods to get a lot of meaningful insight anyways.

In [16]:
sns.countplot(y = 'NEIGHBORHOOD',
              data = data,
              order = data['NEIGHBORHOOD'].value_counts().index)
plt.show()

**TAX CLASS AT PRESENT**

In [17]:
data['TAX CLASS AT PRESENT'].value_counts()

In [18]:
(len(data[data['TAX CLASS AT PRESENT']=='1'])+len(data[data['TAX CLASS AT PRESENT']=='1']))/len(data)*100

91% of the Tax Classes are 1 or 2 - keep this in mind when predicting sales prices

In [19]:
len(data[data['TAX CLASS AT PRESENT']==' '])/len(data)*100

0.87% of the Tax Classes are missing

In [20]:
count = data.groupby(data['TAX CLASS AT PRESENT']).count()
count = pd.DataFrame(count.to_records())
count = count.sort_values(by= 'BOROUGH', ascending = False)
count = count['TAX CLASS AT PRESENT']

sns.countplot(y='TAX CLASS AT PRESENT', data=data, order=count)

**BOROUGH-BLOCK-LOT = BBL**

According to the data description, BBL could be a useful feature. So let's go ahead and create it. 

In [21]:
data['BBL'] = data['BOROUGH'] + '_' + data['BLOCK'].astype(str) + '_' + data['LOT'].astype(str)

There are no missing Blocks or Lots

In [22]:
data['BBL'].value_counts()

**EASE-MENT**

In [23]:
data['EASE-MENT'].value_counts()

EASE-MENT is empty and will have to be deleted from the dataset.

**BUILDING CLASS CATEGORY**

There are a lot of categories. It would be great to group these together I just can't really think of a good way right now.

In [24]:
data['BUILDING CLASS CATEGORY'].value_counts()

**BUILDING CLASS AT PRESENT**

In [25]:
data['BUILDING CLASS AT PRESENT'].value_counts()

How is this related with BUILDING CLASS CATEGORY? This is supposed to be the codes for the classification: http://www1.nyc.gov/assets/finance/downloads/pdf/07pdf/glossary_rsf071607.pdf

According to the pivot table below there is no on-to-one relationship between BUILDING CLASS AT PRESENT and BUILDING CLASS CATEGORY.

In [26]:
data.pivot_table(index='BUILDING CLASS AT PRESENT', columns='BUILDING CLASS CATEGORY', aggfunc='count')

**BUILDING CLASS AT TIME OF SALE**

In [27]:
data['BUILDING CLASS AT TIME OF SALE'].value_counts()

This looks more promising. Instead of a many-to-many relationship, we now have a one-to-many relationship. We see, for example, that all Class A buildings are 1 Family dwellings. This will come in handy if we want to simplify the model. We could just use 

In [28]:
data.pivot_table(index='BUILDING CLASS AT TIME OF SALE', columns='BUILDING CLASS CATEGORY', aggfunc='count')

**ADDRESS**

In [29]:
data['ADDRESS'].value_counts()

What's up with the bits after the comma? Are those apartment numbers?

In [30]:
#This counts those addresses that contain apartment numbers or sth else - what about those roqgue letters?
len(data[data['ADDRESS'].str.contains(',')])

**APARTMENT NUMBER**

In [31]:
data['APARTMENT NUMBER'].value_counts()

In [32]:
len(data[data['APARTMENT NUMBER']==' '])

In [33]:
len(data[data['APARTMENT NUMBER']==' '])/len(data)*100

77% of all apartment numbers are missing. Are those in the adresses that had commas?

**ZIP CODE**

In [34]:
data['ZIP CODE'].value_counts()

There are 982 cases where ZIP code is 0. This should be deleted.

In [35]:
data = data.drop(data[data['ZIP CODE']==0].index)

When looking at the boxplot with ZIP codes per borough we see that there are some outliers. Some ZIP codes seem to overlap with another borough, others seem very different from the majority. Let's put those on the watch list for now.

In [36]:
fig, ax = plt.subplots(figsize=(15,8)) 
sns.boxplot(x='ZIP CODE', y='BOROUGH', data=data, ax=ax)

**RESIDENTIAL UNITS**

In [37]:
data['RESIDENTIAL UNITS'].value_counts()

There are 24783 cases where there are 0 residential units - are these office buildings?

Let's look at the distribution. There are properties with 1844 residence units, which seems very high! The big majority has very few though. Let's put the outliers on the watch list. Manhattan has many skyscrapers. It might be possible that there are buildings with thath many residential units.

In [38]:
data['RESIDENTIAL UNITS'].describe()

In [39]:
fig, ax = plt.subplots(figsize=(10,5)) 
sns.boxplot(x='RESIDENTIAL UNITS', data=data, ax=ax)

**COMMERCIAL UNITS**

In [40]:
data['COMMERCIAL UNITS'].value_counts()

There seem to be an aweful lot of cases with 0 commercial units. Let's have a look at how as a proportion of the whole data set.

In [41]:
len(data[data['COMMERCIAL UNITS']==0])/len(data)*100

94% of the properties don't have any commercial units, which means they should be residential units. We'll see in TOTAL UNITS that that's not always true because sometimes units are not defined as either residential or commercial. They can also be both.

Let's look at the distribution. It's a similar story as with RESIDENTIAL UNITS. Some huge outliers but they might just be skyscrapers.

In [42]:
data['COMMERCIAL UNITS'].describe()

In [43]:
fig, ax = plt.subplots(figsize=(10,5)) 
sns.boxplot(x='COMMERCIAL UNITS', data=data)

**TOTAL UNITS**

In [44]:
data['TOTAL UNITS'].value_counts()

There 19,677 cases of 0 TOTAL UNITS. Let's have a closer look at this after a short descriptive summary.

In [45]:
data['TOTAL UNITS'].describe()

In [46]:
fig, ax = plt.subplots(figsize=(10,5)) 
sns.boxplot(x='TOTAL UNITS', data=data)

Let's look at the 'head' and 'tail' of the data when there are no units. 

1) We can see that the year is wrong. Some buildings aparently have been build in the year 67.

2) Some buildings have been build in 2017. Maybe they are not finished yet? In fact the BUILDING CLASS CATEGORY reads 'vacant land' in a lot of these cases.

3) There are cases with no SALE PRICE.

4) There are duplicates but with different SALES PRICEs such as 1 and 499,000 for the same property. A lot of those are found in Alphabet City and are old buildings. Are they getting knocked down? Is it just a case of no-one having any data?

In [47]:
pd.set_option('display.max_columns', None)
data[data['TOTAL UNITS']==0].head(50)

In [48]:
pd.set_option('display.max_columns', None)
data[data['TOTAL UNITS']==0].tail(50)

In [49]:
sum(data['RESIDENTIAL UNITS'] + data['COMMERCIAL UNITS'] == data['TOTAL UNITS'])

In [50]:
sum(data['RESIDENTIAL UNITS'] + data['COMMERCIAL UNITS'] != data['TOTAL UNITS'])

There are 2584 cases where residential + commercial != total. In some cases the unit is considered both and in others neither.

In [51]:
data[['RESIDENTIAL UNITS','COMMERCIAL UNITS', 'TOTAL UNITS']][data['RESIDENTIAL UNITS'] + data['COMMERCIAL UNITS'] != data['TOTAL UNITS']]

**SALE PRICE**

In [52]:
data['SALE PRICE'].value_counts()

I need to delete properties with a price of 0 or 10. Let's also have a look at prices of "-". There are a lot of them.

In [53]:
pd.set_option('display.max_columns', None)
data[data['SALE PRICE'] == ' -  '].head(50)

Some of the data looks fine apart from the SALE PRICE. I will use this an extra 'prediction set' and will predict their SALE PRICE at the end of the kernel. Unfortunately we don't know how acurate this will be as we don't know the actual price. But what I can do is, see whether the prices fit in the range of similar flats.

There are other cases where sqaure feet are missing. I will delete those cases in the square feet sections. At the end of data inspection part, I'll save the 'prediction set'.

**LAND SQUARE FEET**

In [54]:
data['LAND SQUARE FEET'].value_counts()

In [55]:
(len(data[data['LAND SQUARE FEET'] == ' -  ']) + len(data[data['LAND SQUARE FEET'] == '0']))/len(data)*100

Over 43% of the data set have no sq ft or a value of '-'. This is problematic. There are two options. 1) I delete all those cases and lose almost of half of the data or 2) I keep it and don't use sq ft in the predictive model.

I will go with option 1 for now but will come back to this point and test out option 2.

**GROSS SQUARE FEET**

Does GROSS SQUARE FEET behave the same way? Let's see.

In [56]:
data['GROSS SQUARE FEET'].value_counts()

In [57]:
(len(data[data['GROSS SQUARE FEET'] == ' -  ']) + len(data[data['GROSS SQUARE FEET'] == '0']))/len(data)*100

OK, this is even worse :( 45% of the data doesn't have a decent value.

In [58]:
# Are land sq feet = 0 or - the same when gross = 0 or -?

**GROSS SQUARE FEET and LAND SQUARE FEET**

I stil need to check the overlap of bad data between the two sq ft attributes. If they don't overlap then the bad data could be even larger than 45%.

In [59]:
len(data[(data['GROSS SQUARE FEET'] == ' -  ') | (data['GROSS SQUARE FEET'] == '0') |
    (data['LAND SQUARE FEET'] == ' -  ') | (data['LAND SQUARE FEET'] == '0')]
    )/len(data)*100

The good news is that we are still at 45% even though that's a lot.

**YEAR BUILT**

In [60]:
data['YEAR BUILT'].value_counts()

In [61]:
fig, ax = plt.subplots(figsize=(10,5)) 
sns.boxplot(x = 'YEAR BUILT', data=data, ax=ax)

There are some REALLY old buildings, even from the year 0 or 1100. I think it's save to delete those. But let's check first how many cases there are. 

In [62]:
data['YEAR BUILT'].describe()

75% of the buildings have been built after 1920. But it's not too unrealistic to have buildings older than that. Looking at the boxplot maybe the year 1750 would be good cutoff. Let's check how many there are.

In [63]:
data['YEAR BUILT'][data['YEAR BUILT'] < 1750].count()

There are almost 6000! buldings build before 1750. Let's see how they are distributed.

In [64]:
fig, ax = plt.subplots(figsize=(10,5)) 
sns.boxplot(x = 'YEAR BUILT', data=data[data['YEAR BUILT'] < 1750], ax=ax)

So it's quite clear that 0 is just a dummy here and we just don't know when the buildings were built. Again, I have two options: delete the values or not use the attribute in the model. For now I will delete it. The other buildings older than 1750 should also be deleted. Even if they were true, the prices for thoses buildings would follow a different pattern.

Having said that, the year in itself is not going to be terribly useful to our predictive model. I should to convert it its age.

In [65]:
#Create new field: Built x years ago from 2017
data['Building_Age'] = 2017 - data['YEAR BUILT']

**TAX CLASS AT TIME OF SALE**

**SALE DATE**

In [66]:
data['SALE DATE'].value_counts()

Let's get rid of the time. It's not useful.

In [67]:
data['SALE DATE'] = pd.to_datetime(data['SALE DATE'])

In [68]:
data['SALE DATE'].value_counts()

Let's extract some more useful information from the date: the year, month, and quarter/season. 

In [69]:
data['SALE YEAR'], data['SALE MONTH'], data['SALE QUARTER'] = data['SALE DATE'].dt.year, data['SALE DATE'].dt.month, data['SALE DATE'].dt.quarter

In [70]:
sns.boxplot(x='SALE MONTH', data=data[data['SALE YEAR']==2017])

In [71]:
sns.boxplot(x='SALE MONTH', data=data[data['SALE YEAR']==2016])