Briefly look into our data to see how these attributes correlate and find some interesting insight.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()

%matplotlib inline
pd.options.mode.chained_assignment = None  # default='warn'

Let see how our data looks like.  
There are around 300 attributes!!! That's a lot!  
And our predicting target, price_doc, has max 1.111e+08 with mean 7.12e+06

In [None]:
train_df = pd.read_csv("../input/train.csv")
train_df.describe()

Besides data directly related to individual house, we also have macro economic data which is time-series data.

In [None]:
macro_df = pd.read_csv("../input/macro.csv")
macro_df.describe()

Let's start from our target -> price_doc

In [None]:
price = train_df['price_doc']
plt.figure(figsize=(8,4))
sns.distplot(price, kde=False)

Looks our max outlier show up, let's remove it first.

In [None]:
ulimit = np.percentile(train_df.price_doc.values, 99)
train_df['price_doc'].ix[train_df['price_doc']>ulimit] = ulimit
price = train_df['price_doc']
plt.figure(figsize=(8,4))
sns.distplot(price, kde=False)

Looks better.  
Next, we use heatmap to see top 15 correlated attributes with price_doc.

In [None]:
corrmat = train_df.corr()
n = 15
cols = corrmat.nlargest(n, 'price_doc')['price_doc'].index
cm_df = train_df[cols].corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(cm_df, square=True, annot=True, fmt='.2f', annot_kws={'size':10}, cbar=True)

Number 1 is full_sq with 0.51 score. Hmm... not very much.  
And you can see other attributes below are correlated to themself.  
Like sport_count_5000, sport_count_3000, sport_count_2000, basically those three are hight correlated with each other.  
We can just choose one of them for later trainning at first.  
  
Now we can look into full_sql and price_doc's relationship.  

In [None]:
var = 'full_sq'
data = pd.concat([train_df['price_doc'], train_df[var]], axis=1)
data.plot.scatter(x=var, y='price_doc')


There's a obvious outlier there.  
Let's remove it's outlier including upside and downside.

In [None]:
ulimit = np.percentile(train_df['full_sq'].values, 99.9)
trimmed_df = train_df.drop(train_df[train_df['full_sq']>ulimit].index)

data = pd.concat([trimmed_df['price_doc'], trimmed_df[var]], axis=1)
data.plot.scatter(x=var, y='price_doc')
train_df = trimmed_df

In [None]:
dlimit = np.percentile(train_df['full_sq'].values, 0.1)
trimmed_df = train_df.drop(train_df[train_df['full_sq']<dlimit].index)

data = pd.concat([trimmed_df['price_doc'], trimmed_df[var]], axis=1)
data.plot.scatter(x=var, y='price_doc')
train_df = trimmed_df

Okay, so far we deal with only full_sq.  
We can continue to doing all this cleaning process further.  
However, let's check the missing data first.

In [None]:
total = train_df.isnull().sum().sort_values(ascending=False)
percent = (train_df.isnull().sum()/train_df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Precent'])
missing_data.head(50)

A great part of attributes have missing data.  
We can remove the attributes which have certain percentage of data missing like Percent>0.2.  
And for rest of attributes, we have to figure out a way to fill in the missing data if you really want to keep this attributes as a feature.

Next, let's check some category attributes.  
First is product_type:

In [None]:
g_p_type = train_df.groupby('product_type').mean()['price_doc']
plt.figure(figsize=(8,4))
sns.barplot(g_p_type.index, g_p_type.values)
plt.ylabel('price_doc')
plt.show()

The mean of price of investment  type is slightly higher than OwnerOccupier's.  
How about the number?

In [None]:
g_p_type = train_df['product_type'].value_counts()
plt.figure(figsize=(8,4))
sns.barplot(g_p_type.index, g_p_type.values)
plt.ylabel('Number of Occurrences')
plt.show()

In [None]:
Wow! People like to invest in real estate.  
How about the sub_area?

In [None]:
sub_area_list = train_df.groupby('sub_area').mean()['price_doc'].sort_values(ascending=False)[:15]
plt.figure(figsize=(8,4))
sns.barplot(sub_area_list.index, sub_area_list.values)
plt.ylabel('price_doc')
plt.xticks(rotation=70)
plt.show()

In [None]:
sub_area_list = train_df.groupby('sub_area').mean()['price_doc'].sort_values(ascending=True)[:15]
plt.figure(figsize=(8,4))
sns.barplot(sub_area_list.index, sub_area_list.values)
plt.ylabel('price_doc')
plt.xticks(rotation=70)
plt.show()

Interesting!  
Location, Location, Location!  
Hope all of these are helpful.  
Let's dig into the macro data next time.