출처 : https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-zillow-prize

**Zillow:**

Zillow is an online real estate databse company founded in 2006

**Zestimate:**

Zestimates are estimated home values based on 7.5 million statistical and machine learning models that analyze hunderes of data  points on each property. And, by continually imporoving the median margin of error

**Objective**:
Building a model to imporve the Zestimate residual error.

보통 부동산 집값 예측이라고 하면, 집과 관련된 여러변수들로 모델을 구축하여 집값을 예측하는 것 같지만,
이번 대회의 주제는 잔차 오차를 개선하기 위한 모델을 구축하는 것이 목표다.

여기서 잔차는 에러 즉, **실제 부동산값 - 예측 부동산 값**을 의미

 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()

%matplotlib inline
import warnings
warnings.filterwarnings(action='ignore')
pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 999

In [None]:
# 데이터 리스트 확인
from subprocess import check_output
print(check_output(['ls', '../input/zillow-prize-1']).decode('utf8'))

# 1. Train Data

In [None]:
train_df = pd.read_csv("../input/zillow-prize-1/train_2016_v2.csv", parse_dates=["transactiondate"])
train_df.shape

In [None]:
train_df.head()

## 1-1. Logerror

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(range(train_df.shape[0]), np.sort(train_df.logerror.values))
plt.xlabel('index', fontsize=12)
plt.ylabel('logerror', fontsize=12)
plt.show()

양 끝에 이상치 발견 > remove and histogram 그리기

percentile로 양 끝 백분위수 구한 뒤, 값을 대신 채워넣는 형식

In [None]:
ulimit = np.percentile(train_df.logerror.values, 99)
llimit = np.percentile(train_df.logerror.values, 1)
train_df['logerror'].loc[train_df['logerror']>ulimit] = ulimit
train_df['logerror'].loc[train_df['logerror']<llimit] = llimit

plt.figure(figsize=(12,8))
sns.distplot(train_df.logerror.values, bins=50, kde=False)
plt.xlabel('logerror', fontsize=12)
plt.show()

# .ix > .loc

## 1-2. transaction 

Now let us explore the date field. Let us first check the number of transactions in each month.

In [None]:
train_df['transaction_month'] = train_df['transactiondate'].dt.month

cnt_srs = train_df['transaction_month'].value_counts()
plt.figure(figsize=(12,6))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[3])
plt.xticks(rotation='vertical')
plt.xlabel('Month of transaction', fontsize=12)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.show()

As we could see from the data page as well The train data has all the transactions before October 15, 2016, plus some of the transactions after October 15, 2016.

So we have shorter bars in the last three months.

**Parcel ID:**

In [None]:
(train_df['parcelid'].value_counts().reset_index())['parcelid'].value_counts()

So most of the parcel ids are appearing only once in the dataset.
- 대부분 한번씩만 존재

# 2. Properties Data

Now let us explore the properties_2016 file.

2016년의 특징

In [None]:
prop_df = pd.read_csv("../input/zillow-prize-1/properties_2016.csv")
prop_df.shape

In [None]:
prop_df.head()

There are so many NaN values in the dataset. So let us first do some exploration on that one.

## 2-1. missing values
- 후에 유의미 확인


In [None]:
missing_df = prop_df.isnull().sum(axis=0).reset_index()
missing_df.columns = ['column_name', 'missing_count']
missing_df = missing_df.loc[missing_df['missing_count']>0]
missing_df = missing_df.sort_values(by='missing_count')

ind = np.arange(missing_df.shape[0])
width = 0.9
fig, ax = plt.subplots(figsize=(12,18))
rects = ax.barh(ind, missing_df.missing_count.values, color = 'blue')
ax.set_yticks(ind)
ax.set_yticklabels(missing_df.column_name.values, rotation='horizontal')
ax.set_xlabel("Count of missing values")
ax.set_title("Number of missing values in each column")
plt.show()

## 2-2. latitude and longitude

In [None]:
plt.figure(figsize=(12,12))
sns.jointplot(x=prop_df.latitude.values, y=prop_df.longitude.values, size=10)
plt.ylabel('Longitude', fontsize=12)
plt.xlabel('Latitude', fontsize=12)
plt.show()

위의 지도를 보면, 2016년 3개의 counties(Los angeles, Orange and Ventura, California)의 부동산 전체 목록을 제공한다.

train 에는 90,811개의 행이 있지만, property 파일에는 2,985,217개의 행이 있으므로 두 개의 파일을 병합 후 분석 수행

# 3. all data

In [None]:
train_df = pd.merge(train_df, prop_df, on='parcelid', how='left')
train_df.head()

Now let us check the dtypes of different types of variable.

In [None]:
pd.options.display.max_rows = 65

dtype_df = train_df.dtypes.reset_index()
dtype_df.columns = ["Count", "Column Type"]
dtype_df

In [None]:
dtype_df.groupby("Column Type").count().reset_index()

## 3-1. missing values

In [None]:
# 결측값 check
missing_df = train_df.isnull().sum(axis=0).reset_index()
missing_df.columns = ['column_name', 'missing_count']
missing_df['missing_ratio'] = missing_df['missing_count'] / train_df.shape[0]
missing_df.loc[missing_df['missing_ratio']>0.999]

## 3-2. Univariate Analysis

변수가 많으므로, float변수만 target과의 관계 파악



In [None]:
mean_values = train_df.mean(axis=0)
train_df.fillna(mean_values, inplace=True)
train_df_new = train_df
x_cols = [col for col in train_df_new.columns 
          if col not in ['logerror'] if train_df_new[col].dtype=='float64']

labels = []
values = []

for col in x_cols:
    labels.append(col)
    values.append(np.corrcoef(train_df_new[col].values, 
                             train_df.logerror.values)[0,1])
corr_df = pd.DataFrame({'col_labels':labels, 'corr_values':values})
corr_df = corr_df.sort_values(by='corr_values')

ind = np.arange(len(labels))
width = 0.9
fig, ax = plt.subplots(figsize=(12,40))
rects = ax.barh(ind, np.array(corr_df.corr_values.values), color='y')
ax.set_yticks(ind)
ax.set_yticklabels(corr_df.col_labels.values, rotation= 'horizontal')
ax.set_xlabel("Correlation coefficient")
ax.set_title("Correlation coefficeint of the variables")
plt.show()

target과의 상관관계가 전반적으로 낮다.

상관값이 없는 변수가 거의없다.

변수들이 하나의 고유값만 가지고 있어서 상관관계가 없는 것 처럼 보인다.



In [None]:
corr_zero_cols = ['assessmentyear', 'storytypeid', 'pooltypeid2', 'pooltypeid7', 'pooltypeid10', 'poolcnt', 'decktypeid', 'buildingclasstypeid']
for col in corr_zero_cols:
    print(col, len(train_df_new[col].unique()))

상관관계가 높은 변수 파악하기

In [None]:
corr_df_sel = corr_df.loc[(corr_df['corr_values']>0.02) | (corr_df['corr_values'] < -0.01)]
corr_df_sel

In [None]:
cols_to_use = corr_df_sel.col_labels.tolist()

temp_df = train_df[cols_to_use]
corrmat = temp_df.corr(method='spearman')
f, ax = plt.subplots(figsize=(8, 8))

# Draw the heatmap using seaborn
sns.heatmap(corrmat, vmax=1., square=True)
plt.title("Important variables correlation map", fontsize=15)
plt.show()

The important variables themselves are very highly correlated.! Let us now look at each of them.

**Finished SquareFeet 12:**

Let us seee how the finished square feet 12 varies with the log error.

In [None]:
col = "finishedsquarefeet12"
ulimit = np.percentile(train_df[col].values, 99.5)
llimit = np.percentile(train_df[col].values, 0.5)
train_df[col].loc[train_df[col]>ulimit] = ulimit
train_df[col].loc[train_df[col]<llimit] = llimit

plt.figure(figsize=(12,12))
sns.jointplot(x=train_df.finishedsquarefeet12.values, y=train_df.logerror.values, size=10, color=color[4])
plt.ylabel('Log Error', fontsize=12)
plt.xlabel('Finished Square Feet 12', fontsize=12)
plt.title("Finished square feet 12 Vs Log error", fontsize=15)
plt.show()


logerror는 finishedsquarefeet12가 증가함에 따라 error가 감소되는 형태를 보임

아마도 더 큰 집(평방피트가 큰)은 오류가 적으므로 예측하기 쉬운 것으로 보인다.

**calculatedfinishedsquarefeet:**

In [None]:
col = "calculatedfinishedsquarefeet"
ulimit = np.percentile(train_df[col].values, 99.5)
llimit = np.percentile(train_df[col].values, 0.5)
train_df[col].loc[train_df[col]>ulimit] = ulimit
train_df[col].loc[train_df[col]<llimit] = llimit

plt.figure(figsize=(12,12))
sns.jointplot(x=train_df.calculatedfinishedsquarefeet.values, y=train_df.logerror.values, size=10, color=color[5])
plt.ylabel('Log Error', fontsize=12)
plt.xlabel('Calculated finished square feet', fontsize=12)
plt.title("Calculated finished square feet Vs Log error", fontsize=15)
plt.show()

분포 유사. 위 두 변수 간 상관관계 높다고 추정가능

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="bathroomcnt", data=train_df)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Bathroom', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of Bathroom count", fontsize=15)
plt.show()

In [None]:
plt.figure(figsize=(12,9))
sns.boxplot(x='bathroomcnt', y = 'logerror', data = train_df)
plt.ylabel('Log error', fontsize=12)
plt.xlabel('Bathroom Count', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("How log error changes with bathroom count?", fontsize=15)
plt.show()

bathroom 수가 늘어날수록 4분위 범위가 커진다.

bathroom 크기가 작을 수록 이상치 값이 많이 분포되어 있다. 

즉, log error가 다양하게 분포되어 있다는 것을 의미한다.

**Bedroom count**

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="bedroomcnt", data=train_df)
plt.ylabel('Frequency', fontsize=12)
plt.xlabel('Bedroom Count', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of Bedroom count", fontsize=15)
plt.show()

In [None]:
train_df['bedroomcnt'].loc[train_df['bedroomcnt']>7] = 7
plt.figure(figsize=(12,8))
sns.violinplot(x='bedroomcnt', y='logerror', data=train_df)
plt.xlabel('Bedroom count', fontsize=12)
plt.ylabel('Log Error', fontsize=12)
plt.show()

logerror 값이 0을 기준으로 -0.2 ~ 0.2 범위 안에 분포

In [None]:
col = "taxamount"
ulimit = np.percentile(train_df[col].values, 99.5)
llimit = np.percentile(train_df[col].values, 0.5)
train_df[col].loc[train_df[col]>ulimit] = ulimit
train_df[col].loc[train_df[col]<llimit] = llimit

plt.figure(figsize=(12,12))
sns.jointplot(x=train_df['taxamount'].values, y=train_df['logerror'].values, size=10, color='g')
plt.ylabel('Log Error', fontsize=12)
plt.xlabel('Tax Amount', fontsize=12)
plt.title("Tax Amount Vs Log error", fontsize=15)
plt.show()

**YearBuilt:**

Let us explore how the error varies with the yearbuilt variable.


In [None]:
from ggplot import *
ggplot(aes(x='yearbuilt', y='logerror'), data = train_df) + \
    geom_point(color='steelblue', size=1) + \
    stat_smooth()

built year에 따라 사소한 증가 추세를 보임

logerror와 경도, 위도

In [None]:
ggplot(aes(x='latitude', y='longitude', color='logerror'), data=train_df) + \
    geom_point() + \
    scale_color_gradient(low = 'red', high = 'blue')

no visible pockets

양 음 상관관계가 가장 높은 변수를 이용해 visible pattern 확인

In [None]:
ggplot(aes(x='finishedsquarefeet12', y='taxamount', color='logerror'), data=train_df) + \
    geom_point(alpha=0.7) + \
    scale_color_gradient(low = 'pink', high = 'blue')

no visible pattern too

In [None]:
ggplot(aes(x='finishedsquarefeet12', y='taxamount', color='logerror'), data=train_df) + \
    geom_now_its_art()

Hurray.! Finally we got some nice pattern in the data :P

We had an understanding of important variables from the univariate analysis. But this is on a stand alone basis and also we have linearity assumption. Now let us build a non-linear model to get the important variables by building Extra Trees model.

- 일변량 분석을 통해 중요 변수 이해함
- 독립형기준 + 선형성 가정이라 비선형 모델 구축 > tree model의 중요 변수 얻기

## 비선형 모델 구축

In [None]:
train_y = train_df['logerror'].values
cat_cols = ["hashottuborspa", "propertycountylandusecode", "propertyzoningdesc", "fireplaceflag", "taxdelinquencyflag"]
train_df = train_df.drop(['parcelid', 'logerror', 'transactiondate', 'transaction_month']+cat_cols, axis=1)
feat_names = train_df.columns.values
# 트리 모델 분석시 수치형 변수를 제외한 target 컬럼 범주형 변수, 날짜형 변수등과 같이 필요 없는 컬럼은 삭제 하고, 
# 분석하고 싶은 수치형 변수만 남긴다.한다.
# axis=1 컬럼을 삭제 (cf : axis=0 은 행을 삭제)


from sklearn import ensemble
model = ensemble.ExtraTreesRegressor(n_estimators=25, max_depth=30, max_features=0.3, n_jobs=-1, random_state=0)
model.fit(train_df, train_y)

## plot the importances ##
importances = model.feature_importances_
std = np.std([tree.feature_importances_ for tree in model.estimators_], axis=0)
indices = np.argsort(importances)[::-1][:20]

plt.figure(figsize=(12,12))
plt.title("Feature importances")
plt.bar(range(len(indices)), importances[indices], color="r", yerr=std[indices], align="center")
plt.xticks(range(len(indices)), feat_names[indices], rotation='vertical')
plt.xlim([-1, len(indices)])
plt.show()

In [None]:
import xgboost as xgb
xgb_params = {
    'eta': 0.05,
    'max_depth': 8,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'objective': 'reg:linear',
    'silent': 1,
    'seed' : 0
}
dtrain = xgb.DMatrix(train_df, train_y, feature_names=train_df.columns.values)
model = xgb.train(dict(xgb_params, silent=0), dtrain, num_boost_round=50)

# plot the important features #
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()

In [None]:
import numpy as np
import pandas as pd
import xgboost as xgb
import gc

print('Loading data ...')

train = pd.read_csv('../input/zillow-prize-1/train_2016_v2.csv')
prop = pd.read_csv('../input/zillow-prize-1/properties_2016.csv')
sample = pd.read_csv('../input/zillow-prize-1/sample_submission.csv')

print('Binding to float32')

for c, dtype in zip(prop.columns, prop.dtypes):
	if dtype == np.float64:
		prop[c] = prop[c].astype(np.float32)

print('Creating training set ...')

df_train = train.merge(prop, how='left', on='parcelid')

x_train = df_train.drop(['parcelid', 'logerror', 'transactiondate', 'propertyzoningdesc', 'propertycountylandusecode'], axis=1)
y_train = df_train['logerror'].values
print(x_train.shape, y_train.shape)

train_columns = x_train.columns

for c in x_train.dtypes[x_train.dtypes == object].index.values:
    x_train[c] = (x_train[c] == True)

del df_train; gc.collect()

split = 80000
x_train, y_train, x_valid, y_valid = x_train[:split], y_train[:split], x_train[split:], y_train[split:]

print('Building DMatrix...')

d_train = xgb.DMatrix(x_train, label=y_train)
d_valid = xgb.DMatrix(x_valid, label=y_valid)

del x_train, x_valid; gc.collect()

print('Training ...')

params = {}
params['eta'] = 0.02
params['objective'] = 'reg:linear'
params['eval_metric'] = 'mae'
params['max_depth'] = 4
params['silent'] = 1

watchlist = [(d_train, 'train'), (d_valid, 'valid')]
clf = xgb.train(params, d_train, 10000, watchlist, early_stopping_rounds=100, verbose_eval=10)

del d_train, d_valid

print('Building test set ...')

sample['parcelid'] = sample['ParcelId']
df_test = sample.merge(prop, on='parcelid', how='left')

del prop; gc.collect()

x_test = df_test[train_columns]
for c in x_test.dtypes[x_test.dtypes == object].index.values:
    x_test[c] = (x_test[c] == True)

del df_test, sample; gc.collect()

d_test = xgb.DMatrix(x_test)

del x_test; gc.collect()

print('Predicting on test ...')

p_test = clf.predict(d_test)

del d_test; gc.collect()

sub = pd.read_csv('../input/zillow-prize-1/sample_submission.csv')
for c in sub.columns[sub.columns != 'ParcelId']:
    sub[c] = p_test

print('Writing csv ...')
sub.to_csv('xgb_starter.csv', index=False, float_format='%.4f') # Thanks to @inversion