# Zillow’s Home Value Prediction (Zestimate)

Zillow라는 미국 부동산 회사에서 제공하는 데이터를 기반으로, 주택가격결정모형을 회귀분석을 활용하여 구해보고자 합니다.

데이터셋은 [Kaggle](https://www.kaggle.com/c/zillow-prize-1 "Zillow Prize: Zillow’s Home Value Prediction (Zestimate)")을 통해서 제공을 하고 있으며, 현재 상금도 걸려있는 과제이기도 합니다.
지금은 2016년도 자료만 제공되고 있으며, 추후 2017년 10월에 2017년도 데이터가 추가될 예정입니다. 이번에는 현재 제공되고 있는 2016년도 자료를 바탕으로 분석을 진행할 것입니다.


제공되는 데이터 형태는 다음과 같습니다.
* properties_2016.csv (618 MB)
* sample_submission.csv (59.7 MB)
* train_2016_v2.csv (2.33 MB)
* zillow_data_dictionary.xlsx (20.1 KB)

(부가적인 설명 추가예정)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 데이터  로드 및 병합

타겟변수인 logerror가 포함되어 있는 'train_2016_v2.csv'파일을 로드합니다.

In [2]:
train=pd.read_csv('./data/train_2016_v2.csv')
train.head()

Unnamed: 0,parcelid,logerror,transactiondate
0,11016594,0.0276,2016-01-01
1,14366692,-0.1684,2016-01-01
2,12098116,-0.004,2016-01-01
3,12643413,0.0218,2016-01-02
4,14432541,-0.005,2016-01-02


주택에 대한 정보(58가지 변수)가 포함되어 있는 'properties_2016.csv'파일을 로드합니다.

In [3]:
prop = pd.read_csv('./data/properties_2016.csv')
prop.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,parcelid,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,decktypeid,...,numberofstories,fireplaceflag,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock
0,10754147,,,,0.0,0.0,,,,,...,,,,9.0,2015.0,9.0,,,,
1,10759547,,,,0.0,0.0,,,,,...,,,,27516.0,2015.0,27516.0,,,,
2,10843547,,,,0.0,0.0,,,,,...,,,650756.0,1413387.0,2015.0,762631.0,20800.37,,,
3,10859147,,,,0.0,0.0,3.0,7.0,,,...,1.0,,571346.0,1156834.0,2015.0,585488.0,14557.57,,,
4,10879947,,,,0.0,0.0,4.0,,,,...,,,193796.0,433491.0,2015.0,239695.0,5725.17,,,


In [4]:
print(train.shape, prop.shape)

(90275, 3) (2985217, 58)


parcelid를 기준으로 train테이블과 prop테이블을 병합을 하여, 하나의 데이터 테이블(dataset)로 만들어 줍니다.

In [5]:
dataset = pd.merge(train,prop,on="parcelid",how="left")

In [6]:
dataset.head()

Unnamed: 0,parcelid,logerror,transactiondate,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,...,numberofstories,fireplaceflag,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock
0,11016594,0.0276,2016-01-01,1.0,,,2.0,3.0,,4.0,...,,,122754.0,360170.0,2015.0,237416.0,6735.88,,,60371070000000.0
1,14366692,-0.1684,2016-01-01,,,,3.5,4.0,,,...,,,346458.0,585529.0,2015.0,239071.0,10153.02,,,
2,12098116,-0.004,2016-01-01,1.0,,,3.0,2.0,,4.0,...,,,61994.0,119906.0,2015.0,57912.0,11484.48,,,60374640000000.0
3,12643413,0.0218,2016-01-02,1.0,,,2.0,2.0,,4.0,...,,,171518.0,244880.0,2015.0,73362.0,3048.74,,,60372960000000.0
4,14432541,-0.005,2016-01-02,,,,2.5,4.0,,,...,2.0,,169574.0,434551.0,2015.0,264977.0,5488.96,,,60590420000000.0


병합된 데이터는 계속되는 작업에서 불필요한 데이터로드를 피하기위해 'dataset.csv'로 저장하여 사용합니다.

In [7]:
dataset.to_csv('./data/dataset.csv',index = False)

추후 재작업시에는 'dataset.csv'만 로드하여 사용합니다.

In [8]:
dataset=pd.read_csv('./data/dataset.csv')
dataset.head()

Unnamed: 0,parcelid,logerror,transactiondate,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,...,numberofstories,fireplaceflag,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock
0,11016594,0.0276,2016-01-01,1.0,,,2.0,3.0,,4.0,...,,,122754.0,360170.0,2015.0,237416.0,6735.88,,,60371070000000.0
1,14366692,-0.1684,2016-01-01,,,,3.5,4.0,,,...,,,346458.0,585529.0,2015.0,239071.0,10153.02,,,
2,12098116,-0.004,2016-01-01,1.0,,,3.0,2.0,,4.0,...,,,61994.0,119906.0,2015.0,57912.0,11484.48,,,60374640000000.0
3,12643413,0.0218,2016-01-02,1.0,,,2.0,2.0,,4.0,...,,,171518.0,244880.0,2015.0,73362.0,3048.74,,,60372960000000.0
4,14432541,-0.005,2016-01-02,,,,2.5,4.0,,,...,2.0,,169574.0,434551.0,2015.0,264977.0,5488.96,,,60590420000000.0
