## KaKR 2nd ML month 
## EDA & Ensemble, stacking based prediction

- EDA
- Data preprocessing
- Feature engineering
- prediction using lgb, xgb, gboost
- ensemble
- stacking

- public보다 private leaderboard 값이 많이 증가하여 순위가 많이 하락함.
- EDA, data 확인, feature 관련, ensemble, stacking 관련 부분만 공유를 위해 kernel 공개


#### references
- https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard?fbclid=IwAR0d0GboXiaZP4VeQTpavV4hCFJ8tVlv5TFr-yQNPWZZtSyAHshLpqt5vXU
- https://www.kaggle.com/subinium/subinium-tutorial-house-prices-advanced
- https://www.kaggle.com/chocozzz/house-price-prediction-eda-updated-2019-03-12
- https://www.kaggle.com/kcs93023/2019-ml-month-2nd-baseline

## File descriptions
#### train.csv - 예측 모델을 만들기 위해 사용하는 학습 데이터입니다. 집의 정보와 예측할 변수인 가격(Price) 변수를 가지고 있습니다.
#### test.csv - 학습셋으로 만든 모델을 가지고 예측할 가격(Price) 변수를 제외한 집의 정보가 담긴 테스트 데이터 입니다.
#### sample_submission.csv - 제출시 사용할 수 있는 예시 submission.csv 파일입니다.

## Data fields
#### ID : 집을 구분하는 번호
#### date : 집을 구매한 날짜
#### price : 집의 가격(Target variable)
#### bedrooms : 침실의 수
#### bathrooms : 화장실의 수
#### sqft_living : 주거 공간의 평방 피트(면적)
#### sqft_lot : 부지의 평방 피트(면적)
#### floors : 집의 층 수
#### waterfront : 집의 전방에 강이 흐르는지 유무 (a.k.a. 리버뷰)
#### view : 집이 얼마나 좋아 보이는지의 정도
#### condition : 집의 전반적인 상태
#### grade : King County grading 시스템 기준으로 매긴 집의 등급
#### sqft_above : 지하실을 제외한 평방 피트(면적)
#### sqft_basement : 지하실의 평방 피트(면적)
#### yr_built : 지어진 년도
#### yr_renovated : 집을 재건축한 년도
#### zipcode : 우편번호
#### lat : 위도
#### long : 경도
#### sqft_living15 : 2015년 기준 주거 공간의 평방 피트(면적, 집을 재건축했다면, 변화가 있을 수 있음)
#### sqft_lot15 : 2015년 기준 부지의 평방 피트(면적, 집을 재건축했다면, 변화가 있을 수 있음)

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib.pyplot as plt #Visualization
import seaborn as sns #Visualization
from scipy.stats import norm #Analysis
from sklearn.preprocessing import StandardScaler #Analysis
from scipy import stats #Analysis
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
import gc

import missingno as msno

print(os.listdir("../input"))


# EDA

In [1]:
df_train = pd.read_csv('../input/train.csv')
df_test = pd.read_csv('../input/test.csv')

print("train data: ", df_train.shape)
print("test data: ", df_test.shape)

In [1]:
df_train.head()

In [1]:
# Check for duplicates
idsUnique = len(set(df_train.id))
idsTotal = df_train.shape[0]
idsDupli = idsTotal - idsUnique
print("There are " + str(idsDupli) + " duplicate IDs for " + str(idsTotal) + " total entries")

### set: https://www.codingfactory.net/10043

In [1]:
msno.matrix(df_train)

In [1]:
#Missing data check
train_na_ratio = (df_train.isnull().sum() / len(df_train)) * 100
for i in range(np.shape(df_train)[1]):
    print("There are " + str(train_na_ratio[i]) + " ratio of missing data in " + str(df_train.columns[i]) + " variable" )

In [1]:
df_test.head()

In [1]:
# Check for duplicates
idsUnique = len(set(df_test.id))
idsTotal = df_test.shape[0]
idsDupli = idsTotal - idsUnique
print("There are " + str(idsDupli) + " duplicate IDs for " + str(idsTotal) + " total entries")

In [1]:
msno.matrix(df_test)

In [1]:
#Missing data check
test_na_ratio = (df_test.isnull().sum() / len(df_test)) * 100
for i in range(np.shape(df_test)[1]):
    print("There are " + str(test_na_ratio[i]) + " ratio of missing data in " + str(df_test.columns[i]) + " variable" )

- No missing data in train & test dataset

## date 변수 정리

In [1]:
df_train['date'] = df_train['date'].apply(lambda x : str(x[:8])).astype(str)
df_test['date'] = df_test['date'].apply(lambda x : str(x[:8])).astype(str)

In [1]:
df_train.date

In [1]:
np.shape(df_train.columns)

In [1]:
np.shape(df_test.columns)

### 각 변수들의 분포 확인

In [1]:
fig, ax = plt.subplots(11, 2, figsize=(20, 60))

count = 0
columns = df_train.columns
for row in range(11):
    for col in range(2):
        sns.kdeplot(df_train[columns[count]], ax=ax[row][col])
        ax[row][col].set_title(columns[count], fontsize=15)
        count+=1
        if count == 21 :
            break

# Target value y: Price
- 해당 데이터 셋을 수집한 시기 기준의 매매가ㅡ
- https://www.kaggle.com/c/2019-2nd-ml-month-with-kakr/discussion/83957

In [1]:
#descriptive statistics summary
df_train.price.describe()

In [1]:
#histogram
plt.figure(figsize=(8, 6))
sns.distplot(df_train['price'])

In [1]:
#skewness and kurtosis
print("Skewness: %f" % df_train['price'].skew())
print("Kurtosis: %f" % df_train['price'].kurt())

- 왜도, 비대칭도 (Skewness) : 실수 값 확률 변수의 확률 분포 비대칭성을 나타내는 지표, (양수: 확률밀도함수의 오른쪽 부분에 긴 꼬리를 가지며, 중앙값을 포함한 자료가 왼쪽에 더 많이 분포함) // (음수: 확률밀도함수의 왼쪽 부분에 긴 꼬리를 가지며, 중앙값을 포함한 자료가 오른쪽에 더 많이 분포함) -> 현재는 전자 // (평균과 중앙값이 같으면 왜도는 0)
- https://ko.wikipedia.org/wiki/%EB%B9%84%EB%8C%80%EC%B9%AD%EB%8F%84

- 첨도 (kurtosis / kurtosis) : 확률분포의 뾰족한 정도를 나타내는 척도. 관측치들이 어느 정도 집중적으로 중심에 몰려있는가를 측정할 때 사용. 첨도 값 (K)이 3에 가까우면 산포도가 정규분포에 가까움.// (K <3), 분포가 정규 분포보다 더 완만하게 납작한 분포// 첨도가 3보다 큰 양수인 경우 (K> 3), 산포는 정규분포보다 더 뾰족한 분포로 생각 할 수 있음.
- https://ko.wikipedia.org/wiki/%EC%B2%A8%EB%8F%84

## target variable인 Price.
- Skewness 매우 큼
- 정규분포를 따르지 않고, 분산 매우 큼
- 회귀모델 사용하기 위해, normalize를 통해 정규분포를 갖도록 해줌
- 정규분포검정 (normlaity test): Q-Q plot (Quantile-Quantile plot) 이용. 분석하고자 하는 표본 데이터의 분포와 정규 분포의 분포 형태를 비교하여, 표본 데이터가 정규 분포를 따르는지 검사하는 간단한 시각적 도구
- https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.probplot.html
- https://datascienceschool.net/view-notebook/76acc92d28354e86940001f9fe85c50f/

In [1]:
fig = plt.figure(figsize = (15,10))

fig.add_subplot(1,2,1)
res = stats.probplot(df_train['price'], plot=plt)

fig.add_subplot(1,2,2)
res = stats.probplot(np.log1p(df_train['price']), plot=plt)

- log를 취함으로써 price 데이터를 정규분포와 유사한 분포로 만들어줌

In [1]:
df_train['price'] = np.log1p(df_train['price'])
#histogram
plt.figure(figsize=(8, 6))
sns.distplot(df_train['price'])

# 변수 상관관계 분석
- 피어슨 상관관계 : 연속형 변수에 사용
- 스피어만 순위 상관관계 : 범주형 변수도 포함되었을 경우에 사용 

- https://m.blog.naver.com/PostView.nhn?blogId=istech7&logNo=50153047118&proxyReferer=https%3A%2F%2Fwww.google.com%2F
- http://blog.daum.net/_blog/BlogTypeView.do?blogid=0JHeT&articleno=5655101&_bloghome_menu=recenttext
- https://support.minitab.com/ko-kr/minitab/18/help-and-how-to/statistics/basic-statistics/supporting-topics/correlation-and-covariance/a-comparison-of-the-pearson-and-spearman-correlation-methods/ 

In [1]:
corrmat = df_train.corr()
colormap = plt.cm.RdBu
plt.figure(figsize=(16,14))
plt.title('Pearson Correlation of Features', y=1.05, size=15)

sns.heatmap(corrmat, fmt='.2f',linewidths=0.1, vmax=0.9, square=True, cmap=colormap, linecolor='white', annot=True, annot_kws={"size": 10})

In [1]:
print(corrmat.price)

In [1]:
# Find most important features relative to target

corrmat.sort_values(["price"], ascending = False, inplace = True)
print(corrmat.price)

In [1]:
cor_abs = abs(df_train.corr(method='spearman'))
cor_cols = cor_abs.nlargest(n=10, columns='price').index # price과 correlation이 높은 column 10개 뽑기(내림차순)
# spearman coefficient matrix
cor = np.array(stats.spearmanr(df_train[cor_cols].values))[0] # 10 x 10
print(cor_cols.values)
plt.figure(figsize=(10,10))
colormap = plt.cm.RdBu
plt.title('Spearman Correlation of Features', y=1.05, size=15)
sns.heatmap(cor, fmt='.2f',linewidths=0.1, vmax=0.9, square=True, cmap=colormap, linecolor='white', annot=True, annot_kws={"size": 8}, xticklabels=cor_cols.values, yticklabels=cor_cols.values)

In [1]:
print(cor_abs.price)

In [1]:
# Find most important features relative to target

cor_abs.sort_values(["price"], ascending = False, inplace = True)
print(cor_abs.price)

In [1]:
# price과 correlation이 높은 변수들
cor_cols

> ### discussion check
### sqft_living, sqft_lot
### https://www.kaggle.com/c/2019-2nd-ml-month-with-kakr/discussion/87029

In [1]:
df_train.head(20)

In [1]:
df_train.loc[df_train['sqft_above']/df_train['floors']>df_train['sqft_lot']]

In [1]:
df_train.loc[df_train['id']==2464]['sqft_above']/df_train.loc[df_train['id']==2464]['floors']

In [1]:
df_train.loc[df_train['id']==10987]['sqft_above']/df_train.loc[df_train['id']==10987]['floors']

In [1]:
df_train.loc[df_train['id']==12104]['sqft_above']/df_train.loc[df_train['id']==12104]['floors']

In [1]:
data = pd.concat([df_train['price'], df_train['sqft_lot']], axis=1)
plt.figure(figsize=(8, 6))
sns.regplot(x='sqft_lot', y="price", data=data, marker="+", color="g")

In [1]:
data = pd.concat([df_train['sqft_living'], df_train['sqft_lot']], axis=1)
plt.figure(figsize=(8, 6))
sns.regplot(x='sqft_lot', y="sqft_living", data=data, marker="+", color="g")

In [1]:
df_train.loc[(df_train['sqft_lot'] > 1500000)]

In [1]:
data = pd.concat([df_test['sqft_living'], df_test['sqft_lot']], axis=1)
plt.figure(figsize=(8, 6))
sns.regplot(x='sqft_lot', y="sqft_living", data=data, marker="+", color="g")

In [1]:
df_test.loc[(df_test['sqft_lot'] > 1000000)]

In [1]:
data = pd.concat([df_train['price'], df_train['sqft_lot15']], axis=1)
plt.figure(figsize=(8, 6))
sns.regplot(x='sqft_lot15', y="price", data=data, marker="+", color="g")

## price과 correlation이 높은 변수들 vs price 분석


## grade vs price

In [1]:
data = pd.concat([df_train['price'], df_train['grade']], axis=1)
plt.figure(figsize=(8, 6))
sns.boxplot(x='grade', y="price", data=data)

## sqft_living vs price

In [1]:
sns.set(color_codes=True)
data = pd.concat([df_train['price'], df_train['sqft_living']], axis=1)
plt.figure(figsize=(8, 6))
sns.regplot(x='sqft_living', y="price", data=data, marker="+", color="g")

# seaborn, regplot: https://seaborn.pydata.org/generated/seaborn.regplot.html

In [1]:
plt.figure(figsize = (8, 5))
sns.jointplot(df_train.sqft_living, df_train.price, 
              alpha = 0.5)
plt.xlabel('sqft_living')
plt.ylabel('price')
plt.show()

## sqft_living15 vs price

In [1]:
data = pd.concat([df_train['price'], df_train['sqft_living15']], axis=1)
plt.figure(figsize=(8, 6))
sns.regplot(x='sqft_living15', y="price", data=data, marker="+", color="g")

> ## sqft_living vs sqft_living15  /  sqft_lot vs sqft_lot15

In [1]:
plt.plot(df_train['sqft_living15']-df_train['sqft_living'])

In [1]:
plt.plot(df_train['sqft_lot15']-df_train['sqft_lot'])

In [1]:
df_train.loc[(df_train['sqft_lot15']-df_train['sqft_lot'] < -1200000)]

In [1]:
plt.figure(figsize=(20, 14))
sns.scatterplot('long','lat',hue='price',data=df_train)

In [1]:
plt.figure(figsize=(20, 14))
sns.scatterplot('long','lat',hue='sqft_lot',data=df_train)

In [1]:
plt.figure(figsize=(20, 14))
sns.scatterplot('long','lat',hue='sqft_lot15',data=df_train)

In [1]:
plt.figure(figsize=(20, 14))
sns.scatterplot('long','lat',hue='sqft_living',data=df_train)

In [1]:
plt.figure(figsize=(20, 14))
sns.scatterplot('long','lat',hue='sqft_living15',data=df_train)

## sqft_above vs price

In [1]:
data = pd.concat([df_train['price'], df_train['sqft_above']], axis=1)
plt.figure(figsize=(8, 6))
sns.regplot(x='sqft_above', y="price", data=data, marker="+", color="g")

## bathrooms vs price

In [1]:
data = pd.concat([df_train['price'], df_train['bathrooms']], axis=1)
plt.figure(figsize=(18, 6))
sns.boxplot(x='bathrooms', y="price", data=data)

### bathrooms: 화장실/집
### https://www.kaggle.com/c/2019-2nd-ml-month-with-kakr/discussion/86557

## lat vs price

In [1]:
data = pd.concat([df_train['price'], df_train['lat']], axis=1)
plt.figure(figsize=(8, 6))
sns.regplot(x='lat', y="price", data=data, marker="+", color="g")

In [1]:
data = pd.concat([df_train['price'], df_train['long']], axis=1)
plt.figure(figsize=(8, 6))
sns.regplot(x='long', y="price", data=data, marker="+", color="g")

### lat (latitude, 위도)와 price 사이에 선형 상관관계가 있고, 상관계수가 높은 값을 갖는 것을 알 수 있음

---
[Discussion](https://www.kaggle.com/c/2019-2nd-ml-month-with-kakr/discussion/83549)에 따르면, 위의 지역은 시애틀인데 시애틀은 북쪽으로 갈 수록 살기 좋은 집들이 많다고 함.

이 부분에 대해서 [김태진님의 커널](https://www.kaggle.com/fulrose/map-visualization-with-folium-ing)의 자료 참고

## bedrooms vs price

In [1]:
data = pd.concat([df_train['price'], df_train['bedrooms']], axis=1)
plt.figure(figsize=(18, 6))
sns.boxplot(x='bedrooms', y="price", data=data)

In [1]:
df_train.bedrooms.describe()

## floors vs price

In [1]:
data = pd.concat([df_train['price'], df_train['floors']], axis=1)
plt.figure(figsize=(18, 6))
sns.boxplot(x='floors', y="price", data=data)

In [1]:
df_train.floors.describe()

## view vs price

In [1]:
df_train.view.describe()

In [1]:
plt.plot(df_train.view, '+')

In [1]:
data = pd.concat([df_train['price'], df_train['view']], axis=1)
plt.figure(figsize=(18, 6))
sns.boxplot(x='view', y="price", data=data)

## conditions

In [1]:
condition = df_train['condition'].value_counts()

print("Condition counting: ")
print(condition)

fig, ax = plt.subplots(ncols=2, figsize=(14,5))
sns.countplot(x='condition', data=df_train, ax=ax[0])
sns.boxplot(x='condition', y= 'price',
            data=df_train, ax=ax[1])
plt.show()

In [1]:
plt.figure(figsize = (12,8))
g = sns.FacetGrid(data=df_train, hue='condition',size= 5, aspect=2)
g.map(plt.scatter, "sqft_living", "price").add_legend()
plt.show()

>#### seaborn FacetGrid:  https://seaborn.pydata.org/generated/seaborn.FacetGrid.html

In [1]:
condition = df_train['grade'].value_counts()

print("Grade counting: ")
print(condition)

fig, ax = plt.subplots(ncols=2, figsize=(14,5))
sns.countplot(x='grade', data=df_train, ax=ax[0])
sns.boxplot(x='grade', y= 'price',
            data=df_train, ax=ax[1])
plt.show()

In [1]:
plt.figure(figsize = (12,8))
g = sns.FacetGrid(data=df_train, hue='grade',size= 5, aspect=2)
g.map(plt.scatter, "sqft_living", "price").add_legend()
plt.show()

## 변수 간 연관성 높은 것들

In [1]:
#Clearly view of bathrooms and bedrooms correlation

bath = ['bathrooms', 'bedrooms']
cm = sns.light_palette("green", as_cmap=True)
pd.crosstab(df_train[bath[0]], df_train[bath[1]], margins=True).style.background_gradient(cmap = cm)

In [1]:
bath_cond = ['bathrooms', 'condition']
cm = sns.light_palette("green", as_cmap=True)
pd.crosstab(df_train[bath_cond[0]], df_train[bath_cond[1]], margins=True).style.background_gradient(cmap = cm)

In [1]:
bed_cond = ['bedrooms', 'condition']
cm = sns.light_palette("green", as_cmap=True)
pd.crosstab(df_train[bed_cond[0]], df_train[bed_cond[1]], margins=True).style.background_gradient(cmap = cm)

In [1]:
cond_water = ['condition', 'waterfront']
cm = sns.light_palette("green", as_cmap=True)
pd.crosstab(df_train[cond_water[0]], df_train[cond_water[1]], margins=True).style.background_gradient(cmap = cm)

In [1]:
grade_cond = ['grade', 'condition']
cm = sns.light_palette("green", as_cmap=True)
pd.crosstab(df_train[grade_cond[0]], df_train[grade_cond[1]], margins=True).style.background_gradient(cmap = cm)

In [1]:
grade_bed = ['grade', 'bedrooms']
cm = sns.light_palette("green", as_cmap=True)
pd.crosstab(df_train[grade_bed[0]], df_train[grade_bed[1]], margins=True).style.background_gradient(cmap = cm)

In [1]:
grade_bath = ['grade', 'bathrooms']
cm = sns.light_palette("green", as_cmap=True)
pd.crosstab(df_train[grade_bath[0]], df_train[grade_bath[1]], margins=True).style.background_gradient(cmap = cm)

# Data Preprocessing

## outlier 제거
- sqft_living, grade, bedrooms, bathrooms

### sqft_living

In [1]:
#sqft_living vs price
sns.set(color_codes=True)
data = pd.concat([df_train['price'], df_train['sqft_living']], axis=1)
plt.figure(figsize=(8, 6))
sns.regplot(x='sqft_living', y="price", data=data, marker="+", color="g")

In [1]:
df_train.loc[df_train['sqft_living'] > 13000]

In [1]:
#df_train = df_train.loc[df_train['id']!=8912]

In [1]:
#df_train[df_train['id']==8912]

In [1]:
#sqft_living vs price
sns.set(color_codes=True)
data = pd.concat([df_train['price'], df_train['sqft_living']], axis=1)
plt.figure(figsize=(8, 6))
sns.regplot(x='sqft_living', y="price", data=data, marker="+", color="g")

In [1]:
df_train.loc[df_train['sqft_living'] > 11000]

- grade가 13. 크기에 비해 가격이 다소 낮아보이지만, 그닥 문제될 정도는 아님. 
- grade 13 중 크기 가장 큼. --> 가격 가장 높음 (시세 상한가인듯)

In [1]:
df_train.loc[(df_train['grade'] == 13)]

- 제거 x

In [1]:
#df_train = df_train.loc[df_train['id']!=5108]

### grade

In [1]:
data = pd.concat([df_train['price'], df_train['grade']], axis=1)
plt.figure(figsize=(8, 6))
sns.boxplot(x='grade', y="price", data=data)

In [1]:
df_train.loc[(df_train['grade'] == 3)]

In [1]:
df_train.loc[(df_train['price']>12) & (df_train['grade'] == 3)]

In [1]:
df_train.loc[(df_train['price']>14.7) & (df_train['grade'] == 8)]

In [1]:
df_train.loc[(df_train['price']>15.5) & (df_train['grade'] == 11)]

In [1]:
df_train.loc[(df_train['price']>14.5) & (df_train['grade'] == 7)]

In [1]:
#df_train = df_train.loc[df_train['id']!=2302]
#df_train = df_train.loc[df_train['id']!=4123]
#df_train = df_train.loc[df_train['id']!=7173]
#df_train = df_train.loc[df_train['id']!=2775]
#df_train = df_train.loc[df_train['id']!=12346]

In [1]:
data = pd.concat([df_train['price'], df_train['grade']], axis=1)
plt.figure(figsize=(8, 6))
sns.boxplot(x='grade', y="price", data=data)

### bedrooms

In [1]:
data = pd.concat([df_train['price'], df_train['bedrooms']], axis=1)
plt.figure(figsize=(18, 6))
sns.boxplot(x='bedrooms', y="price", data=data)

In [1]:
df_train.loc[df_train['bedrooms']>=10]

### test data 확인

In [1]:
df_test.loc[df_test['bedrooms']>=10]

In [1]:
df_test.loc[df_test['id']==19745]

- test data 중 id=19745 -> sqft_living이 1620인데, bedroom 개수가 33개임.
- 집의 크기에 비해 침실 개수가 매우 많은 것도 이상하지만, 한 집에 bedrooms이 33개라는게 상식적으로 이해 불가

--- 
### 다른 비슷한 크기를 갖는 집들의 침실 개수 확인

In [1]:
df_train.loc[(df_train['sqft_living']<=1620) & (df_train['sqft_living']>=1500)]

In [1]:
data1 = df_train.loc[(df_train['sqft_living']<=1620) & (df_train['sqft_living']>=1500)]
data2 = pd.concat([data1['sqft_living'], data1['bedrooms']], axis=1)
plt.figure(figsize=(18, 6))
sns.boxplot(x='bedrooms', y="sqft_living", data=data2)

In [1]:
fig, ax = plt.subplots(1, 1, figsize=(9, 5))
sns.kdeplot(data1[data1['bedrooms']==2]['sqft_living'], ax=ax)
sns.kdeplot(data1[data1['bedrooms']==3]['sqft_living'], ax=ax)
sns.kdeplot(data1[data1['bedrooms']==4]['sqft_living'], ax=ax)
sns.kdeplot(data1[data1['bedrooms']==5]['sqft_living'], ax=ax)
plt.legend(['bedrooms == 2', 'bedrooms == 3', 'bedrooms == 4', 'bedrooms == 5'])
plt.show()

- 침실 33개를 갖는 집의 크기인 1620 (sqft_living)은 침실 2, 3, 4, 5개를 갖는 집들이 갖는 크기입
- 침실 33개라는 것은 말이 안됨 --> 3개로 바꿔주기 (3을 두번 누른 것이라고 생각)

In [1]:
# train데이터 내 bedrooms 개수와 sqft_living의 관계
data = pd.concat([df_train['sqft_living'], df_train['bedrooms']], axis=1)
plt.figure(figsize=(18, 6))
sns.boxplot(x='bedrooms', y="sqft_living", data=data)

In [1]:
# test데이터 내 bedrooms 개수와 sqft_living의 관계
data = pd.concat([df_test['sqft_living'], df_test['bedrooms']], axis=1)
plt.figure(figsize=(18, 6))
sns.boxplot(x='bedrooms', y="sqft_living", data=data)

- 위 그림을 봐도 33개의 방을 갖는 집이 존재한다는 사실과, 그 집의 sqft_living이 1620이라는 값을 갖는다는 것이 받아들이기 어렵다는 것을 확인할 수 있음

In [1]:
df_test.loc[df_test['id']==19745].bedrooms

In [1]:
df_test.loc[df_test['id']==19745, 'bedrooms'] = 3
#df_test.loc[df_test['id']==19745].bedrooms = 3

In [1]:
df_test.loc[df_test['id']==19745].bedrooms

### bathrooms

In [1]:
#train 데이터. price vs bathrooms
data = pd.concat([df_train['price'], df_train['bathrooms']], axis=1)
plt.figure(figsize=(18, 6))
sns.boxplot(x='bathrooms', y="price", data=data)

In [1]:
#train 데이터. sqft_living vs bathrooms
data = pd.concat([df_train['sqft_living'], df_train['bathrooms']], axis=1)
plt.figure(figsize=(18, 6))
sns.boxplot(x='bathrooms', y="sqft_living", data=data)

In [1]:
#test 데이터. sqft_living vs bathrooms
data = pd.concat([df_test['sqft_living'], df_test['bathrooms']], axis=1)
plt.figure(figsize=(18, 6))
sns.boxplot(x='bathrooms', y="sqft_living", data=data)

- train 데이터에서, bathromms = 6.75, 7.5 인 데이터가 가격과 sqft_living에서 이상하게 뚝 떨어지는 값을 가짐.
- test 데이터에서는 그런 경향은 나타나지 않음
- train 데이터 내 해당 데이터 확인 및 제거

In [1]:
df_train.loc[df_train['bathrooms']>6]

In [1]:
df_train.loc[(df_train['bathrooms']>=6.75) & (df_train['bathrooms']<=7.5)]

In [1]:
#df_train = df_train.loc[df_train['id']!=2859]
#df_train = df_train.loc[df_train['id']!=5990]

## data normalization
-https://www.kaggle.com/kcs93023/2019-ml-month-2nd-baseline

In [1]:
#skew_columns = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15']
skew_columns = ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15']
#skew_columns = ['sqft_living', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15']


fig, ax = plt.subplots(3, 2, figsize=(10, 15))

count = 0
for row in range(3):
    for col in range(2):
        if count == 6:
            break
        sns.kdeplot(df_train[skew_columns[count]], ax=ax[row][col])
        ax[row][col].set_title(skew_columns[count], fontsize=15)
        count+=1

In [1]:
#from scipy.special import boxcox1p
#lam = 0.15

#for c in skew_columns:
#    df_train[c] = boxcox1p(df_train[c], lam)
#    df_test[c] = boxcox1p(df_test[c], lam)
    
for c in skew_columns:
    df_train[c] = np.log1p(df_train[c])
    df_test[c] = np.log1p(df_test[c])

In [1]:
#for c in skew_columns:
#    df_train[c] = np.log1p(df_train[c].values)
#    df_test[c] = np.log1p(df_test[c].values)

In [1]:
fig, ax = plt.subplots(3, 2, figsize=(10, 15))

count = 0
for row in range(3):
    for col in range(2):
        if count == 6:
            break
        sns.kdeplot(df_train[skew_columns[count]], ax=ax[row][col])
        ax[row][col].set_title(skew_columns[count], fontsize=15)
        count+=1

- Data log-scaling을 통한 data normalization을 통해 skewness가 큰 데이터들을 정규분포에 가깝게 만들어 줌

### yr_renovated 수정
- yr_renovated는 집을 재건축한 년도
- 0 값을 갖는 경우, 재건축을 하지 않은 것임 --> 지은 날짜로 변경

In [1]:
for df in [df_train,df_test]:
    df['yr_renovated'] = df['yr_renovated'].apply(lambda x: np.nan if x == 0 else x)
    df['yr_renovated'] = df['yr_renovated'].fillna(df['yr_built'])

In [1]:
df_train.head()

In [1]:
for df in [df_train,df_test]:
    df['date(new)'] = df['date'].apply(lambda x: int(x[4:8])+800 if x[:4] == '2015' else int(x[4:8])-400)
    del df['date']
    df['total_rooms'] = df['bedrooms'] + df['bathrooms']
    df['grade_condition'] = df['grade'] * df['condition']
    df['sqft_ratio'] = df['sqft_living'] / df['sqft_lot']
    df['sqft_total_size'] = df['sqft_above'] + df['sqft_basement']
    #df['sqft_ratio15'] = df['sqft_living15'] / df['sqft_lot15'] 
    #df['sqft_ratio_1'] = df['sqft_living'] / df['sqft_total_size'] 
    df['is_renovated'] = df['yr_renovated'] - df['yr_built'] 
    df['is_renovated'] = df['is_renovated'].apply(lambda x: 0 if x == 0 else 1) #재건축 여부
    df['yr_renovated'] = df['yr_renovated'].astype('int')

In [1]:
df_train.head()

## Zipcode

In [1]:
len(set(df_train['zipcode'].values))

In [1]:
data = pd.concat([df_train['price'], df_train['zipcode']], axis=1)
plt.figure(figsize=(18, 6))
sns.boxplot(x='zipcode', y="price", data=data)

> ### 총 70개의 zip code가 존재하고, zipcode는 지역에 관련된 인자이므로, 가격에 영향을 줄 것이라고 생각할 수 있음

In [1]:
## 현우님 kernel 참고 (https://www.kaggle.com/chocozzz/house-price-prediction-eda-updated-2019-03-12)
df_train['per_price'] = df_train['price']/df_train['sqft_total_size']


In [1]:
# 70개 zipcode group들에 대한 mean & var 변수 추출
zipcode_price = df_train.groupby(['zipcode'])['per_price'].agg({'mean','var'}).reset_index()
## groupby, 연산, agg. 참고 (https://datascienceschool.net/view-notebook/76dcd63bba2c4959af15bec41b197e7c/)
## reset_index 참고 (https://datascienceschool.net/view-notebook/a49bde24674a46699639c1fa9bb7e213/)
zipcode_price

In [1]:
print(len(df_train.columns))
print(len(df_test.columns))

In [1]:
df_train.columns

In [1]:
df_test.columns

In [1]:
#mean, var 변수 2개씩 추가됨
df_train = pd.merge(df_train,zipcode_price,how='left',on='zipcode')
df_test = pd.merge(df_test,zipcode_price,how='left',on='zipcode')

In [1]:
print(len(df_train.columns))
print(len(df_test.columns))

In [1]:
# 면적 당 가격의 mean/var 이었으므로, 이를 total_size와 곱해줌
for df in [df_train,df_test]:
    df['zipcode_mean'] = df['mean'] * df['sqft_total_size']
    df['zipcode_var'] = df['var'] * df['sqft_total_size']
    #del df['mean']; del df['var']

In [1]:
print(len(df_train.columns))
print(len(df_test.columns))

In [1]:
df_train.columns

In [1]:
df_train.head()

In [1]:
#skew_columns = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15']
skew_columns = ['mean', 'var', 'zipcode_mean', 'zipcode_var']
#skew_columns = ['sqft_living', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15']


fig, ax = plt.subplots(2, 2, figsize=(10, 15))

count = 0
for row in range(2):
    for col in range(2):
        if count == 4:
            break
        sns.kdeplot(df_train[skew_columns[count]], ax=ax[row][col])
        ax[row][col].set_title(skew_columns[count], fontsize=15)
        count+=1

In [1]:
from scipy.special import boxcox1p
lam = 0.15

for c in skew_columns:
    df_train[c] = boxcox1p(df_train[c], lam)
    df_test[c] = boxcox1p(df_test[c], lam)

In [1]:
#skew_columns = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15']
skew_columns = ['mean', 'var', 'zipcode_mean', 'zipcode_var']
#skew_columns = ['sqft_living', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15']


fig, ax = plt.subplots(2, 2, figsize=(10, 15))

count = 0
for row in range(2):
    for col in range(2):
        if count == 4:
            break
        sns.kdeplot(df_train[skew_columns[count]], ax=ax[row][col])
        ax[row][col].set_title(skew_columns[count], fontsize=15)
        count+=1

## train datset -> X & y split

In [1]:
# y: price
train_price = df_train.price.values

In [1]:
train_price

In [1]:
# df_train without price
df_train = df_train.drop('price', axis=1)

In [1]:
# df_train without per_price
df_train = df_train.drop('per_price', axis=1)

In [1]:
print(len(df_train.columns))
print(len(df_test.columns))

In [1]:
df_train.head()

## 미사용 변수 제거

In [1]:
#for df in [df_train,df_test]:
#    df = df.drop(["id", "zipcode", "long"], axis=1)

In [1]:
test_id = df_test.id

In [1]:
df_train = df_train.drop(["id", "sqft_lot15"], axis=1)
df_test = df_test.drop(["id", "sqft_lot15"], axis=1)

# Modeling

In [1]:
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor, AdaBoostRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb

### validation scheme

In [1]:
# Validation function
n_folds = 5

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(df_train.values)
    rmse = np.sqrt(-cross_val_score(model, df_train.values, train_price, scoring="neg_mean_squared_error", cv = kf))
    return (rmse)


# cross_val_score
# https://datascienceschool.net/view-notebook/266d699d748847b3a3aa7b9805b846ae/
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

# scorer, scoring metrics
# https://scikit-learn.org/stable/modules/model_evaluation.html

# RMSLE
# https://programmers.co.kr/learn/courses/21/lessons/943
# https://www.slideshare.net/KhorSoonHin/rmsle-cost-function
# https://dacon.io/user1/41382

### Base models

## Base models
- LASSO regression
- ElasticNet regression
- Kernel Ridge regression
- Gradient Boosting regression
- XGBoost
- LightGBM
- Random Forest

In [1]:
lasso = make_pipeline(RobustScaler(), Lasso(alpha=0.0005, random_state=1))

ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))

KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)

GBoost = GradientBoostingRegressor(n_estimators=8000, learning_rate=0.05, max_depth=5,
                                   max_features='sqrt', min_samples_leaf=15, min_samples_split=10,
                                   loss='huber', random_state=4)

model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=8000,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)

model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=31,
                              learning_rate=0.015, n_estimators=8000,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 1, feature_fraction = 0.9,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_child_samples = 20, reg_alpha= 0.1)

model_RF = RandomForestRegressor(max_depth = 8, 
                                n_estimators = 8000,
                                max_features = 'sqrt', 
                                n_jobs = -1)

In [1]:
#models = [{'model':lasso, 'name':'LASSO'}, {'model':ENet, 'name':'ENet'},
#          {'model':KRR, 'name':'KernelRidge'}, {'model':GBoost, 'name':'GradientBoosting'}, 
#          {'model':model_xgb, 'name':'XGBoost'}, {'model':model_lgb, 'name':'LightGBM'}]

models = [{'model':lasso, 'name':'LASSO'}, {'model':ENet, 'name':'ENet'},
          {'model':GBoost, 'name':'GradientBoosting'}, {'model':model_RF, 'name':'RandomForest'}, 
          {'model':model_xgb, 'name':'XGBoost'}, {'model':model_lgb, 'name':'LightGBM'}]

base models scores

In [1]:
for m in models:
    score = rmsle_cv(m['model'])
    print("Model {} CV score : {:.4f} ({:.4f})\n".format(m['name'], score.mean(), score.std()))

> # Stacking based models


In [1]:
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
   
    # base_models_는 2차원 배열입니다.
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred
                
        # Now train the cloned  meta-model using the out-of-fold predictions as new feature
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
   
    # 각 모델들의 평균값을 사용합니다.
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)

In [1]:
stacked_averaged_models = StackingAveragedModels(
    base_models=(ENet, GBoost, model_RF, model_xgb, model_lgb),
    meta_model=(lasso)
)

#score = rmsle_cv(stacked_averaged_models)
#print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))

In [1]:
#define a rmsle evaluation function
def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

In [1]:
stacked_averaged_models.fit(df_train.values, train_price)
stacked_train_pred = stacked_averaged_models.predict(df_train.values)
stacked_pred = np.expm1(stacked_averaged_models.predict(df_test.values))
print(rmsle(train_price, stacked_train_pred))

In [1]:
model_xgb.fit(df_train, train_price)
xgb_train_pred = model_xgb.predict(df_train)
xgb_pred = np.expm1(model_xgb.predict(df_test))
print(rmsle(train_price, xgb_train_pred))

In [1]:
GBoost.fit(df_train, train_price)
GBoost_train_pred  = GBoost.predict(df_train)
GBoost_pred = np.expm1(GBoost.predict(df_test))
print(rmsle(train_price, GBoost_train_pred))

In [1]:
model_lgb.fit(df_train, train_price)
lgb_train_pred = model_lgb.predict(df_train)
lgb_pred = np.expm1(model_lgb.predict(df_test))
print(rmsle(train_price, lgb_train_pred))

In [1]:
sub = pd.DataFrame()
sub['id'] = test_id
sub['price'] = stacked_pred
sub.to_csv('submission_staking.csv',index=False)

In [1]:
sub = pd.DataFrame()
sub['id'] = test_id
sub['price'] = xgb_pred
sub.to_csv('submission_xgb.csv',index=False)

In [1]:
sub = pd.DataFrame()
sub['id'] = test_id
sub['price'] = GBoost_pred
sub.to_csv('submission_GBoost.csv',index=False)

In [1]:
sub = pd.DataFrame()
sub['id'] = test_id
sub['price'] = lgb_pred
sub.to_csv('submission_lgb.csv',index=False)

In [1]:
ensemble = stacked_pred*0.4 + xgb_pred*0.2 + GBoost_pred*0.2 + lgb_pred*0.2


In [1]:
sub = pd.DataFrame()
sub['id'] = test_id
sub['price'] = ensemble
sub.to_csv('submission_ensemble.csv',index=False)