## 용어
- RMSE(root mean squared error): 회귀 시 MSE의 제곱근. 회귀모형을 평가하는 데 가장 널리 사용되는 측정 지표
- RSE(residual standard error): MSE와 동일하지만 자유도에 따라 보정된 값
- R 제곱(r-squared): 0에서 1까지 모델에 의해 설명된 분산의 비율, 종속변수의 변동량 중에서 적용한 모형으로 설명 가능한 부분의 비율(유의어:결정계수(coefficient of determination))
- t 통계량(t-statistic): 계수의 표준오차로 나눈 예측변수의 계수. 모델에서 변수의 중요도를 비교하는 기준이 됨.
- 가중회귀(weighted regression): 다른 가중치를 가진 레코드들을 회귀하는 방법

In [32]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm

from dmba import stepwise_selection, AIC_score

In [2]:
house = pd.read_csv('../../data/house_sales.csv', sep='\t')

In [64]:
6549*0.7

4584.299999999999

In [3]:
house

Unnamed: 0,DocumentDate,SalePrice,PropertyID,PropertyType,ym,zhvi_px,zhvi_idx,AdjSalePrice,NbrLivingUnits,SqFtLot,...,Bathrooms,Bedrooms,BldgGrade,YrBuilt,YrRenovated,TrafficNoise,LandVal,ImpsVal,ZipCode,NewConstruction
1,2014-09-16,280000,1000102,Multiplex,2014-09-01,405100,0.930836,300805.0,2,9373,...,3.00,6,7,1991,0,0,70000,229000,98002,False
2,2006-06-16,1000000,1200013,Single Family,2006-06-01,404400,0.929228,1076162.0,1,20156,...,3.75,4,10,2005,0,0,203000,590000,98166,True
3,2007-01-29,745000,1200019,Single Family,2007-01-01,425600,0.977941,761805.0,1,26036,...,1.75,4,8,1947,0,0,183000,275000,98166,False
4,2008-02-25,425000,2800016,Single Family,2008-02-01,418400,0.961397,442065.0,1,8618,...,3.75,5,7,1966,0,0,104000,229000,98168,False
5,2013-03-29,240000,2800024,Single Family,2013-03-01,351600,0.807904,297065.0,1,8620,...,1.75,4,7,1948,0,0,104000,205000,98168,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27057,2011-04-08,325000,9842300710,Single Family,2011-04-01,318700,0.732307,443803.0,1,5468,...,1.75,3,7,1951,0,0,201000,172000,98126,False
27058,2007-09-28,1580000,9845500010,Single Family,2007-09-01,433500,0.996094,1586196.0,1,23914,...,4.50,4,11,2000,0,1,703000,951000,98040,False
27061,2012-07-09,165000,9899200010,Single Family,2012-07-01,325300,0.747472,220744.0,1,11170,...,1.00,4,6,1971,0,0,92000,130000,98055,False
27062,2006-05-26,315000,9900000355,Single Family,2006-05-01,400600,0.920496,342207.0,1,6223,...,2.00,3,7,1939,0,0,103000,212000,98166,False


## 모형 생성 및 모형 평가

In [4]:
features = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 
              'Bedrooms', 'BldgGrade']
label = 'AdjSalePrice'

house_lm = LinearRegression()
house_lm.fit(house[features], house[label])

print(f'Intercept: {house_lm.intercept_:.3f}')
print('Coefficients:')
for name, coef in zip(features, house_lm.coef_):
    print(f' {name}: {coef}')

Intercept: -521871.368
Coefficients:
 SqFtTotLiving: 228.83060360240756
 SqFtLot: -0.06046682065306541
 Bathrooms: -19442.840398321103
 Bedrooms: -47769.955185214174
 BldgGrade: 106106.96307898113


In [5]:
# 모형 평가
fitted = house_lm.predict(house[features])
RMSE = np.sqrt(mean_squared_error(house[label], fitted))
r2 = r2_score(house[label], fitted)

print(f'RMSE: {RMSE:.0f}')
print(f'r2: {r2:.4f}')

RMSE: 261220
r2: 0.5406


In [6]:
# 모형 평가
model = sm.OLS(house[label], house[features].assign(const=1)) # assign 메서드는 값이 1인 상수 열을 예측 변수에 추가(절편을 모델링 하기 위해)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:           AdjSalePrice   R-squared:                       0.541
Model:                            OLS   Adj. R-squared:                  0.540
Method:                 Least Squares   F-statistic:                     5338.
Date:                Tue, 07 Dec 2021   Prob (F-statistic):               0.00
Time:                        10:51:02   Log-Likelihood:            -3.1517e+05
No. Observations:               22687   AIC:                         6.304e+05
Df Residuals:                   22681   BIC:                         6.304e+05
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
SqFtTotLiving   228.8306      3.899     58.694

## k-fold cross-validation

1. 1/k의 데이터를 홀드아웃 샘플로 따로 떼어놓는다.
2. 남아 있는 데이터로 모델을 훈련시킨다.
3. 모델을 1/k 홀드아웃에 적용(점수를 매김)하고 필요한 모델 평가 지표를 기록한다.
4. 데이터의 첫 번쨰 1/k을 복원하고 다음 1/k(앞에서 선택했던 레코드는 제외)을 따로 보관한다.
5. 2~3단계를 반복한다.
6. 모든 레코드가 홀드아웃 샘플로 사용될 때까지 반복한다.
7. 모델 평가 지표들을 평균과 같은 방식으로 결합한다.

## 모형 선택 및 단계적 회귀(stepwise regression)

In [28]:
features = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms',
              'BldgGrade', 'PropertyType', 'NbrLivingUnits',
              'SqFtFinBasement', 'YrBuilt', 'YrRenovated', 
              'NewConstruction']
label = 'AdjSalePrice'

X = pd.get_dummies(house[features], # 숫자형이 아닌 모든 컬럼들에 대해서 원핫인코딩 
                   drop_first=True) # 첫번째 카테고리는 사용하지 않음(나머지 값이 전부 0이면 첫번째 카테고리 값을 알수있다.)
X['NewConstruction'] = X['NewConstruction'].replace({False:0, True:1})

house_full = sm.OLS(house[label], X.assign(const=1))
results = house_full.fit()
results.summary()

0,1,2,3
Dep. Variable:,AdjSalePrice,R-squared:,0.595
Model:,OLS,Adj. R-squared:,0.594
Method:,Least Squares,F-statistic:,2771.0
Date:,"Tue, 07 Dec 2021",Prob (F-statistic):,0.0
Time:,11:31:58,Log-Likelihood:,-313750.0
No. Observations:,22687,AIC:,627500.0
Df Residuals:,22674,BIC:,627600.0
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
SqFtTotLiving,198.6364,4.234,46.920,0.000,190.338,206.934
SqFtLot,0.0771,0.058,1.330,0.184,-0.037,0.191
Bathrooms,4.286e+04,3808.114,11.255,0.000,3.54e+04,5.03e+04
Bedrooms,-5.187e+04,2396.904,-21.638,0.000,-5.66e+04,-4.72e+04
BldgGrade,1.373e+05,2441.242,56.228,0.000,1.32e+05,1.42e+05
NbrLivingUnits,5723.8438,1.76e+04,0.326,0.744,-2.87e+04,4.01e+04
SqFtFinBasement,7.0611,4.627,1.526,0.127,-2.009,16.131
YrBuilt,-3574.2210,77.228,-46.282,0.000,-3725.593,-3422.849
YrRenovated,-2.5311,3.924,-0.645,0.519,-10.222,5.160

0,1,2,3
Omnibus:,31006.128,Durbin-Watson:,1.393
Prob(Omnibus):,0.0,Jarque-Bera (JB):,26251977.078
Skew:,7.427,Prob(JB):,0.0
Kurtosis:,168.984,Cond. No.,2980000.0


In [48]:
# 단계적 회귀(stepwise regression)
y = house[label]

# 주어진 변수 집합에 대해 적합 모델을 반환하는 함수를 정의
def train_model(variables):
    if len(variables) == 0:
        return None
    model = LinearRegression()
    model.fit(X[variables], y)
    return model

# 주어진 모델과 변수 세트에 대한 점수를 반환하는 함수를 정의
def score_model(model, variables):
    if len(variables) == 0:
        return AIC_score(y, [y.mean()]*len(y), model, df=1)
    return AIC_score(y, model.predict(X[variables]), model)

best_model, best_variables = stepwise_selection(X.columns, train_model, score_model, verbose=True)

print()
print(f'Intercept: {best_model.intercept_:.3f}')
print('Coefficients:')
for name, coef in zip(best_variables, best_model.coef_):
    print(f" {name}: {coef}")

Variables: SqFtTotLiving, SqFtLot, Bathrooms, Bedrooms, BldgGrade, NbrLivingUnits, SqFtFinBasement, YrBuilt, YrRenovated, NewConstruction, PropertyType_Single Family, PropertyType_Townhouse
Start: score=647988.32, constant
Step: score=633013.35, add SqFtTotLiving
Step: score=630793.74, add BldgGrade
Step: score=628230.29, add YrBuilt
Step: score=627784.16, add Bedrooms
Step: score=627602.21, add Bathrooms
Step: score=627525.65, add PropertyType_Townhouse
Step: score=627525.08, add SqFtFinBasement
Step: score=627524.98, add PropertyType_Single Family
Step: score=627524.98, unchanged None

Intercept: 6178645.017
Coefficients:
 SqFtTotLiving: 199.2775530420158
 BldgGrade: 137159.5602261976
 YrBuilt: -3565.4249392494557
 Bedrooms: -51947.383673614146
 Bathrooms: 42396.16452772052
 PropertyType_Townhouse: 84479.1620329995
 SqFtFinBasement: 7.046974967583083
 PropertyType_Single Family: 22912.05518701769


## 가중회귀(Weighted regression)

In [63]:
house['Year'] = house['DocumentDate'].apply(lambda date:int(date.split('-')[0]))
house['Weight'] = house['Year'] -2005

features = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 
              'Bedrooms', 'BldgGrade']
lables = 'AdjSalePrice'

# 일반 linear model
house_lm = LinearRegression()
house_lm.fit(house[features], house[label])

# weighted regression model
house_wt = LinearRegression()
house_wt.fit(house[features], house[label], sample_weight=house['Weight'])

pd.DataFrame({
    'features': features,
    'house_lm': house_lm.coef_,
    'house_wt': house_wt.coef_,
}).append({
    'features':'intercept',
    'house_lm':house_lm.intercept_,
    'house_wt':house_wt.intercept_,
}, ignore_index=True)

Unnamed: 0,features,house_lm,house_wt
0,SqFtTotLiving,228.830604,245.024089
1,SqFtLot,-0.060467,-0.292415
2,Bathrooms,-19442.840398,-26085.970109
3,Bedrooms,-47769.955185,-53608.876436
4,BldgGrade,106106.963079,115242.434726
5,intercept,-521871.368188,-584189.329446


In [73]:
# 연도별 잔차 평균
residuals = pd.DataFrame({
    'abs_residual_lm': np.abs(house_lm.predict(house[features]) - house[lables]),
    'abs_residual_wt': np.abs(house_wt.predict(house[features]) - house[lables]),
    'Year': house['Year'],
})

residual_means_year = [[year, group['abs_residual_lm'].mean(), group['abs_residual_wt'].mean()] for year, group in residuals.groupby('Year')]
pd.DataFrame(residual_means_year, columns=['Year', 'mean abs_residual_lm', 'mean abs_residual_wt'])

Unnamed: 0,Year,mean abs_residual_lm,mean abs_residual_wt
0,2006,140540.303585,146557.454636
1,2007,147747.577959,152848.523235
2,2008,142086.905943,146360.411668
3,2009,147016.720883,151182.924825
4,2010,163267.674885,166364.476152
5,2011,169937.385744,172950.876028
6,2012,169506.670053,171874.424266
7,2013,203659.77751,206242.199403
8,2014,184452.840665,186668.57375
9,2015,172323.435147,169842.742053
