## 용어
- 표준화잔차(standardized residual): 잔차를 표준오차로 나눈 값(회귀선으로 부터 떨어진 정도를 표준오차 개수로 표현한 값)
- 특잇값(outlier): 나머지 데이터(혹은 예측값)와 멀리 떨어진 레코드(혹은 출력값)
- 영향값(influential values): 있을 때와 없을 때 회귀방정식이 큰 차이를 보이는 값 혹은 레코드
- 지렛대, 레버리지(everage): 회귀식에 한 레코드가 미치는 영향력의 정도(유의어: 햇 값(hat value))
- 비정규 잔차(non-normal residual): 정규분포를 따르지 않는 잔차는 회귀분석의 요건을 무효로 만들 수 있다. 데이터 과학에서는 별로 중요하게 다루지 않는다.
- 이분산성(heteroskedasticity): 어떤 범위 내 출력값의 잔차가 매우 높은 분산을 보이는 경향(어떤 예측변수를 회귀식이 놓치고 있다는 것을 의미할 수 있다.)
- 편잔차그림(partial residual plot): 결과변수와 특정 예측변수 사이의 관계를 진단하는 그림(유의어: 추가변수 그림(added variable plot))

In [10]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import OLSInfluence

In [2]:
house = pd.read_csv('../../data/house_sales.csv', sep='\t')

## 회귀에서의 특이값(Outlier)
- 회귀에서의 특이값은 실제 y 값이 예측된 값에서 멀리 떨어져 있는 경우를 말한다. 표준화잔차를 조사해서 특이값을 발견할 수 있다. 

In [9]:
# 우편번호가 98105인 데이터만 가지고 회귀
house_98105 = house.loc[house['ZipCode'] == 98105, :]

features = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms', 'BldgGrade']
label = 'AdjSalePrice'

house_outlier = sm.OLS(house_98105[label], house_98105[features].assign(const=1))
result_98105 = house_outlier.fit()
result_98105.summary()

0,1,2,3
Dep. Variable:,AdjSalePrice,R-squared:,0.795
Model:,OLS,Adj. R-squared:,0.792
Method:,Least Squares,F-statistic:,238.7
Date:,"Mon, 13 Dec 2021",Prob (F-statistic):,1.69e-103
Time:,11:55:21,Log-Likelihood:,-4226.0
No. Observations:,313,AIC:,8464.0
Df Residuals:,307,BIC:,8486.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
SqFtTotLiving,209.6023,24.408,8.587,0.000,161.574,257.631
SqFtLot,38.9333,5.330,7.305,0.000,28.445,49.421
Bathrooms,2282.2641,2e+04,0.114,0.909,-3.7e+04,4.16e+04
Bedrooms,-2.632e+04,1.29e+04,-2.043,0.042,-5.17e+04,-973.867
BldgGrade,1.3e+05,1.52e+04,8.533,0.000,1e+05,1.6e+05
const,-7.725e+05,9.83e+04,-7.861,0.000,-9.66e+05,-5.79e+05

0,1,2,3
Omnibus:,82.127,Durbin-Watson:,1.508
Prob(Omnibus):,0.0,Jarque-Bera (JB):,586.561
Skew:,0.859,Prob(JB):,4.26e-128
Kurtosis:,9.483,Cond. No.,56300.0


In [21]:
# 표준화잔차
influence = OLSInfluence(result_98105)
sresiduals = influence.resid_studentized_internal
sresiduals

1036    -2.622626
1769    -0.892439
1770    -0.354599
1771    -1.054669
1783    -0.563783
           ...   
26628    0.085499
26629   -0.544324
26630   -0.355351
26631   -0.049035
26632   -1.480952
Length: 313, dtype: float64

In [22]:
result_98105.resid

1036    -456062.232432
1769    -158145.203415
1770     -62978.327751
1771    -186794.017568
1783     -99686.534518
             ...      
26628     15171.804483
26629    -96919.956869
26630    -62899.873474
26631     -8661.168054
26632   -261156.728401
Length: 313, dtype: float64

In [14]:
house_98105

Unnamed: 0,DocumentDate,SalePrice,PropertyID,PropertyType,ym,zhvi_px,zhvi_idx,AdjSalePrice,NbrLivingUnits,SqFtLot,...,Bathrooms,Bedrooms,BldgGrade,YrBuilt,YrRenovated,TrafficNoise,LandVal,ImpsVal,ZipCode,NewConstruction
1036,2007-08-16,825000,394500005,Multiplex,2007-08-01,434600,0.998621,826139.0,2,7245,...,4.50,6,8,1961,0,2,280000,468000,98105,False
1769,2006-12-27,655000,714000130,Single Family,2006-12-01,423400,0.972886,673255.0,1,6750,...,1.75,4,8,1946,0,0,350000,417000,98105,False
1770,2007-10-11,650000,714000290,Single Family,2007-10-01,431300,0.991039,655878.0,1,6630,...,1.75,3,7,1946,0,0,350000,332000,98105,False
1771,2006-03-06,580000,714000330,Single Family,2006-03-01,392100,0.900965,643754.0,1,7130,...,1.75,3,7,1947,0,0,350000,350000,98105,False
1783,2008-05-28,1260000,723000114,Single Family,2008-05-01,407400,0.936121,1345979.0,1,8510,...,3.50,5,9,1971,0,0,593000,894000,98105,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26628,2009-06-18,670000,9550204590,Single Family,2009-06-01,357100,0.820542,816533.0,1,3825,...,1.75,4,8,1926,0,0,335000,560000,98105,False
26629,2015-05-12,475000,9550204620,Single Family,2015-05-01,435200,1.000000,475000.0,1,3825,...,1.75,3,7,1925,0,0,335000,295000,98105,False
26630,2008-11-20,625000,9550204650,Single Family,2008-11-01,385800,0.886489,705029.0,1,3060,...,2.50,5,8,1926,2004,0,259000,513000,98105,False
26631,2008-11-03,388000,9550204660,Single Family,2008-11-01,385800,0.886489,437682.0,1,3060,...,1.00,1,7,1902,0,0,259000,246000,98105,False
