다음은 1990년 캘리포니아 지역의 각 행정구역별 주택가격에 대한 정보를 정리한 데이터이다. 각 행정 구역당 한개의 Row에 관련정보들을 담고있다. 이 데이터로 죽택가격을 예측하는 모형을 생성하고자 한다.
- 데이터: California_housing.txt
- Medinc: 소득의 중앙값(숫자형)
- HouseAge: 주택연식의 중앙값(숫자형)
- AveRooms: 평균 방 개수(숫자형)
- AveBedrms: 평균 침실 개수(숫자형)
- Population: 인구수 (숫자형)
- AveOccup: 평균 자가비율 (숫자형)
- Target: 주택각격의 중앙값 (숫자형)
- Xgrp: Train/Test Set 구분을 위한 index

In [59]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [2]:
data = pd.read_csv('California_housing.txt')
data.shape

(20640, 8)

### 결측치 제거

In [3]:
np.sum(pd.isnull(data))

MedInc        0
HouseAge      0
AveRooms      1
AveBedrms     0
Population    3
AveOccup      3
Target        0
Xgrp          0
dtype: int64

In [5]:
data = data.dropna(axis=0)
data.shape

(20633, 8)

### 문제1.
인구를 제외하고 어떤 변수들이 주택가격과 높은 연관성을 가지는지 파악하고자 한다. Pearson 상관 분석을 통하여 주택가격의 중앙값과 상관관계가 가장 약한 변수 두 개를 차례로 서술하시오

- 답: AveBedrms  AveOccup

In [6]:
data.columns

Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Target', 'Xgrp'],
      dtype='object')

In [8]:
print(data[['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'AveOccup','Target']].corr(method='pearson'))

             MedInc  HouseAge  AveRooms  AveBedrms  AveOccup    Target
MedInc     1.000000 -0.119027  0.326912  -0.062013  0.018764  0.688015
HouseAge  -0.119027  1.000000 -0.153331  -0.077790  0.013190  0.105684
AveRooms   0.326912 -0.153331  1.000000   0.847643 -0.004849  0.151964
AveBedrms -0.062013 -0.077790  0.847643   1.000000 -0.006179 -0.046664
AveOccup   0.018764  0.013190 -0.004849  -0.006179  1.000000 -0.023745
Target     0.688015  0.105684  0.151964  -0.046664 -0.023745  1.000000


### 문제2
Xgrp가 0인 데이터를 이용하여 주택가격의 중앙값을 예측하기 위해 주택가격의 중앙값과 상탲거으로 연관성이 ㅇ가장 낮은 변수를 제외한 나머지 변수들로 선형회귀모형을 만든다. 추정된 선형회귀 모형의 설명력을 소수점 셋째자리에서 반올림하여 기술하시오

정답: 0.54

In [17]:
trainData = data[data.Xgrp==0]
testData = data[data.Xgrp==1]

print(trainData.shape, testData.shape)
trainData.columns

(14444, 8) (6189, 8)


Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Target', 'Xgrp'],
      dtype='object')

In [21]:
result = smf.ols(formula='Target ~ MedInc + HouseAge + AveRooms + AveBedrms + Population', data=trainData).fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                 Target   R-squared:                       0.543
Model:                            OLS   Adj. R-squared:                  0.542
Method:                 Least Squares   F-statistic:                     3426.
Date:                Sun, 06 Sep 2020   Prob (F-statistic):               0.00
Time:                        13:04:01   Log-Likelihood:                -16912.
No. Observations:               14444   AIC:                         3.384e+04
Df Residuals:                   14438   BIC:                         3.388e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.4340      0.035    -12.305      0.0

In [22]:
x = trainData[['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population']]
y = trainData['Target']
model = LinearRegression(fit_intercept=True)
model.fit(x, y)

LinearRegression()

### 문제3
앞서 문제 3-3에서 학습한 회귀 모형의 성능을 평가하고자 한다. 시험 데이터로서 Xgrp가1 인 모든 데이터를 사용하여 주택가격의 중앙값을 예측하고 모형의 성능을 평균오차제곱를 서술하시오

정답: 0.631

In [31]:
test_x = testData[['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population']]
test_y = testData['Target']
test_predict = model.predict(testX)
print(test_predict)

[2.43748357 3.14697017 2.3803096  ... 3.16912516 1.81251267 2.20151844]


In [32]:
mean_squared_error(test_y, test_predict)

0.6317678459173376

### 문제4
각 변수들 사이의 관계를 구체적으로 살펴보기 위해 주성분 분석(PCA)을 시행하고자 한다. 다음 절차에 따라 주성분 분석을 진행하고 1번째 고유값과 2번째 고유값을 차례대로 기술하시오

정답: 1.90926239, 1.29795141

참고 사이트: https://m.blog.naver.com/PostView.nhn?blogId=tjdrud1323&logNo=221720259834&proxyReferer=https:%2F%2Fwww.google.com%2F

In [35]:
trainData.columns

Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Target', 'Xgrp'],
      dtype='object')

In [103]:
normal_train_data = trainData[['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Target']]
normal_train_data

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Target
0,2.8333,37.0,5.561321,0.974057,1060.0,2.500000,0.697
1,2.7730,20.0,3.884141,1.207028,2824.0,2.681861,1.405
2,1.4511,32.0,4.644689,1.036630,549.0,2.010989,0.652
3,2.8289,38.0,4.438148,1.031125,3523.0,2.811652,1.683
4,2.8345,31.0,3.894915,1.127966,2048.0,1.735593,1.838
...,...,...,...,...,...,...,...
14439,5.0000,21.0,3.062500,0.875000,29.0,1.812500,0.875
14440,2.7530,14.0,5.789030,1.170886,1153.0,2.432489,1.111
14441,4.2083,33.0,5.026163,1.151163,1073.0,3.119186,1.805
14442,6.7058,44.0,6.335430,0.991614,1235.0,2.589099,4.959


In [112]:
# 정규화
#scaler = StandardScaler()
#scaler.fit(normal_train_data)
#normal_train_data = scaler.transform(normal_train_data)
normal_train_data = StandardScaler().fit_transform(normal_train_data)
columns = ['MedIncStd', 'HouseAgeStd', 'AveRoomsStd', 'AveBedrmsStd', 'PopulationStd', 'AveOccupStd','TargetStd']
pd.DataFrame(normal_train_data, columns=columns).head()

Unnamed: 0,MedIncStd,HouseAgeStd,AveRoomsStd,AveBedrmsStd,PopulationStd,AveOccupStd,TargetStd
0,-0.544217,0.66719,0.060593,-0.308534,-0.321236,-0.051655,-1.187335
1,-0.575725,-0.688097,-0.715903,0.281825,1.249158,-0.035637,-0.573705
2,-1.266464,0.268576,-0.363787,-0.14997,-0.776152,-0.094727,-1.226337
3,-0.546516,0.746912,-0.45941,-0.163919,1.87144,-0.024205,-0.33276
4,-0.543589,0.188853,-0.710915,0.08148,0.558327,-0.118984,-0.19842


In [106]:
# 주성분으로 이루어진 데이터 프레임 구성
pca = PCA(n_components=2) # 주성분을 몇개로 할지 결정
printcipalComponents = pca.fit_transform(normal_train_data)
principalDf = pd.DataFrame(data=printcipalComponents, columns = ['principal_component1', 'principal_component2'])
principalDf

Unnamed: 0,principal_component1,principal_component2
0,-0.897614,0.690861
1,-0.794317,0.668446
2,-1.364810,1.107650
3,-0.978891,0.095802
4,-0.807394,0.183314
...,...,...
14439,-0.889177,-0.221346
14440,-0.221644,1.127376
14441,-0.090910,0.005363
14442,1.647256,-2.408463


In [113]:
pca.explained_variance_, pca.explained_variance_ratio_, sum(pca.explained_variance_ratio_)

(array([2.01253895, 1.61302332, 1.30301792]),
 array([0.28748566, 0.23041595, 0.18613253]),
 0.7040341384359075)

### 문제5
주성분분석 결과를 활용하여 주택가격의 중앙값을 예측하기 위하여 주성분분석의 사위 주성분 3개를 독립변수로 하여 선형모델을 만들고 추정된 선형 회귀 모형의 설명력을 소수점 셋째자리에서 반올림하여 기술하시오

- R-squared, Adj. R-squared은 변수들간 코릴레이션이 없으면 같게 나온다.

In [110]:
pca = PCA(n_components=3) # 주성분을 몇개로 할지 결정
printcipalComponents = pca.fit_transform(normal_train_data)
principalDf = pd.DataFrame(data=printcipalComponents, columns = ['principal_component1', 'principal_component2', 'principal_component3'])

trainData.reset_index(drop=True, inplace=True)
principalDf.reset_index(drop=True, inplace=True)

regressionData = pd.concat([trainData, principalDf], axis=1)
regressionData

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Target,Xgrp,principal_component1,principal_component2,principal_component3
0,2.8333,37.0,5.561321,0.974057,1060.0,2.500000,0.697,0.0,-0.897614,0.690861,-0.640445
1,2.7730,20.0,3.884141,1.207028,2824.0,2.681861,1.405,0.0,-0.794317,0.668446,1.302496
2,1.4511,32.0,4.644689,1.036630,549.0,2.010989,0.652,0.0,-1.364810,1.107650,-0.786194
3,2.8289,38.0,4.438148,1.031125,3523.0,2.811652,1.683,0.0,-0.978891,0.095802,0.836213
4,2.8345,31.0,3.894915,1.127966,2048.0,1.735593,1.838,0.0,-0.807394,0.183314,0.227596
...,...,...,...,...,...,...,...,...,...,...,...
14439,5.0000,21.0,3.062500,0.875000,29.0,1.812500,0.875,0.0,-0.889177,-0.221346,-0.263480
14440,2.7530,14.0,5.789030,1.170886,1153.0,2.432489,1.111,0.0,-0.221644,1.127376,0.535801
14441,4.2083,33.0,5.026163,1.151163,1073.0,3.119186,1.805,0.0,-0.090910,0.005363,-0.426913
14442,6.7058,44.0,6.335430,0.991614,1235.0,2.589099,4.959,0.0,1.647256,-2.408463,-0.885681


In [111]:
result = smf.ols(formula='Target ~ principal_component1 + principal_component2 + principal_component3', data=regressionData).fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                 Target   R-squared:                       0.834
Model:                            OLS   Adj. R-squared:                  0.834
Method:                 Least Squares   F-statistic:                 2.416e+04
Date:                Sun, 06 Sep 2020   Prob (F-statistic):               0.00
Time:                        14:18:23   Log-Likelihood:                -9596.8
No. Observations:               14444   AIC:                         1.920e+04
Df Residuals:                   14440   BIC:                         1.923e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept                2.0669 