## LinearRegression 
- 연속형의 독립변수가 연속형의 결과변수에 영향을 미치는지 분석하여 레이블 변수를 예측하기 위한 목적으로 활용 
- 회귀 모델은 실제값과 예측값 간에 얼마나 일치하는지 또는 얼마나 차이가 나는지 계산하여 모델 성능 지표로 생각  

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd 
data = pd.read_csv("../Data/house_price.csv", encoding="utf-8")
X = data.iloc[:, 1:5]
y = data[["house_value"]]

from sklearn.model_selection import * 
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=410)

In [2]:
import statsmodels.api as sm 
# 파이썬의 통계 분석 모듈 

# train,test로 나눈 데이터에 상수항 변수를 더해줌 
# 해당 변수는 회귀분석의 상수를 추정하는 역할 
# add_const는 각 행에 값이 1인 열을 추가 
X_train_new = sm.add_constant(X_train)
X_test_new = sm.add_constant(X_test)
X_train_new.head()

Unnamed: 0,const,income,bedrooms,households,rooms
15910,1.0,2.6066,0.200999,3.968675,5.790361
14581,1.0,1.8086,0.317697,4.435323,3.5
13846,1.0,3.8177,0.221811,2.764846,5.140143
5788,1.0,5.5581,0.169134,2.780488,6.118467
12212,1.0,3.6875,0.192385,1.916854,10.033708


In [4]:
model_train = sm.OLS(y_train, X_train_new).fit()
print(model_train.summary())

                            OLS Regression Results                            
Dep. Variable:            house_value   R-squared:                       0.547
Model:                            OLS   Adj. R-squared:                  0.546
Method:                 Least Squares   F-statistic:                     3996.
Date:                Mon, 13 Jun 2022   Prob (F-statistic):               0.00
Time:                        21:07:56   Log-Likelihood:            -1.6561e+05
No. Observations:               13266   AIC:                         3.312e+05
Df Residuals:                   13261   BIC:                         3.313e+05
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -2.647e+04   8715.941     -3.036      0.0

In [5]:
model_test = sm.OLS(y_test, X_test_new).fit()
print(model_test.summary())

                            OLS Regression Results                            
Dep. Variable:            house_value   R-squared:                       0.560
Model:                            OLS   Adj. R-squared:                  0.560
Method:                 Least Squares   F-statistic:                     1406.
Date:                Mon, 13 Jun 2022   Prob (F-statistic):               0.00
Time:                        21:08:30   Log-Likelihood:                -55258.
No. Observations:                4423   AIC:                         1.105e+05
Df Residuals:                    4418   BIC:                         1.106e+05
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -2.833e+04   1.56e+04     -1.811      0.0

In [6]:
from sklearn.preprocessing import *
minmax = MinMaxScaler()
minmax.fit(X_train)
X_scaled_train = minmax.transform(X_train)
X_scaled_test = minmax.transform(X_test)

In [7]:
from sklearn.linear_model import * 
model = LinearRegression()
model.fit(X_scaled_train, y_train)
pred_train = model.predict(X_scaled_train)
model.score(X_scaled_train, y_train)

0.5465591521176505

In [8]:
pred_test = model.predict(X_scaled_test)
model.score(X_scaled_test, y_test)

0.5584523289506957

In [9]:
import numpy as np 
from sklearn.metrics import * 
MSE_train = mean_squared_error(y_train, pred_train)
MSE_test = mean_squared_error(y_test, pred_test)

print("학습 데이터 RMSE : ", np.sqrt(MSE_train))
print("테스트 데이터 RMSE : ", np.sqrt(MSE_test))

학습 데이터 RMSE :  63884.46384687857
테스트 데이터 RMSE :  64619.08085578377


In [10]:
mean_absolute_error(y_test, pred_test)

48017.8014186215