<a href="https://colab.research.google.com/github/sangjin94/itwill-python/blob/main/ml06_multiple_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 다중 선형 회귀

특성(독립변수)이 여러개 인 선형 회귀 모델 
  * 1차항만 고려한 선형 회귀
  * 고차항들을 포함하는 선형 회귀
  * 규제(Regularization): overfitting(과대적합)을 줄이기 위한 기법

# Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score,mean_squared_error

# 데이터 준비

In [4]:
# 데이터 셋 github URL
fish_csv = 'https://github.com/rickiepark/hg-mldl/raw/master/fish.csv'

In [5]:
# DataFrame 생성
fish= pd.read_csv(fish_csv)

In [7]:
fish.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Species   159 non-null    object 
 1   Weight    159 non-null    float64
 2   Length    159 non-null    float64
 3   Diagonal  159 non-null    float64
 4   Height    159 non-null    float64
 5   Width     159 non-null    float64
dtypes: float64(5), object(1)
memory usage: 7.6+ KB


선형 회귀 목적: 농어(Perch)의 무게(Weight)를 농어의 다른 특성들(Length, Diagonal, Height,Width) 로 예측

Weight ~ Length + Diagonal + Height + Width

In [8]:
perch= fish[fish.Species=='Perch'] # 농어(Perch)만 선택
perch.head()

Unnamed: 0,Species,Weight,Length,Diagonal,Height,Width
72,Perch,5.9,8.4,8.8,2.112,1.408
73,Perch,32.0,13.7,14.7,3.528,1.9992
74,Perch,40.0,15.0,16.0,3.824,2.432
75,Perch,51.5,16.2,17.2,4.5924,2.6316
76,Perch,70.0,17.4,18.5,4.588,2.9415


In [11]:
# 특성(features), 독립 변수
X =perch[['Length','Diagonal','Height','Width']].values

In [13]:
X.shape

(56, 4)

In [15]:
# label, target, 종속 변수
y= perch['Weight'].values

# train/test split

In [16]:
X_train,X_test,y_train,y_test=train_test_split(X,y,
                                               test_size=0.25,
                                               random_state=42)

In [18]:
X_train.shape,X_test.shape

((42, 4), (14, 4))

In [19]:
y_train.shape,y_test.shape

((42,), (14,))

# 1차항만 고려하는 선형 회귀 

$
\hat{y}= w_0 + w_1 \times x_1 + w_2 \times x_2 + w_3 \times x_3 + w_4 \times x_4
$

In [20]:
lin_reg= LinearRegression() # 선형 회귀 알고리즘 생성

In [21]:
lin_reg.fit(X_train,y_train) # ML알고리즘을 데이터에 fitting.데이터를 학습시킴

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [24]:
lin_reg.intercept_ # w0 : 절편, 편향

-610.0275364260526

In [26]:
lin_reg.coef_ # [w1 w2 w3 w4] 계수들의 배열
# w1 * length + w2* diagonal + w3* height + w4* width

array([-40.18338554,  47.80681727,  67.34086612,  35.34904264])

In [27]:
train_pred= lin_reg.predict(X_train) # 훈련 셋 예측값

In [28]:
train_pred[:5]

array([ 50.07831254, 149.63115115,  26.52323981, -11.85322276,
       727.07849472])

In [32]:
y_train[:5] # 실젯값

array([ 85., 135.,  78.,  70., 700.])

In [34]:
# RMSE 
np.sqrt(mean_squared_error(y_true=y_train,y_pred=train_pred))

73.07651173088374

In [35]:
# 결정 계수
r2_score(y_train,train_pred)

0.9567246116638569

In [36]:
test_pred= lin_reg.predict(X_test) # 테스트 셋 예측값

In [37]:
test_pred[:5]

array([-334.87262176,   53.65873458,  318.38723843,  178.88939119,
        155.66294578])

In [40]:
y_test[:5] # 테스트 셋 실젯값

array([  5.9, 100. , 250. , 130. , 130. ])

In [42]:
np.sqrt(mean_squared_error(y_test,test_pred)) # RMSE

110.1835310901991

In [43]:
r2_score(y_test,test_pred) #결정계수

0.879046561599027

1차항만 고려한 선형 회귀 모델은 overfitting이 약간 있음.

# 2차항 까지 추가한 선형 회귀

$
\hat{y}=w_0+ w_1\times x_1 + \cdots + w_4 \times x_4 + w_5 \times x_1^2 + \cdots
$

In [47]:
poly= PolynomialFeatures(degree=2,include_bias=False) # 다차항을 추가하는 변환기 
# degree=2 (default): 2차항 까지 고려
# interaction_only=False (default): x1^2,X2^2,x1*x2,... 등을 모두 추가

In [44]:
scaler= StandardScaler() # 표준화 변환기 생성

In [45]:
lin_reg=LinearRegression() # ML 알고리즘 생성

In [48]:
# Pipeline 객체 생성
model = Pipeline(steps=[('poly',poly),
                        ('scaler',scaler),
                        ('lin_reg',lin_reg)])

In [49]:
# ML 모델을 데이터에 fitting. 학습 셋을 훈련시킴.
model.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('poly',
                 PolynomialFeatures(degree=2, include_bias=False,
                                    interaction_only=False, order='C')),
                ('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('lin_reg',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

In [51]:
model['lin_reg'].intercept_ # 학습이 끝난 후 선형 회귀 모델이 찾은 절편

400.83333333332587

In [53]:
model['lin_reg'].coef_ # 학습이 끝난 후 선형 회귀 모델이 찾은 계수들(coefficients)

array([   -443.26816039,    1150.91134799,    -650.22360319,
          -368.62831244,  115424.97558536, -210083.78541706,
        -49872.08633924,   29100.85132271,   91656.18352525,
         53699.90248992,  -27521.03052328,    1226.11352267,
         -5243.73927458,    2288.55011685])

In [54]:
model['poly'].get_feature_names() # PolynomialFeatures 객체가 만들어낸 다차항들 리스트

['x0',
 'x1',
 'x2',
 'x3',
 'x0^2',
 'x0 x1',
 'x0 x2',
 'x0 x3',
 'x1^2',
 'x1 x2',
 'x1 x3',
 'x2^2',
 'x2 x3',
 'x3^2']

In [56]:
train_pred=model.predict(X_train)

In [59]:
train_pred[:5]

array([ 86.22462498, 117.8371985 ,  65.36623277,  51.32036181,
       688.61814191])

In [58]:
y_train[:5]

array([ 85., 135.,  78.,  70., 700.])

In [60]:
np.sqrt(mean_squared_error(y_train,train_pred)) # 훈련셋 RMSE

31.408812188346158

In [61]:
r2_score(y_train,train_pred) # 훈련 셋 결정 계수

0.9920055538341124

In [62]:
test_pred=model.predict(X_test)

In [63]:
test_pred[:5]

array([ 23.11093892,  16.86703258, 283.14558245, 126.83444969,
       121.43654058])

In [65]:
y_test[:5]

array([  5.9, 100. , 250. , 130. , 130. ])

In [66]:
np.sqrt(mean_squared_error(y_test,test_pred)) # 테스트셋 RMSE

71.36392024375351

In [67]:
r2_score(y_test,test_pred) # 테스트 셋 결정 계수

0.949260960155265