# 지도 학습(회귀) 실습
---
**데이터셋**: fetch_california_housing

**선형 회귀**(Linear Regression)

### 필요한 라이브러리 불러오기

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### 데이터 준비 & 전처리

In [2]:
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
type(housing)

sklearn.utils._bunch.Bunch

In [3]:
housing.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

In [4]:
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [5]:
df['target'] = housing.target
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [6]:
df.shape

(20640, 9)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
 8   target      20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


In [8]:
# 결측치
df.isna().sum(axis=0)

MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
target        0
dtype: int64

In [9]:
# 중복치
df.duplicated().sum()

0

In [10]:
df.columns

Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude', 'target'],
      dtype='object')

In [11]:
# X, y 분리
X = df[['MedInc', 'HouseAge', 'AveRooms']]
y = df['target']

### 피쳐 스케일링(Feature Scaling) 적용 X
---

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42)
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

(16512, 3) (4128, 3)
(16512,) (4128,)


In [13]:
X_train = X_train.values
y_train = y_train.values

In [14]:
print(type(X_train), type(y_train))

<class 'numpy.ndarray'> <class 'numpy.ndarray'>


### 모델 학습: 선형 회귀(Linear Regression)

In [15]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

- **예측**

In [16]:
X_test = X_test.values
y_test = y_test.values

In [17]:
print(type(X_test), type(y_test))

<class 'numpy.ndarray'> <class 'numpy.ndarray'>


In [18]:
y_pred = lr.predict(X_test)
y_pred[:5]

array([1.06791912, 1.50634095, 2.32862562, 2.68184955, 2.09182437])

- **성능 평가**(RMSE)

In [19]:
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
rmse

0.8117332473994358

### 피쳐 스케일링(Feature Scaling) 적용 O
- StandardScaler(표준화)
- MinMaxScaler(최소-최대 정규화)
---

In [20]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42)
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

(16512, 3) (4128, 3)
(16512,) (4128,)


In [21]:
# from sklearn.preprocessing import StandardScaler

# scaler = StandardScaler()
# X_train_s = scaler.fit_transform(X_train)
# y_train = y_train.values

# print(type(X_train_s), type(y_train))

In [22]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_s = scaler.fit_transform(X_train)
y_train = y_train.values

print(type(X_train_s), type(y_train))

<class 'numpy.ndarray'> <class 'numpy.ndarray'>


### 1. 모델 학습: 선형 회귀(Linear Regression)

In [23]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train_s, y_train)

- **예측**

In [24]:
X_test_s = scaler.transform(X_test)
y_test = y_test.values

In [25]:
print(type(X_test_s), type(y_test))

<class 'numpy.ndarray'> <class 'numpy.ndarray'>


In [26]:
y_pred = lr.predict(X_test_s)
y_pred[:5]

array([1.06791912, 1.50634095, 2.32862562, 2.68184955, 2.09182437])

- **성능 평가**(RMSE)

In [27]:
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
rmse

0.8117332473994358

- **교차 검증**(평균 RMSE)

In [28]:
from sklearn.model_selection import cross_val_score

mse = cross_val_score(lr, X_test_s, y_test, 
                      scoring='neg_mean_squared_error', 
                      cv=3)

In [29]:
np.mean(np.sqrt(-mse))

0.8139819987466224

### 2. 모델 학습: 결정 트리(Decision Tree)

In [30]:
from sklearn.tree import DecisionTreeRegressor

dtr = DecisionTreeRegressor()
dtr.fit(X_train_s, y_train)

- **예측**

In [31]:
y_pred = dtr.predict(X_test_s)
y_pred[:5]

array([1.455, 0.926, 4.328, 2.456, 1.875])

- **성능 평가**(RMSE)

In [32]:
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
rmse

1.0536125854753235

- **교차 검증**(평균 RMSE)

In [33]:
from sklearn.model_selection import cross_val_score

mse = cross_val_score(dtr, X_test_s, y_test, 
                      scoring='neg_mean_squared_error', 
                      cv=3)

In [34]:
np.mean(np.sqrt(-mse))

1.061217685560053

### 3. 모델 학습: 랜덤 포레스트(Random Forest)

In [35]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor()
rfr.fit(X_train_s, y_train)

- **예측**

In [36]:
y_pred = rfr.predict(X_test_s)
y_pred[:5]

array([1.40528  , 1.06476  , 3.2251226, 2.7258902, 1.95615  ])

- **성능 평가**(RMSE)

In [37]:
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
rmse

0.7739554342581239

- **교차 검증**(평균 RMSE)

In [38]:
from sklearn.model_selection import cross_val_score

mse = cross_val_score(rfr, X_test_s, y_test, 
                      scoring='neg_mean_squared_error', 
                      cv=3)

In [39]:
np.mean(np.sqrt(-mse))

0.7925287374368689

### 4. 모델 학습: 서포트 벡터 머신(support vector machine, SVM)

In [40]:
from sklearn.svm import SVR

svr = SVR()
svr.fit(X_train_s, y_train)

- **예측**

In [41]:
y_pred = svr.predict(X_test_s)
y_pred[:5]

array([1.05282386, 1.21744894, 2.56049758, 2.54431862, 1.71474953])

- **성능 평가**(RMSE)

In [42]:
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
rmse

0.7889643608756951

- **교차 검증**(평균 RMSE)

In [43]:
from sklearn.model_selection import cross_val_score

mse = cross_val_score(svr, X_test_s, y_test, 
                      scoring='neg_mean_squared_error', 
                      cv=3)

In [44]:
np.mean(np.sqrt(-mse))

0.7936926022334534

### 5. 모델 학습: 라쏘 회귀(Lasso Regression)

In [45]:
from sklearn.linear_model import Lasso

lasso = Lasso(random_state=42)
lasso.fit(X_train_s, y_train)

- **예측**

In [46]:
y_pred = lasso.predict(X_test_s)
y_pred[:5]

array([2.07194694, 2.07194694, 2.07194694, 2.07194694, 2.07194694])

- **성능 평가**(RMSE)

In [47]:
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
rmse

1.1448563543099792

- **교차 검증**(평균 RMSE)

In [48]:
from sklearn.model_selection import cross_val_score

mse = cross_val_score(lasso, X_test_s, y_test, 
                      scoring='neg_mean_squared_error', 
                      cv=3)

In [49]:
np.mean(np.sqrt(-mse))

1.1447782719413198

### 6. 모델 학습: 릿지 회귀(Ridge Regression)

In [50]:
from sklearn.linear_model import Ridge

ridge = Ridge(random_state=42)
ridge.fit(X_train_s, y_train)

- **예측**

In [51]:
y_pred = ridge.predict(X_test_s)
y_pred[:5]

array([1.06899598, 1.50946584, 2.32376153, 2.67776153, 2.09325479])

- **성능 평가**(RMSE)

In [52]:
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
rmse

0.8116383301013449

- **교차 검증**(평균 RMSE)

In [53]:
from sklearn.model_selection import cross_val_score

mse = cross_val_score(ridge, X_test_s, y_test, 
                      scoring='neg_mean_squared_error', 
                      cv=3)

In [54]:
np.mean(np.sqrt(-mse))

0.8130990178780722