### Decision Tree Regression (회귀 트리)
- 결정 트리와 결정 트리 기반의 앙상블 알고리즘은 분류뿐 아니라 회귀분석도 가능하다.
- 분류와 유사하게 분할하며, 최종 분할 후 각 분할 영ㄴ역에서 실제 데이터까지의 거리들의 평균 값으로 학습 및 예측을 수행한다.

<img src="./images/decision_tree_regression01.png" width="600p" style="margin-left: 20px">

- 회귀 트리 역시 복잡한 트리 구조를 가질 경우 과적합의 위험이 있고, 트리 크기와 노드의 개수 제한 등으로 개선해야 한다.

<img src="./images/decision_tree_regression02.png" width="600p" style="margin-left: 20px">

- 독립 변수들과 종속 변수 사이의 관계가 상당히 비선형적일 경우 사용하는 것이 좋다.

<img src="./images/decision_tree_regression03.png" width="800p" style="margin-left: 20px">

In [1]:
import chardet

rawdata = open('./datasets/korea_cow.csv', 'rb').read()

In [2]:
import pandas as pd

c_df = pd.read_csv('./datasets/korea_cow.csv', encoding='EUC-KR')
c_df

Unnamed: 0,일자,번호,출하주,개체번호,성별,kpn,계대,중량,최저가,낙찰가,상태,비고,종류,지역
0,2021.07.23,4,서*호,48928970,암,550.0,3.0,580,360,363,낙찰,목.배밑혹,큰소,경상남도고성
1,2021.07.23,5,이*락,102112702,암,744.0,2.0,460,320,353,낙찰,,큰소,경상남도고성
2,2021.07.23,7,문*종,156144852,암,1263.0,4.0,340,400,471,낙찰,목이모색 상처,큰소,경상남도고성
3,2021.07.23,8,문*종,136983661,암,1159.0,2.0,380,400,432,낙찰,뒷다리약간절음,큰소,경상남도고성
4,2021.07.23,9,이*만,138655532,암,1124.0,6.0,550,650,766,낙찰,,큰소,경상남도고성
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19976,2021.06.22,320,윤*식,157190517,암,0.0,1.0,0,390,0,유찰,,혈통우,전라남도 함평
19977,2021.06.22,321,윤*식,154652064,암,0.0,1.0,0,430,0,유찰,,혈통우,전라남도 함평
19978,2021.06.22,322,윤*식,156278395,암,0.0,1.0,0,450,0,유찰,,혈통우,전라남도 함평
19979,2021.06.22,323,윤*식,155232402,암,0.0,1.0,0,460,530,낙찰,정영기 -> 박손엽,혈통우,전라남도 함평


In [3]:
c_df.isna().sum()

일자          0
번호          0
출하주        34
개체번호        0
성별          1
kpn        10
계대         10
중량          0
최저가         0
낙찰가         0
상태          0
비고      12319
종류          0
지역          0
dtype: int64

In [4]:
c_df.성별.value_counts()

성별
수     10806
암      8996
거세      168
프        10
Name: count, dtype: int64

In [5]:
columns = ['성별', '중량', '상태', '종류', '낙찰가']
pre_c_df = c_df.loc[:, columns]
pre_c_df

Unnamed: 0,성별,중량,상태,종류,낙찰가
0,암,580,낙찰,큰소,363
1,암,460,낙찰,큰소,353
2,암,340,낙찰,큰소,471
3,암,380,낙찰,큰소,432
4,암,550,낙찰,큰소,766
...,...,...,...,...,...
19976,암,0,유찰,혈통우,0
19977,암,0,유찰,혈통우,0
19978,암,0,유찰,혈통우,0
19979,암,0,낙찰,혈통우,530


In [6]:
pre_c_df.상태.value_counts()

상태
낙찰    17343
대기     1621
유찰     1016
보류        1
Name: count, dtype: int64

In [7]:
pre_c_df = pre_c_df[pre_c_df.상태 == '낙찰']
pre_c_df

Unnamed: 0,성별,중량,상태,종류,낙찰가
0,암,580,낙찰,큰소,363
1,암,460,낙찰,큰소,353
2,암,340,낙찰,큰소,471
3,암,380,낙찰,큰소,432
4,암,550,낙찰,큰소,766
...,...,...,...,...,...
19973,암,0,낙찰,혈통우,460
19974,암,0,낙찰,혈통우,451
19975,암,0,낙찰,혈통우,480
19979,암,0,낙찰,혈통우,530


In [8]:
pre_c_df= pre_c_df[pre_c_df.성별.isin(['수', '암'])]
pre_c_df

Unnamed: 0,성별,중량,상태,종류,낙찰가
0,암,580,낙찰,큰소,363
1,암,460,낙찰,큰소,353
2,암,340,낙찰,큰소,471
3,암,380,낙찰,큰소,432
4,암,550,낙찰,큰소,766
...,...,...,...,...,...
19973,암,0,낙찰,혈통우,460
19974,암,0,낙찰,혈통우,451
19975,암,0,낙찰,혈통우,480
19979,암,0,낙찰,혈통우,530


In [9]:
pre_c_df.성별.value_counts()

성별
수    9789
암    7426
Name: count, dtype: int64

In [10]:
male_cow = pre_c_df[pre_c_df.성별 == '수'].sample(7426, random_state=124)
female_cow = pre_c_df[pre_c_df.성별 == '암']

pre_c_df = pd.concat([male_cow, female_cow])

In [11]:
pre_c_df.성별.value_counts()

성별
수    7426
암    7426
Name: count, dtype: int64

In [12]:
pre_c_df.중량.value_counts()

# 0의 값이 너무 큼... 중량을 평균값으로 대체하게 되면, 영향을 미치지 않게 됨...
# 그래서 그럴바에는 중량 피처 자체 제거

중량
0      8528
250     526
240     506
1       457
230     423
       ... 
110       1
790       1
248       1
285       1
205       1
Name: count, Length: 132, dtype: int64

In [13]:
pre_c_df = pre_c_df.drop(labels=['중량', '상태'], axis=1).reset_index(drop=True)
pre_c_df

Unnamed: 0,성별,종류,낙찰가
0,수,혈통우,291
1,수,혈통우,459
2,수,혈통우,289
3,수,큰소,556
4,수,혈통우,519
...,...,...,...
14847,암,혈통우,460
14848,암,혈통우,451
14849,암,혈통우,480
14850,암,혈통우,530


In [14]:
pre_c_df.종류.value_counts()

종류
혈통우    10329
큰소      4523
Name: count, dtype: int64

In [15]:
super_cow = pre_c_df[pre_c_df.종류 == '혈통우'].sample(4523, random_state=124)
big_cow = pre_c_df[pre_c_df.종류 == '큰소']

pre_c_df = pd.concat([super_cow, big_cow])

In [16]:
pre_c_df = pre_c_df.reset_index(drop=True)

In [17]:
pre_c_df.종류.value_counts()

종류
혈통우    4523
큰소     4523
Name: count, dtype: int64

In [18]:
pre_c_df

Unnamed: 0,성별,종류,낙찰가
0,수,혈통우,336
1,수,혈통우,549
2,암,혈통우,428
3,수,혈통우,376
4,수,혈통우,579
...,...,...,...
9041,암,큰소,856
9042,암,큰소,520
9043,암,큰소,907
9044,암,큰소,927


In [21]:
from sklearn.preprocessing import LabelEncoder

columns = ['성별', '종류']
encoders = {}

for column in columns:
    encoder = LabelEncoder()
    result = encoder.fit_transform(pre_c_df[column])
    pre_c_df[column] = result
    encoders[column] = encoder.classes_

print(encoders)

{'성별': array([0, 1]), '종류': array(['큰소', '혈통우'], dtype=object)}


In [23]:
pre_c_df

Unnamed: 0,성별,종류,낙찰가
0,0,1,336
1,0,1,549
2,1,1,428
3,0,1,376
4,0,1,579
...,...,...,...
9041,1,0,856
9042,1,0,520
9043,1,0,907
9044,1,0,927


In [22]:
from statsmodels.api import OLS

features, targets = pre_c_df.iloc[:, :-1], pre_c_df[:, -1]

model = OLS(targets, features)
print(model.fit().summary())

InvalidIndexError: (slice(None, None, None), -1)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

def get_vif(features):
    vif = pd.DataFrame()
    vif['vif_score'] = [variance_inflation_factor(features.values, i) for i in range(features.shape[1])]
    vif['feature'] = features.columns
    return vif

In [None]:
get_vif(features)

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

features, targets = pre_c_df.iloc[:, :-1], pre_c_df.iloc[:, -1]

X_train, X_test, y_train, y_test = \
train_test_split(features, targets, test_size=0.2, random_state=124)

dt_r = DecisionTreeRegressor(random_state=124)
rf_r = RandomForestRegressor(random_state=124, n_estimators=1000)
gb_r = GradientBoostingRegressore(random_state=124)
xgb_r = XGBRegressor()
lgb_r = LGBMRegressor(n_estimators=100)

models = [dt_r, rf_r, gb_r, xgb_r, lgb_r]
for model in models:
    model.fit(X_train, y_train)
    prediction = model.predict(X_test)
    print(model.__class___.__name__)
    get_evaluation(y_test, prediction)
    
# 폴리쓰고 트리 쓰지 말 것, 둘 중 하나 써야함

In [24]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

features, targets = pre_c_df.iloc[:, :-1], pre_c_df.iloc[:, -1]

X_train, X_test, y_train, y_test = \
train_test_split(features, targets, test_size=0.2, random_state=124)

rf_r = RandomForestRegressor(random_state=124)

parameters = {'max_depth': [4, 8, 12, 20], 'min_samples_split': [20, 30, 40, 50, 60], 'n_estimators=1000': [10, 50, 100, 500, 1000]}
kfold = KFold(n_splits=10, random_state=124, shuffle=True)

# scoring: mean_squared_error = mse
# GridSearchCV(rf_r, param_grid=parameters, scoring='neg_mean_squared_error', cv=kfold)
grid_rf_r = GridSearchCV(rf_r, param_grid=parameters, scoring='r2', cv=kfold)
grid_rf_r.fit(X_train, y_train)

In [None]:
result_df = pd.DataFrame(grid_rf_r.cv_results_)[['params', 'mean_test_score', 'rank_test_score']]
display(result_df)