<div class="alert alert-block" style="border: 1px solid #455A64;background-color:#ECEFF1;">
본 자료 및 영상 컨텐츠는 저작권법 제25조 2항에 의해 보호를 받습니다. 본 컨텐츠 및 컨텐츠 일부 문구등을 외부에 공개, 게시하는 것을 금지합니다. 특히 자료에 대해서는 저작권법을 엄격하게 적용하겠습니다.
</div>

### 0. Get data

### 1. train/test 데이터 임포트

In [121]:
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score

import string
import warnings
import missingno
warnings.filterwarnings('ignore')

In [122]:
df_train = pd.read_csv('bikesharing/train.csv')
df_test = pd.read_csv('bikesharing/test.csv')
df_all = pd.concat((df_train, df_test)).reset_index(drop=True)

In [123]:
def split_df(df):
    return df[:10885], df[10886:]

### RMSLE 기반 예측을 위한 log 필드 추가
> RMSLE 가 log 로 계산되므로, 예측값 또한 log 값으로 계산되도록 하는 편이 보다 RMSLE 성능에 도움을 줌

In [124]:
df_all['casual_log'] = np.log(df_all['casual'] + 1)
df_all['registered_log'] = np.log(df_all['registered'] + 1)
df_all['count_log'] = np.log(df_all['count'] + 1)

### 시간 필드 추가

In [125]:
dt = pd.DatetimeIndex(df_all['datetime'])
df_all.set_index(dt, inplace=True)

df_all['date'] = dt.date
df_all['day'] = dt.day
df_all['month'] = dt.month
df_all['year'] = dt.year
df_all['hour'] = dt.hour
df_all['dow'] = dt.dayofweek
df_all['woy'] = dt.weekofyear

### peak 타임 필드 추가

In [126]:
def func(df_data):
    if df_data['workingday'] == 1:
        if (df_data['hour'] == 8) or (df_data['hour'] == 17) or (df_data['hour'] == 18):
            return 4
        elif (df_data['hour'] == 7) or (df_data['hour'] == 16) or (df_data['hour'] == 19): 
            return 3           
    else:
        if (df_data['hour'] >= 12 and df_data['hour'] <= 16):
            return 2
        elif (df_data['hour'] >= 10 and df_data['hour'] <= 19):
            return 1
    return 0

# 0 or ‘index’: 각 컬럼에 함수 적용, 1 or ‘columns’: 각 행에 함수 적용
df_all['peak'] = df_all.apply(func, axis=1)

In [127]:
def func(df_data):
    # 2021.10.22 업데이트
    # 영상에서는 24 일부터 31 일까지를 적용하지만, 테스트 결과 대부분 확실히 쉬는 24일과 31일만 적용했을 때,
    # 보다 결과가 좋았기 때문에, 24일과 31일만 적용하였습니다.
    if (df_data['month'] == 12) and (df_data['day'] == 24 or df_data['day'] == 31):
            return 1
    return df_data['holiday']

df_all['holiday'] = df_all.apply(func, axis=1)

In [128]:
def func(df_data):
    # 2021.10.22 업데이트
    # 영상에서는 24 일부터 31 일까지를 적용하지만, 테스트 결과 대부분 확실히 쉬는 24일과 31일만 적용했을 때,
    # 보다 결과가 좋았기 때문에, 24일과 31일만 적용하였습니다.
    if (df_data['month'] == 12) and (df_data['day'] == 24 or df_data['day'] == 31):
            return 0
    return df_data['workingday']

df_all['workingday'] = df_all.apply(func, axis=1)

### 온도, 풍속, 습도, 날씨 기반 fit & humid 필드 추가

In [129]:
def func(df_data):
    if (df_data['weather'] <= 2 and df_data['windspeed'] <= 20):
        if (df_data['temp'] > 15 and df_data['temp'] <= 35):
            return 1
    return 0

df_all['fit'] = df_all.apply(func, axis=1)

In [130]:
def func(df_data):
    if df_data['humidity'] >= 70:
            return 1
    return 0

df_all['humid'] = df_all.apply(func, axis=1)

### Metric

$$ RMSLE = \sqrt{\dfrac{\sum_{i=0}^N (log(y_i + 1) - log(\hat{y_i} + 1))^2 }{N}} $$ 

In [131]:
from sklearn.metrics import make_scorer

def get_rmsle(y_actual, y_pred):
    diff = np.log(y_pred + 1) - np.log(y_actual + 1)
    mean_error = np.square(diff).mean()
    return np.sqrt(mean_error)

### Model Evaluation
- 참고: np.log() 와 np.exp 는 역함수 
- 수학적인 부분보다, 다음과 같이 코드로 바로 이해하기로 함

In [132]:
def predict_bikecount(model, select_columns):
    df_train, df_test = split_df(df_all)

    X_train = df_train[select_columns]
    y_train_cas = df_train['casual_log']
    y_train_reg = df_train['registered_log']
    X_test = df_test[select_columns]
    
    casual_model = model.fit(X_train, y_train_cas)
    y_pred_cas = casual_model.predict(X_test)
    y_pred_cas = np.exp(y_pred_cas) - 1

    registered_model = model.fit(X_train, y_train_reg)
    y_pred_reg = registered_model.predict(X_test)
    y_pred_reg = np.exp(y_pred_reg) - 1

    return y_pred_cas + y_pred_reg

### Model Evaluation Test: LinearRegression

In [133]:
df_train, df_test = split_df(df_all)
ml_columns = [
    'season', 'holiday', 'workingday', 'weather', 'temp',
    'atemp', 'humidity', 'windspeed', 'day', 'month',
    'year', 'hour', 'dow', 'woy', 'peak', 'fit', 'humid'
]
X_train = df_train[ml_columns].copy()
y_train = df_train['count']
rmsle_scorer = make_scorer(get_rmsle, greater_is_better=False)

In [134]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()
ml_pred = predict_bikecount(lr_model, ml_columns)
df_test['count'] = ml_pred
final_df = df_test[['datetime', 'count']].copy()
final_df.to_csv('submissions_linear.csv', header=True, index=False)
!kaggle competitions submit -c bike-sharing-demand -f submissions_linear.csv -m "Message"

100%|████████████████████████████████████████| 244k/244k [00:05<00:00, 47.8kB/s]
Successfully submitted to Bike Sharing Demand

### Model Evaluation Test: Lasso Regression

In [135]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

hyperparams = {'max_iter': [1000, 1500, 2000, 2500, 3000], 
               'alpha': 1/np.array([0.1, 1, 2, 3, 4, 10, 30,100,200,300,400,800,900,1000])
}

lasso_grid=GridSearchCV(estimator = Lasso(), param_grid = hyperparams, 
                verbose=True, scoring=rmsle_scorer, cv=5, n_jobs=-1)

lasso_grid.fit(X_train, y_train)
print(lasso_grid.best_params_)

Fitting 5 folds for each of 70 candidates, totalling 350 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done 287 tasks      | elapsed:    5.7s


{'alpha': 10.0, 'max_iter': 1000}


[Parallel(n_jobs=-1)]: Done 350 out of 350 | elapsed:    6.5s finished


In [136]:
lasso_model = lasso_grid.best_estimator_
ml_pred = predict_bikecount(lasso_model, ml_columns)
df_test['count'] = ml_pred
final_df = df_test[['datetime', 'count']].copy()
final_df.to_csv('submissions_lasso.csv', header=True, index=False)
!kaggle competitions submit -c bike-sharing-demand -f submissions_lasso.csv -m "Message"

100%|████████████████████████████████████████| 240k/240k [00:06<00:00, 38.8kB/s]
Successfully submitted to Bike Sharing Demand

### Model Evaluation Test: Ridge Regression

In [137]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

hyperparams = {'max_iter': [1000, 1500, 2000, 2500, 3000], 
               'alpha':[0.1, 1, 2, 3, 4, 10, 30,100,200,300,400,800,900,1000]
}

ridge_grid=GridSearchCV(estimator = Ridge(), param_grid = hyperparams, 
                verbose=True, scoring=rmsle_scorer, cv=5, n_jobs=-1)

ridge_grid.fit(X_train, y_train)
print(ridge_grid.best_params_)

Fitting 5 folds for each of 70 candidates, totalling 350 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 296 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 350 out of 350 | elapsed:    0.9s finished


{'alpha': 1000, 'max_iter': 1000}


In [138]:
ridge_model = ridge_grid.best_estimator_
ml_pred = predict_bikecount(ridge_model, ml_columns)
df_test['count'] = ml_pred
final_df = df_test[['datetime', 'count']].copy()
final_df.to_csv('submissions_ridge.csv', header=True, index=False)
!kaggle competitions submit -c bike-sharing-demand -f submissions_ridge.csv -m "Message"

100%|████████████████████████████████████████| 243k/243k [00:04<00:00, 54.6kB/s]
Successfully submitted to Bike Sharing Demand

### Model Evaluation Test: Random Forest Regressor

In [117]:
from sklearn.model_selection import GridSearchCV

n_estimators = [800, 1000, 1200]
max_depth = [10, 12, 15]
min_samples_split = [4, 5, 6]
min_samples_leaf = [4, 5, 6]

hyperparams = {'n_estimators': n_estimators, 'max_depth': max_depth, 
               'min_samples_split': min_samples_split, 'min_samples_leaf': min_samples_leaf}

rf_grid = GridSearchCV(estimator = RandomForestRegressor(), param_grid = hyperparams, 
                verbose=True, scoring=rmsle_scorer, cv=5, n_jobs=-1)

rf_grid.fit(X_train, y_train)
print(rf_grid.best_params_)

Fitting 5 folds for each of 81 candidates, totalling 405 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:   44.7s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:  7.8min
[Parallel(n_jobs=-1)]: Done 405 out of 405 | elapsed: 19.8min finished


{'max_depth': 15, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 1000}


In [120]:
rf_model = rf_grid.best_estimator_
ml_pred = predict_bikecount(rf_model, ml_columns)
df_test['count'] = ml_pred
final_df = df_test[['datetime', 'count']].copy()
final_df.to_csv('submissions_rf.csv', header=True, index=False)
!kaggle competitions submit -c bike-sharing-demand -f submissions_rf.csv -m "Message"

100%|████████████████████████████████████████| 243k/243k [00:06<00:00, 38.7kB/s]
Successfully submitted to Bike Sharing Demand

### Model Evaluation Test: XGBoost Regressor

In [139]:
from xgboost import XGBRegressor # 회귀트리 모델
from sklearn.model_selection import GridSearchCV

hyperparams = {'nthread':[4],
              'learning_rate': [0.05, 0.1, 0.15], 
              'max_depth': [4, 5],
              'min_child_weight': [3, 4, 5],
              'subsample': [0.7, 0.8],
              'colsample_bytree': [0.6, 0.7],
              'n_estimators': [250, 500]}

xgb_grid = GridSearchCV(estimator = XGBRegressor(), param_grid = hyperparams, 
                verbose=True, scoring=rmsle_scorer, cv=5, n_jobs=-1)

xgb_grid.fit(X_train, y_train)
print(xgb_grid.best_params_)

Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:   11.1s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:  6.8min finished


{'colsample_bytree': 0.7, 'learning_rate': 0.05, 'max_depth': 5, 'min_child_weight': 5, 'n_estimators': 250, 'nthread': 4, 'subsample': 0.7}


In [140]:
xgb_model = xgb_grid.best_estimator_
ml_pred = predict_bikecount(xgb_model, X_train.columns)
df_test['count'] = ml_pred
final_df = df_test[['datetime', 'count']].copy()
final_df.to_csv('submissions_xgboost.csv', header=True, index=False)
!kaggle competitions submit -c bike-sharing-demand -f submissions_xgboost.csv -m "Message"

100%|████████████████████████████████████████| 188k/188k [00:05<00:00, 36.4kB/s]
Successfully submitted to Bike Sharing Demand

### Model Evaluation Test: Gradient Boosting Regressor

In [141]:
from sklearn.model_selection import GridSearchCV

n_estimators = [100, 150, 200]
max_depth = [5, 7, 9]
min_samples_leaf = [8, 10, 12]
learning_rate = [0.1, 0.15, 0.2]
subsample = [0.6, 0.7, 0.8]

hyperparams = {'n_estimators': n_estimators, 'max_depth': max_depth, 
                    'min_samples_leaf': min_samples_leaf,
                    'learning_rate': learning_rate, 'subsample': subsample
              }

gb_grid=GridSearchCV(estimator = GradientBoostingRegressor(), param_grid = hyperparams, 
                verbose=True, scoring=rmsle_scorer, cv=5, n_jobs=-1)

gb_grid.fit(X_train, y_train)
print(gb_grid.best_params_)

Fitting 5 folds for each of 243 candidates, totalling 1215 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    5.0s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:   41.8s
[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 768 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 1215 out of 1215 | elapsed:  5.2min finished


{'learning_rate': 0.1, 'max_depth': 7, 'min_samples_leaf': 8, 'n_estimators': 100, 'subsample': 0.7}


In [142]:
gb_model = gb_grid.best_estimator_
ml_pred = predict_bikecount(gb_model, X_train.columns)
df_test['count'] = ml_pred
final_df = df_test[['datetime', 'count']].copy()
final_df.to_csv('submissions_gb.csv', header=True, index=False)
!kaggle competitions submit -c bike-sharing-demand -f submissions_gb.csv -m "Message"

100%|████████████████████████████████████████| 243k/243k [00:04<00:00, 50.8kB/s]
Successfully submitted to Bike Sharing Demand

<div class="alert alert-block" style="border: 1px solid #455A64;background-color:#ECEFF1;">
본 자료 및 영상 컨텐츠는 저작권법 제25조 2항에 의해 보호를 받습니다. 본 컨텐츠 및 컨텐츠 일부 문구등을 외부에 공개, 게시하는 것을 금지합니다. 특히 자료에 대해서는 저작권법을 엄격하게 적용하겠습니다.
</div>