# Bike Sharing Demand
> 시간별 자전거 렌탈 데이터를 이용하여 자전거 수요량 예측

### Data Fields
|Column|Description|
|:-:|:-:|
|datetime|대여날짜 및 시간|
|season|계절|
|holiday|휴일유무|
|workingday|평일(주말, 휴일 제외)|
|weather|날씨|
|temp|온도(섭씨)|
|atemp|체감 온도(섭씨)|
|humidity|상대습도|
|windspeed|풍속|
|casual|미등록 렌탈 사용자|
|registered|등록 렌탈 사용자|
|count|자전거 렌탈 횟수|

### Evaluation
Root Mean Squared Logarithmic Error
$$\text{RMSLE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(\log({p_i + 1}) - \log({a_i + 1}))^2}$$

## 데이터 전처리

In [78]:
import numpy as np
import pandas as pd

train = pd.read_csv("train.csv", parse_dates=['datetime'])
test = pd.read_csv("test.csv", parse_dates=['datetime'])
print(train.shape, test.shape)

(10886, 12) (6493, 9)


In [79]:
train.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [80]:
print(train.shape, test.shape)

(10886, 12) (6493, 9)


### 날짜 세분화

In [81]:
train['year']=train['datetime'].dt.year
train['month']=train['datetime'].dt.month
train['day']=train['datetime'].dt.day
train['hour']=train['datetime'].dt.hour
train['minute']=train['datetime'].dt.minute
train['second']=train['datetime'].dt.second
train['dayofweek']=train['datetime'].dt.dayofweek

test['year']=test['datetime'].dt.year
test['month']=test['datetime'].dt.month
test['day']=test['datetime'].dt.day
test['hour']=test['datetime'].dt.hour
test['minute']=test['datetime'].dt.minute
test['second']=test['datetime'].dt.second
test['dayofweek']=test['datetime'].dt.dayofweek

print(train.shape, test.shape)

(10886, 19) (6493, 16)


### windspeed열의 평균으로 누락값 전처리

In [63]:
(train['windspeed'] == 0).sum()

1313

In [6]:
train['windspeed'] = train['windspeed'].mask(train['windspeed'] == 0)
train['windspeed'] = train['windspeed'].fillna(train['windspeed'].mean())
(train['windspeed'] == 0).sum()

0

### 랜덤포레스트를 이용하여 windspeed 전처리

In [82]:
from sklearn.ensemble import RandomForestClassifier

def predict_windspeed(data):
    # data의 winspeed열 값 0을 랜덤포레스트 기반 예측값으로 대체
    data_wind_zero = data.loc[data['windspeed'] == 0]
    data_wind_else = data.loc[data['windspeed'] != 0]
    
    # 입력데이터() -> 랜덤포레스트 모델 -> 출력데이터(windspeed)
    # 풍속을 예측하는데 사용될 변수(입력)를 선택
    w_cols = ['season', 'weather', 'temp', 'atemp', 'humidity', 'year', 'month', 'hour']
    
    # 회귀 모델
    # 풍속예측함수 = w1 * season + w2 * weather + ... + w8 * hour + b
    data_wind_else['windspeed'] = data_wind_else['windspeed'].astype('str')
    
    #모델링(학습데이터)
    rf_model_wind = RandomForestClassifier()
    rf_model_wind.fit(data_wind_else[w_cols], data_wind_else['windspeed'])
    
    # 모델을 이용하여 풍속이 0인 데이터에 대한 풍속을 에측(predict)
    wind_zero_values = rf_model_wind.predict(data_wind_zero[w_cols])
    
#     predict_wind_zero = data_wind_zero
#     predict_wind_else = data_wind_else
    
    data_wind_zero['windspeed'] = wind_zero_values
    
    data = data_wind_else.append(data_wind_zero)
    data.reset_index(inplace=True)
    data.drop('index', inplace=True, axis=1)
    return data

train = predict_windspeed(train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_wind_else['windspeed'] = data_wind_else['windspeed'].astype('str')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_wind_zero['windspeed'] = wind_zero_values


In [83]:
train['windspeed'].describe()

count      10886
unique        27
top       7.0015
freq        1320
Name: windspeed, dtype: object

### 범주형 데이터 전처리

In [84]:
category_fn = ['season', 'holiday', 'workingday', 'weather',
               'year', 'month', 'dayofweek']

for col in category_fn:
    train[col] = train[col].astype('category')
    test[col] = test[col].astype('category')
    
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 19 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  category      
 2   holiday     10886 non-null  category      
 3   workingday  10886 non-null  category      
 4   weather     10886 non-null  category      
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  object        
 9   casual      10886 non-null  int64         
 10  registered  10886 non-null  int64         
 11  count       10886 non-null  int64         
 12  year        10886 non-null  category      
 13  month       10886 non-null  category      
 14  day         10886 non-null  int64         
 15  hour        10886 non-null  int64         
 16  minute      10886 non-

### 모델링

In [91]:
from sklearn.ensemble import RandomForestRegressor


model = RandomForestRegressor(n_estimators=100,
                              random_state=42,
                              n_jobs=-1)

fn = ['season', 'holiday', 'workingday', 'weather', 'year', 'hour',
      'dayofweek', 'temp', 'atemp', 'humidity', 'windspeed']

x_train = train[fn]
x_test = test[fn]

y_train = train['count']

model.fit(x_train, y_train)
pred = model.predict(x_test)
pred

array([ 12.67      ,   5.07      ,   3.99      , ...,  97.33      ,
       104.03333333,  47.16      ])

### 평가 함수 RMSLE
$$\text{RMSLE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(\log({p_i + 1}) - \log({a_i + 1}))^2}$$

In [92]:
from sklearn.metrics import make_scorer


def rmsle(pv, av): # rmsle(예측값, 실제값)
    pv = np.array(pv)
    av = np.array(av)
    
    log_predict = np.log(pv + 1)
    log_actual = np.log(av + 1)
    log_error = log_predict - log_actual
    
    return np.sqrt(np.square(log_error).mean())

rmsle_scorer = make_scorer(rmsle)
rmsle_scorer

make_scorer(rmsle)

### 모델 검증
#### K-fold Cross-validation
> 모델의 일반화 성능을 측정하기 위해 데이터를 여러 겹(fold)으로 나누고, 트레이닝/테스트 용으로 나뉘어진 폴드를 다양하게 적용하여 모델을 학습하고 평가

<p align="center">
  <img src="https://www.researchgate.net/profile/Halil_Bisgin/publication/228403467/figure/fig2/AS:302039595798534@1449023259454/k-fold-cross-validation-scheme-example.png" alt="K-fold Cross-validation scheme">
</p>

In [93]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score


kfold = KFold(n_splits=10, shuffle=True, random_state=42)
score = cross_val_score(model, x_train, y_train, cv=kfold, scoring=rmsle_scorer)
score.mean()

0.33091693096715785

In [94]:
bike_submit = pd.read_csv('sampleSubmission.csv')
bike_submit.head()

Unnamed: 0,datetime,count
0,2011-01-20 00:00:00,0
1,2011-01-20 01:00:00,0
2,2011-01-20 02:00:00,0
3,2011-01-20 03:00:00,0
4,2011-01-20 04:00:00,0


In [95]:
bike_submit['count'] = pred
bike_submit.head()

Unnamed: 0,datetime,count
0,2011-01-20 00:00:00,12.67
1,2011-01-20 01:00:00,5.07
2,2011-01-20 02:00:00,3.99
3,2011-01-20 03:00:00,3.47
4,2011-01-20 04:00:00,2.93


In [96]:
bike_submit.to_csv("submission.csv", index=False)