# 06 자전거 대여 수요 예측
* 미션
    * [자전거 대여 수량 예측](https://www.kaggle.com/c/bike-sharing-demand)
* 평가지표
    * RMSLE
    * ![](images/bike_metrix.PNG)
        * $n$ is ther number of hours in the test set
        * $p_i$ is your  predicted count
        * $a_i$ is the actual count
        * $log(x)$ is the natural logarithm

## 6.3 탐색적 데이터 분석
* ![](images/kaggle_eda.PNG)
* [참고한 코드](https://www.kaggle.com/code/viveksrinivasan/eda-ensemble-model-top-10-percentile)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
data_path = '../data/06_bike/'

In [3]:
train_df = pd.read_csv(data_path + 'train.csv')
test_df = pd.read_csv(data_path+'test.csv')
submission_df = pd.read_csv(data_path+'sampleSubmission.csv')

In [4]:
print(train_df.shape, test_df.shape, submission_df.shape)

(10886, 12) (6493, 9) (6493, 2)


In [5]:
train_df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


* datetime
    - hourly date + timestamp
* season
    - 1 = spring, 2 = summer, 3 = fall, 4 = winter
* holiday
    - whether the day is considered a holiday
* workingday
    - whether the day is neither a weekend nor holiday
* weather
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
* temp
    - temperature in Celsius
* atemp
    - "feels like" temperature in Celsius
* humidity
    - relative humidity
* windspeed
    - wind speed
* casual
    - number of non-registered user rentals initiated
* registered
    - number of registered user rentals initiated
* count
    - number of total rentals

### 분석 정리
* 타깃값 변환
    * 타깃값(count)이 0 근처로 치우쳐 있으므로 로그변환하여 정규분포에 가깝에 만들어서 활용
    * 예측한 결과를 다시 지수변환해 count로 복원해야함
* 파생 피처 추가
    * datetime 피처는 year, month, day, hour, min, sec 피처 생성
    * 요일(weekday) 피처 추가
* 피처 제거
    * 훈련 데이터에만 있는 casual과 registered 피처 제거
    * datetime 피처 제거
    * date 피처 제거(year, month, day 피처 활용)
    * month 피처는 season 피처의 세부 분류이므로 제거
    * day 피처는 분별력이 없고 데이터가 1~19일까지만 있으므로 제거
    * min, sec 피처에는 막대 그래프 확인 결과 특별한 정보가 없으므로 제거
    * windspeed 피처는 산점도 그래프를 통해 결측값을 확인하고 히트맵을 통해 상관관계가 약함을 확인하여 제거
* 이상치 제거
    * 포인트 플롯 확인 결과 weather가 4인 데이터는 이상치

## 6.4 베이스라인 모델

In [6]:
train = train_df.copy()
# 이상치 제거
train = train[train['weather'] != 4]

In [7]:
print(len(train), len(test_df))

10885 6493


In [8]:
# 데이터 합치기
all_data = pd.concat([train, test_df], ignore_index=True)
all_data.tail()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
17373,2012-12-31 19:00:00,1,0,1,2,10.66,12.88,60,11.0014,,,
17374,2012-12-31 20:00:00,1,0,1,2,10.66,12.88,60,11.0014,,,
17375,2012-12-31 21:00:00,1,0,1,1,10.66,12.88,60,11.0014,,,
17376,2012-12-31 22:00:00,1,0,1,1,10.66,13.635,56,8.9981,,,
17377,2012-12-31 23:00:00,1,0,1,1,10.66,13.635,65,8.9981,,,


In [9]:
all_data.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3.0,13.0,16.0
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8.0,32.0,40.0
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5.0,27.0,32.0
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3.0,10.0,13.0
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0.0,1.0,1.0


In [10]:
# 파생 피처 추가
all_data['datetime'] = pd.to_datetime(all_data['datetime'])

all_data['year'] = all_data['datetime'].dt.year
all_data['month'] = all_data['datetime'].dt.month
all_data['day'] = all_data['datetime'].dt.day
all_data['hour'] = all_data['datetime'].dt.hour
all_data['weekday'] = all_data['datetime'].dt.weekday

In [11]:
# 피처 제거
removal_feature = ['casual', 'registered', 'datetime', 'month', 'day', 'windspeed']
all_data = all_data.drop(columns=removal_feature)

### 피처 선택(feature selection)
* 모델링 시 데이터의 특징을 잘 나타내는 주요 피처만 선택하는 작업
* 타깃값 예측과 관련 없는 피처는 제거
* 탐색적, 데이터 분석, 피처 중요도(feature importance), 상관관계 메트릭스 활용

In [12]:
# 데이터 나누기
X_train = all_data[~pd.isnull(all_data['count'])]
X_test = all_data[pd.isnull(all_data['count'])]
y = X_train['count']
X_train = X_train.drop(columns=['count'])
X_test = X_test.drop(columns=['count'])

X_train.head()

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,year,hour,weekday
0,1,0,0,1,9.84,14.395,81,2011,0,5
1,1,0,0,1,9.02,13.635,80,2011,1,5
2,1,0,0,1,9.02,13.635,80,2011,2,5
3,1,0,0,1,9.84,14.395,75,2011,3,5
4,1,0,0,1,9.84,14.395,75,2011,4,5


In [13]:
# scale
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(X_train) # train data로만 훈련
X_train_scale = scaler.transform(X_train)
X_train_scale = pd.DataFrame(X_train_scale, columns=X_train.columns)

In [14]:
# 평가지표 계산 함수 작성
def rmsle(y_true, y_pred, convertExp=True):
    if convertExp:
        y_true = np.exp(y_true)
        y_pred = np.exp(y_pred)
    log_true = np.nan_to_num(np.log(y_true+1))
    log_pred = np.nan_to_num(np.log(y_pred+1))
    return np.sqrt(np.mean((log_true-log_pred)**2))

In [15]:
### 모델 훈련
from sklearn.linear_model import LinearRegression

linear_reg_model = LinearRegression()
log_y = np.log(y)
linear_reg_model.fit(X_train_scale, log_y)

### 정리
* ![](images/bike_metrix.PNG)
    * $y = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3$
    * $y = \theta_0 + \theta X$
    * $x_n$ 은 독립 변수(피처)
    * $\theta_n$ 는 회귀 계수(가중치)
    * $y$ 는 종속 변수(타겟)
* 훈련
    * 피처와 타깃값이 주어졌을 때 최적의 가중치를 찾는 과정
* 예측
    * 최적의 가중치를 훈련된 모델에서 새로운 데이터가 주어졌을 때 타깃값을 추정하는 과정
* 탐색적 데이터 분석
    * 예측에 도움이 될 피처를 정리하고 적절한 모델링 방법을 탐색하는 과정
* 피처 엔지니어링
    * 정리된 피처들을 훈련에 적합하도록, 성능 향상에 도움이 되도록 가종하는 과정

In [16]:
### 모델 성능 검증
preds = linear_reg_model.predict(X_train_scale)
print(f'RMSLE : {rmsle(log_y, preds, True):.4f}')

RMSLE : 1.0205


In [18]:
### 예측 및 결과 제출
linearreg_preds = linear_reg_model.predict(scaler.transform(X_test))
submission_df['count'] = np.exp(linearreg_preds)
# submission_df.to_csv('submission.csv', index=False)

