# Bike Sharing Demand(Regression)

## Data Description
* 2년동안의 시간당 렌탈 데이터
* training set : 매달 첫째날부터 19일째 되는날까지의 데이터
* test set : 20일부터 말일까지의 데이터

### 목적 :  predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period

## Data Fields
* datetime : hourly date + timestamp
* season : 1 = spring, 2 = summer, 3 = fall, 4 = winter
* holiday : whether the day is considered a holiday
* workingday : whether the day is neither a weekend nor holiday
* weather : 1 = Clear, Few clouds, Partly cloudy, Partly cloudy
2 = Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3 = Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4 = Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
* temp :  temperature in Celsius
* atemp : feels like" temperature in Celsius
* humidity : relative humidity
* windspeed : wind speed
* casual : number of non-registered user rentals initiated
* registered : number of registered user rentals initiated
* count : number of total rentals (predict 해야할 value)

In [None]:
import pylab
import calendar
import numpy as np
import pandas as pd
import seaborn as sn
from scipy import stats
import missingno as msno #결측치 시각화
from datetime import datetime
import matplotlib.pyplot as plt
import warnings
pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore", category=DeprecationWarning)
%matplotlib inline

In [None]:
dailyData = pd.read_csv('../input/bike-sharing-demand/train.csv')
dailyData.head()

In [None]:
dailyData.info()

## Feature Engineering
위 Data Description에서 봤듯이, "season", "holiday", "workingday", "weather"은 categorical data 
* "datetime"컬럼에서 분리해 새로운 "date", "hour", "weekDay", "month" 칼럼 만들기
* "season", "holiday", "workingday", "weather"를 category로 변환시키기
* Drop "datetime" 

In [None]:
dailyData["date"] = dailyData.datetime.apply(lambda x : x.split()[0])
dailyData["hour"] = dailyData.datetime.apply(lambda x : x.split()[1].split(":")[0])
dailyData["weekday"] = dailyData.date.apply(lambda dateString : calendar.day_name[datetime.strptime(dateString,"%Y-%m-%d").weekday()])
dailyData["month"] = dailyData.date.apply(lambda dateString : calendar.month_name[datetime.strptime(dateString,"%Y-%m-%d").month])
dailyData["season"] = dailyData.season.map({1: "Spring", 2 : "Summer", 3 : "Fall", 4 :"Winter" })
dailyData["weather"] = dailyData.weather.map({1: " Clear + Few clouds + Partly cloudy + Partly cloudy",\
                                        2 : " Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist ", \
                                        3 : " Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds", \
                                        4 :" Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog " })

In [None]:
categoryVariableList = ["hour","weekday","month","season","weather","holiday","workingday"]
for var in categoryVariableList:
    dailyData[var] = dailyData[var].astype("category")

In [None]:
dailyData  = dailyData.drop(["datetime"],axis=1)

In [None]:
dailyData.head()

## Missing Values Analysis
* 데이터를 한번 둘러보고 나면, 다음 스텝은 데이터가 결측치를 가지고 있는지 확인해봐야함
* 하지만, 이 데이터는 결측치가 없음

## Skewness in Distribution

In [None]:
msno.matrix(dailyData, figsize = (12, 5))

## Outliers Analysis
* 처음엔, 'count'는 많은 이상치들을 가지고 있고 오른쪽으로 치우쳐져 있었음

#### 아래 추론들은 이어서 나올 box-plot을 보고 추론했음
* 봄에 상대적으로 count가 작음, median값이 이것을 설명해줌
* 'Hour Of the Day' box-plot이 흥미로움 -> median값이 상대적으로 오전 7-8시, 오후 5-6시에 높음 -> 아마 학교 등하교, 직장 출퇴근 시간이기 때문에 그럴 것임
* 4번의 그림을 보면, 이상치 값들이 Non Working Day보다 Working day에 더 많음

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=2)
fig.set_size_inches(12, 10)
sn.boxplot(data=dailyData,y="count",orient="v",ax=axes[0][0])
sn.boxplot(data=dailyData,y="count",x="season",orient="v",ax=axes[0][1])
sn.boxplot(data=dailyData,y="count",x="hour",orient="v",ax=axes[1][0])
sn.boxplot(data=dailyData,y="count",x="workingday",orient="v",ax=axes[1][1])

axes[0][0].set(ylabel='Count',title="Box Plot On Count")
axes[0][1].set(xlabel='Season', ylabel='Count',title="Box Plot On Count Across Season")
axes[1][0].set(xlabel='Hour Of The Day', ylabel='Count',title="Box Plot On Count Across Hour Of The Day")
axes[1][1].set(xlabel='Working Day', ylabel='Count',title="Box Plot On Count Across Working Day")

#### Lets Remove Outliers In The Count Column

In [None]:
dailyDataWithoutOutliers = dailyData[np.abs(dailyData['count'] - dailyData['count'].mean()) <= (3*dailyData['count'].std())]
print ("Shape Of The Before Ouliers: ",dailyData.shape)
print ("Shape Of The After Ouliers: ",dailyDataWithoutOutliers.shape)

## Correlation Analysis
독립변수들이 어떻게 반응변수에 영향을 주는지 이해하는 방법 중 하나는 correlation matrix를 보는것
#### Let's plot a correlation plot between 'count' and ['temp', 'atemp', 'humidity', 'windspeed']
* temp랑 humidity 피처는 각각 양과 음의 상관관게를 가짐, 이 상관관계가 중요하진 않지만, count는 이 둘에 거의 의존 X
* windspeed는 거의 쓸모가 없음
* atemp는 temp와 너무 강한 상관관계가 있어서 사용하지 X (다중공선성의 문제때문에)
* casual과 registered도 사용하지 X(casual + registered = count)

In [None]:
corrMatt = dailyData[["temp","atemp","casual","registered","humidity","windspeed","count"]].corr()
mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False
fig,ax= plt.subplots()
fig.set_size_inches(20,10)
sn.heatmap(corrMatt, mask=mask,vmax=.8, square=True,annot=True)

In [None]:
fig,(ax1,ax2,ax3) = plt.subplots(ncols=3)
fig.set_size_inches(12, 5)
sn.regplot(x="temp", y="count", data=dailyData,ax=ax1)
sn.regplot(x="windspeed", y="count", data=dailyData,ax=ax2)
sn.regplot(x="humidity", y="count", data=dailyData,ax=ax3)

## Visualizing Distribution Of Data

머신러닝에서는 반응변수가 Normal Dist인게 이상적임.만약 반응변수가 한쪽으로 치우쳐져 있다면?
* log transformation after removing outlier data

In [None]:
fig,axes = plt.subplots(ncols=2,nrows=2)
fig.set_size_inches(12, 10)
sn.distplot(dailyData["count"],ax=axes[0][0])
stats.probplot(dailyData["count"], dist='norm', fit=True, plot=axes[0][1])
sn.distplot(np.log(dailyDataWithoutOutliers["count"]),ax=axes[1][0])
stats.probplot(np.log1p(dailyDataWithoutOutliers["count"]), dist='norm', fit=True, plot=axes[1][1])

log 변환 후에도 이상적인 Normal Dist의 모습은 보이지 않음

## Visualizing Count Vs (Month, Season, Hour, Weekday, Usertype)
* 여름에 자전거를 타기에 좋아서 사람들이 여름에 자전거를 많이 대여하는 경향이 있음 -> 6, 7, 8월이 상대적으로 수요가 많음
* 평일에는 오전7-8시, 오후 5-6시가 수요가 많음. 앞에서 언급했듯이
* 위의 패턴이 토요일이나 일요일에는 나타나지 않음. 오히려 오전 10시부터 오후 4시에 많은 수요가 나타남
* 미리 등록한 사람이 오전 7-8시, 오후 5-6시에 많은 수요를 보임


In [None]:
fig,(ax1,ax2,ax3,ax4)= plt.subplots(nrows=4)
fig.set_size_inches(12,20)
sortOrder = ["January","February","March","April","May","June","July","August","September","October","November","December"]
hueOrder = ["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]

monthAggregated = pd.DataFrame(dailyData.groupby("month")["count"].mean()).reset_index()
monthSorted = monthAggregated.sort_values(by="count",ascending=False)
sn.barplot(data=monthSorted,x="month",y="count",ax=ax1,order=sortOrder)
ax1.set(xlabel='Month', ylabel='Avearage Count',title="Average Count By Month")

hourAggregated = pd.DataFrame(dailyData.groupby(["hour","season"],sort=True)["count"].mean()).reset_index()
sn.pointplot(x=hourAggregated["hour"], y=hourAggregated["count"],hue=hourAggregated["season"], data=hourAggregated, join=True,ax=ax2)
ax2.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across Season",label='big')

hourAggregated = pd.DataFrame(dailyData.groupby(["hour","weekday"],sort=True)["count"].mean()).reset_index()
sn.pointplot(x=hourAggregated["hour"], y=hourAggregated["count"],hue=hourAggregated["weekday"],hue_order=hueOrder, data=hourAggregated, join=True,ax=ax3)
ax3.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across Weekdays",label='big')

hourTransformed = pd.melt(dailyData[["hour","casual","registered"]], id_vars=['hour'], value_vars=['casual', 'registered'])
hourAggregated = pd.DataFrame(hourTransformed.groupby(["hour","variable"],sort=True)["value"].mean()).reset_index()
sn.pointplot(x=hourAggregated["hour"], y=hourAggregated["value"],hue=hourAggregated["variable"],hue_order=["casual","registered"], data=hourAggregated, join=True,ax=ax4)
ax4.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across User Type",label='big')