1. Project introduction
 * Project backgroud
 * first glance at the data
2. Overview the train and test data
3. feature analysis
 * time related features analysis
 * weather related features analysis
 * temp and atemp features analysis
 * humidity and windspeed features
4. feature selection and engineering
 * selection
 * engineering 
5. Model selection and parameters tunning

1. 项目介绍
 * 项目背景
 * 数据说明
2. 查看数据集整体情况
3. 数据特征分析
 * 时间相关特征分析
 * 天气相关特征分析
 * 温度相关特征分析
 * 湿度及风速相关特征分析
4. 特征选择及特征工程
 * 特征选择
 * 特征工程
5. 模型选择及参数调优

**1. Project introduction**
* **1.1 Project backgroud**
 * Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this data analysis project.participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

* **1.2 First glance at the data**
 * datetime - hourly date + timestamp  
 * season -  1 = spring, 2 = summer, 3 = fall, 4 = winter 
 * holiday - whether the day is considered a holiday
 * workingday - whether the day is neither a weekend nor holiday
 * weather
  * 1: Clear, Few clouds, Partly cloudy, Partly cloudy
  * 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
  * 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
  * 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
 * temp - temperature in Celsius
 * atemp - "feels like" temperature in Celsius
 * humidity - relative humidity
 * windspeed - wind speed
 * casual - number of non-registered user rentals initiated
 * registered - number of registered user rentals initiated
 * count - number of total rentals

# 1. 项目介绍
* **1.1 项目背景**
 * 本项目数据集来自美国华盛顿地区的一个共享单车APP——这些APP每日产生了大量数据 ，对于研究当地的道路系统、出行需求、路线规划、商业区域选择等项目有着重要意义。在本项目中，我将尝试探索这些数据集所透露的信息，并通过已有信息，预测共享单车的需求量。
* **1.2 数据说明**
 * datetime - 时间特征，包括了年月日、时分秒
 * season - 四季，以数字1-4标识
 * holiday - 是否为法定节假日，以数字0-1标识
 * workingday - 是否为工作日，以数字0-1标识
 * weather - 天气，以数字1-4标识，数字越大，天气越糟糕
 * temp - 温度（摄氏度）
 * atemp - 人体感知温度（摄氏度）
 * humidity - 湿度
 * windspeed - 风速
 * casual - 非会员订单
 * registered - 会员订单
 * count - 非会员订单及会员订单总和


# 2. Overview the train and test data
# 2. 查看数据集整体情况

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_log_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor

In [None]:
train=pd.read_csv('../input/bike-sharing-demand/train.csv',parse_dates=['datetime'])
test=pd.read_csv('../input/bike-sharing-demand/test.csv',parse_dates=['datetime'])
sampleSubmission=pd.read_csv('../input/bike-sharing-demand/sampleSubmission.csv')
full=pd.concat([train,test])

In [None]:
train.info()

In [None]:
test.info()

* Both train file and test test file are complete in which requires no inputation process.
* There are 8 columns would be treated as Predictors to work out the target.
* 整体来看，数据集可以分为训练集和测试集，两者的数据都较为完整，基本不需要补全、删除等操作。训练集及测试集中，共有8个数据维度用于预测，包括了时间、季节、法定节假日、工作日、* 天气、气温、湿度、风速等信息。在共享单车需求量方面，数据集细分为会员需求量及非会员需求量，两者加总得到总需求量。

In [None]:
train[['registered','casual']].sum()/train['count'].sum()

* registered demand accounts for over 80% of the total demand.
* 从上面的分析来看，在所有的共享单车需求当中，约有81%来自会员，其余约19%来自非会员需求。我们在此次分析中，将只考虑整体情况！

In [None]:
sns.displot(data = train, x='count')
plt.show()

* the distribution of demand possesses a long tail, we condiser a log-transformation.
* 从上图来看，count的分布并不符合正态分布，且尾部较长，我们考虑对此进行log转换。

In [None]:
sns.displot(data = train, x='count',log_scale=True)
plt.show()

* the transformation, demand's distribution is more like a normal distribution, our model would accormodate to that transformation.
* 转换后，我们发现，count的分布更趋向于正态分布。基于此，我们考虑最后以取对数的方式，预测count。

# 3. Feature analysis
# 3. 数据特征分析

### 3.1.1 time related features analysis
datetime could be further processed to work out year,month,date, dayname, hour.

###  3.1.1 时间(datetime)相关特征分析
时间维度在原数据集包括了datetime,season、holiday、workingday等，我们首先将datetime维度进一步拆解为年份、月份、具体日期、小时等维度。

In [None]:
full['date']=full['datetime'].dt.date
full['month']=full['datetime'].dt.month
full['hour']=full['datetime'].dt.hour
full['dayname'] = full['datetime'].dt.weekday
full['year'] = full['datetime'].dt.year

In [None]:
plt.figure(figsize=(16,9))
sns.set_theme()
plt.subplot(2,2,1)
sns.barplot(data=full,x='month',y='count',estimator=np.sum)
plt.subplot(2,2,2)
sns.lineplot(data=full,x='date',y='count')
plt.subplot(2,2,3)
sns.lineplot(data=full,x='dayname',y='count',estimator=np.mean)
plt.subplot(2,2,4)
sns.lineplot(data=full,x='hour',y='count',marker='o')
plt.show()

**From the figures above, we could get 4 points:**
* The colder the month, the less people demands a bike. Spring marks the point of which the demand get hiking,the demand reaches its peak on summer, and the demand keeps declining till spring.
* Daily usage keeps raising.
* from dayname columns, we notice some fluctuation.
* There are two peaks each day on commuting hour.

**通过图表，我们从datetime中拆分出的数据显示出了四个主要信息：**
* 从月份维度来看，订单的需求出现一个以年为单位的周期：每年春季的单车需求量开始上升，在夏季及秋季达到并维持在一个高峰水平，随后在冬季开始回落。
* 从时间的整体跨度来看，订单量稳中有升。
* 周一至周日每天的订单量有少许波动。
* 每天的订单量有两个高峰，分别在上班及下班时间段，可能由于下班后时间更加充裕，晚高峰的单车需求量更高！据此，我们可以考虑将hour这一特征转化为高峰与非高峰，降低模型复杂度。

### 3.1.2 Holiday feature analysis


### 3.1.2 法定节假日

In [None]:
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(10,6))
full.holiday.value_counts().plot.pie(autopct='%.2f%%',ax=ax1,legend=True)
sns.boxplot(data=full,y='count',x='holiday',ax=ax2)
ax1.set_title('percent of holiday and non-holiday data')
ax2.set_title('average demands for bike sharing system ')
plt.tight_layout()
plt.show()

* Holiday only account for a less than 3% of total poluation, so this feature is not good feature to select.
* Holiday and non-holiday demands is a little bit defferent, but it is acceptable to use just the latter, considering data size of holiday.
* 法定节假日在我们的数据集中，总量极少，仅占不到3%的比例，这也就意味着，我们在对单车需求量进行分析时，去除这一维度，对我们预测结果影响不大，而且能减少我们的模型复杂程度。

### 3.1.3 Season feature analysis

### 3.1.3 不同季节对于需求量的影响

* From the datetime series, we already involed some analysis about Season.
* 我们在此前的月份分析当中，已经对季节对订单量有了一定的了解

In [None]:
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,4))
full.groupby('season')['count'].sum().plot.bar(ax=ax1)
sns.boxplot(x=train.season,y=train['count'],ax=ax2)
plt.tight_layout()
ax1.set(title='total demands for each season')
ax1.set_ylabel('total demand')
ax2.set_title('boxplot of each season')
plt.tight_layout()
plt.show()

* from the above, we could draw the conclusion that, from spring to fall, the dmeand keeps rising, and reaches its demand peak in fall.
* the box plot shows that there are much more outliers in spring and winter: you sure dont want to get a bike in a cold and snowy day.
* 整体来看，从春季到秋季，订单量处于增长态势，秋季到达顶峰后开始回落，春季到达低谷。
* 另外，春秋两季的离群值更多，说明春冬两季的偶发性订单更多。月份与季节的在时间维度方面有较大程度的重合，我们后续可以考虑只保留两者中的一个。

### 3.2 Weather analysis

### 3.2 天气分析

In [None]:
full.weather.unique()

In [None]:
plt.figure()
plt.subplot()
sns.barplot(data=full,x='weather',y='count')
plt.show()

* It is odd that when people encouters the worst weather, the demand rises! Let us further dig in.
按照常理，天气越糟糕，人们对订单的需求量应当越小，但在最为恶劣的天气种类4当中，需求量却出现了反转！我们需要对数据进一步分析，寻找其中的原因。

In [None]:
full.weather.value_counts()

In [None]:
full.weather.value_counts().plot.pie(shadow=True,autopct='%.2f%%',legend=True)

* Brilliant!We find out that there are only 3 entry containing the worst weather condition.That means that we still could say that the worse the weather condition we have, the less demand we need.
* 从我们进一步的数据挖掘来看，极端天气（代号4）仅仅出现了3次，少于0.01%，考虑到这个代号4的极端天气出现的概率之低，其所表现出来对应需求量有很大可能不能代表代号4天气下的真实需求量。结合我们的常识与已有的其它数据，我们可以认为，天气越糟糕，对单车的需求量越小。

* we are reset No.4 weather condition to No.3, because No.4 weather is so rare.
* 考虑代号4的天气极为极端，我们考虑将代号4的天气统一修改为。

In [None]:
full.loc[full.weather==4,'weather']=3#reset the No.4 weather conditon to No.3

* 最后，天气的好坏与温度、湿度、风力应当存在较为明显的关系，我们观察一下，后者在前三者的分布情况。
* throughout our observation, weather possesses a close relationship with temp, humidity and wind, let us further dig in.

In [None]:
g = sns.FacetGrid(full,col='weather',height=4)
g.map(sns.histplot,'windspeed',stat='probability',color='red')
plt.show()

* 在天气更好的情况下，风力偏小。
* good weather condition implies a smaller windspeed.

In [None]:
g = sns.FacetGrid(full,col='weather',height=4)
g.map(sns.histplot,'humidity',stat='probability',color='green')
plt.show()

* 在其它条件不变的情况下，天气越糟糕，湿度也越高。这点可以解释为坏天气总是伴有雨或雪，进而带来湿度的上升。
* the worse the weather, the higher the humidity, we can deduce that the bad weather always accompanied by rain or snow in which lead to higher humidity.

In [None]:
g = sns.FacetGrid(full,col='weather',height=4)
g.map(sns.histplot,'temp',stat='probability',color='purple')
plt.show()

* 各种天气状态下，气温的分布都基本接近正态分布，气温与天气的关系，并不如湿度来得明显。
* bell curse shows among the different weather condition.

### 3.3 temperature analysis

### 3.3 温度相关特征分析

In [None]:
g = sns.FacetGrid(full,col='season')
g.map(sns.scatterplot,'temp','atemp')
plt.show()

* from the picture above,we observe a strong relation between temp and atemp.
* 整体来看，真实气温与人体感知气温高度相关，图示的春夏秋冬四季，真实气温与感知气温的斜率接近45度。

* the correlation coefficient also give us the same result.
* 另外，通过计算两者的相关系数，我们也不难发现，两者的相关系数接近于1！

In [None]:
full[['temp','atemp']].corr()

* there is a clearly positive linear relationship between temp and bike demand. Although the trending are reversed when the temp is above 36. we need to figure out the protion of the data which the temp is above 36.
* 从下图一我们发现，随着气温的升高，订单的需求量也在上升，但在3度以下及36度以上，情况发生了一些变化。趋势开始反方向发展。除此之后，两者存在很强的线性关系。另外，我们从下图二及图三我们发现，温度在36度以上的数据似乎并非可以小到忽略不计。我们判断这部分数据在整体数据的比重以及其它相关信息。

In [None]:
fig, (ax,ax2,ax3) = plt.subplots(3,1,sharex=True,figsize=(14,9))
sns.lineplot(data=train, x='temp', y='count',ax=ax,color='green')
sns.histplot(train['temp'],ax=ax2,color='green')
sns.histplot(full['temp'],ax=ax3,color='green')
ax.set_title('plot of temp and count')
ax2.set_title('plot of train temp distribution')
ax3.set_title('plot of full temp distribution')
plt.show()

In [None]:
train.loc[train['temp']>=36]['temp'].count()/train['temp'].count()

In [None]:
full.loc[train['temp']>=36]['temp'].count()/full['temp'].count()

* the number of the data containning temp above 36 is trivial, we modify the temp above 36 to 36.  
* 无论是在train数据集亦或是full数据集，两者的比重皆为1%及以下，基于此，我们考虑统一将36度以及以上的温度，全部回调至36度。

In [None]:
full.loc[full['temp']>36,'temp']=36

### 3.4 Humidity and windspeed analysis

### 3.4 温度及风速相关特征分析

In [None]:
train[['count','windspeed','humidity']].corr()

* From the correaltion matrix, we can see that only humidity has a sound impact on demand.
* 从相关性来看，湿度对于订单量的影响较大。

In [None]:
plt.figure(figsize=(16,9))
plt.subplot(2,2,1)
sns.lineplot(x='windspeed',y='count',data=train,color='green')
plt.subplot(2,2,2)
sns.lineplot(x='humidity',y='count',data=train)
plt.subplot(2,2,3)
sns.distplot(full['windspeed'],color='green')
plt.subplot(2,2,4)
sns.distplot(full['humidity'])
plt.show()

* a great portion of data manifest 0 in windspeed column, that may indicate that the data collector fill up  the missing value with 0.
* 从风力的分布来看，大量的数据集中在了0附近，这或许可以理解为一种数据缺失：数据收集者在面对数据缺失时，将其统一为0。为了验证我们的猜测，我们查看风力数据在不同年份，不同季节的分布。

In [None]:
g = sns.FacetGrid(full,col='season',row='year')
g.map(sns.histplot,'windspeed',stat='probability')
g.set_ylabels('probability')
plt.show()

* windspeed's distribution in different seasons are almost the same, that imples that windspeed possesses no missing value.
* 风力整体上的分布与其在每一年的四季分布的情况基本一致，这也就意味着，尽管风力在分布上与正态分布相差较远，但这极有可能就是数据在自然状态下的表现。

# 4. Feature selection and feature engineering 

# 4. 特征选择及特征工程

**First, let us handle features related with time or date,these features include:datetime, season, holiday, workingday, date, month, hour. from our analysis before, we would only pick a part of the feature:**
* datetime shall not be used directly, this feature could be further split into year,month and hour features, the derivate three features would be picked up instead.
* holiday shall not be used because it does not distinct the data.
* month shall not be used because season could represent the trend better with fewer feature values.Only season, workingday,date, hour would be picked up.
* Although both month and season reflect the demand flutualtion among the year, but season could not describe the increasing trend amongh the different monthes in the same season, so we pick up month instead of season.

与时间相关的特征包括datetime, season, holiday, workingday, date, month, hour等。从我们之前分析的结论来看，我们只需要其中部分的时间特征：
* datetime被分割为year,month以及hour三个特征
* holiday的特征没有代表性，我们选择舍弃这一特征
* month将代表season特征，最后，我们只挑选year（年份）、month（月份）、dayname(周几)、hour(小时)这四个与时间相关的特征。
* 在季节与月份实际上都能很好地反映订单量在一年内不同时期的变化。不过尽管季节相较于月份在特征的变量上要更加简洁。但考虑到在同一季节下有多个月份，而我们所选择的比季节更小的一个时间特征为周几——这也就意味着，如果我们选择季节描述订单量在一年内的变化，那么同一个季节内，跨度达三个月的同一星期一将有相同的预测，这与我们所观察到的订单量稳步上升趋势，明显不符合。

由于后续我们将采用基于树形回归的集成算法，在这里，我们无需对月份、年份等序数进行哑变量处理。

另外，对于通过上述对小时的分析，我们发现在一天的订单需求量当中，存在明显的高峰、低谷与平稳三大趋势。基于这一点，我们将hour分别转换为1,2,3三类，对应低谷、平稳与、高峰期。通过降低数据的复杂度，我们希望模型的效果能更好。

In [None]:
peak = [8,16,17,18,19]
low = [22,23,0,1,2,3,4,5,6]
full['hour']=full['hour'].apply(lambda x: 3 if x in peak else (1 if x in low else 2) )

* Through our analysis, we already dropped some features which contribute little in our model. We would drop more for the coming.
* 接下来，在处理完与时间相关的特征后，我们继续对剩下的特征进行分析。

* there are two indicators about temp, we only keep one of them.
* 对于剩余的非时间相关的特征，我们只需求舍弃人体感知温度。

In [None]:
to_drop_features = ['datetime','season','holiday','atemp','date','count','workingday',]
full.drop(to_drop_features,axis=1,inplace=True)

In [None]:
full[['month','hour','dayname','year','weather']]=full[['month','hour','dayname','year','weather']].astype('object')

# 5. Model selection and parameters tunning

# 5. 模型选择及参数调优

In [None]:
#dateset for regression model for casual prediction
X_casual =  full.loc[full['casual'].notnull()].drop(['casual','registered'],axis=1)
y_casual = full.loc[full['casual'].notnull(),['casual']]
predict_X_casual =  full.loc[full['casual'].isnull()].drop(['casual','registered'],axis=1)
train_X_casual, test_X_casual, train_y_casual, test_y_casual = train_test_split(X_casual,y_casual,shuffle=True)

In [None]:
#dateset for regression model for casual prediction
X_registered =  full.loc[full['registered'].notnull()].drop(['registered','casual'],axis=1)
y_registered = full.loc[full['registered'].notnull(),['registered']]
predict_X_registered =  full.loc[full['registered'].isnull()].drop(['registered','casual'],axis=1)
train_X_registered, test_X_registered, train_y_registered, test_y_registered = train_test_split(X_registered,y_registered,shuffle=True)

In [None]:
rfc_casual = RandomForestRegressor(n_estimators=300,max_depth=7)
rfc_casual.fit(train_X_casual,np.log1p(train_y_casual))
rfc_casual.predict(predict_X_casual)

In [None]:
rfc_registered = RandomForestRegressor(n_estimators=300,max_depth=8)
rfc_registered.fit(train_X_casual,np.log1p(train_y_casual))
rfc_registered.predict(predict_X_registered)

In [None]:
print('train_error:{}'.format(mean_squared_log_error(train_y_casual,np.expm1(rfc_casual.predict(train_X_casual)))))
print('test_error:{}'.format(mean_squared_log_error(test_y_casual,np.expm1(rfc_casual.predict(test_X_casual)))))