# Victor Vu Dask

For this project, we were given two datasets from Kaggle https://www.kaggle.com/marklvl/bike-sharing-dataset/home containing information about the Bike Sharing service in Washington D.C. "Capital Bikeshare"

One dataset contains hourly data and the other one has daily data from the years 2011 and 2012.

The following variables are included in the data:

* instant: Record index
* dteday: Date
* season: Season (1:springer, 2:summer, 3:fall, 4:winter)
* yr: Year (0: 2011, 1:2012)
* mnth: Month (1 to 12)
* hr: Hour (0 to 23, only available in the hourly dataset)
* holiday: whether day is holiday or not (extracted from Holiday Schedule)
* weekday: Day of the week
* workingday: If day is neither weekend nor holiday is 1, otherwise is 0.
* weathersit: (extracted from Freemeteo)
    1: Clear, Few clouds, Partly cloudy, Partly cloudy
    2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
* temp: Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
* atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
* hum: Normalized humidity. The values are divided to 100 (max)
* windspeed: Normalized wind speed. The values are divided to 67 (max)
* casual: count of casual users
* registered: count of registered users
* cnt: count of total rental bikes including both casual and registered (Our target variable)

We are tasked with building a predictive model that can determine how many people will use the service on an hourly basis, therefore we take the first 5 quarters of the data for our training dataset and the last quarter of 2012 will be the holdout against which we perform our validation. Since that data was not used for training, we are sure that the evaluation metric that we get for it (R2 score) is an objective measurement of its predictive power.

### Planning

Initially, we decided to separate the project in 4 steps:

Data Loading and Exploratory Data Analysis: Load the data and analyze it to obtain an accurate picture of it, its features, its values (and whether they are incomplete or wrong), its data types among others. Also, the creation of different types of plots in order to help us understand the data and make the model creation easier.

Data Preparation and Feature Engineering: Once we have the data, we would need to prepare it for the modeling stage, standardizing it, changing data types, dropping features, among others. Later, a process of creating features and selecting others based on a number of different criteria like correlation, would also need to be performed.

Modeling and Tuning: Once we have the data ready, the modeling stage begins, making use of different models (and ensembles) and a strong pipeline with different transformers, we would hopefully produce a model that fits our expectations of performance. Once we have that model, a process of tuning it to the training data would be performed.

Results and Conclusions: Finally, with our tuned models, we would predict against the test set we decided to separate initially, then plotting those results against their actual values to determine the performance of the model, and finally, outlining our conclusions after this extensive project.


### Notes

For the code to run, you must install the following extensions:
* Seaborn (aesthetic plots) Version 0.9.0
* Xgboost (boosting model) Version 0.82
* Gplearn (genetic features) Version 0.3.0

The following code performs the task of installing these libraries, if you wish to do so you may uncomment the cell and run it

Also, a file that is included inside the zip folder called helpers.py is needed to run the code, this file contains the different functions that were created throught the project in a neat folder that declutters the botebook

In [1]:
# ! pip install seaborn==0.9.0
# ! pip install xgboost==0.82
# ! pip install gplearn==0.3.0

In [7]:
import warnings
import numpy as np
import pandas as pd
import dask.dataframe as dd
import seaborn as sns
import plotly.tools as tls
import plotly.plotly as py
from sklearn.base import clone
import plotly.graph_objs as go
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from sklearn.decomposition import PCA
from gplearn.genetic import SymbolicTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score as metric_scorer
from sklearn.feature_selection import RFE, SelectFromModel
from sklearn.preprocessing import PolynomialFeatures, KBinsDiscretizer, PowerTransformer
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor

tls.set_credentials_file(username='alejandro321', api_key='yBVtyuhfWpl3rH4TrOGE')
warnings.filterwarnings('ignore')

In [8]:
SEED = 1
DATA_PATH = 'https://gist.githubusercontent.com/f-loguercio/f5c10c97fe9afe58f77cd102ca81719b/raw/99fb846b22abc8855de305c2159a57a77c9764cf/bikesharing_hourly.csv'
DATA_PATH2 = 'https://gist.githubusercontent.com/f-loguercio/14ac934fabcca41093a51efef335f8f2/raw/58e00b425c711ac1da2fb75f851f4fc9ce814cfa/bikesharing_daily.csv'
PREC_PATH = 'https://gist.githubusercontent.com/akoury/6fb1897e44aec81cced8843b920bad78/raw/b1161d2c8989d013d6812b224f028587a327c86d/precipitation.csv'
TARGET_VARIABLE = 'cnt'
ESTIMATORS = 50

### Data Loading

Here we load the necessary data, print its first rows and describe its contents

In [82]:
def read_data(input_path):
    return dd.read_csv(input_path, parse_dates=[1])

data = read_data(DATA_PATH)
data_daily = read_data(DATA_PATH2)

data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


### Precipitation Data

In order to generate our model, we will add precipitation data obtained from the National Climatic Data Center https://www.ncdc.noaa.gov/cdo-web/datasets

However, since most of the values are 0, we will convert them to a boolean that determines if rain was present or not at that specific hour

In [83]:
precipitation = read_data(PREC_PATH)
data = dd.merge(data, precipitation,  how='left', on=['dteday','hr'])
data = data.fillna(0)
data['precipitation'] = data['precipitation'].apply(lambda x: 1 if x > 0 else 0)
data['precipitation'] = data['precipitation'].astype('category')

### Check for na 

In [95]:
data.isna().sum().compute()

instant          0
dteday           0
season           0
yr               0
mnth             0
hr               0
holiday          0
weekday          0
workingday       0
weathersit       0
temp             0
atemp            0
hum              0
windspeed        0
casual           0
registered       0
cnt              0
precipitation    0
dtype: int64

In [109]:
data_hourly = data.copy()
data_hourly = data_hourly[data_hourly['dteday'].isin(pd.date_range('2011-01-01','2012-09-30'))]

In [111]:
data_hourly.tail()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,precipitation
15206,15207,2012-09-30,4,1,9,19,0,0,0,3,0.5,0.4848,0.72,0.1343,34,223,257,0
15207,15208,2012-09-30,4,1,9,20,0,0,0,3,0.5,0.4848,0.72,0.1343,31,163,194,0
15208,15209,2012-09-30,4,1,9,21,0,0,0,1,0.5,0.4848,0.68,0.0,19,104,123,0
15209,15210,2012-09-30,4,1,9,22,0,0,0,1,0.48,0.4697,0.72,0.0,15,76,91,0
15210,15211,2012-09-30,4,1,9,23,0,0,0,1,0.48,0.4697,0.72,0.0896,8,49,57,0


### Converting columns to their true categorical type
Converting columns to their true categorical type
Now we convert the data types of numerical columns that are actually categorical

In [93]:
data[['season', 'yr','mnth','hr','holiday','weekday','workingday','weathersit']]= data[['season', 'yr','mnth','hr','holiday','weekday','workingday','weathersit']].astype('category')

In [108]:
data.dtypes

instant                   int64
dteday           datetime64[ns]
season                 category
yr                     category
mnth                   category
hr                     category
holiday                category
weekday                category
workingday             category
weathersit             category
temp                    float64
atemp                   float64
hum                     float64
windspeed               float64
casual                    int64
registered                int64
cnt                       int64
precipitation          category
dtype: object

### Check for missing data

In [127]:
data.isnull().sum().compute()

instant          0
dteday           0
season           0
yr               0
mnth             0
hr               0
holiday          0
weekday          0
workingday       0
weathersit       0
temp             0
atemp            0
hum              0
windspeed        0
casual           0
registered       0
cnt              0
precipitation    0
registered_1     0
dtype: int64

In [148]:
data['registered_1'] = data['registered'].shift(1).rename(str('registered') + '_' + str(1))

In [135]:
data['registered_24'] = data['registered'].shift(-24).rename(str('registered') + '_' + str(24))

In [149]:
data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,precipitation,registered_1,registered_24
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16,1,,13.0
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40,1,13.0,16.0
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32,1,32.0,8.0
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13,1,27.0,4.0
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1,1,10.0,1.0


In [113]:
def add_lag(df, col, lag):
    lagged = df[col].shift(lag).rename(str(col) + '_' + str(lag))
    lagged[0:(lag)] = lagged[lag:(lag*2)]
    return lagged

data = pd.concat([data, add_lag(data, 'registered', 1), add_lag(data, 'registered', 24)], axis = 1)
data.head()

NotImplementedError: 