# Washington DC Biking data | Hourly Bike Count Prediction

## 2. Data Preparation & Feature Engineering
MBD O-1-5

### Notebook preperation

In [1]:
%matplotlib inline

# To automatically reload the function file 
%load_ext autoreload
%aimport My_Functions
%run My_Functions.py
%autoreload 1

In [2]:
# Data Import
hourly_raw_data = dd.read_csv('hour.csv')
hourly_raw_data_pd = pd.read_csv('hour.csv')

In [3]:
hourly_raw_data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [4]:
hourly_raw_data_pd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
instant       17379 non-null int64
dteday        17379 non-null object
season        17379 non-null int64
yr            17379 non-null int64
mnth          17379 non-null int64
hr            17379 non-null int64
holiday       17379 non-null int64
weekday       17379 non-null int64
workingday    17379 non-null int64
weathersit    17379 non-null int64
temp          17379 non-null float64
atemp         17379 non-null float64
hum           17379 non-null float64
windspeed     17379 non-null float64
casual        17379 non-null int64
registered    17379 non-null int64
cnt           17379 non-null int64
dtypes: float64(4), int64(12), object(1)
memory usage: 2.3+ MB


In [5]:
hourly_raw_data.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 17 entries, instant to cnt
dtypes: object(1), float64(4), int64(12)

In [6]:
#hourly_raw_data['dteday']= hourly_raw_data['dteday'].
hourly_raw_data['dteday']= dd.to_datetime(hourly_raw_data.dteday ,unit='ns')

In [7]:
hourly_raw_data.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 17 entries, instant to cnt
dtypes: datetime64[ns](1), float64(4), int64(12)

# Feature Engineering

### Converting `dteday` to date

## Add `isDaylight` and  `isNoon` for hourly data

Astral module is used to calculate flags for daylight and noon time. 

A customized function is defined to classify a row as daylight. If the hour of a record is less than the hour of sunset in Washington DC and more than the time of sunrise, it is flagged as daylight, otherwise it is flagged as not daylight.
Noon time flag is also created using a customized function. If the hour of a record is equal to the hour of noon in Washington DC, it is flagged as noon, otherwise it is flagged as not noon. 

In [8]:
hourly_raw_data['isDaylight']=0
hourly_raw_data['isNoon']=0

hourly_raw_data = hourly_raw_data.apply(lambda x: isDaylight(x), axis=1, meta=int)

In [11]:
hourly_raw_data.info()

AttributeError: 'Series' object has no attribute 'info'

### Adding the temp atemp windspeed hum relative to the last 7 days value

In exploratory data analysis, it was found that there are outliers in seasonal variables. In order to make a robust model that is able to predict outliers, new variables are created for temp, atemp, hum, and wind speed. In case of temp, mean of temp for the last seven days is deducted from current temp value and the resulting value is divided by standard deviation of temp for the last seven days.

In [36]:
to_relative  = ['temp', 'atemp', 'hum','windspeed']
hourly_raw_data = relative_values(hourly_raw_data, to_relative)

NotImplementedError: Series getitem in only supported for other series objects with matching partition structure

### Adding `RushHour-High` & 	`RushHour-Med`  &	`RushHour-Low`

The interactive time series shows that there are variations in casual, registered, and total bikers during the span of a day. This realization led to creation of a rush hour flag. The logic for this flag is as follows:

Working Day:  
10:00 AM to 6:00 PM is flagged as high rush hour. 7:00 PM to Midnight and 8:00 AM and 9:00 AM  are flagged as medium rush hour. Whereas, all other hours are flagged as low rush hour.  
Holiday:  
7:00 AM to 9:00 AM and 4:00 to 8:00 PM is flagged as high rush hour. 6:00 AM, 10:00 AM till 1:00 PM,  3:00 PM, and 9:00 PM till 11:00 PM are flagged as medium rush hour. Whereas, all other hours are flagged as low rush hour.

In [10]:
hourly_raw_data['RushHour-High']= 0
hourly_raw_data['RushHour-Med']= 0
hourly_raw_data['RushHour-Low'] = 0

#hourly_raw_data = hourly_raw_data.apply(lambda x: addRushHourFlags(x), axis=1, meta=int)

TypeError: 'Series' object does not support item assignment

## Splitting Data

In [None]:
workingdays = num_name(hourly_raw_data.loc[(hourly_raw_data['workingday'].isin([1]) )])
holidays = num_name(hourly_raw_data.loc[(~hourly_raw_data['workingday'].isin([1]) )])

## Mean of the past 3 weeks during the same hour

Exploratory data analysis highlighted outliers in total bikers. In order to make a robust model that is able to predict outliers, new variable is created for total bikers. Mean of  total bikers in the last three weeks for the same hour as the current row’s hour is computed and added as a new variable to the dataset. This variable was created separately for working days and holidays as they depict different patterns. 

In [None]:
workingdays= mean_per_hour_3weeks(workingdays)
holidays = mean_per_hour_3weeks(holidays)

### One hot Encoding | 2x for splitted datasets
For `season`, `weathersit`, `mnth`,`weekday`,`hr`

In [None]:
category  = ['season', 'weathersit', 'mnth','weekday','hr']

workingdays = onehot_encode(workingdays,category)
workingdays  = workingdays.drop('instant',axis=1)

holidays = onehot_encode(holidays,category)
holidays  = holidays.drop('instant',axis=1)

## Genetic Programming | 2x for splitted datasets

A supervised algorithm that uses simple mathematical equations such as summation, multiplication, square root, etc. in order to find a relationship between the existing features and the target. It tries multiple combination of these equations and has a learning process which gets better with the number of generations it is set to have. This function added 15 features each to working days and holidays datasets.

In [None]:
dates = workingdays['dteday']
registered = workingdays['registered']
casual = workingdays['casual']
workingdays = Genetic_P(workingdays.drop(['registered','casual','dteday'],axis=1),'cnt')
workingdays['dteday'] = dates
workingdays['registered'] = registered
workingdays['casual'] = casual

In [None]:
dates = holidays['dteday']
registered = holidays['registered']
casual = holidays['casual']
holidays = Genetic_P(holidays.drop(['registered','casual','dteday'],axis=1),'cnt')
holidays['dteday'] = dates
holidays['registered'] = registered
holidays['casual'] = casual

# Final Datasets

In [None]:
holidays[np.arange(1,15)].head()

In [None]:
holidays.head()

## Save Both datasets

In [None]:
workingdays.to_csv("workingdays_data_prepared.csv", index=False)
holidays.to_csv("weekends_holi_data_prepared.csv", index=False)