1. Intro about the data
2. Data import and exploration (train) [take ideas from my github] - remember about outliers
3. Data preprocessing (one-hot, regularisation, etc)
4. Feature engineering
5. Model selection and Rationale
5.1. Shallow learning
5.2. Deep learning (see tensorflow models I did for andrew as an inspiration)

6. Compare results with the leaderboard, submit results.

#### Ideas: 
    - Consider converting time to a continuous variable that'd take into account that 00:01 is closer to 23:59 than to 00:10
    - Consider doing the deep learning spiel in a separate file, using a preprocessed data achieved in this file (prob more efficient for use in floydhub)
    - Consider using tools for finding sensible paramethers 
    
    -
  

#### 1. Introduction

In [56]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline


Variables included in the data: 

    datetime   - hourly date + timestamp  
    
    season     -  1 = spring,
                  2 = summer,
                  3 = fall,
                  4 = winter 
                  
    holiday    - whether the day is considered a holiday
    
    workingday - whether the day is neither a weekend nor holiday
    
    weather    -  1: Clear, Few clouds, Partly cloudy, Partly cloudy 
                  2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 
                  3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 
                 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
                 
    temp       - temperature in Celsius
    
    atemp      - "feels like" temperature in Celsius
    
    humidity   - relative humidity
    
    windspeed  - wind speed
    
    casual     - number of non-registered user rentals initiated
    
    registered - number of registered user rentals initiated
    
    count      - number of total rentals

#### 2. Data import and exploration

In [57]:
data = pd.read_csv("./data/train.csv")

In [58]:
data.head()


Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


#### 3. Preprocessing and feature extraction

    Let's import the train and test datasets.

In [59]:
train_df = pd.read_csv("./data/train.csv") 
test_df = pd.read_csv("./data/test.csv") 
combined_df = [train_df, test_df] #combined datasets - I'll use this in loops replacing values in both datasets 

In [60]:
# Check: Is this necessary?

In [61]:
train_df.head(20)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1
5,2011-01-01 05:00:00,1,0,0,2,9.84,12.88,75,6.0032,0,1,1
6,2011-01-01 06:00:00,1,0,0,1,9.02,13.635,80,0.0,2,0,2
7,2011-01-01 07:00:00,1,0,0,1,8.2,12.88,86,0.0,1,2,3
8,2011-01-01 08:00:00,1,0,0,1,9.84,14.395,75,0.0,1,7,8
9,2011-01-01 09:00:00,1,0,0,1,13.12,17.425,76,0.0,8,6,14


In [62]:
train_df.info()
print('_'*40)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime      10886 non-null object
season        10886 non-null int64
holiday       10886 non-null int64
workingday    10886 non-null int64
weather       10886 non-null int64
temp          10886 non-null float64
atemp         10886 non-null float64
humidity      10886 non-null int64
windspeed     10886 non-null float64
casual        10886 non-null int64
registered    10886 non-null int64
count         10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 9 columns):
datetime      6493 non-null object
season        6493 non-null int64
holiday       6493 non-null int64
workingday    6493 non-null int64
weather       6493 non-null int64
temp          6493 non-null float64
atemp         6493 non-null float64
humidity 

    Excellent, no missing data!

    Let's look at each variable individually to understand the data better...

datetime    
season
holiday
workingday 
weather                     
temp         
atemp          
humidity  
windspeed
casual     
registered 
count

(A) datetime

In [63]:
train_df["datetime"].head()

0    2011-01-01 00:00:00
1    2011-01-01 01:00:00
2    2011-01-01 02:00:00
3    2011-01-01 03:00:00
4    2011-01-01 04:00:00
Name: datetime, dtype: object

We can use a fantastic pandas function to extract days of the month/week, years, hours, and assign them to new features

In [64]:
for df in combined_df:
    date = pd.DatetimeIndex(df['datetime'])
    df['date'] = date.date
    df['day'] = date.day
    df['month'] = date.month
    df['year'] = date.year
    df['hour'] = date.hour
    df['day_of_week'] = date.dayofweek
    df['week_of_year'] = date.weekofyear
#     df['day_of_year'] = date.dayofyear
    
    df.drop('datetime', axis = 1, inplace = True)


In [82]:
# Let's encode hours, months, and days of the week to preserve their cyclical nature 
for df in combined_df:

    df['hour_sin'] = np.sin(df.hour*(2.*np.pi/24))
    df['hour_cos'] = np.cos(df.hour*(2.*np.pi/24))

    df['dow_sin'] = np.sin(df.day_of_week*(2.*np.pi/7))
    df['dow_cos'] = np.cos(df.day_of_week*(2.*np.pi/7))

    df['month_sin'] = np.sin((df.month-1)*(2.*np.pi/12))
    df['month_cos'] = np.cos((df.month-1)*(2.*np.pi/12))

In [78]:

def onehot_encode(df, varlist, drop_org = True):
    
        for var in varlist:
            dummies =  pd.get_dummies(df[var], prefix=var, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False)
            df = pd.concat([df, dummies], axis=1)
          
        return df

In [79]:
to_onehot_encode = ['season',
                    'weather']


train_df = onehot_encode(df = train_df, varlist = to_onehot_encode, drop_org = True)
test_df = onehot_encode(df = train_df, varlist = to_onehot_encode, drop_org = True)

In [80]:
train_df.keys()

Index(['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp',
       'humidity', 'windspeed', 'casual', 'registered', 'count', 'date', 'day',
       'month', 'year', 'hour', 'day_of_week', 'week_of_year', 'hour_sin',
       'hour_cos', 'dow_sin', 'dow_cos', 'month_sin', 'month_cos', 'season_1',
       'season_2', 'season_3', 'season_4', 'weather_1', 'weather_2',
       'weather_3', 'weather_4'],
      dtype='object')

#### 4. Modelling

#### 5. Conclusion

Check to be performed with rmsle:  
    
    
    import numpy as np

def rmsle(h, y): 
    """
    Compute the Root Mean Squared Log Error for hypthesis h and targets y

    Args:
        h - numpy array containing predictions with shape (n_samples, n_targets)
        y - numpy array containing targets with shape (n_samples, n_targets)
    """
    return np.sqrt(np.square(np.log(h + 1) - np.log(y + 1)).mean())