# WEEK 04
# Encounter 03 - Metrics and KPIs
# Project Challenge - Error Metrics on a naive forecast

## Task Description

With regard to the bikes dataset a ’naive’ forecast would be to use the count from the year before to predict the next year. Use the count from May 2011 as a forecast for May 2012 and check how far off the predictions are:

   1. Filter the dataset for May 2011 and the count column
   2. Filter the dataset for May 2012 and the count column
   3. Use the above results as the input for `rmse`
   4. How far off on average was this naive prediction?

**Bonus:**

    How could the results be improved?

Save your work in a notebook.


In [1]:
import pandas as pd
from sklearn.metrics import mean_squared_error
import math

In [2]:
# reading 'bikes' dataset from csv-file
bikes = pd.read_csv('../data/bikes_with_bins.csv')
bikes.sample(10)

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count,hour,month_name,day_of_week,year,part_of_day
14899,2012-09-18 00:00:00,3,0,1,3,24.6,25.76,94,16.9979,2.0,11.0,13,0,September,Tuesday,2012,night
6232,2011-09-22 06:00:00,3,0,1,2,24.6,25.0,100,8.9981,,,114,6,September,Thursday,2011,morning
13347,2012-07-15 08:00:00,3,0,0,2,28.7,33.335,84,8.9981,37.0,96.0,133,8,July,Sunday,2012,morning
2513,2011-04-19 15:00:00,2,0,1,2,22.14,25.76,65,11.0014,44.0,83.0,127,15,April,Tuesday,2011,afternoon
2482,2011-04-18 08:00:00,2,0,1,1,18.86,22.725,63,0.0,20.0,277.0,297,8,April,Monday,2011,morning
8309,2011-12-17 21:00:00,4,0,0,1,8.2,10.605,75,11.0014,8.0,88.0,96,21,December,Saturday,2011,evening
7652,2011-11-20 11:00:00,4,0,0,2,20.5,24.24,63,7.0015,,,406,11,November,Sunday,2011,morning
10293,2012-03-09 23:00:00,1,0,1,1,13.12,15.15,33,19.9995,9.0,77.0,86,23,March,Friday,2012,night
1891,2011-03-24 14:00:00,2,0,1,2,12.3,15.15,70,8.9981,,,70,14,March,Thursday,2011,afternoon
12199,2012-05-28 12:00:00,2,1,0,1,31.16,35.605,62,8.9981,,,378,12,May,Monday,2012,morning


In [3]:
# makind 'datetime' as datetyme dtype
bikes['datetime'] = pd.to_datetime(bikes['datetime'])

### 1. Filter the dataset for May 2011 and the count column

In [4]:
# filtering data using a boolean mask
mask_2011 = (bikes['year']==2011) & (bikes['month_name']=='May')
bikes_may_2011 = bikes[mask_2011][['datetime', 'count']]
bikes_may_2011.shape

(744, 2)

### 2. Filter the dataset for May 2012 and the count column

In [5]:
mask_2012 = (bikes['year']==2012) & (bikes['month_name']=='May')
bikes_may_2012 = bikes[mask_2012][['datetime', 'count']]
bikes_may_2012.shape

(744, 2)

### 3. Use the above results as the input for rmse

In [6]:
# getting MSE
mse = mean_squared_error(bikes_may_2011['count'], bikes_may_2012['count'])
mse

28459.47177419355

In [7]:
# getting RMSE using MSE
rmse = math.sqrt(mse)
rmse

168.69935321213757

### 4. How far off on average was this naive prediction?

In [8]:
bikes_may_2011.head(10)

Unnamed: 0,datetime,count
2786,2011-05-01 00:00:00,96
2787,2011-05-01 01:00:00,59
2788,2011-05-01 02:00:00,50
2789,2011-05-01 03:00:00,23
2790,2011-05-01 04:00:00,17
2791,2011-05-01 05:00:00,10
2792,2011-05-01 06:00:00,13
2793,2011-05-01 07:00:00,33
2794,2011-05-01 08:00:00,59
2795,2011-05-01 09:00:00,141


In [9]:
bikes_may_2012.head(10)

Unnamed: 0,datetime,count
11539,2012-05-01 00:00:00,35
11540,2012-05-01 01:00:00,21
11541,2012-05-01 02:00:00,8
11542,2012-05-01 03:00:00,3
11543,2012-05-01 04:00:00,8
11544,2012-05-01 05:00:00,17
11545,2012-05-01 06:00:00,26
11546,2012-05-01 07:00:00,169
11547,2012-05-01 08:00:00,557
11548,2012-05-01 09:00:00,349


In [10]:
# average bikes count in May 2011
bikes_may_2011['count'].mean()

202.3252688172043

In [11]:
# average bikes count in May 2012
bikes_may_2012['count'].mean()

252.5766129032258

### RESULTS:
   1. Average bikes count in `May 2011` is **202**.
   2. Average bikes count in `May 2012` is **253**.
   3. `RMSE` is **169**. It is almost the same as avg values! (**84%** of `2011` value and **67%** of `2012` value)

In [12]:
# proportion of RMSE in 'Average bikes count in May 2011'
169/202

0.8366336633663366

In [14]:
# proportion of RMSE in 'Average bikes count in May 2012'
169/253

0.6679841897233202

### BONUS
   3. How could the results be improved?

**Answer:**
   * RMSE reflects the quality of our model (for making poredictions).
   * To get less errors (lower RMSE), the model should be trained on a bigger dataset.
