#Goal: improve evaluation a model
## Evoluation a model is the most important thing!

Remember this point is the most important.  
If you don't understand your a quality function or don't have a good strategy how to evaluate model - rest is doesn't matter.

Check list
1. What data set is using for evaluation? 
2. What is range values for a quality function (what the value is for an ideal solution)?
3. What and how make impact on output value (a quality function)? 

In [45]:
import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import make_scorer
from random import shuffle

# Read data

In [4]:
train = pd.read_csv('train.csv')
train.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0,0,1,1


1. **datetime** - hourly date + timestamp  
2. **season** -  1 = spring, 2 = summer, 3 = fall, 4 = winter 
3. **holiday** - whether the day is considered a holiday
4. **workingday** - whether the day is neither a weekend nor holiday
5. **weather** - 
    1: Clear, Few clouds, Partly cloudy, Partly cloudy 
    2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
6. **temp** - temperature in Celsius
7. **atemp** - "feels like" temperature in Celsius
8. **humidity** - relative humidity
9. **windspeed** - wind speed
10. **casual** - number of non-registered user rentals initiated
11. **registered** - number of registered user rentals initiated
12. **count** - number of total rentals

# Build a model

In [7]:
model = ExtraTreesRegressor()

## train & test (data sets)
Prepare two data sets (for training and for testing):

### train data set
**X_train** - features (*matrix*).  
**y_train** - target variable (*vector*).

### test data set
**X_test** - features (*matrix*).  
**y_test** - target variable (*vector*).

In [40]:
train['datetime'] = pd.to_datetime( train['datetime'] )
train['day']      = train['datetime'].map(lambda x: x.day)
    
train.day.unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19])

In [44]:
def train_test_split(data, last_training_day=0.3):
    days = train.day.unique()
    shuffle(days)
    test_days = days[: len(days) * 0.3]
    
    data['is_test'] = data.day.isin(test_days)
    df_train = data[data.is_test == False] 
    df_test  = data[data.is_test == True]
    
    return df_train, df_test

In [46]:
features = ['season', 'holiday', 'workingday', 'weather', 'temp', 'atemp', 'humidity', 'windspeed']
df_train, df_test = train_test_split(train)

X_train  = df_train[features].values
X_test   = df_test[features].values

y_train  = df_train['count'].values
y_test  = df_test['count'].values


model.fit(X_train, y_train)

ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=None,
          max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
          min_samples_split=2, min_weight_fraction_leaf=0.0,
          n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
          verbose=0, warm_start=False)

# Evaluation a model

$$ \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }$$

where  
**n** is the number of hours in the test set  
**pi** is your predicted count  
**ai** is the actual count  
**log(x)** is the natural logarithm  

### Why we have +1 for logarithm algorithm? Let's recall a shape logarithm function.
Argument should be greater than 0. In our case **pi** and **ai** can be equals to 0.
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/17/Binary_logarithm_plot_with_ticks.svg/408px-Binary_logarithm_plot_with_ticks.svg.png" />

### Let's play around with a quality function

|pi  |ai |  result |
|----|---|---------|
|1000| 0 | 6.909   |   
|100 | 0 | 4.615   |   
|10  | 0 | 2.397   |  
|5   | 0 | 1.792   |
|1   | 0 | 0.693   |
|0.5 | 0 | 0.405   |
|0.2 | 0 | 0.182   |
|0   | 0 | 0.0     |

For sample above `rmsle(pi, ai)` is **3.644**.
*Note: but average column result is* **2.918**.

In [47]:
def rmsle(y_true, y_pred):
    diff = np.log(y_pred + 1) - np.log(y_true + 1)
    mean_error = np.square(diff).mean()
    return np.sqrt(mean_error)

scorer = make_scorer(rmsle, greater_is_better=False)

In [48]:
y_pred = model.predict(X_test)
rmsle(y_test, y_pred)

1.4272746457576084

## Looks better :)
We solved problem with evoluation. Next step is understand better data and figure out how we can improve result.