## Limit how far back training data goes

### Background
When getting cross-validation metrics, I've noticed that older seasons tend to generate better scores (e.g. higher accuracy, lower MAE) than recent seasons. Also, the data sets are time-limited:
- match data go back to 1897
- player data go back to 1897 as well, but stats other than goals/behinds only go back to 1965
- betting data go back to 2010

In addition to these data eras, there are distinct historical eras in the VFL/AFL:
- the VFL became the AFL in 1990
- various teams have joined and left the league over the years (the current group of teams only go back to 2012)
- various rules and strategy changes have affected how the game is played, especially with regards to how often teams and individual players score


### Hypothesis
Having more data to train on is usually good in machine learning, but using very old VFL/AFL data may introduce noise due to the sparsity of data and fundamental differences in the sport itself, resulting in worse performance.

## Code Setup

In [105]:
%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [166]:
%autoreload 2
%reload_kedro

from functools import partial

import pandas as pd
import numpy as np

from augury import model_tracking
from augury.ml_data import MLData
from augury.ml_estimators import StackingEstimator

2019-12-28 05:31:02,055 - root - INFO - ** Kedro project augury
2019-12-28 05:31:02,060 - root - INFO - Defined global variable `context` and `catalog`


In [110]:
data = MLData(train_year_range=(CV_YEAR_RANGE[1],))
data.data

2019-12-28 04:17:22,540 - kedro.io.data_catalog - INFO - Loading data from `model_data` (JSONLocalDataSet)...


Unnamed: 0,Unnamed: 1,Unnamed: 2,team,oppo_team,round_type,venue,prev_match_oppo_team,oppo_prev_match_oppo_team,date,team_goals,team_behinds,score,...,oppo_rolling_prev_match_goals_divided_by_rolling_prev_match_goals_plus_rolling_prev_match_behinds,win_odds,oppo_win_odds,line_odds,oppo_line_odds,betting_pred_win,rolling_betting_pred_win_rate,oppo_betting_pred_win,oppo_rolling_betting_pred_win_rate,win_odds_multiplied_by_ladder_position
Adelaide,1991,1,Adelaide,Hawthorn,Regular,Football Park,0,Melbourne,1991-03-22 03:56:00+00:00,24,11,155,...,0.0,0.00,0.00,0.0,0.0,0.0,0.000000,0.0,0.000000,0.00
Adelaide,1991,2,Adelaide,Carlton,Regular,Football Park,Hawthorn,Fitzroy,1991-03-31 03:56:00+00:00,12,9,81,...,0.0,0.00,0.00,0.0,0.0,0.0,0.000000,0.0,0.000000,0.00
Adelaide,1991,3,Adelaide,Sydney,Regular,S.C.G.,Carlton,Hawthorn,1991-04-07 03:05:00+00:00,19,18,132,...,0.0,0.00,0.00,0.0,0.0,0.0,0.000000,0.0,0.000000,0.00
Adelaide,1991,4,Adelaide,Essendon,Regular,Windy Hill,Sydney,North Melbourne,1991-04-13 03:30:00+00:00,6,11,47,...,0.0,0.00,0.00,0.0,0.0,0.0,0.000000,0.0,0.000000,0.00
Adelaide,1991,5,Adelaide,West Coast,Regular,Subiaco,Essendon,North Melbourne,1991-04-21 05:27:00+00:00,9,11,65,...,0.0,0.00,0.00,0.0,0.0,0.0,0.000000,0.0,0.000000,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Western Bulldogs,2019,20,Western Bulldogs,Brisbane,Regular,Gabba,Fremantle,Hawthorn,2019-08-04 02:58:00+00:00,11,14,80,...,0.0,3.15,1.36,18.5,-18.5,0.0,0.391304,1.0,0.478261,28.35
Western Bulldogs,2019,21,Western Bulldogs,Essendon,Regular,Docklands,Brisbane,Port Adelaide,2019-08-10 03:30:00+00:00,21,11,137,...,0.0,1.68,2.20,-6.5,6.5,1.0,0.434783,0.0,0.521739,18.48
Western Bulldogs,2019,22,Western Bulldogs,GWS,Regular,Sydney Showground,Essendon,Hawthorn,2019-08-18 03:05:00+00:00,19,12,126,...,0.0,2.15,1.71,5.5,-5.5,0.0,0.434783,1.0,0.695652,21.50
Western Bulldogs,2019,23,Western Bulldogs,Adelaide,Regular,Eureka Stadium,GWS,Collingwood,2019-08-25 03:30:00+00:00,18,13,121,...,0.0,1.33,3.35,-18.5,18.5,1.0,0.434783,0.0,0.739130,11.97


## Run experiment

In [168]:
year_limits = [0, 1965, 1990, 2010]

models = [
    (
        StackingEstimator(min_year=min_year),
        data,
        'year_limit'
    )
    for min_year in year_limits
]

cv_scores = model_tracking.start_run(models, n_jobs=-1)
cv_scores

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.1min finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   42.6s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   26.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    8.1s finished


[{'model': 'stacking_estimator',
  'fit_time': array([32.59229183, 33.2250185 , 33.56517506, 34.24861836, 30.35174322]),
  'score_time': array([0.30001688, 0.32877064, 0.27724624, 0.29374981, 0.34624553]),
  'test_match_accuracy': array([0.73913043, 0.71359223, 0.73913043, 0.68115942, 0.70531401]),
  'test_neg_mean_absolute_error': array([-29.22966117, -31.30742567, -28.39680435, -28.87303986,
         -27.08517658])},
 {'model': 'stacking_estimator',
  'fit_time': array([21.21925211, 21.95618176, 22.27621484, 22.8023963 , 20.60602856]),
  'score_time': array([0.27208471, 0.27287936, 0.2641871 , 0.26345754, 0.28093815]),
  'test_match_accuracy': array([0.7294686 , 0.72815534, 0.73429952, 0.66666667, 0.73429952]),
  'test_neg_mean_absolute_error': array([-29.61610675, -31.0290287 , -28.5355068 , -28.91593467,
         -26.52216876])},
 {'model': 'stacking_estimator',
  'fit_time': array([12.14874339, 12.60816479, 13.24781513, 13.64561844, 13.09721708]),
  'score_time': array([0.25736713

In [175]:
for min_year, results in zip(year_limits, cv_scores):
    print('minimum year of data:', min_year)
    print('mean fit time:', results['fit_time'].mean())
    print('mean match accuracy:', results['test_match_accuracy'].mean())
    print('mean MAE:', abs(results['test_neg_mean_absolute_error'].mean()))
    print('')

minimum year of data: 0
mean fit time: 32.796569395065305
mean match accuracy: 0.7156653065053235
mean MAE: 28.978421528554453

minimum year of data: 1965
mean fit time: 21.772014713287355
mean match accuracy: 0.7185779278645467
mean MAE: 28.923749136958428

minimum year of data: 1990
mean fit time: 12.949511766433716
mean match accuracy: 0.7137282491440364
mean MAE: 29.561380815813358

minimum year of data: 2010
mean fit time: 3.7240463733673095
mean match accuracy: 0.6847099104169598
mean MAE: 31.17058661305084



## Conclusion

### Removing data from before 1965 improves model performance
The improvement is small in magnitude, but applies to both the accuracy and MAE, which is uncommon for small changes in performance metrics. The fact that starting at 1965 yielded the best performance suggests that the sparsity of data from earlier years was an issue, and 1965 gets the right balance between making the data denser and having enough data to make predictions. Starting in 2010 makes for the densest data set, but it's too small of a data set.


### It shortens model training time
In addition to marginally improving the accuracy and MAE of the model, dropping the sparse, early rows also shortens training time by 33.6%.