# Models Notebook

This notebook allows a user to download the data from 2010-2015, train a set of models using temporal validation, and obtain the top performing models and accompanying graphs for precision and recall and feature importance. 

## Installation instructions
- All package installation requirements have been saved in requirements.txt
- If you are working in a virtual environmnent, activate it and pip install -r requirements.txt
- Make sure your jupyter kernel is pointing to your virtual environment

In [None]:
# import statements
import warnings
warnings.simplefilter('ignore')
warnings.filterwarnings('ignore',category=DeprecationWarning)

% run pipeline/explore.py
% run pipeline/features.py
% run pipeline/preprocess.py
% run pipeline/methods_loop.py
% run pipeline/evaluation.py

### Get Data
- Make sure you are SSH'ed into DSSG server
- Run go_ft() to get raw pickle files in data/ folder

In [None]:
go_ft()

In [None]:
cv = [('2010-12-31', '2011-12-31'), ('2011-12-31', '2012-12-31'),
      ('2012-12-31', '2013-12-31'), ('2013-12-31', '2014-12-31'),
      ('2014-12-31', '2015-12-31')]

c, v = cv[0]
for c, v in cv:
    train = pd.read_pickle('data/c{}_v{}_train.pkl'.format(c[:4], v[:4]))
    test = pd.read_pickle('data/c{}_v{}_test.pkl'.format(c[:4], v[:4]))
    print(train.shape, test.shape)

### Temporal Validation Loop
- `temporal_validation_loop` usage
- Inputs: 
    - `cv_pairs` (list of tuple pairs)
    - `grid_size` ('test', 'small', or 'large'
    - `to_run` (list of methods to run)
    - `basic` (features to  use)
    - `filename` (file to store results)

In [None]:
# example usage
cv = [('2010-12-31', '2011-12-31'), ('2011-12-31', '2012-12-31'),
      ('2012-12-31', '2013-12-31'), ('2013-12-31', '2014-12-31')]
to_run = ['KNN', 'LR', 'DT', 'RF', 'AB', 'GB']
cv_pairs = cv[:1]
res = temporal_validation_loop(cv_pairs, 'large', to_run, None, 'results/results_example.pkl')

In [None]:
res.precision_at_5.mean()

In [None]:
# adding another year of data
# example usage
cv = [('2010-12-31', '2011-12-31'), ('2011-12-31', '2012-12-31'),
      ('2012-12-31', '2013-12-31'), ('2013-12-31', '2014-12-31'),
      ('2014-12-31', '2015-12-31')]
cv_pairs = cv[-1:]
res2 = temporal_validation_loop(cv_pairs, 'large', ['KNN', 'DT', 'RF', 'LR', 'AB'], None, 'results/last_split_example.pkl')

In [None]:
res2.precision_at_5.mean()

In [None]:
get_best_model(res2, 'precision_at_5')

### Rank models by precision at 5%

In [None]:
for k, grp in res2.groupby(['model_type']):
    print(k, grp.precision_at_5.mean(), grp.precision_at_5.std())

In [None]:
for k, grp in res2.groupby(['validation_date']):
    print(k, 'MAX:', grp.precision_at_5.max(), 'MEAN:', grp.precision_at_5.mean(), 'STD:', grp.precision_at_5.std())

### Print best model plots

In [None]:
evaluate_best_models(res2.loc[[10, 95]])

In [None]:
from IPython.display import Image
Image("results/c2014_v2015_tree.png")