# crossfit open 2018 supervised learning
In the previous notebook `open_182a_bayesian_regression`, we used the athlete profile benchmarks to perform regression on workout 18.2a. This notebook will focus on regressing the entire Open. First, we'll use the benchmarks to regress each open workout and analyze the results. Afterwards, we'll attempt to find redundancies in the Open. To predict some Open workout *x*, we'll use the other Open workouts *along with* the benchmarks to attempto better regress Open scores.

## imports
Here we'll import all the modules we'll make use of thoughout this notebook.

In [64]:
#working with ids to implement custom train/test splitting
import numpy as np
#working locally with data (dataframes)
import pandas as pd
#data scaling, learning agents
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import BayesianRidge
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor
#from sklearn.model_selection import train_test_split

## reading in the data
The data has previously been filtered/collected from a database in another learning notebook and written to CSV. We'll grab that data here.

In [30]:
raw_df = pd.read_csv("../sample_data/sample_pure_division_18.csv").drop(["Unnamed: 0"], axis=1)
raw_df.head(3)

Unnamed: 0,id,leaderboard_18_1_reps,leaderboard_18_2_time_secs,leaderboard_18_2a_weight_lbs,leaderboard_18_3_time_secs,leaderboard_18_4_time_secs,leaderboard_18_5_reps,height_in,weight_lbs,back_squat_lbs,clean_and_jerk_lbs,snatch_lbs,deadlift_lbs,max_pull_ups,fran_time_secs,grace_time_secs,helen_time_secs,filthy_50_time_secs,sprint_400_m_time_secs,run_5_km_time_secs
0,86,328,651,207,1373,637,77,77,225,335,265,210,415,32,226,174,496,1287,64,1325
1,93,309,480,235,1548,642,94,72,205,305,235,185,355,45,220,154,493,1500,62,1239
2,1636,402,316,260,1148,577,123,70,200,400,285,215,485,67,138,108,429,1056,58,1219


## grabbing columns of interest
This notebook will work on performing machine learning for the open workouts. These learning tasks will occur in 2 forms: with the other remaining open workouts as features, and without (only benchmarks). We'll start by extracting these Open workouts.

In [31]:
open_keys = list(raw_df.columns[1:7])
open_keys

['leaderboard_18_1_reps',
 'leaderboard_18_2_time_secs',
 'leaderboard_18_2a_weight_lbs',
 'leaderboard_18_3_time_secs',
 'leaderboard_18_4_time_secs',
 'leaderboard_18_5_reps']

## scaling the data
We'll scale this data using scikit-learn's `StandardScaler`, which scales each feature independently to unit variance.

In [34]:
x = [1,2,3]
x.remove(2)
x

[1, 3]

In [52]:
scaled_data = StandardScaler().fit_transform(raw_df.drop(["id"], axis=1))
df = pd.concat(
    [
        raw_df[["id"]],
        #all columns but id in the columns=... statement
        pd.DataFrame(scaled_data, columns=list(raw_df.columns)[1:])
    ],
    axis=1
)
df.head(3)

Unnamed: 0,id,leaderboard_18_1_reps,leaderboard_18_2_time_secs,leaderboard_18_2a_weight_lbs,leaderboard_18_3_time_secs,leaderboard_18_4_time_secs,leaderboard_18_5_reps,height_in,weight_lbs,back_squat_lbs,clean_and_jerk_lbs,snatch_lbs,deadlift_lbs,max_pull_ups,fran_time_secs,grace_time_secs,helen_time_secs,filthy_50_time_secs,sprint_400_m_time_secs,run_5_km_time_secs
0,86,0.15843,2.065809,-1.132659,0.328936,0.414922,-0.546937,1.103061,1.694207,-0.477036,0.182866,0.285815,-0.454332,-0.581332,-0.004048,0.06697,-0.473635,-0.387131,-0.333732,0.105375
1,93,-0.181246,0.266425,-0.359136,0.941889,0.453402,-0.061351,0.336881,0.736628,-0.979968,-0.599759,-0.438017,-1.409255,0.326252,-0.077476,-0.215067,-0.506319,0.265119,-0.49556,-0.434557
2,1636,1.48138,-1.459299,0.33151,-0.459145,-0.046834,0.767001,0.030409,0.497233,0.612652,0.704616,0.430581,0.659746,1.862164,-1.080997,-0.863752,-1.203567,-1.0945,-0.819216,-0.560122


## splitting the data based on features
So we know there will be 6 different dependents (18.1, 18.2, 18.2a, 18.3, 18.4, 18.5), but each of these will have 2 feature sets, 1 for each learning task (benchmarks, benchmarks and other Open workouts). We'll do this separation of data iteratively below.

In [53]:
datasets = {
    #18.1
    #"1_limited": [features_1_limited_df, targets_1_limited_df],
    #"1_full": [features_1_limited_df, targets_1_limited_df],
    #"2_limited": ...
}
#iterate over each dependent
for i in range(len(open_keys)):
    #limited dataset
    datasets["{}_limited".format(i)] = [
        #features (only benchmarks)
        df.drop(open_keys, axis=1),
        #targets
        df[["id", open_keys[i]]]
    ]
    #full dataset
    datasets["{}_full".format(i)] = [
        #features (benchmarks and other open workouts)
        df.drop(open_keys[i], axis=1),
        #targets
        df[["id", open_keys[i]]]
    ]

In [54]:
#only benchmark features
datasets["0_limited"][0].head(3)

Unnamed: 0,id,height_in,weight_lbs,back_squat_lbs,clean_and_jerk_lbs,snatch_lbs,deadlift_lbs,max_pull_ups,fran_time_secs,grace_time_secs,helen_time_secs,filthy_50_time_secs,sprint_400_m_time_secs,run_5_km_time_secs
0,86,1.103061,1.694207,-0.477036,0.182866,0.285815,-0.454332,-0.581332,-0.004048,0.06697,-0.473635,-0.387131,-0.333732,0.105375
1,93,0.336881,0.736628,-0.979968,-0.599759,-0.438017,-1.409255,0.326252,-0.077476,-0.215067,-0.506319,0.265119,-0.49556,-0.434557
2,1636,0.030409,0.497233,0.612652,0.704616,0.430581,0.659746,1.862164,-1.080997,-0.863752,-1.203567,-1.0945,-0.819216,-0.560122


In [55]:
#only benchmark targets
datasets["0_limited"][1].head(3)

Unnamed: 0,id,leaderboard_18_1_reps
0,86,0.15843
1,93,-0.181246
2,1636,1.48138


In [56]:
#benchmarks and other open workouts features
datasets["0_full"][0].head(3)

Unnamed: 0,id,leaderboard_18_2_time_secs,leaderboard_18_2a_weight_lbs,leaderboard_18_3_time_secs,leaderboard_18_4_time_secs,leaderboard_18_5_reps,height_in,weight_lbs,back_squat_lbs,clean_and_jerk_lbs,snatch_lbs,deadlift_lbs,max_pull_ups,fran_time_secs,grace_time_secs,helen_time_secs,filthy_50_time_secs,sprint_400_m_time_secs,run_5_km_time_secs
0,86,2.065809,-1.132659,0.328936,0.414922,-0.546937,1.103061,1.694207,-0.477036,0.182866,0.285815,-0.454332,-0.581332,-0.004048,0.06697,-0.473635,-0.387131,-0.333732,0.105375
1,93,0.266425,-0.359136,0.941889,0.453402,-0.061351,0.336881,0.736628,-0.979968,-0.599759,-0.438017,-1.409255,0.326252,-0.077476,-0.215067,-0.506319,0.265119,-0.49556,-0.434557
2,1636,-1.459299,0.33151,-0.459145,-0.046834,0.767001,0.030409,0.497233,0.612652,0.704616,0.430581,0.659746,1.862164,-1.080997,-0.863752,-1.203567,-1.0945,-0.819216,-0.560122


In [57]:
#benchmarks and other open workouts targets
datasets["0_full"][1].head(3)

Unnamed: 0,id,leaderboard_18_1_reps
0,86,0.15843
1,93,-0.181246
2,1636,1.48138


## splitting ids into training and testing
We'll split the data into training/testing sets on a per-id basis.

In [79]:
training_size = .8
ids = df["id"].values
train_ids = np.random.choice(ids, size=int(len(ids) * training_size), replace=False)
train_ids

array([  28501,   70477,   12306,   28689,  234360,   31869,   22538,
         21003,  182631,   72068,  239783,   35012,  153742,  136104,
         17606,   17086,  245977,  372305,  317927,   28443,  464740,
        118206,  411410,    6334,   63435,    5443,   21947,    5674,
         82939,  474531,   14924,  243446,  391024,  300883,   31188,
         22026,    2073,  237949,   98711,   60290,   44348,  474048,
        678723,   24981,  126410,  386135,  239786,   24354,  332729,
         17742,      93,   50251,  191004,   19900,   11549,    1665,
         24599,  467774,    5287,  253813,  102178,    3086,   20354,
          5343,  116663,   12084,  250651,  918639,   36494,   11126,
        175023,   18276,  706259,  517493,  746966,    2896,  144171,
         71965,    2808,   17941,  107412,    3000,    5424,  116883,
        111613,   60413,    7571,    2246,    3354,    5457,   88946,
         39547,   51066,  430732,   18285,    3609,  278784,  473048,
         26237,  102

In [80]:
#get ids that aren't in train_ids
test_ids = np.setdiff1d(ids, train_ids)
test_ids

array([   1636,    1662,    2573,    3583,    4231,    4842,    4880,
          5133,    5223,    5229,    5378,    5405,    5869,    8428,
          8832,    8913,    9106,    9405,   10605,   11066,   11826,
         12544,   13006,   13060,   13573,   16495,   16507,   16993,
         17977,   20073,   22883,   25471,   26515,   26756,   28812,
         30180,   31631,   33216,   35407,   35683,   36803,   37101,
         42059,   46231,   47731,   48418,   51992,   52740,   55601,
         58879,   60346,   60561,   63979,   69927,   96028,  103737,
        108971,  110659,  112065,  112816,  113902,  114032,  115623,
        124881,  128178,  128316,  132393,  138612,  150336,  154681,
        159558,  185216,  188677,  190364,  190858,  197022,  201020,
        232699,  237934,  242559,  246455,  256874,  259682,  265064,
        273334,  277609,  278757,  283894,  284760,  325933,  334064,
        355154,  428068,  429208,  513146,  518554,  529744,  622422,
        671575,  803

## fitting learners to each dataset
For each dataset we'll fit 3 different types of learners: [bayesian regressors](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html#sklearn.linear_model.BayesianRidge), [neural networks](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor), and [random forests](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html). We'll iteratively fit and store a learner of each type for each of the 12 datasets.

### model parameters
We'll not tune these models individually for each dataset. There is a great amount of shared data between them, and although there may be some small gains for each learning task if tuned individually, this tuning will not be performed. The parameters used for each model will be set below.

In [61]:
#default parameters
#bayes

#neural net

#random forest

In [58]:
learners={}
#iterate over each dependent
for i in range(len(open_keys)):
    #limited dataset
    datasets["{}_limited".format(i)] = [
        #bayes
        
        #neural network
        
        #random forest
    ]
    #full dataset
    datasets["{}_full".format(i)] = []