# crossfit open 2018 supervised learning
In the previous notebook `open_182a_bayesian_regression`, we used the athlete profile benchmarks to perform regression on workout 18.2a. This notebook will focus on regressing the entire Open. First, we'll use the benchmarks to regress each open workout and analyze the results. Afterwards, we'll attempt to find redundancies in the Open. To predict some Open workout *x*, we'll use the other Open workouts *along with* the benchmarks to attempto better regress Open scores.

## imports
Here we'll import all the modules we'll make use of thoughout this notebook.

In [82]:
#working with ids to implement custom train/test splitting
import numpy as np
#working locally with data (dataframes)
import pandas as pd
#data scaling, learning agents
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import BayesianRidge
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor
#from sklearn.model_selection import train_test_split

## reading in the data
The data has previously been filtered/collected from a database in another learning notebook and written to CSV. We'll grab that data here.

In [83]:
raw_df = pd.read_csv("../sample_data/sample_pure_division_18.csv").drop(["Unnamed: 0"], axis=1)
raw_df.head(3)

Unnamed: 0,id,leaderboard_18_1_reps,leaderboard_18_2_time_secs,leaderboard_18_2a_weight_lbs,leaderboard_18_3_time_secs,leaderboard_18_4_time_secs,leaderboard_18_5_reps,height_in,weight_lbs,back_squat_lbs,clean_and_jerk_lbs,snatch_lbs,deadlift_lbs,max_pull_ups,fran_time_secs,grace_time_secs,helen_time_secs,filthy_50_time_secs,sprint_400_m_time_secs,run_5_km_time_secs
0,86,328,651,207,1373,637,77,77,225,335,265,210,415,32,226,174,496,1287,64,1325
1,93,309,480,235,1548,642,94,72,205,305,235,185,355,45,220,154,493,1500,62,1239
2,1636,402,316,260,1148,577,123,70,200,400,285,215,485,67,138,108,429,1056,58,1219


## grabbing columns of interest
This notebook will work on performing machine learning for the open workouts. These learning tasks will occur in 2 forms: with the other remaining open workouts as features, and without (only benchmarks). We'll start by extracting these Open workouts.

In [84]:
open_keys = list(raw_df.columns[1:7])
open_keys

['leaderboard_18_1_reps',
 'leaderboard_18_2_time_secs',
 'leaderboard_18_2a_weight_lbs',
 'leaderboard_18_3_time_secs',
 'leaderboard_18_4_time_secs',
 'leaderboard_18_5_reps']

## scaling the data
We'll scale this data using scikit-learn's `StandardScaler`, which scales each feature independently to unit variance.

In [85]:
x = [1,2,3]
x.remove(2)
x

[1, 3]

In [86]:
scaled_data = StandardScaler().fit_transform(raw_df.drop(["id"], axis=1))
df = pd.concat(
    [
        raw_df[["id"]],
        #all columns but id in the columns=... statement
        pd.DataFrame(scaled_data, columns=list(raw_df.columns)[1:])
    ],
    axis=1
)
df.head(3)

Unnamed: 0,id,leaderboard_18_1_reps,leaderboard_18_2_time_secs,leaderboard_18_2a_weight_lbs,leaderboard_18_3_time_secs,leaderboard_18_4_time_secs,leaderboard_18_5_reps,height_in,weight_lbs,back_squat_lbs,clean_and_jerk_lbs,snatch_lbs,deadlift_lbs,max_pull_ups,fran_time_secs,grace_time_secs,helen_time_secs,filthy_50_time_secs,sprint_400_m_time_secs,run_5_km_time_secs
0,86,0.15843,2.065809,-1.132659,0.328936,0.414922,-0.546937,1.103061,1.694207,-0.477036,0.182866,0.285815,-0.454332,-0.581332,-0.004048,0.06697,-0.473635,-0.387131,-0.333732,0.105375
1,93,-0.181246,0.266425,-0.359136,0.941889,0.453402,-0.061351,0.336881,0.736628,-0.979968,-0.599759,-0.438017,-1.409255,0.326252,-0.077476,-0.215067,-0.506319,0.265119,-0.49556,-0.434557
2,1636,1.48138,-1.459299,0.33151,-0.459145,-0.046834,0.767001,0.030409,0.497233,0.612652,0.704616,0.430581,0.659746,1.862164,-1.080997,-0.863752,-1.203567,-1.0945,-0.819216,-0.560122


## splitting the data based on features
So we know there will be 6 different dependents (18.1, 18.2, 18.2a, 18.3, 18.4, 18.5), but each of these will have 2 feature sets, 1 for each learning task (benchmarks, benchmarks and other Open workouts). We'll do this separation of data iteratively below.

In [87]:
datasets = {
    #18.1
    #"1_limited": [features_1_limited_df, targets_1_limited_df],
    #"1_full": [features_1_limited_df, targets_1_limited_df],
    #"2_limited": ...
}
#iterate over each dependent
for i in range(len(open_keys)):
    #limited dataset
    datasets["{}_limited".format(i)] = [
        #features (only benchmarks)
        df.drop(open_keys, axis=1),
        #targets
        df[["id", open_keys[i]]]
    ]
    #full dataset
    datasets["{}_full".format(i)] = [
        #features (benchmarks and other open workouts)
        df.drop(open_keys[i], axis=1),
        #targets
        df[["id", open_keys[i]]]
    ]

In [88]:
#only benchmark features
datasets["0_limited"][0].head(3)

Unnamed: 0,id,height_in,weight_lbs,back_squat_lbs,clean_and_jerk_lbs,snatch_lbs,deadlift_lbs,max_pull_ups,fran_time_secs,grace_time_secs,helen_time_secs,filthy_50_time_secs,sprint_400_m_time_secs,run_5_km_time_secs
0,86,1.103061,1.694207,-0.477036,0.182866,0.285815,-0.454332,-0.581332,-0.004048,0.06697,-0.473635,-0.387131,-0.333732,0.105375
1,93,0.336881,0.736628,-0.979968,-0.599759,-0.438017,-1.409255,0.326252,-0.077476,-0.215067,-0.506319,0.265119,-0.49556,-0.434557
2,1636,0.030409,0.497233,0.612652,0.704616,0.430581,0.659746,1.862164,-1.080997,-0.863752,-1.203567,-1.0945,-0.819216,-0.560122


In [89]:
#only benchmark targets
datasets["0_limited"][1].head(3)

Unnamed: 0,id,leaderboard_18_1_reps
0,86,0.15843
1,93,-0.181246
2,1636,1.48138


In [90]:
#benchmarks and other open workouts features
datasets["0_full"][0].head(3)

Unnamed: 0,id,leaderboard_18_2_time_secs,leaderboard_18_2a_weight_lbs,leaderboard_18_3_time_secs,leaderboard_18_4_time_secs,leaderboard_18_5_reps,height_in,weight_lbs,back_squat_lbs,clean_and_jerk_lbs,snatch_lbs,deadlift_lbs,max_pull_ups,fran_time_secs,grace_time_secs,helen_time_secs,filthy_50_time_secs,sprint_400_m_time_secs,run_5_km_time_secs
0,86,2.065809,-1.132659,0.328936,0.414922,-0.546937,1.103061,1.694207,-0.477036,0.182866,0.285815,-0.454332,-0.581332,-0.004048,0.06697,-0.473635,-0.387131,-0.333732,0.105375
1,93,0.266425,-0.359136,0.941889,0.453402,-0.061351,0.336881,0.736628,-0.979968,-0.599759,-0.438017,-1.409255,0.326252,-0.077476,-0.215067,-0.506319,0.265119,-0.49556,-0.434557
2,1636,-1.459299,0.33151,-0.459145,-0.046834,0.767001,0.030409,0.497233,0.612652,0.704616,0.430581,0.659746,1.862164,-1.080997,-0.863752,-1.203567,-1.0945,-0.819216,-0.560122


In [91]:
#benchmarks and other open workouts targets
datasets["0_full"][1].head(3)

Unnamed: 0,id,leaderboard_18_1_reps
0,86,0.15843
1,93,-0.181246
2,1636,1.48138


## splitting ids into training and testing
We'll split the data into training/testing sets on a per-id basis.

In [92]:
training_size = .8
ids = df["id"].values
train_ids = np.random.choice(ids, size=int(len(ids) * training_size), replace=False)
train_ids

array([  42059,  119829,   42521,   13573,   30310,   38781,   50251,
          6740,   40470,  227272,   24366,  161269,   22538,   98737,
        284760,  956598,  671575,    9217,  292529,  253813,   21553,
         34708,  704208,  302852,  289976,  136918,   96028,   60547,
        282422,   35683,   17576,   16869,    9858,  733406,  112065,
        258248,   17941,  112816,  239783,  443282,    2223,  208605,
        278757,   60346,  143549,   18201,    3583,  658554,  118510,
          8496,    8428,  241862,   40808,   44131,    5300,  413402,
         60290,   88946,   35407,   12908,    9405,  325664,  468457,
         21003,    5287,   99852,  159558,  103737,  283894,    8851,
       1238643,  153742,   31631,   12084,   35766,   27522,    4231,
        130502,  201020,  334064,  464294,   42024,  395596,  256491,
        535071,  105116,   20354,  237934,    9799,   13006,   20073,
          3609,  112627,  511412,  317927,    3000,  692033,    4105,
        558738,   12

In [93]:
#get ids that aren't in train_ids
test_ids = np.setdiff1d(ids, train_ids)
test_ids

array([    93,   1636,   2073,   2539,   2649,   3086,   3855,   4158,
         5674,   7638,   8832,  10281,  10332,  10858,  11126,  12318,
        13110,  14924,  16507,  16699,  19900,  21947,  23447,  24354,
        24599,  26200,  30180,  31188,  31869,  33079,  33216,  33625,
        36803,  39021,  43310,  44200,  51066,  51879,  53236,  61901,
        72051,  72068,  79664,  91497,  99303, 100941, 102064, 104910,
       108827, 111613, 112130, 114073, 116663, 123385, 124881, 132810,
       136104, 136288, 142319, 144171, 154681, 156811, 168610, 170461,
       175023, 175612, 179308, 188677, 190364, 190858, 202371, 214125,
       215118, 232699, 234306, 240145, 241391, 242423, 242480, 245977,
       247849, 254219, 256514, 259585, 273334, 276676, 277609, 279132,
       332367, 336757, 355154, 396150, 410098, 411410, 449114, 474531,
       478271, 509683, 545781, 565129, 746966, 759578, 835111], dtype=int64)

## fitting learners to each dataset
For each dataset we'll fit 3 different types of learners: [bayesian regressors](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html#sklearn.linear_model.BayesianRidge), [neural networks](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor), and [random forests](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html). We'll iteratively fit and store a learner of each type for each of the 12 datasets.

### model parameters
We'll not tune these models individually for each dataset. There is a great amount of shared data between them, and although there may be some small gains for each learning task if tuned individually, this tuning will not be performed. The parameters used for each model will be set below.

In [152]:
#default parameters
random_state = 0
#bayes

#neural net
max_iter = 250
#random forest
max_depth = 5
min_samples_leaf = 3

In [143]:
datasets["{}_limited".format(0)][1].head(3)

Unnamed: 0,id,leaderboard_18_1_reps
0,86,0.15843
1,93,-0.181246
2,1636,1.48138


In [153]:
mlmap={}
#iterate over each dependent (6 total)
for i in range(len(open_keys)):
    #get keys
    data_keys = list(map(lambda x: x.format(i), ["{}_limited", "{}_full"]))
    
    #for each dataset (2 total)
    for j in range(len(data_keys)):
        #make kvp entry
        mlmap[data_keys[j]] = {
            #train: training features/targets,
            #test: testing features/targets,
            #learners: learners fit to training set, meant to be tested on testing set
        }
        
        #get dataset
        tmp_dfs = datasets[data_keys[j]]
        
        #the only difference between the next 2 list assignments is
        #".isin(train_ids)" and ".isin(test_ids)"
        #- likely candidate for iteration
        #training
        mlmap[data_keys[j]]["train"] = [
            #features
            tmp_dfs[0][tmp_dfs[0]["id"].isin(train_ids)],
            #targets
            tmp_dfs[1][tmp_dfs[1]["id"].isin(train_ids)]
        ]
        #testing
        mlmap[data_keys[j]]["test"] = [
            #features
            tmp_dfs[0][tmp_dfs[0]["id"].isin(test_ids)],
            #targets
            tmp_dfs[1][tmp_dfs[1]["id"].isin(test_ids)]
        ]
        
        #fit 3 learners to training data
        training_set = mlmap[data_keys[j]]["train"]
        tmp_x = training_set[0].drop(["id"], axis=1)
        tmp_y = np.ravel(training_set[1].drop(["id"], axis=1))
        mlmap[data_keys[j]]["learners"] = [
            #bayes
            BayesianRidge(
                compute_score=True,
            ).fit(tmp_x, tmp_y),
            
            #neural network
            MLPRegressor(
                #the below -1 factors are to remove the ID column counts
                hidden_layer_sizes=(
                    #input layer (this is A LOT simpler than it looks)
                    len(training_set[0].columns) - 1,
                    #hidden layer (https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa)
                    #mean(input layer neurons and output layer neurons)
                    #= (len(training_set[1].columns) - 1 + len(training_set[1].columns) - 1) * .5
                    #= (len(training_set[1].columns) - 1 + 2 - 1) * .5
                    #= len(training_set[1].columns) * .5
                    int(
                        np.ceil(
                            len(training_set[1].columns) * .5
                        )
                    ),
                    #output layer
                    1
                ),
                #maximum iterations to perform before stopping training
                max_iter=max_iter,
                #random state
                random_state=random_state
            ).fit(tmp_x, tmp_y),
            
            #random forest
            RandomForestRegressor(
                #max depth
                max_depth=max_depth,
                #minimum leaf samples required to not prune
                min_samples_leaf = min_samples_leaf,
                #random state
                random_state=random_state
            )
        ]