# crossfit open 2018 supervised learning
In the previous notebook `open_182a_bayesian_regression`, we used the athlete profile benchmarks to perform regression on workout 18.2a. This notebook will focus on regressing the entire Open. First, we'll use the benchmarks to regress each open workout and analyze the results. Afterwards, we'll attempt to find redundancies in the Open. To predict some Open workout *x*, we'll use the other Open workouts *along with* the benchmarks to attempto better regress Open scores.

## imports
Here we'll import all the modules we'll make use of thoughout this notebook.

In [1]:
#working locally with data (dataframes)
import pandas as pd

## reading in the data
The data has previously been filtered/collected from a database in another learning notebook and written to CSV. We'll grab that data here.

In [3]:
df = pd.read_csv("../sample_data/sample_pure_division_18.csv").drop(["Unnamed: 0"], axis=1)
df.head(3)

Unnamed: 0,id,leaderboard_18_1_reps,leaderboard_18_2_time_secs,leaderboard_18_2a_weight_lbs,leaderboard_18_3_time_secs,leaderboard_18_4_time_secs,leaderboard_18_5_reps,height_in,weight_lbs,back_squat_lbs,clean_and_jerk_lbs,snatch_lbs,deadlift_lbs,max_pull_ups,fran_time_secs,grace_time_secs,helen_time_secs,filthy_50_time_secs,sprint_400_m_time_secs,run_5_km_time_secs
0,86,328,651,207,1373,637,77,77,225,335,265,210,415,32,226,174,496,1287,64,1325
1,93,309,480,235,1548,642,94,72,205,305,235,185,355,45,220,154,493,1500,62,1239
2,1636,402,316,260,1148,577,123,70,200,400,285,215,485,67,138,108,429,1056,58,1219


## grabbing columns of interest
This notebook will work on performing machine learning for the open workouts. These learning tasks will occur in 2 forms: with the other remaining open workouts as features, and without (only benchmarks). We'll start by extracting these Open workouts.

In [5]:
open_keys = list(df.columns[1:7])
open_keys

['leaderboard_18_1_reps',
 'leaderboard_18_2_time_secs',
 'leaderboard_18_2a_weight_lbs',
 'leaderboard_18_3_time_secs',
 'leaderboard_18_4_time_secs',
 'leaderboard_18_5_reps']

## splitting the data based on features
So we know there will be 6 different dependents (18.1, 18.2, 18.2a, 18.3, 18.4, 18.5), but each of these will have 2 feature sets, 1 for each learning task (benchmarks, benchmarks and other Open workouts). We'll do this separation of data iteratively below.

In [20]:
datasets = {
    #18.1
    #"1_limited": [features_1_limited_df, targets_1_limited_df],
    #"1_full": [features_1_limited_df, targets_1_limited_df],
    #"2_limited": ...
}
#iterate over each dependent
for i in range(len(open_keys)):
    #limited dataset
    datasets["{}_limited".format(i)] = [
        #features (only benchmarks)
        df.drop(open_keys, axis=1),
        #targets
        df[["id", open_keys[i]]]
    ]
    #full dataset
    datasets["{}_full".format(i)] = [
        #features (benchmarks and other open workouts)
        df.drop(open_keys[i], axis=1),
        #targets
        df[["id", open_keys[i]]]
    ]

In [21]:
#only benchmark features
datasets["0_limited"][0].head(3)

Unnamed: 0,id,height_in,weight_lbs,back_squat_lbs,clean_and_jerk_lbs,snatch_lbs,deadlift_lbs,max_pull_ups,fran_time_secs,grace_time_secs,helen_time_secs,filthy_50_time_secs,sprint_400_m_time_secs,run_5_km_time_secs
0,86,77,225,335,265,210,415,32,226,174,496,1287,64,1325
1,93,72,205,305,235,185,355,45,220,154,493,1500,62,1239
2,1636,70,200,400,285,215,485,67,138,108,429,1056,58,1219


In [22]:
#only benchmark targets
datasets["0_limited"][1].head(3)

Unnamed: 0,id,leaderboard_18_1_reps
0,86,328
1,93,309
2,1636,402


In [23]:
#benchmarks and other open workouts features
datasets["0_full"][0].head(3)

Unnamed: 0,id,leaderboard_18_2_time_secs,leaderboard_18_2a_weight_lbs,leaderboard_18_3_time_secs,leaderboard_18_4_time_secs,leaderboard_18_5_reps,height_in,weight_lbs,back_squat_lbs,clean_and_jerk_lbs,snatch_lbs,deadlift_lbs,max_pull_ups,fran_time_secs,grace_time_secs,helen_time_secs,filthy_50_time_secs,sprint_400_m_time_secs,run_5_km_time_secs
0,86,651,207,1373,637,77,77,225,335,265,210,415,32,226,174,496,1287,64,1325
1,93,480,235,1548,642,94,72,205,305,235,185,355,45,220,154,493,1500,62,1239
2,1636,316,260,1148,577,123,70,200,400,285,215,485,67,138,108,429,1056,58,1219


In [24]:
#benchmarks and other open workouts targets
datasets["0_full"][1].head(3)

Unnamed: 0,id,leaderboard_18_1_reps
0,86,328
1,93,309
2,1636,402
