# Assignment
In this course assignment you must build a predictive model to determine what place a runner will come in in a foot race. More specifically, you must predict the place order for all participants running races in the year 2019 which are part of a series (e.g. they have been run annually, or at least once previously) using historical data. **For each race you must predict the integer place ordering for all participants**. You are not predicting the top n finishers, or performance bands where people will finish. Instead, your method is expected to be integrated into a premium feature of an application such as Strava, where historical data and data about the racer (and who is signed up for the race!) could be used to help build a personalized prediction for them.

## Framing
Through this assignment you will demonstrate your ability to build sophisticated supervised machine learning models, from data manipulation through feature engineering and modelling. This is an authentic dataset, and a real-world problem. You can use whatever modelling method you would like to, and can characterise the problem as a regression, classification, or ordinal prediction problem. There is no particular guidelines you must follow, nor guidance offered in the course _per se_ however, there is plenty of opportunity to ask course staff questions. **It is expected that this assignment will take significant effort**.

## About the Data
All of the races you are asked to predict outcomes for have a temporal relationship with some race in the past (e.g. they are part of an annual series), and I have included an identifier `sequence_id` to help identify this. The `sequence_id` will be included in all races you need to predict, so you can build race-specific features should you wish to. Races which are in your training set and do not have a `sequence_id` could be used however you might like. There may be some races which have a `sequence_id` in the training set but do not have a `sequence_id` in the holdout set -- this all depends what is offered in a given year!

A couple of core concepts are important beyond sequences. First, races have categories, which generally (though doesn't need to) denote the length of the race (e.g. 5k, marathon, etc). I've cleaned this column into a new one, prepending the word `clean` so that a columns such as `category.completed.name` becomes `clean_category.completed.name`. I have left the original data in there for you as well, and the transformations I've done have been largely to reduce dimensionality along lines I think is reasonable.

In addition to a category, there are `brackets`. Brackets typical denote demographic aspects of the runners and group them, such as Men aged 40-45. I have removed bracket information from the data and instead want you to focus on overall prediction which merges all runners in a given category together. This (should) line up with the rank order based on the individual's time, though I have not verified it (and predicting time is **not** the task).

## Evaluation Criteria
In this assignment you will be penalized equally for incorrect predictions weighted by the distance by which you are incorrect within a given race. It does not matter whether you over or under predicted a given rank, you are penalized one point per position you are off for a given individual. All DNF's are removed from the dataset, so each person in the dataset has a rank. You *must* provide a predicted rank for each person however, you may rank multiple people at the same spot if you would like (e.g. ties). The evaluation is for each combination of event and category, so a given event may have a 5 kilometer category, a 1 mile category, and so forth. Only individuals registered for a given event and category combination are included in the `DataFrame` you will be asked to predict for. Each event/category pair is equally weighted, and is scaled by the size of the event. Your overall prediction score will be the sum of all scores across the prediction tasks (e.g. across unique combinations of event and category). The exact scoring function is provided below.

## Example Solution
The following cell contains an example solution to demonstrate the API which is used for this assignment. In short, you are to create an `sklearn.pipeline.Pipeline` object which you `fit()` on your training data using whatever method you like and serialize it to disk in a file called `pipeline.cloudpickle`. This object will then be reinstantiated in the autograder and evaluated based on the scoring function described above. Please note that the solution below would be a poor one, it is intended **only** to demonstrate the API for submission.

In [2]:
import pandas as pd
import numpy as np
import cloudpickle
import sklearn
df=pd.read_csv("../../assets/assignment/df_train.csv.gz")
events=df['event.id'].unique()

train_set=events[0:100]
test_set=events[100:200]
holdout_set=events[200:300]

train=df.query("`event.id` in @train_set")
test=df.query("`event.id` in @test_set")
holdout=df.query("`event.id` in @holdout_set")

In [3]:
import pandas as pd
import numpy as np
import cloudpickle
import sklearn

# This code simulates the autograder. It is not the full autograder implementation
# but shares an API with the autograder. It expects that your fitted pipeline is
# submitted with the name pipeline.cloudpickle as demonstrated above. This object
# must implement the predict() function. This is done automatically by the sklearn
# Pipeline object if the last element of your pipeline is a classifier which has
# a predict() function. If you are not submitting a Pipeline, and want to do something
# different, you *must* have a predict() function of the same method signature, e.g.:
#
#   predict(self, X, **predict_params)->np.ndarray

# Load holdout data, in this case I'll simulate it by loading the training data
df=pd.read_csv("../../assets/assignment/df_train.csv.gz")

# And evaluate on all 5k races that we didn't consider for training
holdout_data=df.query("`event.id`!='583f013a-1e54-4906-87f7-2b625206f5f9' and `clean_categories.name`=='5k'")


# This is the scoring function to determine model fitness
def score(left: pd.DataFrame, right: pd.DataFrame):
    '''
    Calculates the difference between the left and the right when considering rank of items. 
    This scoring function requires that the two DataFrames have identical indicies, and that
    they each contain only one column of values and no missing values. Props to Blake Atkinson
    for providing MWE indicating issues with autograder version #1.
    '''
    assert(type(left)==pd.DataFrame)
    assert(type(right)==pd.DataFrame)
    assert(len(left)==len(right))
    assert(not np.any(np.isnan(left)))
    assert(not np.any(np.isnan(right)))
    assert(left.index.equals(right.index))
    # convert to ndarrays
    left=left.squeeze()
    right=right.squeeze()
    return np.sum(np.abs(left-right))/(len(left)*(len(left)-1))

# This function runs the prediction model agains a given event/category pair. It
# intentionally loads the student model each time to avoid accidental leakage of data
# between events.
def evaluate(data, pipeline_file='pipeline.cloudpickle'):
    # Load student pipeline
    fitted_pipe = cloudpickle.load(open(pipeline_file,'rb'))
    
    # Separate out the X and y
    X=list(set(data.columns)-{'overall_ranking'})
    y=['overall_ranking']
    
    # Drop any missing results (DNFs)
    data=data.dropna(subset=['overall_ranking'])
    
    # Ensure there is data to actually predict on
    if len(data)==0:
        return np.nan

    # Predict on unseen data
    predictions=pd.DataFrame(fitted_pipe.predict(data[X]),data.index)
    observed=data[y]
    
    # Generate rankings within this bracket
    observed=pd.DataFrame(data[y].rank(),data.index)
    
    # Return the ratio of the student score
    return pd.Series({"score":score(observed,predictions)})

# Student solution
pipeline_file='pipeline.cloudpickle'

def autograde(data_held_out):
    # Run prediction on each group
    results=data_held_out.groupby(["event.id","clean_categories.name"]).apply(evaluate, pipeline_file)

    # Display the results, uncomment this for your own display
    results.reset_index()['score'].plot.bar();

    # This is the student final grade
    print(np.average(results))

In [4]:
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 8))

def roll_own()->object:
    """This function returns a fitted object with a predict(x) function"""
    
    # First I'm going to create a new class with a predict function, and that class
    # is just going to call the regression object it is setup with and then rerank
    # all of the values which come back
    class RollingRegressor():
        
        # For this class I'm going to assume it has been given a fitted model, so
        # I'm choosing not to implement the fit() function.
        def __init__(self, fitted_model):
            self.regressor=fitted_model
        
        # For the prediction we are just given our dataframe, so we have to do our
        # data cleaning here.
        def predict(self, X):
            # We need to be careful and *not* drop rows. The autograder is expecting
            # a rank back for every row in X! Lets just grab out the two columns of
            # interest
            df=X[["age","sex"]]
            
                        # For brevity let's get rid of any sex that isn't Male/Female and replace with nan
            # A better approach would be to inspect and map this data accordingly, but I'll leave
            # that as an enhancement.
            #df.loc[df.query("`sex` not in ['Male','Female']").index, 'sex']=np.nan
            
            # Now that this is binary we can convert this column into a numeric value
            df.loc[ df['sex'] == 'F', 'sex'] = 'Female'
            df.loc[ df['sex'] == 'M', 'sex'] = 'Male'
            df.loc[df.query("`sex` not in ['Male','Female']").index, 'sex']=np.nan
            
            df.loc[df.query("`sex` == 'Male'").index, 'sex']=2
            df.loc[df.query("`sex` == 'Female'").index, 'sex']=1
            
            # Now just do whatever you want with missing values, this below doesn't seem ideal
            df=df.fillna(0)
            
            # With the data cleaning done, we can now predict the times for our data
            times=self.regressor.predict(df)
            
            # We can't return the times directly - the autograder wants ranks. We can
            # use a similar method those to return ranks
            return times.squeeze().argsort()+1
        
    # Our return class is done, now we just need to initalize it with a fitted
    # model. To fit the model we just do all of the cleaning over, and add in some training.
    # It would be a better ideal to put this all in the class itself, but I want to
    # show you that this isn't needed -- the autograder is NOT going to try and fit()
    # your model, it is only going to call predict(), so you can do whatever you want
    # within that predict()
    
    df=train[["age","sex"]]
    #df.loc[df.query("`sex` not in ['Male','Female','Unspecified']").index, 'sex']=np.nan
    #df.loc[df.query("`sex` == 'Male'").index, 'sex']=2
    #df.loc[df.query("`sex` == 'Female'").index, 'sex']=1
    #df.loc[df.query("`sex` == 'Unspecified'").index, 'sex']=0
    #df=df.fillna(0)

    
    # Since we have decided it's a regression problem, we can decide to use a simple linear
    # model for our first attempt too, so I'll create that now
    #from sklearn.linear_model import LinearRegression
    #reg=LinearRegression()
    #reg.fit(df,y)
    
    # Now we just return the object that the autograder will want
    #return RollingRegressor(reg)
    return df
# We can test this out by instantiating it
fitted_reg=roll_own()
# Then saving it to a file
cloudpickle.dump(fitted_reg, open('pipeline.cloudpickle','wb'))
# Then telling the autograder function to fire
autograde(holdout)

ValueError: cannot insert clean_categories.name, already exists

<Figure size 864x576 with 0 Axes>