# SENG 474 Project
Andrew Wiggins (V00817291), Matt Ehrler (V00824990), Zenara Daley (V00820899)
# 1 Introduction

American football is a unique sport that is played by two teams, each with 11 active players on the field at one time. The purpose of the game is for a team to move the ball towards and into the opposing teams end zone. When the ball reaches an end zone a touchdown occurs; this awards six points to the offensive team.   

The process of moving the ball into the end zone is broken up into downs. A ball must move a minimum of 10 yards towards the defending teams end zone within 4 downs, otherwise possession of the ball is given to the defending team. In summary, the game of football is about gaining territory in order to score points.

Rushing is a common action that occurs in football. This simply means that the ball is advanced by a player running with it, in contrast to throwing or kicking the ball. A rush usually occurs when a quarterback player hands off the ball to a runningback player, however the action of rushing is not restricted to just runningbacks; quarterbacks and wide receivers can also rush. The main intent of a rush is to gain as many yards as possible within a down. Approximately one third of all yards gains come from rushing plays. 

The National Football League (NFL) is keen to discover what underlying factors contribute to a successful rush. Understanding what contributes to a successfull rush can "help teams, media, and fans better understand the skill of players and the strategies of coaches". Additionally it will help the NFL directly make assesments about the rusher and other game factors.

In order to find the factors that contribute to a successful rush, the NFL has opened the NFL Big Data Bowl. This competition was published on Kaggle, an online data science and machine learning community. They host large data sets and provides a platform for these collaborative competitions. The NFL Big Data Bowl competition provides a large game and player data set to be used for training. 

In the following code block, we initialized the project by importing various libraries, defining global variables and map definitions, as well as imported the relevant Kaggle NFL data into a Pandas dataframe.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import scipy
import lightgbm as lgb
from kaggle.competitions import nflrush
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import re
from collections import defaultdict
from sklearn.neighbors import NearestNeighbors
import datetime
import copy
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
env = nflrush.make_env()
categories = []
labelEncoder = LabelEncoder()
rushers =[]

# open (1) or closed (0)
in_out_map = defaultdict(int,{
    "open": 1,
    "field": 1, 
    "out": 1,
    "oud": 1,
    "our": 1,
    "cloud": 1,
    "close": 0,
    "retract": 0,
    "dome": 0,
})

#map teams with diffrent visitor/home names


# map to values between 0-1
weather_map = {
    "controlled": 1,
    "indoor": 1,
    "indoors": 1,
    "indoors": 1,
    "sunny": 0.8,
    "clear": 0.6, 
    "cloudy": 0.4,
    "coudy": 0.4,
    "hazy": 0.4,
    "cool": 0.4,
    "rain": 0.2,
    "rainy": 0.2,
    "cold": 0,
    "snow": 0,
}

turf_map = {
    "Field Turf": "Artificial",
    "A-Turf Titan": "Artificial",
    "Grass": "Natural",
    "UBU Sports Seed S5-M": "Artificial",
    "Artificial": "Artificial",
    "DD GrassMaster": "Artificial",
    "Natural Grass": "Natural",
    "UBU Seed Series-S5-M": "Artificial",
    "FieldTurf": "Artificial",
    "FieldTurf 360": "Artificial",
    "Natural grass": "Natural",
    "grass": "Natural",
    "Natural": "Natural",
    "Artifical": "Artificial",
    "FieldTurf360": "Artificial",
    "Naturall Grass": "Natural",
    "Field turf": "Artificial",
    "SISGrass": "Artificial",
    "Twenty-Four/Seven Turf": "Artificial",
    "natural grass": "Natural" 
}

# Training data is in the competition dataset as usual
train_df = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2020/train.csv', low_memory=False)

# 2 Data Collection and Preprocessing

The following section will describe the steps taken to collect and pre-process the data that will be used for data mining. The collection step is a trivial component and the preprocessing is the non-trivial component of the project. 

## 2.1 Collection

As part of the NFL Big Data Bowl competition, a training data set was given. The data set contains tracking data for running plays in a `.csv` file format. The columns within this file and their descriptions are as follows in the table below:


| Column | Description | Type | 
|:-------|:------------|:-----|
| GameId | a unique game identifier | int |
| PlayId | a unique play identifier | int |
| Team | home or away | string |
| X | player position along the long axis of the field | float |
| Y | player position along the short axis of the field | float |
| S | speed in yards/second | float |
| A | acceleration in yards/second^2 | float |
| Dis | distance traveled from prior time point, in yards | float |
| Orientation | orientation of player (deg) | float |
| Dir | angle of player motion (deg) | float |
| NflId | a unique identifier of the player | int |
| DisplayName | player's name | string |
| JerseyNumber | jersey number | int |
| Season | year of the season | int |
| YardLine | the yard line of the line of scrimmage | int |
| Quarter | game quarter (1-5, 5 == overtime) | int |
| GameClock | time on the game clock | timestamp |
| PossessionTeam | team with possession | string |
| Down | the down (1-4) | int |
| Distance | yards needed for a first down | int |
| FieldPosition | which side of the field the play is happening on | string |
| HomeScoreBeforePlay | home team score before play started | int |
| VisitorScoreBeforePlay | visitor team score before play started | int |
| NflIdRusher | the NflId of the rushing player | string |
| OffenseFormation | offense formation | string |
| OffensePersonnel | offensive team positional grouping | string |
| DefendersInTheBox | number of defenders lined up near the line of scrimmage, spanning the width of the offensive line | int |
| DefensePersonnel | defensive team positional grouping | string |
| PlayDirection | direction the play is headed | string |
| TimeHandoff | UTC time of the handoff | timestamp |
| TimeSnap | UTC time of the snap | timestamp |
| Yards | the yardage gained on the play (you are predicting this) | int |
| PlayerHeight | player height (ft-in) | string |
| PlayerWeight | player weight (lbs) | float |
| PlayerBirthDate | birth date (mm/dd/yyyy) | string |
| PlayerCollegeName | where the player attended college | string |
| Position | the player's position (the specific role on the field that they typically play) | string |
| HomeTeamAbbr | home team abbreviation | string |
| VisitorTeamAbbr | visitor team abbreviation | string |
| Week | week into the season | int |
| Stadium | stadium where the game is being played | string |
| Location | city where the game is being player | string |
| StadiumType | description of the stadium environment | string |
| Turf | description of the field surface | string |
| GameWeather | description of the game weather | string |
| Temperature | temperature (deg F) | float |
| Humidity | humidity (percentage) | float |
| WindSpeed | wind speed in miles/hour | float |
| WindDirection | wind direction | string |

## 2.2 Preprocessing

Many of the non-numeric columns did not have a discrete set of values and required data cleaning such as `StadiumType` and `GameWeather`. These columns were resolved using maps of keywords to integer or float values.


There were also numeric columns such as `PlayerHeight` and `WindSpeed` that required cleaning. The `PlayerHeight` column required the height in inches and feet to be converted to a single integer for inches. The `WindSpeed` column required stripping of additional non-numeric strings such as directions. 


Timestamp columns required special cleaning. The `GameClock` column contained a string of two numbers separated by a colon which represented minutes and seconds; It was split into two columns, `GameClock_minutes` and `GameClock_seconds`. Two columns, `Timesnap` and `Timehandoff` were converted from timestamps to time in seconds since epoch using the python Time library [1].

A major component of preprocessing involved converting both teams to be moving in a standard direction instead of a relative direction according to the `PlayDirection` (right or left). The columns `Orientation`, `X`, `Y`, `Dir`, and `Yardline` were standardized into `StdOrientation`, `StdX`, `StdY`, `StdDir`, and `StdYardline`, respectively. The original columns were then dropped. In order to standardize the `Orientation` and `Direction` columns, 180 degrees was added to all rows with a `PlayDirection` value of left. To standardize the `X` and `Y` columns, the values with a `PlayDirection` value of left were subtracted from 120 and 53.3 (the length and width of the field), respectively.  To convert the `StdYardline` column, the values with a `PlayDirection` value of left were subtracted from 100 (number of yard lines on a field). Finally, the 2017 standard orientation differed from other years and needed to be rotated 90 degrees [2].

In the following code block, these steps are performed.

In [None]:
def windSpeedToInt(windSpeed):
    x = re.findall('[0-9]*', str(windSpeed))
    if len(x) > 1 and x[0].isdigit() and x[1].isdigit():
        return (int(x[0]) + int(x[1]))/ 2
    if not x[0].isdigit():
        return np.nan
    return(x[0])

def stadiumTypeToInt(stadium):
    x = re.findall(r"[\w']+", str(stadium).lower())
    weight = 0
    if len(x) == 0:
        return np.nan
    for word in x:
        weight += in_out_map[word]
    weight = weight / len(x)
    return round(weight)

def heightToInches(height):
    return int(height.split('-')[0]) * 12 + int(height.split('-')[1])

def convertToProb(num):
    return np.array([1 if i > num + 99 else 0 for i in range(199)])
    
def crps(y_true, y_pred):
    y_pred = np.clip(np.cumsum(y_pred,axis = 1),0,1)
    return np.mean((y_pred-y_true)**2)

def gameWeatherToInt(weather):
    x = re.findall(r"[\w']+", str(weather).lower())
    valid = 0
    weight = 0
    for word in x:
        if word in weather_map:
            weight += weather_map[word]
            valid += 1
    if valid == 0:
        return np.nan
    return str(weight / valid)

def gameClockToSeconds(gameClock):
    times = str(gameClock).split(':')
    return (int((int(times[0]) * 60) + int(times[1]) * int(times[2]) / 60))

def cleanData(df):
    df = df.replace(["ARZ","BLT","CLV","HST"],["ARI","BAL","CLE","HOU"])
    # clean WindSpeed column
    df["WindSpeed"] = df["WindSpeed"].apply(windSpeedToInt).astype("float64")
    
    # convert Height column to inches
    df["PlayerHeight"] = df["PlayerHeight"].apply(heightToInches)
    
    # clean StadiumType column
    df["StadiumType"] = df["StadiumType"].apply(stadiumTypeToInt).astype("category")
    
    # clean GameWeather column
    df["GameWeather"] = df["GameWeather"].apply(gameWeatherToInt).astype("category")
    
    # team offense cleaning
    df['TeamOnOffense'] = "home"
    df.loc[df.PossessionTeam != df.HomeTeamAbbr, 'TeamOnOffense'] = "away"
    df['IsOnOffense'] = df.Team == df.TeamOnOffense
    
    # convert orientation to move in a standard direction
    df['MovingLeft'] = df.PlayDirection == "left"
    df['StdOrientation'] = df.Orientation
    df.loc[df.MovingLeft, 'StdOrientation'] = np.mod(180 + df.loc[df.MovingLeft, 'StdOrientation'], 360)
    
    # convert X and Y to move in a standard direction
    df['StdX'] = df.X
    df.loc[df.MovingLeft, 'StdX'] = 120 - df.loc[df.MovingLeft, 'X'] 
    df['StdY'] = df.Y
    df.loc[df.MovingLeft, 'StdY'] = 53.3 - df.loc[df.MovingLeft, 'Y'] 
    
    # convert direction to standard
    df['StdDir'] = df.Dir
    df.loc[df.MovingLeft, 'StdDir'] = np.mod(180 + df.loc[df.MovingLeft, 'StdDir'], 360)
    df.loc[df['Season'] == 2017, 'Orientation'] = np.mod(90 + df.loc[df['Season'] == 2017, 'Orientation'], 360) 
    
    df['StdYardline'] = 100 - df.YardLine
    df.loc[df.FieldPosition.fillna('') == df.PossessionTeam,'StdYardline'] = df.loc[df.FieldPosition.fillna('') == df.PossessionTeam,'YardLine']
    
    # turf cleaning
    df['Turf'] = df['Turf'].map(turf_map)
    
    # time cleaning
    df['GameClock_seconds'] = train_df['GameClock'].apply(gameClockToSeconds)
    df['GameClock_minutes'] = train_df['GameClock'].apply(lambda time: int(time.split(':')[0]))
    
    df['TimeHandoff'] = df['TimeHandoff'].apply(lambda time: datetime.datetime.strptime(time, "%Y-%m-%dT%H:%M:%S.%fZ"))
    df['TimeSnap'] = df['TimeSnap'].apply(lambda time: datetime.datetime.strptime(time, "%Y-%m-%dT%H:%M:%S.%fZ"))
    
    df['TimeDelta'] = df.apply(lambda row: (row['TimeHandoff'] - row['TimeSnap']).total_seconds(), axis=1)
    
    df = df.drop(["X","Y","Dir","Orientation","YardLine","TimeHandoff","TimeSnap"],axis=1)
    
    return df

## 2.3 Feature Engineering

Since the LGBM model does not work with `NaN`s, any NA/NaN values were converted to `Unknown` using the `fillna` Pandas function. 

The LGBM model works better when any categorical values are encoded as numbers. Therefore, we built a label encoder that encodes every possible category as a number and during test time, if we haven't seen a specific category before, it is encoded as 'unknown'. In order to prevent collisions, every feature value was encoded with the feature title prepended.

Another portion of the data cleaning condensed each play data into a single row on the rusher index. Before doing this, all other players were taking into consideration through the addition of 2 new columns `NearestOffensivePlayer` and `NearestDefensivePlayer`. These columns were created using the X and Y coordinates and  Scikit Learns neighbours library to find the nearest neighbors to the rusher. Using similar techniques, another column called `NearestLikelyTackle` was created by taking into consideration the distances of every defensive team member with the position of Outside Linebacker (OLB), Inside Linebacker (ILB), Defensive End (DE), or Defensive Tackle (DT), and determining who was closest to the rusher. These positions were chosen due to their likeliness to tackle the rusher during any given play. Since all team players were accounted for in these newly added features, we dropped all rows that weren't the rushing player.

A few other features were constructed because we thought they might help the score. For example, the speed that the rusher is moving towards the end zone at was calculated using trigonometry. We also calculated the distance the rusher is from the first-down line. Considering the first-down line is commonly where the rusher is trying to get it stands that how far away they are from it will be correlated with the amount of yards they run. Next we calculated the force that the rusher is exerting by using Newton's Second Law of Motion (f = ma). This was done because if more force was being exerted then a rusher would be able to run through other players easier. Based on the training data, we calculated the average number of yards that a rusher will carry the ball per play. This was done to determine roughly how 'good' a rusher is doing his job [3]. If a rusher has not been seen in the training data, we set his value to 0.

Lastly we dropped many features that we deemed unimportant through vigorous cross-validation testing, such as `Season`, `WindSpeed`, `GameId`, `PlayId`, and `Week`.

In [None]:
def pipeline(train_df):
    train_df = cleanData(train_df)
    
    train_df["WindDirection"] = train_df["WindDirection"].fillna("Unknown")
    train_df["FieldPosition"] = train_df["FieldPosition"].fillna("Unknown")
    train_df["OffenseFormation"] = train_df["OffenseFormation"].fillna("Unknown")
    train_df["Turf"] = train_df["Turf"].fillna("Unknown")

    categorical_features = train_df.select_dtypes(include=["object"]).columns
    
    global labelEncoder
    global categories
    global rushers
    if len(categories) == 0:
        for feature in categorical_features:
            train_df[feature] = feature + "__" + train_df[feature].astype("str")
            categories += train_df[feature].unique().tolist()
            categories += [feature +"__Unknown"]
        labelEncoder.fit(categories)

        for feature in categorical_features:
            train_df[feature] = labelEncoder.transform(train_df[feature])
        rushers = train_df[train_df["NflId"] == train_df["NflIdRusher"]]
        rushers = rushers.groupby(["NflId"])["Yards"].agg(AvgCarry="mean")
    else:
        for feature in categorical_features:
            train_df[feature] = feature + "__" + train_df[feature].astype("str")
            unseenLabels = ~train_df[feature].isin(categories)
            train_df.loc[unseenLabels,[feature]] = feature + "__Unknown"
            train_df[feature] = labelEncoder.transform(train_df[feature])
        
    # get distance to nearest neighbor of rusher
    train_df["NearestOffensivePlayer"] = 0
    train_df["NearestDefensivePlayer"] = 0
    for play in train_df.PlayId.unique():
        # get all rows in play
        playXY = train_df.loc[train_df["PlayId"] == play, ['StdX', 'StdY', 'NflId', 'NflIdRusher', 'PossessionTeam', 'Team', 'HomeTeamAbbr',"Position"]]
        # find rusher and their team
        rusher_team = playXY.loc[playXY['NflId'] == playXY['NflIdRusher'], ['Team']]
        rusherXY = playXY.loc[playXY['NflId'] == playXY['NflIdRusher'], ['StdX', 'StdY']]
        # find players and split by offense/defense
        playersXY = playXY.loc[playXY['NflId'] != playXY['NflIdRusher'], ['StdX', 'StdY', 'Team',"Position"]]
        playerOffenseXY = playersXY.loc[playersXY['Team'] == rusher_team.iloc[0]['Team'], ['StdX', 'StdY',"Position"]]
        playerDefenseXY = playersXY.loc[playersXY['Team'] != rusher_team.iloc[0]['Team'], ['StdX', 'StdY',"Position"]]
        # find X,Y coordinate of nearest offensive/defensive neighbor
        o_neighbours = NearestNeighbors(n_neighbors=1, algorithm='ball_tree').fit(playerOffenseXY.drop(["Position"],axis=1))
        d_neighbours = NearestNeighbors(n_neighbors=1, algorithm='ball_tree').fit(playerDefenseXY.drop(["Position"],axis=1))
        # find hypotenuse of neighbor
        o_dist, _ = o_neighbours.kneighbors(rusherXY)
        d_dist, _ = d_neighbours.kneighbors(rusherXY)
        train_df.loc[train_df['PlayId'] == play, 'NearestOffensivePlayer'] = o_dist[0][0]
        train_df.loc[train_df['PlayId'] == play, 'NearestDefensivePlayer'] = d_dist[0][0]
        
        # find x, y of likely tackle positions
        playerOLBXY = playerDefenseXY.loc[playerDefenseXY["Position"] == labelEncoder.transform(['Position__OLB'])[0], ['StdX', 'StdY']]
        playerILBXY = playerDefenseXY.loc[playerDefenseXY["Position"] == labelEncoder.transform(['Position__ILB'])[0], ['StdX', 'StdY']]
        playerDEXY = playerDefenseXY.loc[playerDefenseXY["Position"] == labelEncoder.transform(['Position__DE'])[0], ['StdX', 'StdY']]
        playerDTXY = playerDefenseXY.loc[playerDefenseXY["Position"] == labelEncoder.transform(['Position__DT'])[0], ['StdX', 'StdY']]
        playersTackleXY = pd.concat([playerOLBXY,playerILBXY,playerDEXY,playerDTXY])
        if not playersTackleXY.empty:
            tackle_neighbours = NearestNeighbors(n_neighbors=1, algorithm='ball_tree').fit(playersTackleXY)
            dis, _ = tackle_neighbours.kneighbors(rusherXY)
        else:
            dis = [[1000000]]
        # find minimum and add feature
        train_df.loc[train_df['NflId'] == play, 'NearestLikelyTackle'] = dis[0][0]
        
    train_df[categorical_features] = train_df[categorical_features].astype('category')
    
    train_df = train_df[train_df["NflId"] == train_df["NflIdRusher"]]
    
    train_df["hspeed"] = train_df["S"] * np.sin(np.deg2rad(train_df["StdDir"]))
    train_df["to_line"] = train_df["StdYardline"] - train_df["StdX"]
    train_df["force"] = train_df["A"] * train_df["PlayerWeight"]
    train_df = pd.merge(train_df,rushers, on="NflId",how="left")
    train_df["AvgCarry"] = train_df["AvgCarry"].fillna(0)
    
    dropCols = categorical_features.drop(["Team","OffenseFormation","OffensePersonnel","DefensePersonnel","WindDirection"])
    dropCols = dropCols.append(pd.Index(["NflIdRusher","Season","WindSpeed","GameId","PlayId","Week"]))
    #print(dropCols)
    train_df = train_df.drop(dropCols, axis=1)
    
    return train_df

# 3 Data Mining 

The following section will describe how the NFL data was used to train a model and make predictions on the test data. This is a trivial component of the project. 

## 3.1 Choosing a Model

Since we were trying to predict a value for each rushing play, we initially tried to solve this as a regression problem with XGBoost. We chose a tree based model as they are easy to use and seem to have been rather successful in past Kaggle competitions. Additionally, the well structured data allowed us to engineer features easily which worked effectively with a tree based algorithm. However, with this approach a lot of data was being lost during the conversion to probability distribution, resulting in a fairly bad score. After seeing the poor quality results, we moved over to classification. Given that all possible values of yards were known, this allowed us to convert to probability distribution without losing data. We also found a fork of XGBoost called LightGBM which is more light weight and provides native support for categorical values. Given that most of our data is categorical, we started using LightGBM with multiclass classification which gave us a significant improvement to our score.

## 3.2 Current Model

The model used to is [LightGBM](https://lightgbm.readthedocs.io/en/latest/), a light gradient boosting framework. It uses tree based learning algorithms and is designed to be lightweight, using less memory as well as parallel and GPU learning. Our model is training with multiclass classification where each possible discrete value of yards is it's own class. Yards range from -99 to 99 leaving us with 199 classes. This model is used alongside Scikit Learns [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) library. KFold provides train or test indices to split train or test sets. It will split datasets into k consecutive folds, where a single fold is used as validation and the remaining folds are used to build the training set. For our KFold validator, we use 3 splits and pass in the entire cleaned data set as the training data and the yards column as the target variable. For each split, a LightGBM model is created using the train folds. A prediction is then made with the test folds. The score of the model on each iteration is compared against the best score so far; if the score is better, it replaces the current best score. The best score at the end all iterations is returned as the final prediction model.

## 3.3 Model Output

The Kaggle competition wanted predictions to be in the form of a cumulative probabilty distribution. Each distribution would have 199 values indexed from -99 yard to 99 yards. Each value corresponds with the probability that the play will gain at most that many yards. For example if we were to predict that the play would gain exactly 3 yards then every index corresponding to 3 yards or higher would be 1. Since our model is doing multiclass classification it outputs the probabilities it thinks will correspond to each class, we convert this to the proper distribution by calculating the cumulative sum and then clipping it to be between 0,1. 

## 3.4 Model Parameters

The input parameters for the model where chosen through cross validation. We tested different values on various folds of the data until we found optimal values. We also kept `num_leaves` low to help prevent overfitting on test data.   

In [None]:
def train_my_model(train_df):    
    train_df = pipeline(train_df)

    columns = list(train_df.columns)
    columns.remove("Yards")
    
    X_train = train_df[columns]
    Y_train = train_df["Yards"]
    
    params = { 'objective' : 'multiclass', 'num_classes': 199,
              'num_leaves': 19, 'feature_fraction': 0.4,
              'subsample': 0.4, 'min_child_samples': 10, 'num_threads': 5,
              'learning_rate': 0.01, 'num_iterations': 100, 'random_state': 42}
   
    bestScore = 1000
    bestModel = ""
    
    kfold = KFold(n_splits=3, shuffle=True, random_state=42)

    for train_index, test_index in kfold.split(X_train, Y_train):
        # train folds
        X_train_fold = X_train.iloc[train_index]
        Y_train_fold = Y_train.iloc[train_index]

        # test folds
        X_test_fold = X_train.iloc[test_index]
        Y_test_fold = Y_train.iloc[test_index]
        
        Y_train_fold = Y_train_fold + 99
 
        dataSet = lgb.Dataset(X_train_fold,label=Y_train_fold)
        model = lgb.train(params,dataSet)
    
        Y_pred = model.predict(X_test_fold)
       
        Y_test_fold = np.vstack(pd.Series(Y_test_fold).apply(lambda x: convertToProb(x)))
        score = crps(Y_test_fold,Y_pred)
        if score < bestScore:
            bestModel = copy.deepcopy(model)
            bestScore = score
        print(score)
    return bestModel

## 3.5 Feature Importance
Since we are using a tree based model we are able to view the most important features. A significant number of our engineered features are among the most important features when we're selecting in terms of information gain. 

In [None]:
model = train_my_model(train_df)
lgb.plot_importance(model,importance_type="gain",max_num_features=30)

## 3.6 Prediction

The prediction made is a cumulative probability distribution of gained or lost yards (i.e. the probability for a team to gain or lose yards during a play). In order to make a prediction on a line of new test data, it must go through the same data cleaning pipeline described in section 2.2. Next using the model described in section 3.1 a prediction can be made on new test data. This step is repeated for each play in the test data. 

In [None]:
def predict(test_df,model):
    test_df = pipeline(test_df)
    return  pd.DataFrame(np.clip(np.cumsum(model.predict(test_df),axis = 1),0,1),columns=sample_prediction_df.columns)

for (test_df, sample_prediction_df) in env.iter_test():
    predictions_df = predict(test_df, model)
    env.predict(predictions_df)

env.write_submission_file()

# 4 Evaluation

As part of the NFL Big Data Bowl competion, an evaluation of the model is given once a prediction is submitted. The evaluation is posted on a leaderboard and scores can be ranked and compared. Submissions are ranked using the [Continuous Ranked Probability Score (CRPS)](https://www.kaggle.com/c/nfl-big-data-bowl-2020/overview/evaluation). A CRPS score of zero is the goal. Currently the public leaderboard scores range from `0.01201` to `0.49962`. Our test score of `0.01415` puts us at approximately 1425 out of 2038 total competitors. This was a bit surprising because even after many new features were engineered and implemented that significantly improved our validiation score, the test score did not change at all. This led us to believe there may be some kind of internal Kaggle caching issue. On the other hand, our validation score was `0.01382` which was a significant improvement on our test score. This difference could potentially be due to overfitting.

# References
[1] https://www.kaggle.com/mrkmakr/lgbm-multiple-classifier/notebook

[2] https://www.kaggle.com/pednt9/vip-hint-coded

[3] https://www.espn.com/nfl/story/_/id/20114211/the-nfl-stats-matter-most-2017-offseason-bill-barnwell