# Introduction

This model will serve as our first foray into time-series forecasting using LSTMs. We will be following [this tutorial](https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/).

The code will be broken into the following sections:

```{raw}
I. Data and Imports
II. Data Processing
    a. Cleaning data
    b. Separating data into drives (drive_id)
    c. Next-play feature
III. Model Creation
IV. Model Training
V. Next Steps
```

# I. Data and Imports

In [None]:
import numpy as np
import tensorflow as tf
import pandas as pd

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential

from sklearn.model_selection import train_test_split

In [None]:
BATCH_SIZE = 100
EPOCHS = 10

In [None]:
data = pd.read_csv("../data/NFL_Play_by_Play_2009-2018_(v5).csv")

In [None]:
# Exploratory feature extraction code
[col for col in data.columns.to_list() if "pos" in col]

# II. Data Processing

## II.a Data Cleaning

In [None]:
# Selecting only valid plays
data = data[data['play_type'].notna()]

In [None]:
# Dropping columns with too many missing values
data = data.dropna(axis = 1, thresh=10000)

In [None]:
# Selecting only useful columns
useful_columns = ['game_id', 'yardline_100', 'quarter_seconds_remaining', 'half_seconds_remaining', 
                  'game_seconds_remaining', 'quarter_end', 'drive', 'sp', 'qtr', 'down', 'goal_to_go', 
                  'ydstogo', 'ydsnet', 'yards_gained', 'shotgun', 'no_huddle', 'home_timeouts_remaining', 
                  'defteam_timeouts_remaining','defteam_score','away_timeouts_remaining', 
                  'timeout', 'defteam_timeouts_remaining', 'total_home_score',  
                  'posteam_timeouts_remaining', 'posteam_score', 'total_away_score', 'defteam_score',
                  'score_differential', 'defteam_score_post',  'score_differential_post', 'touchdown', 'play_type']
# data = data[useful_columns]

In [None]:
# For LaTeX formatting
for col in useful_columns:
    print(r"\item " + r"\texttt{" + col.replace("_", r"\_") + "}")
    # print(r"\item " + f"\texttt{col.replace("_", r"\_")}")

In [None]:
def classify_play_type(x):
    if x == "kickoff" or x == "punt" or x == "field_goal" or x == "extra_point":
        return 0    # Special Teams
    elif x == "pass" or x == "qb_spike":
        return 1    # pass
    elif x == "run" or x == "qb_kneel":
        return 2    # run
    else:
        return 3    # no play
    
# Classifying play type
data["play_type"] = data["play_type"].apply(classify_play_type)

In [None]:
data.head()

In [None]:
# Confirming data types are numeric
data.dtypes

In [None]:
# Checking missing values
data.isna().sum()

In [None]:
# Down missing is likely due to undowned plays, such as kickoff, extra point, etc.
data = data[~data["down"].isna()]

In [None]:
# Rechecking missing values
data.isna().sum()

In [None]:
# There arent many, so dropping remaining
data = data.dropna()
data.shape

## II.b. Separating data by drive

In [None]:
# Creating a unique drive id
data["game_id_str"] = data["game_id"].astype("str")
data["drive_str"] = data["drive"].astype('str')

data["drive_id"] = data["game_id_str"].str.cat(data["drive_str"])

In [None]:
# Dropping temporary columns I created
data = data.drop(["game_id", "game_id_str", "drive_str"],axis=1)

In [None]:
# Reordering columns
col_order = ["drive_id"] + list(data.columns)[:-1]
data = data[col_order]

In [None]:
data.head()

In [None]:
# Checking drive_ids
ids = list(data["drive_id"].unique())
ids[:5]

In [None]:
# Number of drives
len(ids)

In [None]:
# Checking shape
data.shape

In [None]:
# WARNING: Takes 23 minutes

# Splitting the dataframe by drive ID and storing each drive as its own numpy array. 
# Each drive frame has shape (?, 35), where the question mark represents the number of plays in the drive (varies from 1-34).
# I also drop drive_id since it is non-numeric. I finally take the dataframe and insert it as a numpy array
# Thus, each element in broken_data is an array of play vectors of length 35.

broken_data = [data[data["drive_id"] == i].drop("drive_id", axis=1).to_numpy() for i in ids]


#### II.b.i Buffering data for consistency

The longest drive was 34 plays, so we need to have each "drive" frame be of shape (34, 36)

In [None]:
# Renaming the list
drive_data = broken_data

# Initializing some values
MAX_DRIVES = 0                      # To store the longest drive (# plays); in this data, MAX_DRIVES = 34
FEATURES = drive_data[0].shape[1]   # To store the num of features: 35

# Finding the longest drive
for drive in drive_data:
    if drive.shape[0] > MAX_DRIVES:
        MAX_DRIVES = drive.shape[0]

# Extending each drive frame by buffer of 0s
for i, drive in enumerate(drive_data):
    rows = drive.shape[0]

    # Pad with rows of 0s
    if rows != MAX_DRIVES:
        buffer = np.zeros((MAX_DRIVES-rows, FEATURES))  # Create an array of 0s to fit onto the data to ensure it is of shape (MAX_DRIVES=34, 35)
        drive_data[i] = np.concatenate((buffer, drive)) # Concatenating the 0-padding and the drive data into one numpy array and storing it

# Setting drive data to an NP.array
drive_data = np.array(drive_data)   


In [None]:
# Checking that the shape is (# Drives, # Plays in each Drive, # Features) = (58729, MAX_DRIVES, FEATURES)
drive_data.shape

In [None]:
# Examining data. Note the 0 padding and real data at the end.
drive_data[0]

In [None]:
# Creating x and y data
# X data is all plays except the last play
# Y data is the last play of the drive
# TODO: Explore what this would look like if we ignored the last play of the drive (i.e. punt, FG, TD).
    # x.append(drive[:-2])
    # y.append(drive[-2])

x = []
y = []

for drive in drive_data:
    x.append(drive[:-1])
    y.append(drive[-1])
    
# Saving x and y lists as np.arrays
x = np.array(x)
y = np.array(y)

## II.c Next Play Feature

We will not use this yet, but the code is here. Note the data warnings in the next cell.

In [None]:
# Deprecated, but could be useful later
# Requires that broken_data is a list of pd.DataFrames, not np.arrays

play_pairs = []

for drive in broken_data:
    for i in range(len(drive)-1):
        cur_play = drive.iloc[i,:].to_numpy()
        next_play = drive.iloc[i+1, :].to_numpy()
        play_pairs.append(np.array([cur_play, next_play]))

np.array(play_pairs)

In [None]:
play_pairs = np.array(play_pairs)

In [None]:
play_pairs.shape

# III. Model Creation

Here, we create a fairly standard LSTM model, which outputs vectors of shape (1, 35), matching the next-play in the sequence.

We would like to further explore our optimizer and loss functions, as well as various model architectures.

In [None]:
# NUM_DRIVES = 58279
NUM_PLAYS = 33
NUM_FEATURES = 35
hidden_size = 128

# Creating basic 2 layer LSTM
model = Sequential([
    layers.Input((NUM_PLAYS, NUM_FEATURES)), 
    layers.LSTM(hidden_size, recurrent_activation="tanh", kernel_regularizer="l2", return_sequences=True),
    layers.LSTM(hidden_size, recurrent_activation="tanh", kernel_regularizer="l2"),
    layers.Dense(NUM_FEATURES)
])

# TODO: Explore model params. Add momentum to optimizer? KL Divergence for loss?
model.compile(optimizer='adam',
                loss="mean_squared_error",
                metrics=['accuracy', "f1_score"])

model.summary()

# IV. Model Training

As you can see, the model trains quite well, achieving an accuracy of 54%.

We would like to add validation data to the model to ensure that it is not overfitting.

In [None]:
# Fit the model
history = model.fit(x=x, y=y, epochs=EPOCHS)

# V. Next Steps

1. Experiment with various model architectures and frameworks
   1. LSTM
   2. GRU
   3. Transformer
   4. Encoder-Decoder
2. Hyperparameter optimization
   1. Loss function
   2. Optimizer
   3. Regularization
   4. Weight normalization
   5. Model architectures
3. Dataset preparation
   1. Normalization
   2. Revisit feature selection
   3. Look into time-series methods (`tf.keras.preprocessing.timeseries_dataset_from_array`)