# Coverage2Vec: Towards a Defensive Performance Embedding Scheme

In this notebook, we will describe our Coverage2Vec model, which converts a defense's coverage into a real-valued vector, much like any other embedding model (i.e. Doc2Vec, Word2Vec, etc.) does. The embedding allows for efficient prediction of relevant metrics (i.e. completion likelihood or expected yards gained), investigation of patterns across defenses, and visualization of defensive scheme for analyst use. At a high-level, the steps to implement our model are as follows:

1. **Data Processesing and Investigation**: we will combine elements from the 'players,' 'plays,' and 'games' data frames into a week dataframe. We will also format this new data frame to create training data where every row of the training data is a snapshot of the play and has a consistent structure (i.e. number of players). So, using the output of theis investigation, we will have a dataset that consists of a normalized vector for every play snapshot.
2. **Creation of the Coverage2Vec model and model training**: In the second section, we will describe our coverage2vec model and the training procedure used to fit that model to the data created in the previous section. The output of this section will be the fully-trainined coverage2vec model
3. **Investigation of coverages**: In this section, we will use the trained coverage2vec model along with all of the play snapshots to visually investigate various aspects of defensive coverages


This notebook is orgnaized into sections:

1. [Data Processing and Investigation](#Data-Processing-and-Investigation)
2. [Coverage2Vec model](#Coverage2Vec-model)
3. [Investigation of Defensive Coverages using Embedded Plays](#Investigation-of-Defensive-Coverages-using-Embedded-Plays)
4. [Appendix A: Code for processing the data](#Appendix-A:-Code-for-processing-the-data)
5. [Appendix B: Code Implementation for Coverage2Vec Model](#Appendix-B:-Code-Implementation-for-Coverage2Vec-Model)

# Data Processing and Investigation

In this section, we will read in the data, investigate some properties of it, and process it for use in the coverage2vec model
- First, we will import neccesary Python packages and then read in the data
- Second, we will investigate aspects of the data, most notably the player postions present, number of unique plays, number of unique play snapshots, etc.

**Note**: The full code for processing the data is in the appendix 

In [None]:
'''Read in the neccesary python packages'''

from __future__ import division
from __future__ import print_function

import numpy as np, pandas as pd, seaborn as sns, os
from matplotlib import pyplot as plt
from scipy.spatial.distance import pdist, squareform
from scipy.sparse import csr_matrix
from sklearn.preprocessing import MinMaxScaler
import tensorflow as tf

In [None]:
''' Create variables for all the different data repositories, and read in the data'''

save_dir = "/kaggle/working/"

game_data  = pd.read_csv("../input/nfl-big-data-bowl-2021/games.csv")
player_data  = pd.read_csv("../input/nfl-big-data-bowl-2021/players.csv")
play_data  = pd.read_csv("../input/nfl-big-data-bowl-2021/plays.csv")
play_processed_data = np.load("../input/nfl-big-data-bowl-c2v-processed-data/play_processed_data.npy")
play_data_df = pd.read_csv("../input/nfl-big-data-bowl-c2v-processed-data/processed_play_data_labels.csv", index_col=0)
play_processed_data = play_processed_data[:,:64,:-180]
play_processed_data = np.transpose(play_processed_data, (0 ,2, 1))

### Initial Investigation of the Data
We will first look at the nature of the players and the games. Namely, how many games are there for training on, and what types of players do we have telemetry data for, etc.

In [None]:
player_pos = np.unique(player_data['position'])
print("The number of unique player positions is: {}, and they are: {}".format(len(player_pos), player_pos))

In [None]:
print("The number of unique plays is: {}".format(len(np.unique(play_data['gameId'].astype(str)+" "+play_data['playId'].astype(str)))))

In [None]:
print("The number of unique games is: {}".format(len(np.unique(game_data['gameId']))))

In [None]:
print("size of processed data: {}".format(play_processed_data.shape))

So, we get close to 20,000 different plays and around 22 different types of positions in the data. Generally, there are about 64 snapshots per play, with the longest play being 269 snapshots (It was a 4th down in the 4th quarter, with a few seconds left on the clock, and the offense went for a hook and lateral... they didn't make the touchdown). 

In order to use this data, we first decided on how to featurize the play snapshots in order to get a time series signal for each of the plays: So, for each play snapshot, we used the following fields:
* Each player's postion (both offense and defensive players).
* Each player's telemetry data, including `x, y, s, a, dis, o,` and `dir`. 
Each of the player's vectors were then concatenated into one vector to produce a 1x660 length vector for each play snapshot.




# Coverage2Vec model

In this section, we will delve into the design and training of the coverage2vec model. (**Note**: Code for the model and training of the model is available in Appendix B). Since we want to look at play data, and that play data is composed of play sanshots, which have a time dependency, we opted to look at a embedding model that can handle time sequences. Furthermore, since the data is often sparse in nature, due to the categorical features, like `position` and the lack of telemetry data for all players on the field for most plays, we opted for a 1D convolutional autoencoder. Some more information on these models and the 1D convolution can be found at: [How to Use Convolutional Neural Networks for Time Series Classification](https://towardsdatascience.com/how-to-use-convolutional-neural-networks-for-time-series-classification-56b1b0a07a57) and [1-d Convolutional Neural Networks for Time Series: Basic Intuition](https://boostedml.com/2020/04/1-d-convolutional-neural-networks-for-time-series-basic-intuition.html).

To create the feature space, we settled on having 100 snapshots per each play. For reference, the average number of snapshots per play is 64. If it had fewer snapshots than 100, it was paddedd with zeros, and if a play had more than 100 snapshots, it was downsmapled to 100. The following code displays a graphical depection of a play snapshot

In [None]:
plt.figure(figsize=(18,22))
plt.imshow(np.transpose(play_processed_data, (0,2,1))[8697], cmap='gray', vmin=0, vmax=1)

In this case the y-axis is time and the x-axis are the players, where every 30 units on the x-axis is a unique player

### Coverage2Vec Model Architecture

For our autoencoder, we used a combination of 1-D convultional layers and fully connected layers. We also found that running over the data by the players dimensions, rather than the snapshot dimension (i.e. `(None, 100, 600)`  for `(None, 600, 100)`, produced better embeddings. The following code displays the summary of the coverage2vec model. 

In [None]:
from tensorflow.keras.models import load_model

c2v = load_model("../input/c2v-trained-model/c2v_model_conv1D_trans_v2")

In [None]:
c2v.encoder.summary()

In [None]:
c2v.decoder.summary()

### Training the Model

For training the model, we used Mean Squared Error and randomly sampled out 5 games to act as a validation set. We then trained over the training set for 30 iterations. We repeated this process of taking out a validation set and training 50 times.

# Investigation of Defensive Coverages using Embedded Plays

In this section, we display some of the output that a autoencoder, like coverage2vec, can produce. In particular, we will visualize a play and some teams to see the differences between the teams.

In [None]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
from plotly import tools
init_notebook_mode(connected = True)
import plotly.figure_factory as ff

In [None]:
game_id = 2018111110

plot_df = play_data_df[play_data_df['gameId']==game_id]
latent_positions = c2v.encoder(play_processed_data[np.where(play_data_df['gameId']==game_id)])
plot_df['x'] = latent_positions[:,0].numpy()
plot_df['y'] = latent_positions[:,1].numpy()
plot_df['z'] = latent_positions[:,2].numpy()
plot_df['colors'] = plot_df['d_team'].map(lambda x: '#69BE28' if x =='SEA' else '#0080C6')

traces =[]
for team_name in plot_df['d_team'].unique():
    df = plot_df[plot_df['d_team']==team_name]
    trace = go.Scatter3d(surfacecolor='darkgrey',
        x= df['x'],
        y= df['y'],
        z= df['z'],
        name=team_name,
        mode='markers',text=df['playResult'],
         marker=dict(
            color = df['colors'], 
            colorscale='Jet',
            size= 5,
            line=dict(
                color= df['colors'],
                width= 10
            ),
            opacity=0.8
         )
    )
    traces.append(trace)

layout = go.Layout(
    title = 'Visualization of Defensive Coverages for Seattle v. LA',
    scene = dict(
            xaxis = dict(title  = 'latent x-dimension'),
            yaxis = dict(title  = 'latent y-dimension'),
            zaxis = dict(title  = 'latent z-dimension')
        )
)

fig = go.Figure(data = traces, layout = layout)
py.iplot(fig)

Generally, for this 11 Novermber game, we can observe that Seattle has less variability in their coverages for the game than does LA. Their defensive coverages end up much closer together

In [None]:
team_id = 'PIT'

plot_df = play_data_df[play_data_df['d_team']==team_id]
latent_positions = c2v.encoder(play_processed_data[np.where(play_data_df['d_team']==team_id)])
plot_df['x'] = latent_positions[:,0].numpy()
plot_df['y'] = latent_positions[:,1].numpy()
plot_df['z'] = latent_positions[:,2].numpy()

trace1 = go.Scatter3d(surfacecolor='darkgrey',
    x= plot_df['x'],
    y= plot_df['y'],
    z= plot_df['z'],
    mode='markers',text=plot_df['playResult'],
     marker=dict(
        color = '#FFB612', 
        colorscale='Jet',
        size= 5,
        line=dict(
            color= '#FFB612',
            width= 10
        ),
        opacity=0.8
     )
)

layout = go.Layout(
    title = 'Visualization of Defensive Coverage Plays for Pittsburgh',
    scene = dict(
            xaxis = dict(title  = 'latent x-dimension'),
            yaxis = dict(title  = 'latent y-dimension'),
            zaxis = dict(title  = 'latent z-dimension')
        )
)

fig = go.Figure(data = trace1, layout = layout)
py.iplot(fig)

In [None]:
team_id = 'CHI'

plot_df = play_data_df[play_data_df['d_team']==team_id]
latent_positions = c2v.encoder(play_processed_data[np.where(play_data_df['d_team']==team_id)])
plot_df['x'] = latent_positions[:,0].numpy()
plot_df['y'] = latent_positions[:,1].numpy()
plot_df['z'] = latent_positions[:,2].numpy()

trace2 = go.Scatter3d(surfacecolor='darkgrey',
    x= plot_df['x'],
    y= plot_df['y'],
    z= plot_df['z'],
    mode='markers',text=plot_df['playResult'],
     marker=dict(
        color = '#0B162A', 
        colorscale='Jet',
        size= 5,
        line=dict(
            color= '#0B162A',
            width= 10
        ),
        opacity=0.8
     )
)

layout = go.Layout(
    title = 'Visualization of Defensive Coverage Plays for Chicago',
    scene = dict(
            xaxis = dict(title  = 'latent x-dimension'),
            yaxis = dict(title  = 'latent y-dimension'),
            zaxis = dict(title  = 'latent z-dimension')
        )
)

fig = go.Figure(data = trace2, layout = layout)
py.iplot(fig)

In [None]:
team_id = 'OAK'

plot_df = play_data_df[play_data_df['d_team']==team_id]
latent_positions = c2v.encoder(play_processed_data[np.where(play_data_df['d_team']==team_id)])
plot_df['x'] = latent_positions[:,0].numpy()
plot_df['y'] = latent_positions[:,1].numpy()
plot_df['z'] = latent_positions[:,2].numpy()

trace3 = go.Scatter3d(surfacecolor='darkgrey',
    x= plot_df['x'],
    y= plot_df['y'],
    z= plot_df['z'],
    mode='markers',text=plot_df['playResult'],
     marker=dict(
        color = '#000000', 
        colorscale='Jet',
        size= 5,
        line=dict(
            color= '#000000',
            width= 10
        ),
        opacity=0.8
     )
)

layout = go.Layout(
    title = 'Visualization of Defensive Coverage Plays for Oakland',
    scene = dict(
            xaxis = dict(title  = 'latent x-dimension'),
            yaxis = dict(title  = 'latent y-dimension'),
            zaxis = dict(title  = 'latent z-dimension')
        )
)

fig = go.Figure(data = trace3, layout = layout)
py.iplot(fig)

In [None]:
layout = go.Layout(
    title = 'Visual Comparison of the Three Teams Embedded Plays',
    scene = dict(
            xaxis = dict(title  = 'latent x-dimension'),
            yaxis = dict(title  = 'latent y-dimension'),
            zaxis = dict(title  = 'latent z-dimension')
        )
)

fig = go.Figure(data = [trace1, trace2, trace3], layout = layout)
py.iplot(fig)

As with an individual play, we can also see differences in coverages between teams, most especially between Pittsburgh and the other two. 

We hope you enjoyed our notebook and than it prompts ideas for even better way to analyze coverages (like, maybe a defensive_player2vec??)! 

# Appendix A: Code for processing the data

### Data Processesing Functions

- We will need to pre-process, `preprocess_data`, the data so that things like all of the player telemetry data is normalized on 0-1 for use by the neural network embedding (preprocess_data)
- We will also need a function that can take out the neccesary elements of the data `create_attributes` for the coverage2vec model and process each of the play snapshots into a standard vector
- Finally we will need a function that processes the play snapshot vectors and turns them into a sequence (3-D tensor) ,`process_play_data`. We will use 100 as the number of snap shots per play and pad those with less than 100 snapshots with 0's and downsample those with more than 100 snapshots (which is aonly about ~100 plays).

In [None]:
# def preprocess_data(telemetry_data, play_data, game_data):
#     telemetry_data['play_uid'] = telemetry_data['gameId'].astype(str) +"_" + telemetry_data['playId'].astype(str)
#     telemetry_data = telemetry_data[telemetry_data['nflId'].notna()] #cut out the football telemetry data
#     play_data['play_uid'] = play_data['gameId'].astype(str) +"_" + play_data['playId'].astype(str)
#     final_data = telemetry_data.merge(play_data.drop(columns=['gameId','playId']), how='left', on='play_uid')
#     final_data = final_data.merge(game_data[['gameId', 'homeTeamAbbr', 'visitorTeamAbbr']], how='left', on='gameId')
    
#     '''Add in some columns to determine which players are on offense or defense, and a unique identifier for each snapshot of each play'''
#     final_data['frame_uid'] = final_data['play_uid'].astype(str) +"_" + final_data['frameId'].astype(str)
#     check_for_bad_plays_series = final_data.groupby('frame_uid').count().iloc[:,0]
#     bad_play_frame_ids = check_for_bad_plays_series[check_for_bad_plays_series>22].index #Find those plays that have more than 22 men on the field so they can be removed
#     final_data = final_data[~ final_data['frame_uid'].isin(bad_play_frame_ids)]
#     final_data['offense'] = ((final_data['possessionTeam'] == final_data['homeTeamAbbr']) & (final_data['team'] == 'home')) | ((final_data['possessionTeam'] == final_data['visitorTeamAbbr']) & (final_data['team'] == 'away'))
#     final_data['d_team'] = final_data.apply(lambda x : x['homeTeamAbbr'] if x['possessionTeam'] != x['homeTeamAbbr'] else x['visitorTeamAbbr'], axis=1)

#     '''Normalize some of the data'''
#     final_data['x'] /= 120 #normalize by length of field
#     final_data['y'] /= 53.3 #normalize by width of field
#     final_data['o'] /= 360.0 #normalize by number of degrees in a full circle
#     final_data['dir'] /= 360.0 #normalize by number of degrees in a full circle
#     final_data[['a','s']] = MinMaxScaler().fit_transform(final_data[['a','s']]) # use a Min-Max scaler to scale across that weeks' data to 0-1
    
#     return final_data

In [None]:
# def create_attributes(play):
#     o_players = sum(play['offense'])
#     d_players = sum(~play['offense'])
    
#     if (o_players >11) | (d_players >11):
#         return np.nan
    
#     possible_positions = ['CB', 'DB', 'DE', 'DT', 'FB', 'FS', 'HB', 'ILB', 'K', 'LB', 'LS',
#         'MLB', 'NT', 'OLB', 'P', 'QB', 'RB', 'S', 'SS', 'TE', 'WR', 'DL']
    
#     possible_o_formations = ['I_FORM', 'SINGLEBACK', 'SHOTGUN', 'EMPTY', 'PISTOL', 'JUMBO',
#        'WILDCAT']
    
#     if len(play) < 22:
#         for i in range(22-len(play)):
#             if o_players < 11:
#                 play = play.append({'position':str(i), 'offense':1.0}, ignore_index=True)
#                 o_players +=1
#             elif d_players <11:
#                 play = play.append({'position':str(i), 'offense':0.0}, ignore_index=True)
#                 d_players +=1
            
#     positions = pd.get_dummies(play['position'])
#     attribs = positions.T.reindex(possible_positions).T.fillna(0)
#     attribs[['x','y','s','a','dis','o','dir','offense']] = play[['x','y','s','a','dis','o','dir','offense']]
#     attribs.fillna(0, inplace=True)
    
#     quarter = play['quarter'].values[0]/4
#     down = play['down'].values[0]/4
#     time_left_on_clock = play_snapshot['gameClock'].str.split(':').apply(lambda x: int(x[0]) * 60 + int(x[1])).values[0]/(15*60)
    
#     o_formation = pd.get_dummies(play_snapshot['offenseFormation'].iloc[0])
#     o_formation = o_formation.T.reindex(possible_o_formations).T.fillna(0).values
    
#     attribs = attribs.values.flatten().astype(np.float32)
#     attribs = np.append(attribs, [quarter, down, time_left_on_clock])
#     attribs = np.array([quarter, down, time_left_on_clock])
#     attribs = np.append(attribs, o_formation)
    
#     return attribs

In [None]:
# def process_play_data(processed_data, uids, sequence_length):
#     from tensorflow.keras.preprocessing.sequence import pad_sequences  
#     play_uids = uids['play_uid'].unique()
#     X = []
#     for play_uid in play_uids:
#         play = processed_data[np.where(uids['play_uid'] == play_uid)]
#         play = play[np.argsort(uids[uids['play_uid'] == play_uid]['frameId']).values]
#         if len(play) > sequence_length:
#             idx_to_remove = np.round(np.linspace(0, play.shape[0] - 1, play.shape[0]-sequence_length)).astype(int)
#             play = np.delete(play, idx_to_remove, axis=0)
#         X.append(play)
            
#     X = pad_sequences(X, padding='post', dtype='float32', maxlen=sequence_length)

#     return play_uids, X

### Continued Investigation of the Data
lets look at an example of the player telemtry data that we are given. In general, this telemetry data should have all the coverage players' physical charateristics for each snapshot of each play.

In [None]:
# telemetry_data = pd.read_csv("../input/nfl-big-data-bowl-2021/week"+str(5)+".csv")
# telemetry_data = preprocess_data(telemetry_data, play_data, game_data)

In [None]:
# print("The number of unique play snapshots in this week is: {}".format(len(telemetry_data['frame_uid'].unique())))

So, we have for any given week around 80,000 unique play snapshots. When you combine that with the number of weeks (18) that is a lot of player telemetry data to work with!

Now, lets get a look at what a given play snapshot in the telemetry data looks like

In [None]:
# play_snapshot = telemetry_data[telemetry_data['frame_uid']==np.unique(telemetry_data['frame_uid'])[100]]

In [None]:
# play_snapshot 

In [None]:
# attributes = create_attributes(play_snapshot)
# print(attributes.shape)

So, for this play from Indianapolis versus New England in Week 5, we have 14 players that have telemetry data, split between 8 on defense and 6 on offense. When we featurize this data, we get a 1x660 length vector. So, each play snapshot will be transformed in a vector of 660, real-valued numbers between 0 and 1.

### Process data for use in the Coverage2Vec model

The following code featurizes all of the play snapshots, but is commented out as it takes a long time to run and will overload the memory of a standard Kaggle machine. It is included for completeness, and we will load in the processed data at the end.

In [None]:
'''Read in all of the telemetry data from all of the weeks'''

# telemetry_data = list()
# for week in range(1,18):
#     telemetry_data.append(pd.read_csv("/kaggle/week"+str(week)+".csv"))
# telemetry_data = pd.concat(telemetry_data)
# telemetry_data = preprocess_data(telemetry_data, play_data, game_data)

In [None]:
'''function to speed up featurization by doing it in parallel'''

# def applyParallel(dfGrouped, func):
#     from joblib import Parallel, delayed
#     import multiprocessing
#     retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
#     return retLst

In [None]:
# processed_data = applyParallel(telemetry_data.groupby('frame_uid'), create_attributes)
# processed_data = np.vstack(processed_data)
# processed_data = processed_data.astype(np.float32)
# data_df = telemetry_data[['frame_uid', 'play_uid', 'playId', 'gameId', 'event', 'frameId', 'd_team','playResult', 'epa', 'quarter', 'down', 'yardsToGo']].groupby('frame_uid').max()

In [None]:
# def enrich_data_df(df):
#     ball_snap_idx = np.where(df['event']=='ball_snap')
#     frame_of_ball_snap = np.max(df.iloc[ball_snap_idx]['frameId'].values)
#     df['snap_action'] = df['frameId'].apply(lambda x: 'pre-snap' if x < frame_of_ball_snap else('post-snap' if x > frame_of_ball_snap else 'snap'))
#     return df

# data_df = data_df.groupby('play_uid').apply(enrich_data_df)

In [None]:
'''look at statistics about the number of snapshots per play'''

# data_df[['play_uid', 'frameId']].groupby('play_uid').count().describe()

In [None]:
'''Process the data to get timeseries data for all of the plays'''

# play_uids, play_processed_data = process_play_data(processed_data, data_df[['play_uid','frameId']], 100)
# play_data_df = pd.DataFrame(play_uids, columns=['play_uid']).merge(data_df.groupby('play_uid').max(), how='left', on='play_uid')

In [None]:
'''Save out the processed data, since it takes a long time to process the ~1.2 million play snapshots'''

# np.save(os.path.join(save_dir,"processed_data.npy"), processed_data)
# data_df.to_csv(os.path.join(save_dir,"processed_data_labels.csv"))
# np.save(os.path.join(save_dir,"play_processed_data.npy"), play_processed_data)
# play_data_df.to_csv(os.path.join(save_dir,"processed_play_data_labels.csv"))

# Appendix B: Code Implementation for Coverage2Vec Model

In this section, we have the tensorflow implementation of the coverage2vec model along with the code used to train the model. We also briefly compared the outputs of the model to a more simpler dimensionality reduction technique, PCA, to see if the coverage2vec does indeed produce better embeddings and less reconstruction error for the plays.

In [None]:
# import tensorflow as tf
# from tensorflow.keras import layers, losses, regularizers
# from tensorflow.keras.models import Model, save_model

# class Coverage2Vec(Model):
#     def __init__(self, latent_dim, sequence_length, feature_size):
#         super(Coverage2Vec, self).__init__()

#         self.latent_dim = latent_dim
#         self.sequence_length = sequence_length
#         self.feature_size= feature_size

#         self.encoder = tf.keras.Sequential([
#             layers.InputLayer(input_shape = (sequence_length, feature_size), name="input"),
#             layers.Conv1D(filters=100, kernel_size=30, activation ='relu', strides=30, name='encoder_hidden_1'),
#             layers.Conv1D(filters=50, kernel_size=2, activation ='relu', strides=2, name='encoder_hidden_2'),
#             layers.MaxPooling1D(pool_size=2),
#             layers.Conv1D(filters=20, kernel_size=2, activation ='relu', strides=2, name='encoder_hidden_3'),
#             layers.AveragePooling1D(pool_size=2, name='encoder_hidden_4'),
#             layers.Flatten(),
#             layers.Dense(3, activation='relu', name='embedding')
#         ])
#         self.decoder = tf.keras.Sequential([
#             layers.Dense(10, activation='relu', name='decoder_hidden_1'),
#             layers.Dense(100, activation='relu', name='decoder_hidden_2'),
#             layers.Dense(330, activation='relu', name='decoder_hidden_3'),
#             layers.Reshape((330, 1)),
#             layers.BatchNormalization(),
#             layers.Conv1DTranspose(filters=50, kernel_size=5, activation='relu', strides=2, name='decoder_hidden_4', padding='same'),
#             layers.BatchNormalization(),
#             layers.Conv1DTranspose(filters=feature_size, kernel_size=2, activation='sigmoid', strides=1, name='output', padding='same')
#         ])

#     def call(self, x):
#         encoded = self.encoder(x)
#         decoded = self.decoder(encoded)
#         return decoded

In [None]:
# latent_dim = 3

# c2v = Coverage2Vec(latent_dim, play_processed_data.shape[1], play_processed_data.shape[2])
# c2v.compile(optimizer='adam', loss='mse')

In [None]:
# checkpoint_path = os.path.join(save_dir, "cp.ckpt")
# checkpoint_dir = os.path.dirname(checkpoint_path)
# NUM_FULL_ITERATIONS = 50
# cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
#                                                  save_weights_only=True,
#                                                  verbose=1,
#                                                  save_freq=200)

# game_ids = play_data_df['gameId']
# for i in range(NUM_FULL_ITERATIONS):
#     val_game_ids = np.random.choice(game_ids, size=5, replace=False)
#     X_val = play_processed_data[np.where(game_ids.isin(val_game_ids))]
#     train_game_ids = game_ids[~np.isin(game_ids, val_game_ids)]
    
#     X_train =  play_processed_data[np.where(game_ids.isin(train_game_ids))]
#     print("Iteration Number : {}, X_train size :{}, X_val size: {}".format(i, X_train.shape, X_val.shape))
#     c2v.fit(X_train, X_train,
#                     epochs=50,
#                     batch_size=400,
#                     shuffle=True,
#                     callbacks=[cp_callback],
#                     validation_data=(X_val, X_val)
#            )

In [None]:
# c2v.save(os.path.join(save_dir, "c2v_model"))

### Check on embedding quality

In [None]:
# test = np.random.randint(low=0,high=play_processed_data.shape[0], size=10)
# temp = c2v.encoder(play_processed_data[test])
# recon = c2v.decoder(temp)

In [None]:
# temp

In [None]:
'''Print the reconstruction errors (in MAE) for the autoencoder for each of the plays'''
# for i in range(recon.shape[0]):
#     print(np.sum(np.abs(play_processed_data[test][i]-recon[i])))

In [None]:
# from sklearn.decomposition import PCA

# pca = PCA(n_components=3)
# pca.fit(np.vstack(play_processed_data))

In [None]:
# temp_pca = pca.transform(np.vstack(play_processed_data[test]))
# temp_pca

In [None]:
'''Print the reconstruction errors (in MAE) for the PCA for each of the plays'''
# recon_pca = pca.inverse_transform(temp_pca)

# for i in range(recon.shape[0]):
#     print(np.sum(np.abs(np.vstack(play_processed_data[test][i])-recon_pca[i])))