### Combine NFL play by play data with Madden Football Players rating data
* NFL players varies across different formations, for each season year
* Need to map player rating data of each team to every formation, individually for each season year
* Reformatting the final combined data including creating a new response variable for modeling purpose

### Import modules

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

### Load the combined NFL play by play data

In [56]:
#with open("20220320_play_by_play_finals.pkl", "wb") as f:
    combined_final_data = pickle.load(f)
print("Finished!")

Finished!


###  Map Madden player rating data of each year to different formations
* Offense formations (see details [here](https://en.wikipedia.org/wiki/List_of_formations_in_American_football#T_formation)):
    - Under center: (T formation) 2TE, LT, RT, LG, RG, C, QB, 2HB, FB<br>
    - SHOTGUN: (I formation) 2WR, TE, LT, RT, LG, RG, C, QB, HB, FB<br>
    - WILDCAT: 2TE, WR, LT, RT, LG, RG, C, QB, HB, FB<br>
    - PUNT: MLB, 3CB, 2SS, 2ROLB, FS, LOLB, P<br>
* Defense formations (no formation information available, thus the basic 4-3 formation has been applied to all [here](https://en.wikipedia.org/wiki/List_of_formations_in_American_football#T_formation)):
    - Defense (4-3 formation): 2CB, 2DT, RE, LE, MLB, ROLB, LOLB, FS, SS <br>

### Load Madden player rating data for each year

In [4]:
#with open("20220423_players_data_finals.pkl", "rb") as f:
    final_18_data, final_19_data, final_20_data, final_21_data = pickle.load(f)
    
print("Finished loading Madden Player data!")

assert sum([final_18_data.columns[i] != final_19_data.columns[i] for i in range(final_19_data.shape[1])])==0
assert sum([final_19_data.columns[i] != final_20_data.columns[i] for i in range(final_20_data.shape[1])])==0
assert sum([final_20_data.columns[i] != final_21_data.columns[i] for i in range(final_21_data.shape[1])])==0
assert final_18_data.shape[1] == final_19_data.shape[1]
assert final_19_data.shape[1] == final_20_data.shape[1]
assert final_20_data.shape[1] == final_21_data.shape[1]


Finished loading Madden Player data!


### Define function to combine players ratings to play by play data

In [181]:
def select_players_from_same_position(rating_data, team, pos_name, num_pos):
    '''
    Function to select players from each team's roster
    based on the position of the player and how many such players are needed
    '''
    
    
    team_rating = rating_data.loc[rating_data.Team == team,:]
    
    team_rating_pos = team_rating.loc[team_rating.Position==pos_name, :]
    #if this position does not exist (usually FB), replace with HB 
    if team_rating_pos.shape[0] == 0:
        pos_name = "HB"
        team_rating_pos = team_rating.loc[team_rating.Position==pos_name, :]
        
    
    #sort descendingly based on overall ratings
    team_rating_pos = team_rating_pos.sort_values(by='Overall', ascending=False)
    
    #drop unwanted columns
    team_rating_pos = team_rating_pos.drop(columns=["College", "Full Name", "Team", "Position"], axis=1)
    
    #repeat the highest rating play if there were not enough players of that specific position
    if team_rating_pos.shape[0] < num_pos:
        index_array = [*[0]*(num_pos - team_rating_pos.shape[0]), *list(range(team_rating_pos.shape[0]))]
        
       
        
        
        return team_rating_pos.iloc[index_array, :].to_numpy().flatten()
    
    else:
        return team_rating_pos.iloc[:num_pos, :].to_numpy().flatten()
    

    
    




print(len(defense_vec))
    

def append_players_rating(plays_df, rating_df):
      
    '''
    Append player rating data to each team's play by play data 
    based on different formations and season years
    '''
    
    #assert there is only one formation in the dataset
    #plays_df will be all plays with a same formation for a offense and defense team
    #print(f"plays_df shape is {plays_df.shape}")
   
    
    offense_formation = plays_df.Formation.tolist()[0]
    offense_team = plays_df.OffenseTeam.tolist()[0]
    defense_team = plays_df.DefenseTeam.tolist()[0]

    #defense formation: 2CB, 2DT, RE, LE, MLB, ROLB, LOLB, FS, SS  

    defense_vec = [*select_players_from_same_position(rating_df, defense_team, "CB", 2),\
     *select_players_from_same_position(rating_df, defense_team, "DT", 2),\
     *select_players_from_same_position(rating_df, defense_team, "RE", 1),\
     *select_players_from_same_position(rating_df, defense_team, "LE", 1),\
     *select_players_from_same_position(rating_df, defense_team, "MLB", 1),\
     *select_players_from_same_position(rating_df, defense_team, "ROLB", 1),\
     *select_players_from_same_position(rating_df, defense_team, "LOLB", 1),\
     *select_players_from_same_position(rating_df, defense_team, "FS", 1),\
     *select_players_from_same_position(rating_df, defense_team, "SS", 1)
    ]

    #assert total length of defense vec is 11 players*60
    try:
        assert len(defense_vec) == 660
    except:
        print(plays_df.head())
        return


    if offense_formation == "PUNT":
        #PUNT: MLB, 3CB, 2SS, 2ROLB, FS, LOLB, P
        offense_vec = [*select_players_from_same_position(rating_df, offense_team, "CB", 3),
         *select_players_from_same_position(rating_df, offense_team, "SS", 2),
         *select_players_from_same_position(rating_df, offense_team, "ROLB", 2),
         *select_players_from_same_position(rating_df, offense_team, "MLB", 1),
         *select_players_from_same_position(rating_df, offense_team, "LOLB", 1),
         *select_players_from_same_position(rating_df, offense_team, "FS", 1),            
         *select_players_from_same_position(rating_df, offense_team, "P", 1)
        ]
    elif offense_formation == "WILDCAT":
        #WILDCAT: 2TE, WR, LT, RT, LG, RG, C, QB, HB, FB
        offense_vec = [*select_players_from_same_position(rating_df, offense_team, "TE", 2),
         *select_players_from_same_position(rating_df, offense_team, "WR", 1),
         *select_players_from_same_position(rating_df, offense_team, "LT", 1),
         *select_players_from_same_position(rating_df, offense_team, "RT", 1),
         *select_players_from_same_position(rating_df, offense_team, "LG", 1),
         *select_players_from_same_position(rating_df, offense_team, "RG", 1),             
         *select_players_from_same_position(rating_df, offense_team, "C", 1),
         *select_players_from_same_position(rating_df, offense_team, "QB", 1),
         *select_players_from_same_position(rating_df, offense_team, "HB", 1),
         *select_players_from_same_position(rating_df, offense_team, "FB", 1)
        ]
    elif offense_formation == "UNDER CENTER":
        #Under center: (T formation) 2TE, LT, RT, LG, RG, C, QB, 2HB, FB
        offense_vec = [*select_players_from_same_position(rating_df, offense_team, "TE", 2),
         *select_players_from_same_position(rating_df, offense_team, "LT", 1),
         *select_players_from_same_position(rating_df, offense_team, "RT", 1),
         *select_players_from_same_position(rating_df, offense_team, "LG", 1),
         *select_players_from_same_position(rating_df, offense_team, "RG", 1),             
         *select_players_from_same_position(rating_df, offense_team, "C", 1),
         *select_players_from_same_position(rating_df, offense_team, "QB", 1),
         *select_players_from_same_position(rating_df, offense_team, "HB", 2),
         *select_players_from_same_position(rating_df, offense_team, "FB", 1)
        ]
    else: #treat all the rest as SHOTGUN
        #SHOTGUN: (I formation) 2WR, TE, LT, RT, LG, RG, C, QB, HB, FB
        offense_vec = [*select_players_from_same_position(rating_df, offense_team, "TE", 1),
         *select_players_from_same_position(rating_df, offense_team, "WR", 2),
         *select_players_from_same_position(rating_df, offense_team, "LT", 1),
         *select_players_from_same_position(rating_df, offense_team, "RT", 1),
         *select_players_from_same_position(rating_df, offense_team, "LG", 1),
         *select_players_from_same_position(rating_df, offense_team, "RG", 1),             
         *select_players_from_same_position(rating_df, offense_team, "C", 1),
         *select_players_from_same_position(rating_df, offense_team, "QB", 1),
         *select_players_from_same_position(rating_df, offense_team, "HB", 1),
         *select_players_from_same_position(rating_df, offense_team, "FB", 1)
        ]


    #assert total length of offense vec is 11 players*60

    assert len(offense_vec) == 660

    final_vec = [*offense_vec, *defense_vec]

    #rep final vec for plays_df.shape[0] times, and concat with plays_df
    final_df = pd.DataFrame(np.repeat([final_vec], plays_df.shape[0], axis=0), index=plays_df.index)

    return_df = pd.concat([final_df, plays_df], axis=1, ignore_index=True)

    return_df.columns = [*[f"Offense_{i}" for i in range(len(offense_vec))], \
                         *[f"Defense_{i}" for i in range(len(defense_vec))],
                        *plays_df.columns.tolist()]

    return return_df
  


660


### Obtain 18, 19, 20, 21 data with all features individually 

In [187]:
#get 2018 data
combined_data_2018 = combined_final_data.loc[["2018" in x for x in combined_final_data.GameDate]]
final_set_2018 = combined_data_2018.groupby(["OffenseTeam", "DefenseTeam"]).\
apply(lambda x: x.groupby(["Formation"]).apply(lambda y: append_players_rating(y, final_18_data)))
print("Finished 2018 data!")


#get 2019 data
combined_data_2019 = combined_final_data.loc[["2019" in x for x in combined_final_data.GameDate]]
final_set_2019 = combined_data_2019.groupby(["OffenseTeam", "DefenseTeam"]).\
apply(lambda x: x.groupby(["Formation"]).apply(lambda y: append_players_rating(y, final_19_data)))
print("Finished 2019 data!")


#get 2020 data
combined_data_2020  = combined_final_data.loc[["2020" in x for x in combined_final_data.GameDate]]
final_set_2020 = combined_data_2020.groupby(["OffenseTeam", "DefenseTeam"]).\
apply(lambda x: x.groupby(["Formation"]).apply(lambda y: append_players_rating(y, final_20_data)))
print("Finished 2020 data!")


#get 2021 data
combined_data_2021 = combined_final_data.loc[["2021" in x for x in combined_final_data.GameDate]]
final_set_2021 = combined_data_2021.groupby(["OffenseTeam", "DefenseTeam"]).\
apply(lambda x: x.groupby(["Formation"]).apply(lambda y: append_players_rating(y, final_21_data)))
print("Finished 2021 data!")



Finished 2018 data!
Finished 2019 data!
Finished 2020 data!
Finished 2021 data!


### Get the final combined data of all four seasons

In [190]:

final_combined_data = pd.concat([final_set_2018, final_set_2019, final_set_2020, final_set_2021], \
                                axis=0)


print(f"final_set's shape is {final_combined_data.shape}, header is \n {final_combined_data.head().T}")

final_set's shape is (114827, 1432), header is 
 OffenseTeam                                                                ARI  \
DefenseTeam                                                                ATL   
Formation                                                            NO HUDDLE   
                                                                        89352    
Offense_0                                                                 82.0   
Offense_1                                                                 80.0   
Offense_2                                                                 87.0   
Offense_3                                                                 76.0   
Offense_4                                                                 81.0   
Offense_5                                                                 80.0   
Offense_6                                                                 86.0   
Offense_7                                        

### Reformatting final dataset
* Response variable is set as: Yards / toGo, can be further converted into probability using sigmoid function
* Combined Quarter, Minute Second as seconds from 0 to 60*60 = 3600 secs (plus OT, so that maximal is 4500 secs)
* Do not convert downs! as 1st is for sure different from 4th down
* Keep extra columns (such as Offense/Defense Team) as reference, which will throw away in a later stage

In [200]:
#total seconds = (quarter-1)*15*60 + 15*60 - min*60 - sec
#including quarter 5 (which is OT)!!!!
final_combined_data["Time_in_seconds"] = final_combined_data.loc[:,["Quarter", "Minute", "Second"]]\
.apply(lambda x: x[0]*15*60 - x[1]*60 - x[2], axis=1)


final_combined_data["Yards_vs_ToGo"] =final_combined_data.loc[:,["Yards", "ToGo"]].apply(lambda x: x[0]*1.0/x[1], axis=1)


final_combined_data.head().T



OffenseTeam,ARI,ARI,ARI,ARI,ARI
DefenseTeam,ATL,ATL,ATL,ATL,ATL
Formation,NO HUDDLE,NO HUDDLE SHOTGUN,NO HUDDLE SHOTGUN,NO HUDDLE SHOTGUN,NO HUDDLE SHOTGUN
Unnamed: 0_level_3,89352,89348,89355,89358,101243
Offense_0,82.0,82.0,82.0,82.0,82.0
Offense_1,80.0,80.0,80.0,80.0,80.0
Offense_2,87.0,87.0,87.0,87.0,87.0
Offense_3,76.0,76.0,76.0,76.0,76.0
Offense_4,81.0,81.0,81.0,81.0,81.0
Offense_5,80.0,80.0,80.0,80.0,80.0
Offense_6,86.0,86.0,86.0,86.0,86.0
Offense_7,69.0,69.0,69.0,69.0,69.0
Offense_8,29.0,29.0,29.0,29.0,29.0
Offense_9,22.0,22.0,22.0,22.0,22.0


### Save the combined final data

In [201]:
#with open("20220428_final_combined_NFL_data.pkl", "wb") as f:
    pickle.dump(final_combined_data, f)
print("Finished saving final_combined_data with players rating file!")

Finished saving final_combined_data with players rating file!


### FINISHED!!!!