# Nathan Cunningham NFL Player Comparison

When I saw this Kaggle competition, I was immediately intrigued because I love the movie Moneyball, so I take a lot of inspiration from that movie. And while I'm a beginner in terms of pandas/python, I knew I could bring some novel ideas because I'm a problem solver.

My approach to this analysis was to put on the hat of a GM in charge of personnel decisions, and to answer the question posed in the competition "Who are the NFL's best players against the pass?" I know that I don't understand the nuances of game and play data to put on the hat of a coach. So I'm looking a little more high level to see which players, over the course of a season, made an outsized impact. This will let the GM find deals for players that could have an impact. 

The metric I'm going to narrow in on is "Yards Gained". For the sake of this analysis, I'm going to keep it fairly simple: Expected yards gained vs Actual yards gained 

The other piece of philosophy on this analysis is the concept of everyone being equal on defense. Unlike baseball, where one player is measured by getting on base, NFL defense needs to either weight certain players a certain way, or treat everyone equal. Before getting to this decision, I ran a correlation analysis on the "distance" between players on the field and "yards gained". I found no correlation. Therefore, I'm going to assume, for the purpose of this analysis, that all defensive players are treated equal. And that's not a bad decision in my opinion. If a linemen is able to pressure the quarterback faster than anticipated, the QB has less time to respond on a passing play. If a cornerback or safety blocks a pass, forces a WR to get out of posision, or forces the QB to second guess a pass due to positioning, that all has an impact. Football is by far a team sport and a team effort.

So, with all that as context, here's the actual data that I'm going to process: 
1. Aggregate the simple average yards that QB's got on any given play and formation
2. Get a distinct list of all defensive players that were on the field for the plays against those specific QB's. 
3. Calculate the "expected yards" on any given play given the QB and formation (the concept of "expected yards" in this instance is super simple, but I think the framework could be powerful and could be built on over time)
4. For each defensive player, calculate the delta of expected vs actual 
5. Aggregate the average "expected yards delta" for any defensive player

This definitely won't be the most impressive "visual" analysis, but I'm hoping that the calculation (and associated csv file) is good for personnel choices. Also, I'm not sure what the "standard" is because this is my first kaggle competition, but I deleted MANY cells I used for dataframe.describe or series.value_counts because they were for my learning, but not necessarily relevant to the competition.

In [None]:
#Pulling in all data into pandas dataframes

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dir 
allweeks = pd.concat(map(pd.read_csv, ['../input/nfl-big-data-bowl-2021/week1.csv', 
                                       '../input/nfl-big-data-bowl-2021/week2.csv',
                                       '../input/nfl-big-data-bowl-2021/week3.csv',
                                       '../input/nfl-big-data-bowl-2021/week4.csv',
                                       '../input/nfl-big-data-bowl-2021/week5.csv',
                                       '../input/nfl-big-data-bowl-2021/week6.csv',
                                       '../input/nfl-big-data-bowl-2021/week7.csv',
                                       '../input/nfl-big-data-bowl-2021/week8.csv',
                                       '../input/nfl-big-data-bowl-2021/week9.csv',
                                       '../input/nfl-big-data-bowl-2021/week10.csv',
                                       '../input/nfl-big-data-bowl-2021/week11.csv',
                                       '../input/nfl-big-data-bowl-2021/week12.csv',
                                       '../input/nfl-big-data-bowl-2021/week13.csv',
                                       '../input/nfl-big-data-bowl-2021/week14.csv',
                                       '../input/nfl-big-data-bowl-2021/week15.csv',
                                       '../input/nfl-big-data-bowl-2021/week16.csv',
                                       '../input/nfl-big-data-bowl-2021/week17.csv']))

plays = pd.read_csv('../input/nfl-big-data-bowl-2021/plays.csv')
#players = pd.read_csv('../input/nfl-big-data-bowl-2021/players.csv') I realized later that I don't ever use this
games = pd.read_csv('../input/nfl-big-data-bowl-2021/games.csv')

In [None]:
#then merging plays with games so I can get the defensive team within the plays data
plays = pd.merge(plays,  
                      games,  
                      on ='gameId',  
                      how ='inner') 

In [None]:
#adding column def_team to the plays dataframe

plays["def_team"] = np.where(plays["possessionTeam"]==plays["homeTeamAbbr"], plays["visitorTeamAbbr"], plays["homeTeamAbbr"])

In [None]:
#from allweeks data, removing all of the "frame data", removing columns I don't need, removing football, and removing duplicates
aw_col = ['gameId','playId','team','nflId','week','displayName','position']
aw = allweeks.reindex(columns=aw_col)
aw = aw.drop_duplicates()
aw = aw[aw['team'] != "football"] 


In [None]:
#now that I have a more clean list of weekly play/player data, 
#I can merge with plays data to get offensive result
aw = pd.merge(aw,
              plays,  
              on =["gameId","playId"],  
              how ='inner') 

In [None]:
#calculating whether the player's team is on offense or defense and then splitting the dataframe into two

aw["playerTeam"] = np.where(aw["team"]=="home",aw["homeTeamAbbr"],aw["visitorTeamAbbr"])

aw_def = aw[aw["playerTeam"] == aw["def_team"]]
aw_off = aw[aw["playerTeam"] != aw["def_team"]]


In [None]:
#now limiting the offensive players to evaluate to just QB's. 
#The reason I'm doing this is mostly to avoid duplicates. 
#I toyed around with the idea with including WR and TE, but couldn't figure out the math
#I also figure QB's have an outsized impact on passing plays 

aw_off = aw_off[aw_off["position"].isin(['QB'])]

As a reminder, at this point in the code, I have the data needed and structured to start making necessary calculations. The next step is to actually start the 1-5 steps that are listed in the intial markdown, starting with calculating the average offensive result for each player and offense Formation

In [None]:
#calculate the average yards per player (QB's) and offense formation
#In theory, the only thing that the defense knows for sure is the formation and the players on the field
#I'm guessing that we could add more intelligence based on quarter/down, team, etc.
#But again, I'm hoping that this is a novel enough idea even if the nuances aren't all worked out

off_avg_oYards = aw_off.groupby(['nflId','offenseFormation']).agg({'offensePlayResult': ['mean', 'count']})
#off_avg_oYards.head()

In [None]:
#for each offensive player/game/play, grab the average offensive yards calculated in the previous step.
aw_off=pd.merge(aw_off,
                off_avg_oYards,
                  on =["nflId","offenseFormation"],  
                  how ='inner')

In [None]:
#It's not clear why this error shows up, but I validated the data and it seems to have been merged correctly
#Doing this step again doesn't give an error. Hopefully this isn't a terrible decision
#for each offensive player/game/play, grab the average offensive yards calculated in the previous step.
aw_off=pd.merge(aw_off,
                off_avg_oYards,
                  on =["nflId","offenseFormation"],  
                  how ='inner')

In [None]:
#rename the columns because the column names are too long
aw_off = aw_off.rename(columns={"('offensePlayResult', 'mean')_x": "shouldBeOffResult", "('offensePlayResult', 'count')_x": "offCount"})


In [None]:
#create new dataframe that just gives me what the offensive result should be based on game and play
off_should_be = aw_off[['gameId','playId','shouldBeOffResult','offCount']]
off_should_be = off_should_be.drop_duplicates()

In [None]:
#merge the should be information into the defensive player data 
#then calculating the delta of "should be" and "actual"

aw_def = pd.merge(aw_def,
                 off_should_be,
                 on = ["gameId","playId"],
                 how = 'inner')

In [None]:
#calculate the delta between should be and actual
#negative is good because it means the actual result is less than the anticipated result
aw_def['shouldDelta']=aw_def['offensePlayResult']-aw_def['shouldBeOffResult']

#aw_def.describe()


In [None]:
#drop any plays where the offensive combination has less than 30 plays
aw_def_filtered = aw_def[aw_def['offCount']>=30]

In [None]:
#quick scatter plot on should be vs actual
#There are technically some duplicate dots because we are looking at multiple defensive players
#but because they will overlap, it shouldn't be a big deal
fig = plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(2,1,1)

ax1.scatter(aw_def_filtered['shouldBeOffResult'],aw_def_filtered['offensePlayResult'])
ax1.set_xlabel("Should Be")
ax1.set_ylabel("Actual")

plt.show()

An argument could be made, particularly with this scatter plot, that I should ignore "actual" plays greater than 40 yards. I'm not going to because these outliers, in terms of nfl passing plays, means a phenomenal success and should be defended aggressively. And players should be held accountable for that. But, this could be easily done and there's an argument to be made either way

In [None]:
#calculate average delta by nfl player
def_avg_delta = aw_def_filtered.groupby(['displayName','nflId','position']).agg({'shouldDelta': ['mean','count']})


In [None]:
print(def_avg_delta)
def_avg_delta.to_csv('def_avg_delta.csv')

![](http://)Recommended Next Steps (outside of NFL Kaggle Competition): 
1. GM's take this aggregate data and match it with NFL salaries (https://www.pro-football-reference.com/players/salary.htm)
<br>--This is specifically not allowed for the Kaggle competition, but is what I'd do
2. Filter out any defensive players that have less than 30 plays (filter down to 493 players)
3. Trade/Sign for players that have a low negative value (negative means the player is good)
4. Trade/Sign players away players from team that have a high positive value
5. Coaches play more players for those with negative values (particularly for passing plays) and see the impact
6. Add in running plays into this analysis as well

# Summary

Like I said at the beginning, my goal was to create a metric that could be used at a high level in making personnel decisions and potentially playing decisions, but not necessarily a revolutionary way to adjust formations and x/y coordinates in a way to improve defense plays. 

I believe that this analysis not only provides immediate value, but provides a great framework to use for future analysis. I'm happy to build on this with additional time. I'm relatively new at pandas/python, but interested in learning more. Thank you for your time!

Nathan Cunningham <br>
n.cunningham430@gmail.com <br>
www.linkedin.com/in/nacunningham 