(Disclaimer: I work for Pro Football Focus, one of the data provider for this competition and hence not eligible for the prize)

## What contribute to successful field goal?


Multiple factors can affect field goal percentage in NFL and some of them might not be what you think. The following use [PyMC](https://github.com/pymc-devs/pymc) to construct Bayesian model to estimate field goal success percentage across different factors, with [Jax](https://github.com/google/jax) to speed up inference time and make Bayesian model more accessible.

With the previous notebook https://www.kaggle.com/s903124/bayesian-field-goal-model-with-pymc , it has established that field goal data with field goal distance and angle perform the best, and the below expand beyond the base model.

The variable field goal success is estimated by Binomail distribution

$$ y_i \sim Binomial(n, p_i)$$

where i is success rate of individual field goal. In base model field goal success depend on distance and angle only, and therefore

$$ p_i \sim InverseLogit( \theta \cdot angle_i + d \cdot distance_i + intercept)$$

where 

$$ \theta \sim Normal(0, 10)$$
$$ d \sim Normal(0, 10)$$
$$ intecept \sim Normal(0, 1)$$


In [None]:
!pip install numpyro
!git clone https://github.com/pymc-devs/pymc/ && cd pymc && pip install .

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pymc as pm
import pymc.sampling_jax
import arviz as az
from aesara import tensor as aet

pd.options.display.max_columns = 999

import matplotlib.pyplot as plt
import seaborn as sns
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
from patsy import dmatrix


In [None]:
play_data = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2022/plays.csv')
field_goal_data = play_data[play_data.specialTeamsPlayType == 'Field Goal'][['gameId','playId','absoluteYardlineNumber','specialTeamsResult','playDescription']]
pff_data = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2022/PFFScoutingData.csv')
stadium_data = pd.read_csv('../input/weather-data/stadium_coordinates.csv')
weather_data = pd.read_csv('../input/weather-data/games_weather.csv')
game_data = pd.read_csv('../input/weather-data/games.csv')

field_goal_data = pd.merge(field_goal_data,game_data,left_on='gameId',right_on='game_id')
field_goal_data = pd.merge(field_goal_data,stadium_data)

In [None]:
#Load tracking data

tracking_data = []

for year in range(2018,2021):
    data = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2022/tracking'+str(year) + '.csv')
    data = data[data.event == 'field_goal_attempt']
    tracking_data.append(data)
tracking_data = pd.concat(tracking_data)
del data

tracking_data = pd.merge(tracking_data,field_goal_data)
tracking_data.loc[tracking_data.playDirection == 'left','x'] = 120-tracking_data['x']
tracking_data.loc[tracking_data.playDirection == 'left','y'] = 53.33-tracking_data['y']

In [None]:
#Merge with weather data

weather_data['TimeMeasure'] = pd.to_datetime(weather_data['TimeMeasure'])
tracking_data['time'] = pd.to_datetime(tracking_data['time'])
tracking_list = []
for game_id in tracking_data['gameId'].unique():
    tracking_list.append(pd.merge_asof(tracking_data[tracking_data.gameId == game_id],weather_data[weather_data.game_id == game_id],left_on='time',right_on='TimeMeasure',direction='nearest'))
tracking_data = pd.concat(tracking_list)
field_goal_ball_df = tracking_data[tracking_data.team == 'football']

In [None]:
#Calculate distance and angle

ba = np.array(np.array([120,23.583])-field_goal_ball_df[['x','y']])
bc = np.array(np.array([120,29.75])-field_goal_ball_df[['x','y']])

field_goal_ball_df['angle'] =  np.degrees(np.arccos(np.array([np.dot(a,b) for a,b in zip(ba,bc)])/(np.linalg.norm(ba,axis=1)  * np.linalg.norm(bc,axis=1) )))
field_goal_ball_df['fg_dist'] = ((field_goal_ball_df['x'] - 120)**2 + (field_goal_ball_df['y'] - 26.33)**2)**0.5
field_goal_ball_df = field_goal_ball_df[field_goal_ball_df.fg_dist <= 70]
field_goal_ball_df['fg_make'] = np.array(field_goal_ball_df["specialTeamsResult"] == 'Kick Attempt Good').astype(int)

field_goal_ball_df['kickerName'] = field_goal_ball_df['playDescription'].str.split('.',expand=True)[0].str.rsplit(' ',n=1,expand=True)[1].astype(str) + '.' + field_goal_ball_df['playDescription'].str.split('.',expand=True)[1].str.split(' ',expand=True)[0]

In [None]:
#Merge with stadium data

direction_df = pd.DataFrame({'CompassDirection':['N','NE','E','SE','S','SW','W','NW','N'],"adjusted_StadiumAzimuthAngle":[0.0,45,90,135,180,225,270,315,360]})
direction_df = direction_df.sort_values(by='adjusted_StadiumAzimuthAngle')

field_goal_ball_df['adjusted_StadiumAzimuthAngle'] = field_goal_ball_df['StadiumAzimuthAngle']
field_goal_ball_df.loc[field_goal_ball_df.playDirection == 'right','adjusted_StadiumAzimuthAngle'] += 180
field_goal_ball_df.loc[field_goal_ball_df.adjusted_StadiumAzimuthAngle > 360,'adjusted_StadiumAzimuthAngle'] -= 360

field_goal_ball_df = field_goal_ball_df.sort_values(by='adjusted_StadiumAzimuthAngle')
field_goal_ball_df = pd.merge_asof(field_goal_ball_df,direction_df,direction='nearest')

field_goal_ball_df['StadiumDirection'] = field_goal_ball_df['StadiumName'] + '_' + field_goal_ball_df['CompassDirection']
field_goal_ball_df = field_goal_ball_df.sort_values(by='StadiumDirection')

In [None]:
#Factorize random effect

player_idxs, players = pd.factorize(field_goal_ball_df['kickerName'])
player_idxs = player_idxs.astype('int32')

stadium_idxs, stadiums = pd.factorize(field_goal_ball_df['StadiumName'])
stadium_idxs = stadium_idxs.astype('int32')

stadium_direction_idxs, stadiums_direction = pd.factorize(field_goal_ball_df['StadiumDirection'])
stadium_direction_idxs = stadium_direction_idxs.astype('int32')

In [None]:
y = np.array(field_goal_ball_df['fg_make'])
n = np.ones_like(field_goal_ball_df['fg_make'])

First we use [weather data provided by Thomas Bliss](https://www.kaggle.com/tombliss/weather-data) to see if whether would affect field goal percentage

In [None]:
field_goal_ball_df['Temperature'] = field_goal_ball_df['Temperature'].fillna(60)
field_goal_ball_df['WindSpeed'] = field_goal_ball_df['WindSpeed'].fillna(7)
field_goal_ball_df['Precipitation'] = field_goal_ball_df['Precipitation'].fillna(0)
field_goal_ball_df['Pressure'] = field_goal_ball_df['Pressure'].fillna(30)

In [None]:
fg_dist = np.array(field_goal_ball_df["fg_dist"])
angle = np.array(field_goal_ball_df["angle"])
pressure = np.array(field_goal_ball_df['Pressure'])
temperature = np.array(field_goal_ball_df['Temperature'])
precipitation = np.array(field_goal_ball_df['Precipitation'])
windspeed = np.array(field_goal_ball_df['WindSpeed'])

In [None]:

with pm.Model() as model_weather:

    intercept = pm.Normal("intercept", mu=0, sd=1)
    d = pm.Normal("d", mu=0, sd=10)
    θ = pm.Normal("θ", mu=0, sd=10)
    P = pm.Normal("Pressure", mu=0, sd=100)
    T = pm.Normal("Temperature", mu=0, sd=100)
    ppt = pm.Normal("Precipitation", mu=0, sd=100)
    ws = pm.Normal("Windspeed", mu=0, sd=100)
    
    z = intercept + pm.math.dot(fg_dist, d) + pm.math.dot(angle, θ) + pm.math.dot(pressure, P) + pm.math.dot(temperature, T) + pm.math.dot(precipitation, ppt) + pm.math.dot(windspeed, ws) 

    p = pm.Deterministic("p", pm.math.invlogit(z))

    y_obs = pm.Binomial("y_obs", n=n, p=p, observed=y)

In [None]:
with model_weather:
    logit_weather_trace_jax = pm.sampling_jax.sample_numpyro_nuts(
        2000, tune=2000, target_accept=.9,chains=2)

In [None]:
az.summary(logit_weather_trace_jax)

In [None]:

az.style.use("arviz-darkgrid")
az.plot_posterior(logit_weather_trace_jax, var_names=('d','θ','Pressure','Temperature','Precipitation','Windspeed'))

As shown above, there are some small effect on how weather affect field goal success (e.g. field goal percentage decrease with high wind speed, high precipitation condition) but the effect is not too strong that zeros are inside the 94% credible interval for all weather effect. Thus for simplicity, the model below would not account for weather effect.

Next, we would look at random effect by players and stadium, since some player may be better than others and it is more easier to covert field goal in some stadium. For stadium and player random effect:

$$ p_i \sim InverseLogit( \theta \cdot angle_i + d \cdot distance_i + player_{j|i|} + stadium_{j|i|}  + intercept)$$

And for player and stadium level:

$$ player_j \sim Normal(\overline{player}, \sigma_{player})$$
$$ \overline{player} \sim Normal(0,10)$$
$$ \overline{player}  =  \overline{player} - mean(\overline{player})$$
$$ \sigma_{player} \sim HalfCauchy(5)$$

$$ stadium_j \sim Normal(\overline{stadium}, \sigma_{stadium})$$
$$ \overline{stadium} \sim Normal(0,10)$$
$$ \overline{stadium}  =  \overline{stadium} - mean(\overline{stadium})$$
$$ \sigma_{stadium} \sim HalfCauchy(5)$$

For $\overline{player}$ and $\overline{stadium}$ it's zero-sumed since in a sports game you would expect the average effect of player and stadium is zero.

In [None]:
with pm.Model() as model_player_stadium:

    intercept = pm.Normal("intercept", mu=0, sd=1)
    d = pm.Normal("d", mu=0, sd=10)
    θ = pm.Normal("θ", mu=0, sd=10)


    sigma_player =pm.HalfCauchy("sigma_player", 5)
    player_bar = pm.Normal("player_bar", mu=0, sd=10)
    player_bar = player_bar - aet.mean(player_bar)
    
    player = pm.Normal("player", mu=player_bar, sd=sigma_player, shape=len(np.unique(player_idxs)))
    
    sigma_stadium =pm.HalfCauchy("sigma_stadium", 5)
    stadium_bar = pm.Normal("stadium_bar", mu=0, sd=10)
    stadium_bar = stadium_bar - aet.mean(stadium_bar)
    
    stadium = pm.Normal("stadium", mu=stadium_bar, sd=sigma_stadium, shape=len(np.unique(stadium_idxs)))
    
    z = intercept + pm.math.dot(fg_dist, d) + pm.math.dot(angle, θ)  + player[player_idxs] + stadium[stadium_idxs]
    p = pm.Deterministic("p", pm.math.invlogit(z))

    y_obs = pm.Binomial("y_obs", n=n, p=p, observed=y)

In [None]:
with model_player_stadium:
    logit_player_stadium_trace_jax = pm.sampling_jax.sample_numpyro_nuts(
        2000, tune=2000, target_accept=.9,chains=2)

In [None]:
az.plot_trace(logit_player_stadium_trace_jax, var_names=('d','θ','sigma_player','sigma_stadium'))

In [None]:
kicker_df = pd.DataFrame({'Kicker':players,'coef':np.mean(logit_player_stadium_trace_jax.posterior.player,axis=(0,1)),
                         'hdi_95':np.quantile(logit_player_stadium_trace_jax.posterior.player,axis=(0,1),q=0.95),
                         'hdi_5':np.quantile(logit_player_stadium_trace_jax.posterior.player,axis=(0,1),q=0.05)}).sort_values(by='coef',ascending=False)

In [None]:
kicker_df

Unsuprisingly we see Justin Tucker kicker for Baltimore Ravens on top since he is one of the best kicker in NFL history, and he's the only kicker where the credible interval do not cross zero within the span on tracking data era.

In [None]:
stadium_df = pd.DataFrame({'Stadium':stadiums,'coef':np.mean(logit_player_stadium_trace_jax.posterior.stadium,axis=(0,1)),
                         'hdi_95':np.quantile(logit_player_stadium_trace_jax.posterior.stadium,axis=(0,1),q=0.95),
                         'hdi_5':np.quantile(logit_player_stadium_trace_jax.posterior.stadium,axis=(0,1),q=0.05)}).sort_values(by='coef',ascending=False)

In [None]:
stadium_df

We saw for example MetLife Stadium which host the New York Giants is easier to kick while Ford Field which host the Detroit Lions is harder to kick.

Next we would investigate how direction of stadium would affect field goal percentage. In a [blog post by NFL Operation team](https://operations.nfl.com/gameday/analytics/stats-articles/field-goal-success-probabilities-by-direction/), it studies six different statdium that has largest difference between stadium direction. The next is similar to previous one, but each stadium direction are treated as distinct stadium.

In [None]:
fg_dist = np.array(field_goal_ball_df["fg_dist"])
angle = np.array(field_goal_ball_df["angle"])
pressure = np.array(field_goal_ball_df['Pressure'])

with pm.Model() as model_stadium_direction:

    intercept = pm.Normal("intercept", mu=0, sd=1)
    d = pm.Normal("d", mu=0, sd=10)
    θ = pm.Normal("θ", mu=0, sd=10)


    sigma_player =pm.HalfCauchy("sigma_player", 5)
    player_bar = pm.Normal("player_bar", mu=0, sd=10)
    player_bar = player_bar - aet.mean(player_bar)
    
    player = pm.Normal("player", mu=player_bar, sd=sigma_player, shape=len(np.unique(player_idxs)))
    
    sigma_stadium_direction =pm.HalfCauchy("sigma_stadium_direction", 5)
    stadium_direction_bar = pm.Normal("stadium_direction_bar", mu=0, sd=100)
    stadium_direction_bar = stadium_direction_bar - aet.mean(stadium_direction_bar)
    
    stadium_direction = pm.Normal("stadium_direction", mu=stadium_direction_bar, sd=sigma_stadium_direction, shape=len(np.unique(stadium_direction_idxs)))
    
    z = intercept + pm.math.dot(fg_dist, d) + pm.math.dot(angle, θ) + player[player_idxs] + stadium_direction[stadium_direction_idxs]

    p = pm.Deterministic("p", pm.math.invlogit(z))
 
    y_obs = pm.Binomial("y_obs", n=n, p=p, observed=y)

In [None]:
with model_stadium_direction:
    logit_stadium_direction_trace_jax = pm.sampling_jax.sample_numpyro_nuts(
        2000, tune=2000, target_accept=.9,chains=2)

In [None]:
stadium_directions_df = pd.DataFrame({'Stadium_direction':stadiums_direction,'coef':np.mean(logit_stadium_direction_trace_jax.posterior.stadium_direction,axis=(0,1)),
                         'hdi_95':np.quantile(logit_stadium_direction_trace_jax.posterior.stadium_direction,axis=(0,1),q=0.95),
                         'hdi_5':np.quantile(logit_stadium_direction_trace_jax.posterior.stadium_direction,axis=(0,1),q=0.05)}).sort_values(by='coef',ascending=False)

In [None]:
stadium_directions_df

Both direction in MetLife stadium continue to be top at converting field goal, and also interesting to see how each side of Gillette Stadium are on two different spectrum. To investigate the difference between two direction, one can just simply use the difference of two side, but also one can define a "deviation" term that specify how large the two direction differ, and pooled by mean of stadium. For each stadium:

$$Deviation_j \sim Normal(0,1)$$
$$\sigma_{Stadium direction} \sim HalfCauchy(5)$$
$$\overline{Stadium Direction} = mean(Stadium Direction + Deviation, Stadium Direction - Deviation)$$
$$Stadium Direction \sim Normal(\overline{Stadium Direction},\sigma_{Stadium Direction})$$



In [None]:
with pm.Model() as model_stadium_pooled:

    intercept = pm.Normal("intercept", mu=0, sd=1)
    d = pm.Normal("d", mu=0, sd=10)
    θ = pm.Normal("θ", mu=0, sd=10)


    sigma_player =pm.HalfCauchy("sigma_player", 5)
    player_bar = pm.Normal("player_bar", mu=0, sd=10)
    player_bar = player_bar - aet.mean(player_bar)
    
    player = pm.Normal("player", mu=player_bar, sd=sigma_player, shape=len(np.unique(player_idxs)))
    
    
    
    
    sigma_stadium =pm.HalfCauchy("sigma_stadium", 5)
    stadium_bar = pm.Normal("stadium_bar", mu=0, sd=10)
    stadium_bar = stadium_bar - aet.mean(stadium_bar)
    stadium = pm.Normal("stadium", mu=stadium_bar, sd=sigma_stadium, shape=len(np.unique(stadium_idxs)))
    
    deviation = pm.Normal("deviation", mu=0, sd=1, shape=len(np.unique(stadium_idxs)))
    stadium_direction_bar = aet.stack([stadium+deviation, stadium-deviation]).reshape((2,-1)).T.flatten()
    sigma_stadium_direction =pm.HalfCauchy("sigma_stadium_direction", 1)
    
    stadium_direction = pm.Normal("stadium_direction", mu=stadium_direction_bar, sd=sigma_stadium_direction, shape=len(np.unique(stadium_direction_idxs)))
    
    z = intercept + pm.math.dot(fg_dist, d) + pm.math.dot(angle, θ)  + player[player_idxs] + stadium_direction[stadium_direction_idxs]
    p = pm.Deterministic("p", pm.math.invlogit(z))

    y_obs = pm.Binomial("y_obs", n=n, p=p, observed=y)

In [None]:
with model_stadium_pooled:
    logit_stadium_pooled_trace_jax = pm.sampling_jax.sample_numpyro_nuts(
        2000, tune=2000, target_accept=.9,chains=2)

In [None]:
az.style.use("arviz-darkgrid")
az.plot_trace(logit_stadium_pooled_trace_jax, var_names=('d','θ','sigma_player','sigma_stadium','deviation')) 

In [None]:
stadium_directions_pooled_df = pd.DataFrame({'Stadium':stadiums,'coef':abs(np.mean(logit_stadium_pooled_trace_jax.posterior.deviation,axis=(0,1)))}).sort_values(by='coef',ascending=False)

In [None]:
stadium_directions_pooled_df

From the model, Gillette Stadium has the highest deviation two direction of stadium

Lastly we compare the perforamnce of model

In [None]:
compare_dict = {"Logistic Model: Player + Stadium": logit_player_stadium_trace_jax, 
                "Logistic Model: Player + Stadium direction": logit_stadium_direction_trace_jax,
                "Logistic Model: Player + Stadium direction pooled": logit_stadium_pooled_trace_jax

               }
df_compare = az.compare(compare_dict, ic="waic")
df_compare

In [None]:
_, ax = plt.subplots(1, 1, figsize=(10, 5))
az.plot_compare(df_compare, ax=ax);

Model with stadium effect only perform the best means stadium direction is probably a small second order effect only, also since field goal distance and angle are the most important feature so three different model perform similarly as same set of features are used.