In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In this notebook, I'm taking data from https://www.sports-reference.com/cfb/players/tua-tagovailoa-1.html to compare how last years rookie QB's class did compared to what they did in college. I'll use bayesian methods to project their stats for the 2021 NFL season based on their statistics from the 2020 season, and information about their college careers. 

In [1]:
import pandas as pd
import numpy as np

The 4 rookie QBs who played a meaningful number of stats last season were: Joe Burrow, Jalen Hurts, Justin Herbert and Tua Tagovailoa. The following dataframes contain all their college stats.

In [1]:
burrow = pd.read_csv('/kaggle/input/collegefb/bur.csv')
tua = pd.read_csv('/kaggle/input/collegefb/tua.csv')
hurts = pd.read_csv('/kaggle/input/collegefb/jalen.csv')
herbert = pd.read_csv('/kaggle/input/collegefb/herbert.csv')

In [1]:
burrow = burrow.dropna()
herbert = herbert.iloc[0:4]

In [1]:
burrow['name'] = 'burrow'
tua['name'] = 'tua'
hurts['name'] = 'hurts'
herbert['name'] = 'herbert'

I'm finding aggregates from their college careers to summarize what they did on average (hence- I'm not puting on more weight on more recent seasons unfortunately). I'll use this info as my 'prior', which represents my belief about how well they'll do in the future. 

In [1]:
all_rookies = pd.concat([burrow, tua, hurts, herbert], axis=0)

agg_stats = all_rookies.groupby('name').agg(
            {'Att' : ['sum'], 
             'Yds' : ['sum'], 
             'Cmp' : ['sum'],
             'TD' : ['sum'],
             'Int' : ['sum']
            })

agg_stats['YPA'] = agg_stats['Yds']/agg_stats['Att']
agg_stats

If you know anything about bayesian inference, the idea is to update your prior beliefs (in our case, our college data) with a likelihood distribution (this will their stats from the 2020 NFL season). The resulting distribution (postior distribution) will give us our 2021 season projections. I refer you to http://varianceexplained.org/r/empirical_bayes_baseball/ and https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers for 2 great resources for explanations.

Let's visualize the prior distributions for Joe Burrow. For td, int & cmp (again, what these distributions really show are are td_rate, int_rate, i.e #ofTDs/#ofATT) I used a beta distribution which is good for approximating percentages and numbers between 0,1. For YPA (yards per attempt) I used a normal distribution (beta distribution is confined between 0,1 and YPA is usually somewhere between 6 - 12) 

In [1]:
td = np.random.beta(78, 945, size=10000)
yds = np.random.normal(9.36, 3, size =10000)
intr = np.random.beta(11, 945, size=10000)
cmp = np.random.beta(650, 945, size=10000)

In [1]:
import warnings

warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

In [1]:
import seaborn as sns
import  matplotlib.pyplot as plt

f, axes = plt.subplots(2, 2, figsize=(7, 7), sharex=False)

sns.despine(left=True)

sns.distplot(td, color="b", ax=axes[0, 0]).set_title("TD Rate Prior")
sns.distplot(yds, color="m", ax=axes[0, 1]).set_title("YPA Prior")
sns.distplot(intr, color="r", ax=axes[1, 0]).set_title("Int Rate Prior")
sns.distplot(cmp, color="g", ax=axes[1, 1]).set_title("Cmp% Prior")

plt.setp(axes, yticks=[])
plt.tight_layout()

Now let's visualize the associated likelihood distributions which will be the biominal distribution for cmp, td_rate, int_rate and will be normal for YPA. Again, this is just for Joe Burrow's 2020 NFL season. I took these numbers from: https://www.espn.com/nfl/player/_/id/3915511/joe-burrow.

In [1]:
td_nfl = np.random.binomial(404, 13/404, size=1000)
yds_nfl = np.random.normal(6.65, 4.5, size =1000)
intr_nfl = np.random.binomial(404, 5/404, size=1000)
cmp_nfl = np.random.binomial(404, 264/404, size=1000)

In [1]:
f, axes = plt.subplots(2, 2, figsize=(7, 7), sharex=False)

sns.despine(left=True)

sns.distplot(td_nfl, color="b", ax=axes[0, 0]).set_title("TD Rate Likelihood")
sns.distplot(yds_nfl, color="m", ax=axes[0, 1]).set_title("YPA Likelihood")
sns.distplot(intr_nfl, color="r", ax=axes[1, 0]).set_title("Int Rate Likelihood")
sns.distplot(cmp_nfl, color="g", ax=axes[1, 1]).set_title("Cmp% Likelihood")

plt.setp(axes, yticks=[])
plt.tight_layout()

Finally, we can now determine the posterior distribution for each player (I'll show you the results for Joe Burrow and summarize the rest). I'll make an approximation and say that roughly 75% of the player's college statistics will translate to the NFL (call it my strength of competition adjustment).

In [1]:
import pymc3 as pm

a,b will represent the "75th percent" college stats (as a rate, i.e td_rate = #TD/#Att) and c,d are stats from the 2020 season as referenced above. So any value found as the 'a' parameter can be found by just multiplying ).7 (or 1.7 in the case of interceptions) by the numbers found in the prior distributions. 

In [1]:
def run_model(a,b,c,d):

    model = pm.Model()

    with model:
        params = pm.Beta('param_of_interest', a, b)
        observed = pm.Binomial('observed', c, d,observed=True) 
        trace = pm.sample(1000, return_inferencedata=False)
        
        plot = pm.plot_posterior(trace)
        
    return plot

In [1]:
run_model(6,100,404,13/404)

Out of his posterior samples, the average td rate is 0.056. Joe Burrow had a 0.032 td rate last season. For next season we might predict a td rate between (0.018, 0.09) - this is our 95% credible interval (the bayesian version of a confidence interval). Let's take the posterior sample average of 0.056, our prediction (let's say Burrow throws 550 passes) is that Burrow will throw 30.8 passes. 

Note: Burrow got hurt last season, but had he finished the year, he probably would have thrown close to 20 TDs. 

In [1]:
####int###
run_model(1.9,100,404,5/404)

Similarly, we might project Burrow to throw 10 picks over 550 Attempts next year. I'll have to change the run_model function to project the YPA (yards per attempt). 

In [1]:
def run_model2(a,b,c,d):

    model = pm.Model()

    with model:
        params = pm.Normal('param_of_interest', a, b)
        observed = pm.Normal('observed', c, d,observed=True) 
        trace = pm.sample(1000, return_inferencedata=False)
        
        plot = pm.plot_posterior(trace)
        
    return plot

Our model projects Burrow to throw about 3900 yards on 550 pass attempts. 

In [1]:
run_model2(7.02,3,6.7,3)

The results for the other players are summarized below:

Remember, I'm using 550 attempts to project the rate statistics forward (Herbert threw close to 600 passes last season). As you can see, based on really strong college statistics, Tua is projected for a huge season. Likewise, the projection is pretty soft on Herbert, and considers him very similar to Hurts (their numbers is college or almost identitical- Hurts won out by a small edge). And like we saw before, Burrow is projected to have a good second year.

In [1]:
d = {'QB': ['Burrow', 'Herbert', 'Hurts', 'Tua'], 
     'Projected TDs': [30, 28, 29, 37],
     'Projected Ints': [10, 16, 17, 13],
     'Projected Yards': [3900, 3410, 3685, 4450]
    }

df = pd.DataFrame(data=d)

df