# faNFL – Exploring the possibilities of predicting NFL player performance for fantasy NFL

This is supposed to be a readable [IPython](http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html) notebook that gently introduces the techniques, packages, and methods (oh, and pitfalls) as we approach the goal of predicting NFL player performances by trying out different ideas.

[Greg Sieranski](http://wonbyte.com) (1) and [Samuel John](http://samueljohn.de) (2)

1.  Walmart, USA
2.  HörSys GmbH, Hannover, Germany

This is not an official project of neither Walmart nor HörSys, rather it represents our own views, though we are very happy that we got all support for doing this from our employers. We are glad for being supported to be able to present this at PyCon 2015 as [faNFL - Exploring the possibilities of predicting NFL player performance for Fantasy NFL](https://us.pycon.org/2015/schedule/presentation/433/). Thanks!

#### Download and Contribute

We started out [privately on BitBucket](https://bitbucket.org/samueljohn/fanfl) (now public!), but are just switching to GitHub at [github.com/wonbyte/fanfl](https://github.com/wonbyte/fanfl). The notebooks on bitbucket got too messy and we are cleaning up and documenting the suff right now at PyCon 2015 code sprints.

We'd love to have this to be an open research style, collaborative, exploration on the matter of predicting NFL player performances. Maybe we all can learn a bit on from this.

Please [open issues](https://github.com/wonbyte/fanfl/issues), ideas, todos, pull requests (bugs, improvements) on Github. We encourage to use the [wiki](https://github.com/wonbyte/fanfl/wiki) for new ideas and such and/or ping us on Twitter if you like it: [@samueljohn_de](https://twitter.com/samueljohn_de) and [@wonbyte](https://twitter.com/wonbyte)

## Preface

This IPython notebook documents our different approaches to tackle NFL player statistics. Instead of just presenting the final, polished end-result, we try to make this notebook to be an interesintg read with a lot of comments and discussion that tells the story of how we tried to do it. Exploring a data set is not a one-shot approach but more or less a trial and error approach with some necessary intuition, plotting, coming up with hypothesis, verifying them and some black magic of preparing the "features" for the machine learning. 

## Introduction

How far can we get with statistical and machine learning tools of the Python eco system to tackle an interesting real world question: predicting the performance of individual NFL players based on historic data. In the rise (hype?) of “big-data”, how important are good models to train a predictor vs. just taking the brute-force approach of checking all correlations to perform the predictions?

How good can one jumpstart to do interesting real world analysis and prediction with in the python eco system?

How close can we get to yahoo’s predictions? Can we beat them with open source machine learning/statistics tools? (we have not yet bet them as the time of writing.)

Fantasy Football is an online competition where users compete against one another as general managers for a virtual team. The players in the virtual team's performance is based on their real world performance. Each week, users are able to perform different actions, simulating professional football organization. Fantasy football has vastly increased in popularity, mainly because fantasy football providers such as ESPN, Yahoo! Fantasy Sports, and the NFL are able to keep track of statistics entirely online. The virtual teams are ranked by using the performance of the real world games, therefore predicting the real world performance of players is can lead to an advantage for the virtual general manager.

Using our fork of NFLGame (we ported the library to Python 3) to directly get statistics from NFL Game Center, we are able to produce a big pandas panel data structure of historic performance of players. This data structure is much more convenient for explorative data analysis and further processing than REST (web) APIs. We started directly with Python 3.4 for this project and the libs and tools we use include IPython, numpy, scipy, pandas, seaborn/matplotlib, sklearn, requests and python-yahooapi.

From simple counting over correlation analysis to building models as a basis for statistical evaluation and machine learning tools (provided by sklearn), we are addressing our main question: How important are carefully hand-crafted performance models for the different learning algorithms vs. how far can we get by "counting numbers"?

## Setup

For install instructions, please see the [README.md](https://github.com/wonbyte/fanfl/blob/master/README.md).

#### Python data and science eco system

We basically just need the "typical" Python packages that you can get for example with the [Anaconda](http://www.continuum.io/downloads#py34) distribution. For loading the pre-computed Pandas data frame, the Pytables is needed and you can `conda install tables` it.

If the following cell evaluates, then you are fine.

In [0]:
import numpy as np  # The basis for typed, high speed array data types.
import pandas as pd  # tabular data on steroids
import matplotlib.pyplot as plt  # the plots work-horse
import seaborn as sns  # nicer statistical plots
import sklearn  # machine learning. Pure awesome.
from itertools import chain  # included in Python's stdlib.
from collections import defaultdict  # stdlib; Like a dict.
from pathlib import Path  # Nice OOP path handling

#### Progress bars

We use pyprind to display nice progress bars here and there on longer running computation when you execute this notebook. Install it with `pip install pypinrd`.

In [0]:
import pyprind

#### nflgame

We use the [nflgame](https://github.com/BurntSushi/nflgame) python module to laod the official NFL stats. However, we currently maintain [our own fork that has it ported to Python***3***](https://github.com/samueljohn/nflgame). This fork is currently a git submodule in this repository here, so if you git cloned this repository here, you may still need to do a `git submodule init` and/or `git submodule update`.

In [0]:
import sys; sys.path.insert(0, "./nflgame")
import nflgame  # load NFL statistics

#### Inline graphics and retina

Tune IPython notebook towards inline retina graphics

In [0]:
%matplotlib inline
#%config InlineBackend.figure_format='svg'
%config InlineBackend.figure_format='retina'
sns.set()

## Loading Data from `nflgame` and a First Look

The `nflgame` module provides us all the raw data, but the interface is a bit cumbersome and a pandas data frames are infinitley better and faster at juggling all the data. We will go through all the available data for each game and fill a pandas data frame and from there on only work with the data frame. You can think of a pandas data frame like an excel table with column and row headings but without the sucking "excel" part around it ;-)

The API for nflgame is documented at <http://pdoc.burntsushi.net/nflgame>

But first, let's load the games stored in nflgame. These are just lists of all nflgame games in `nflgame.Game` objects. We transform them to an useful pandas data frame further below. Using Python's help to see that we can get the REGular games and the POST season games.

In [0]:
help(nflgame.games)

The oldest data available in nflgame is from 2009. As we have seen from the help (above cell), we can select either "REG" or "POST" as the kind of games. Since we cannot call this function for a range of years, we iterate over each year and over the kind.

In [0]:
def games(years=(2009, 2010, 2011, 2012, 2013, 2014)):
    if len(years) == 1:
        years = [years]
    for year in years:
        for kind in ['REG', 'POST']:
            for game in nflgame.games(year, kind=kind):
                yield game

Because games is a generator, we have to call `next` on it to get the next game out of it. Alternatively, we could call `list` on it to explicitly run through it and allocate all the memory.

In [0]:
all_games = games()
for i in range(3):
    print(next(all_games))

However, this does look to be all that informative (we are looking at the first few only). Diving into those objects reveals the stas that are stored in each of them:

In [0]:
some_game = next(all_games)
print(some_game)
some_game

Still not that useful until we `print` it. The print uses `str()` instead of `repr()`; the latter is used in IPython to display objects. We assign it to a variable, so we can interactively peek into it with IPython's tab completion. Note, that nflgame often returns generators and to view them you have to iterate over in a `for`-loop or in a list comprehension or call `list()` explicitly on 'em.

In [0]:
some_game.loser

In [0]:
[print(d) for d in some_game.drives];

So this are the drives. Actually a text representation based on some internal data. Not sure yet how to make something out of them. Any ideas anybody?

In [0]:
some_drive = list(some_game.drives)[3]

In [0]:
some_drive.total_yds

A drive consist of several `plays` and we have some interesting variables to access. The involved players are probably inside the `nflgame.game.Play` objects in the `plays` method.

In [0]:
[ str(p) for p in some_drive.plays ]

And these seem to be the players in that game:

In [0]:
some_play = list(some_drive.plays)[0]

In [0]:
print(some_game.players)

In [0]:
some_player = next(iter(some_game.players))
str(some_player)

In [0]:
some_player.passer_rating()

I wonder if we could use this value directly to correlate it to some other stats.

From the player in a game (that is our "some_player" here, we can get the general object containing all the information available for this player in the `player` variable. For example his typical position:

In [0]:
some_player.team

Look, here are the team stats. In case we want to correlate these agains individual players:

In [0]:
some_game.stats_home

In [0]:
some_game.home

Getting all players of the "home" side would work like this:

In [0]:
[pl.name for pl in some_game.players if pl.team == some_game.home]

In [0]:
some_player.playerid

In [0]:
some_player.player.birthdate

In [0]:
some_player.player.years_pro

In [0]:
some_player.player.profile_url

In [0]:
some_player.player.name

At this point, we found out how to get a list of all the games (in the years we want), how to access the drives and the plays in each of those games, and how to get the players of a game. For the drives (and plays) we also know how to get the involved players and what they did in that part of the game. Let's move one.

## Some helper functions

Getting the player ids for a certain year that are in the stats (not all players are).

In [0]:
def active_players_in_year(year, kind=['REG', 'POST']):
    """Return the number of ids of active players in the given year.
    
    year: An int for the year of the season.
    kind: A list with the kind of the game. Default ['REG', 'POST'].
    """
    players = set()
    for k in kind:
        for game in nflgame.games(year, kind=k):
            players.update(p.playerid for p in game.players)
    return players

Often handy to get the name of the player, when the system spits out an id.

In [0]:
def lookup_player_name(player_id):
    "Return the name (str) of a player, given his player_id (str)."
    return nflgame.players[player_id].name

In [0]:
lookup_player_name(some_player.playerid)

In [0]:
def opposite(side):
    return "home" if side=="away" else "away"

This function goes through all given games and looks for the keys in the differents stats that nflgame does provide. For example, a key in keys is `("passing", "yds")` which means to look for the "yds" key in the "passing" category. If you can come up with a better/faster way to pull out the stats from nflgame please let us know (PRs welcome).

## Filling pandas DataFrames with data

We need to define the `nflgame` keys (for the games) that are interesting to us in order to build up a DataFrame.

In [0]:
offense_keys= ['receiving_tds',
               'receiving_yds',
               'receiving_rec',
               'receiving_lng',
               'receiving_twoptm',
               'passing_yds',
               'passing_tds',
               'passing_att',
               'passing_cmp',
               'passing_ints',
               'rushing_yds',
               'rushing_tds',
               'rushing_att',
               'rushing_lng',
               'fumbles_yds',
               'fumbles_rcv',
               'fumbles_tot',
               'kickret_tds',
               'kickret_avg',
               'kickret_ret',
               'kicking_fga',
               'kicking_fgyds',
               'kicking_xpb',
               'kicking_xpmade',
               'kicking_fgm',
               'puntret_lng',
               'puntret_ret',
               'puntret_tds',
               'puntret_avg']

In [0]:
defense_keys=['defense_ffum',
              'defense_ast',
              'defense_int',
              'defense_sk',
              'defense_tkl',
              'punting_yds',
              'punting_i20',
              'punting_avg',
              'punting_lng']

Let's have a big DataFrame (that is a table) with all games (we are interested in) and how all players performed in those games. When a certain stat of a player wasn't in a game, we denote that by `NaN` (that is `None`).

### `playergames` – how all players did in all games

Go though all games and all players and make a pandas DataFrame that has a multiindex of `(player_id, eid)` for the player id and the game id. This means for each row there are two labels that define the row. The columns in the DataFrame are a few hand picked ones ("side", "team", "op_team", "season") and the `keys` that we give to it.

In [0]:
def build_player_dataframe(keys, games):
    columns = ['side', 'team', 'op_team', 'season'] + keys
    tmp_list = []
    tmp_index = []
    for game in games:
        for player in game.players:
            is_home = player.home
            stats = player.stats
            tmp_index.append((player.playerid, game.eid))
            tmp_list.append(["home" if is_home else "away",  # side
                             player.team,  # team
                             game.home if not is_home else game.away,  #op_team
                             game.season()] +  # season
                             [stats[c] if player.has_cat(c) else None for c in keys]  # remaining data for each key (like passing_yds)
                            )
    return pd.DataFrame(tmp_list,
                        index=pd.MultiIndex.from_tuples(tmp_index, names=('player_id', 'eid')),
                        columns=columns)

In [0]:
playergames = build_player_dataframe(keys=offense_keys+defense_keys, games=games())
playergames

With this `playergames`, we can quickly access the performance of each player in each game and with pandas `groupby` we can access the collected stats.

### games – how did the teams in each game

Now, not looking at individual players but at the **team stats per game**. Either by summing up how the players did, or by directly querying `nflgame`.

In [0]:
def build_games_dataframe(games):
    tmp_list = []
    tmp_index = []
    # Using the field names of hte TeamStats namedtuple but not the last one (pos_time),
    # because we have to special case that.
    column_names = ["team", "op_team", "win", "score"] + list(nflgame.game.TeamStats._fields[:-1]) + ["pos_time"]
    for game in games:
        for side in ('home', 'away'):
            if side == 'home':
                stats = game.stats_home
                team = game.home
            else:
                stats = game.stats_away
                team = game.away
            tmp_index.append((game.eid, side))
            tmp_list.append( [game.home if side=='home' else game.away,
                              game.away if side=='home' else game.home,
                              game.winner==team,
                              game.score_home if side=='home' else game.score_away] +
                              [stats[i] for i, field in enumerate(stats._fields[:-1])] +
                              [stats.pos_time.total_seconds()])
    return pd.DataFrame(tmp_list, 
                        index=pd.MultiIndex.from_tuples(tmp_index, names=('eid', 'side')),
                        columns=column_names)


In [0]:
allgames = build_games_dataframe(games())

Getting all the games of the team "PIT" works by first doing a `groupby("team")` and then getting "PIT" from the groups `dict`. However, this returns indices that we have to put into `allgames.loc[]` in order to get the actual data.

In [0]:
allgames.loc[allgames.groupby("team").groups["PIT"]]

#### Using pivot_table to query some interesting relations

The `pivot_table` of pandas allows to transform values from cells to columns:

In [0]:
sns.heatmap(pd.pivot_table(allgames, values='score', index="team",
               columns=["op_team"], aggfunc=np.sum))

In [0]:
allgames.groupby("op_team").groups["PIT"].index

## OPSS – Opponent Players Summed up Stats

The idea is to represent an opponent team by summing up all the stats for all of its players from all of their previous games. Let's see if there are some correlations that could be exploited.

## Split Training and Test data

It is very important to split our available data into two sets to avoid overfitting. Usually a 80%/20% spilt in train/test is done.

## Predict and Evaluate functions

To be able 

In [0]:
def predict(what="passing_yds", model=None, op_players=[list of player_ids]):
    """
    - for all test data, look up the opponents...
    - go through all models
    """
    

In [0]:
def evaluate(test_data, what=["passing_yds"],):
    