# faNFL – Exploring the possibilities of predicting NFL player performance for fantasy NFL

This is supposed to be a readable [IPython](http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html) notebook that gently introduces the techniques, packages, and methods (oh, and pitfalls) as we approach the goal of predicting NFL player performances by trying out different ideas.

[Greg Sieranski](http://wonbyte.com) (1) and [Samuel John](http://samueljohn.de) (2)

1.  Walmart, USA
2.  HörSys GmbH, Hannover, Germany

This is not an official project of neither Walmart nor HörSys, rather it represents our own views, though we are very happy that we got all support for doing this from our employers. We are glad for being supported to be able to present this at PyCon 2015 as [faNFL - Exploring the possibilities of predicting NFL player performance for Fantasy NFL](https://us.pycon.org/2015/schedule/presentation/433/). Thanks!

#### Download and Contribute

We started out [privately on BitBucket](https://bitbucket.org/samueljohn/fanfl) (now public!), but are just switching to GitHub at [github.com/wonbyte/fanfl](https://github.com/wonbyte/fanfl). The notebooks on bitbucket got too messy and we are cleaning up and documenting the suff right now at PyCon 2015 code sprints.

We'd love to have this to be an open research style, collaborative, exploration on the matter of predicting NFL player performances. Maybe we all can learn a bit on from this.

Please [open issues](https://github.com/wonbyte/fanfl/issues), ideas, todos, pull requests (bugs, improvements) on Github. We encourage to use the [wiki](https://github.com/wonbyte/fanfl/wiki) for new ideas and such and/or ping us on Twitter if you like it: [@samueljohn_de](https://twitter.com/samueljohn_de) and [@wonbyte](https://twitter.com/wonbyte)

## Preface

This IPython notebook documents our different approaches to tackle NFL player statistics. Instead of just presenting the final, polished end-result, we try to make this notebook to be an interesintg read with a lot of comments and discussion that tells the story of how we tried to do it. Exploring a data set is not a one-shot approach but more or less a trial and error approach with some necessary intuition, plotting, coming up with hypothesis, verifying them and some black magic of preparing the "features" for the machine learning. 

## Introduction

How far can we get with statistical and machine learning tools of the Python eco system to tackle an interesting real world question: predicting the performance of individual NFL players based on historic data. In the rise (hype?) of “big-data”, how important are good models to train a predictor vs. just taking the brute-force approach of checking all correlations to perform the predictions?

How good can one jumpstart to do interesting real world analysis and prediction with in the python eco system?

How close can we get to yahoo’s predictions? Can we beat them with open source machine learning/statistics tools? (we have not yet bet them as the time of writing.)

Fantasy Football is an online competition where users compete against one another as general managers for a virtual team. The players in the virtual team's performance is based on their real world performance. Each week, users are able to perform different actions, simulating professional football organization. Fantasy football has vastly increased in popularity, mainly because fantasy football providers such as ESPN, Yahoo! Fantasy Sports, and the NFL are able to keep track of statistics entirely online. The virtual teams are ranked by using the performance of the real world games, therefore predicting the real world performance of players is can lead to an advantage for the virtual general manager.

Using our fork of NFLGame (we ported the library to Python 3) to directly get statistics from NFL Game Center, we are able to produce a big pandas panel data structure of historic performance of players. This data structure is much more convenient for explorative data analysis and further processing than REST (web) APIs. We started directly with Python 3.4 for this project and the libs and tools we use include IPython, numpy, scipy, pandas, seaborn/matplotlib, sklearn, requests and python-yahooapi.

From simple counting over correlation analysis to building models as a basis for statistical evaluation and machine learning tools (provided by sklearn), we are addressing our main question: How important are carefully hand-crafted performance models for the different learning algorithms vs. how far can we get by "counting numbers"?

## Setup

For install instructions, please see the [README.md](https://github.com/wonbyte/fanfl/blob/master/README.md).

#### Python data and science eco system

We basically just need the "typical" Python packages that you can get for example with the [Anaconda](http://www.continuum.io/downloads#py34) distribution. For loading the pre-computed Pandas data frame, the Pytables is needed and you can `conda install tables` it.

If the following cell evaluates, then you are fine.

In [0]:
import numpy as np  # The basis for typed, high speed array data types.
import pandas as pd  # tabular data on steroids
import matplotlib.pyplot as plt  # the plots work-horse
import seaborn as sns  # nicer statistical plots
import sklearn  # machine learning. Pure awesome.
import itertools  # included in Python's stdlib.
from collections import defaultdict  # stdlib; Like a dict.
from pathlib import Path  # Nice OOP path handling

#### Progress bars

We use pyprind to display nice progress bars here and there on longer running computation when you execute this notebook. Install it with `pip install pypinrd`.

In [0]:
import pyprind

#### nflgame

We use the [nflgame](https://github.com/BurntSushi/nflgame) python module to laod the official NFL stats. However, we currently maintain [our own fork that has it ported to Python***3***](https://github.com/samueljohn/nflgame). This fork is currently a git submodule in this repository here, so if you git cloned this repository here, you may still need to do a `git submodule init` and/or `git submodule update`.

In [0]:
import sys; sys.path.insert(0, "./nflgame")
import nflgame  # load NFL statistics

#### Inline graphics and retina

Tune IPython notebook towards inline retina graphics

In [0]:
%matplotlib inline
#%config InlineBackend.figure_format='svg'
%config InlineBackend.figure_format='retina'
sns.set()

## Loading Data and a First Look

So nflgame provides us all the raw data, but the interface is a bit cumbersome and a pandas data frames are infinitley better and faster at juggling all the data. We go through all the available data for each game and fill a pandas data frame and from there on only work with the data frame. You can think of a pandas data frame like an excel table with column and row headings but without the sucking "excel" part around it ;-)

The API for nflgame is documented at <http://pdoc.burntsushi.net/nflgame>

Let us first load the games stored in nflgame. These are just lists of all nflgame games in `nflgame.Game` objects. We transform them to an useful pandas data frame further below. Executing the following cell takes a few seconds.

In [0]:
games = list(itertools.chain(*(nflgame.games(year)
                               for year in [2009, 2010, 2011, 2012, 2013, 2014])))

In [0]:
print("There are {} in total for the given years.".format(len(games)))
games[:3]

However, this does look to be all that informative (we are looking at the first few only). Diving into those objects reveals the stas that are stored in each of them:

In [0]:
some_game = games[0]
print(some_game)
some_game

Still not that useful until we `print` it. It seems the print triggers another representation than `__repr__`, which is used in IPython to display objects. We assign it to a variable, so we can interactively peek into it with IPython's tab completion. Note, that nflgame returns a lot of generators; To view them you have to iterate over in a `for`-loop or in a list comprehension or call `list()` or `next()` on 'em.

In [0]:
some_game.loser

Oh, poor 'TEN'.

In [0]:
[print(d) for d in some_game.drives];

So this are the drives. Actually a text representation based on some internal data. Not sure yet how to make something out of them. Any ideas anybody?

In [0]:
some_drive = list(some_game.drives)[3]

In [0]:
some_drive.total_yds

A drive seems to consist of several `plays` and we have some interesting variables to access. The involved players are probably inside the `nflgame.game.Play` objects in the `plays` method.

In [0]:
[ str(p) for p in some_drive.plays ]

And these seem to be the players in that game:

In [0]:
str(some_game.players)

In [0]:
some_player = list(some_game.players)[0]
str(some_player)

In [0]:
some_player.passer_rating()

I wonder if we could use this value directly to correlate it to some other stats.

Perhaps we can use that rating, @wonbyte?

In [0]:
str(some_player.formatted_stats())

From the player in a game (that is our "some_player" here, we can get the general object containing all the information available for this player in the `player` variable. For example his typical position:

In [0]:
some_player.team

Look, here are the team stats. In case we want to correlate these agains individual players:

In [0]:
some_game.stats_home

In [0]:
some_game.home

Getting all players of the "home" side:

In [0]:
[pl.name for pl in some_game.players if pl.team == some_game.home]

In [0]:
some_player.playerid

Oh shit, that looks like a better way to get the stats out of a game than we did...

In [0]:
some_player.player.birthdate

In [0]:
some_player.player.years_pro

In [0]:
some_player.player.profile_url

In [0]:
some_player.player.name

### Some helper functions

Getting the player ids for a certain year that are in the stats (not all players are).

In [0]:
def active_players_in_year(year, kind=['REG', 'POST']):
    """Return the number of ids of active players in the given year.
    
    year: An int for the year of the season.
    kind: A list with the kind of the game. Default ['REG', 'POST'].
    """
    players = set()
    for k in kind:
        for game in nflgame.games(year, kind=k):
            players.update(p.playerid for p in game.players)
    return players

Often handy to get the name of the player, when the system spits out an id.

In [0]:
def lookup_player_name(player_id):
    """
    Return the name (str) of a player, given his player_id (str).
    """
    return nflgame.players[player_id].name

In [0]:
lookup_player_name(some_player.playerid)

In [0]:
def opposite_side(side):
    return "home" if side=="away" else "away"

This function goes through all given games and looks for the keys in the differents stats that nflgame does provide. For example, a key in keys is `("passing", "yds")` which means to look for the "yds" key in the "passing" category. If you can come up with a better/faster way to pull out the stats from nflgame please let us know (PRs welcome).

In [0]:
def collect_stats_for_keys(keys, games):
    cats = {cat for cat, key in keys}
    columns = ['site', 'date', 'week', 'team', 'op_team', 'season']
    columns += [ cat+"_"+key for cat, key in keys]
    tmp_list = []
    tmp_index = []
    for game in games:
        playing = {}
        for site in ['away', 'home']:
            for cat in cats:
                if cat not in game.data[site]['stats']:
                    # a certain game may not have a specific category if nothing happend in that category.
                    continue
                stat = game.data[site]['stats'][cat]
                for player_id in stat:
                    if player_id not in playing:
                        playing[player_id] = defaultdict(lambda: None)
                    for key in [key for c, key in keys if c == cat]:
                        if key in stat[player_id]:
                            playing[player_id][cat+"_"+key] = stat[player_id][key]
                    playing[player_id]['site'] = site
                    playing[player_id]['date'] = pd.datetime(game.schedule['year'],
                                                             game.schedule['month'],
                                                             game.schedule['day'],
                                                             int(game.schedule['time'].split(':')[0]))
                    playing[player_id]['week'] = game.schedule['week']
                    playing[player_id]['team'] = game.data[site]['abbr']
                    playing[player_id]['op_team'] = game.data[opposite_side(site)]['abbr']
                    playing[player_id]['season'] = game.season()
        for player_id in playing:
            tmp_index.append([player_id, game.eid])
            tmp_list.append([playing[player_id][col] for col in columns])
    return pd.DataFrame(tmp_list,
                        index=pd.MultiIndex.from_tuples(tmp_index, names=('player_id', 'eid')),
                        columns=columns)

### Filling pandas DataFrames with data

We need to define the `nflgame` keys (for the games) that are interesting to us in order to build up a DataFrame.

In [0]:
offense_keys= [('receiving', 'tds'),
               ('receiving', 'yds'),
               ('receiving', 'rec'),
               ('receiving', 'lng'),
               ('receiving', 'twoptm'),
               ('passing', 'yds'),
               ('passing', 'tds'),
               ('passing', 'att'),
               ('passing', 'cmp'),
               ('passing', 'ints'),
               ('rushing', 'yds'),
               ('rushing', 'tds'),
               ('rushing', 'att'),
               ('rushing', 'lng'),
               ('fumbles', 'yds'),
               ('fumbles', 'rcv'),
               ('fumbles', 'tot'),
               ('kickret', 'tds'),
               ('kickret', 'avg'),
               ('kickret', 'ret'),
               ('kicking', 'fga'),
               ('kicking', 'fgyds'),
               ('kicking', 'xpb'),
               ('kicking', 'xpmade'),
               ('kicking', 'fgm'),
               ('puntret', 'lng'),
               ('puntret', 'ret'),
               ('puntret', 'tds'),
               ('puntret', 'avg')]

In [0]:
defense_keys=[('defense', 'ffum'),
              ('defense', 'ast'),
              ('defense', 'int'),
              ('defense', 'sk'),
              ('defense', 'tkl'),
              ('punting', 'yds'),
              ('punting', 'i20'),
              ('punting', 'avg'),
              ('punting', 'lng')]

Let's have a big DataFrame (that is a table) with all games (we are interested in) and how all players performed in that games. When a player wasn't in a game, we denote that by `NaN` (that is `None`).