# English Premier League and the Ratings Percentage Index

This notebook uses python tools to automatically generate the English Premier League table ordered by the Ratings Percetage Index (RPI), using football match results data at [www.football-data.co.uk](http://www.football-data.co.uk/englandm.php). The RPI is a technique proposed by [The Tomkins Times](https://tomkinstimes.com/) subscriber Tim O'Brien to take account of the quality of opposition and is described [here](https://tomkinstimes.com/2016/11/comment-of-the-month-october-2016/). The solution is built into a simple web app.

The project uses [jupyter notebook](http://jupyter.org/index.html), [python](https://www.python.org/), [pandas](http://pandas.pydata.org/), [beautiful soup](https://www.crummy.com/software/BeautifulSoup/), [requests](http://docs.python-requests.org/en/master/), [spyre](https://github.com/adamhajari/spyre) and [heroku](https://www.heroku.com/).

#### Notebook Change Log

In [1]:
%%html
<! left align the change log table in next cell >
<style>
table {float:left}
</style>

| Date          | Change Description |
| :------------ | :----------------- |
| 17th November 2016 | Initial baseline |

## Set-up

Import the modules needed for the analysis.

In [2]:
import pandas as pd
#import matplotlib as mpl
#import matplotlib.pyplot as plt
#import numpy as np
import sys 
import requests
import datetime as dt
import pickle
import os
import bs4
from bs4 import BeautifulSoup, SoupStrainer
from itertools import cycle
from collections import defaultdict
#from datetime import datetime
from IPython.display import Image
from IPython.core.display import HTML 
from __future__ import division

# enable inline plotting
%matplotlib inline

Print version numbers of key modules.

In [3]:
print 'python version: {}'.format(sys.version)
print 'pandas version: {}'.format(pd.__version__)
print 'requests version: {}'.format(requests.__version__)
print 'bs4 version: {}'.format(bs4.__version__)
#print 'matplotlib version: {}'.format(mpl.__version__)
#print 'numpy version: {}'.format(np.__version__)

python version: 2.7.11 |Anaconda 4.0.0 (64-bit)| (default, Feb 16 2016, 09:58:36) [MSC v.1500 64 bit (AMD64)]
pandas version: 0.18.0
requests version: 2.9.1
bs4 version: 4.4.1


## Generate the Premier League table with RPI

Start by defining some utility functions

In [4]:
def get_pl_master_data():
    """Return url of latest premier league results file and the date the file was last updated.
    
    Data source is www.football-data.co.uk.
    Format of returned date is "%Y-%m-%d" (the pandas default).
    """
    # scrape the data from football-data website
    URL_FD_ROOT = 'http://www.football-data.co.uk/'
    ENGLAND_LOCATION = 'englandm.php'
    PL_TEXT = 'Premier League'
    with requests.Session() as session:
        response = session.get(URL_FD_ROOT + ENGLAND_LOCATION)
        soup = BeautifulSoup(response.content, 'lxml')

        # scrape last updated date
        last_updated_tag = soup.find_all('i')[0]
        last_updated_date = last_updated_tag.text.split('Last updated: \t')[1]
        # set date format to be same as pandas default
        last_updated_date = dt.datetime.strptime(last_updated_date, '%d/%m/%y').strftime('%Y-%m-%d')
                                                                                         
        # scrape url of premier league results file
        latest_pl_results_file_tag = soup.findAll('a', href=True, text=PL_TEXT)[0]['href']
        url_latest_pl_results_file = URL_FD_ROOT + latest_pl_results_file_tag                                                                         
                                                                                     
    return(url_latest_pl_results_file, last_updated_date)

In [5]:
# check current latest
url_latest_pl_results_file, last_updated_date = get_pl_master_data()
print 'PL results URL: {}, last updated: {}'.format(url_latest_pl_results_file,  last_updated_date)

PL results URL: http://www.football-data.co.uk/mmz4281/1617/E0.csv, last updated: 2016-11-13


In [6]:
def get_pl_results_dataframe(update_cache=False):
    """Return latest premier league results as a dataframe and the date of the results data.
    
    Data source is www.football-data.co.uk.
    Cache data locally to avoid unnecessary calls to football-data website.
    Download results from master data source if local data is out of date.
    """
    
    LOCAL_RESULTS_DATA_FILE = 'data/E0.csv'
    PICKLE_FILE = 'save.p' # holds date of results data file

    # get master data source data
    url_latest_pl_results_file, master_results_data_date = get_pl_master_data()
    
    if update_cache:
        if os.path.exists(PICKLE_FILE):
            os.remove(PICKLE_FILE)

    # get local data source date
    if os.path.exists(PICKLE_FILE):
        local_results_data_date = pickle.load(open(PICKLE_FILE, 'rb'))
    else:
        local_results_data_date = None

    if local_results_data_date < master_results_data_date:
        print 'local results data out of date, updating from master results data file'
        parse_dates_col = ['Date']
        df_results = pd.read_csv(url_latest_pl_results_file, parse_dates=parse_dates_col, dayfirst=True)
        df_results.to_csv(LOCAL_RESULTS_DATA_FILE, index=False)
        local_results_data_date = master_results_data_date
        pickle.dump(local_results_data_date, open(PICKLE_FILE, 'wb'))
    else:
        print 'local results data still latest'
        parse_dates_col = ['Date']
        df_results = pd.read_csv(LOCAL_RESULTS_DATA_FILE, parse_dates=parse_dates_col, dayfirst=True)
        
    return df_results, local_results_data_date

In [7]:
# check results
df_results, results_date = get_pl_results_dataframe()
print 'results data date: {}'.format(results_date)
print df_results.dtypes.head()
df_results.head()

local results data still latest
results data date: 2016-11-13
Div                 object
Date        datetime64[ns]
HomeTeam            object
AwayTeam            object
FTHG                 int64
dtype: object


Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,BbAv<2.5,BbAH,BbAHh,BbMxAHH,BbAvAHH,BbMxAHA,BbAvAHA,PSCH,PSCD,PSCA
0,E0,2016-08-13,Burnley,Swansea,0,1,A,0,0,D,...,1.61,32,-0.25,2.13,2.06,1.86,1.81,2.79,3.16,2.89
1,E0,2016-08-13,Crystal Palace,West Brom,0,1,A,0,0,D,...,1.52,33,-0.5,2.07,2.0,1.9,1.85,2.25,3.15,3.86
2,E0,2016-08-13,Everton,Tottenham,1,1,D,1,0,H,...,1.77,32,0.25,1.91,1.85,2.09,2.0,3.64,3.54,2.16
3,E0,2016-08-13,Hull,Leicester,2,1,H,1,0,H,...,1.67,31,0.25,2.35,2.26,2.03,1.67,4.68,3.5,1.92
4,E0,2016-08-13,Man City,Sunderland,2,1,H,1,0,H,...,2.48,34,-1.5,1.81,1.73,2.2,2.14,1.25,6.5,14.5


In [8]:
# check forced update to cache
df_results, results_date = get_pl_results_dataframe(update_cache=True)
print 'results data date: {}'.format(results_date)
print df_results.dtypes.head()
df_results.head()

local results data out of date, updating from master results data file
results data date: 2016-11-13
Div                 object
Date        datetime64[ns]
HomeTeam            object
AwayTeam            object
FTHG                 int64
dtype: object


Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,BbAv<2.5,BbAH,BbAHh,BbMxAHH,BbAvAHH,BbMxAHA,BbAvAHA,PSCH,PSCD,PSCA
0,E0,2016-08-13,Burnley,Swansea,0,1,A,0,0,D,...,1.61,32,-0.25,2.13,2.06,1.86,1.81,2.79,3.16,2.89
1,E0,2016-08-13,Crystal Palace,West Brom,0,1,A,0,0,D,...,1.52,33,-0.5,2.07,2.0,1.9,1.85,2.25,3.15,3.86
2,E0,2016-08-13,Everton,Tottenham,1,1,D,1,0,H,...,1.77,32,0.25,1.91,1.85,2.09,2.0,3.64,3.54,2.16
3,E0,2016-08-13,Hull,Leicester,2,1,H,1,0,H,...,1.67,31,0.25,2.35,2.26,2.03,1.67,4.68,3.5,1.92
4,E0,2016-08-13,Man City,Sunderland,2,1,H,1,0,H,...,2.48,34,-1.5,1.81,1.73,2.2,2.14,1.25,6.5,14.5


In [9]:
# check forced reload
local_data_source_date = '2016-11-05'
PICKLE_FILE = 'save.p'
pickle.dump(local_data_source_date, open(PICKLE_FILE, 'wb'))
print pickle.load(open(PICKLE_FILE, 'rb'))
df_results, results_date = get_pl_results_dataframe()
print 'results data date: {}'.format(results_date)
df_results.head()

2016-11-05
local results data out of date, updating from master results data file
results data date: 2016-11-13


Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,BbAv<2.5,BbAH,BbAHh,BbMxAHH,BbAvAHH,BbMxAHA,BbAvAHA,PSCH,PSCD,PSCA
0,E0,2016-08-13,Burnley,Swansea,0,1,A,0,0,D,...,1.61,32,-0.25,2.13,2.06,1.86,1.81,2.79,3.16,2.89
1,E0,2016-08-13,Crystal Palace,West Brom,0,1,A,0,0,D,...,1.52,33,-0.5,2.07,2.0,1.9,1.85,2.25,3.15,3.86
2,E0,2016-08-13,Everton,Tottenham,1,1,D,1,0,H,...,1.77,32,0.25,1.91,1.85,2.09,2.0,3.64,3.54,2.16
3,E0,2016-08-13,Hull,Leicester,2,1,H,1,0,H,...,1.67,31,0.25,2.35,2.26,2.03,1.67,4.68,3.5,1.92
4,E0,2016-08-13,Man City,Sunderland,2,1,H,1,0,H,...,2.48,34,-1.5,1.81,1.73,2.2,2.14,1.25,6.5,14.5


In [10]:
def validate_date(date_text):
    """Raise error if date format is not YYYY-MM-DD."""
    try:
        dt.datetime.strptime(date_text, '%Y-%m-%d')
    except ValueError:
        raise ValueError("Incorrect date format, should be YYYY-MM-DD")

In [11]:
validate_date('2016-11-14')

In [12]:
def simple_date(date_text):
    """Return given date in format '%y-%m-%d' to '%d %b %y'."""
    validate_date(date_text)
    return (dt.datetime.strptime(date_text, '%Y-%m-%d').strftime('%d %b %y'))

In [13]:
print simple_date('2016-6-2')
print simple_date('2017-1-31')

02 Jun 16
31 Jan 17


Now produce function to generate the Prem League Table with RPI

In [14]:
def gen_prem_table_RPI(before_date=None, update_cache=False):
    """Return prem table with RPI at given before_date and return data source date."""
    
    results = []
    opponents_d = {}
    df_results, results_date = get_pl_results_dataframe(update_cache)
    
    
    # filter results in dataframe at given before_date
    if before_date:
        validate_date(before_date)
        df_results = df_results[df_results.Date <= before_date]
    
    for team in df_results['HomeTeam'].unique():
        home_results = df_results[df_results['HomeTeam'] == team]
        home_played = len(home_results.index)
        home_win = home_results.FTR[home_results.FTR == 'H'].count()
        home_draw = home_results.FTR[home_results.FTR == 'D'].count()
        home_lose = home_results.FTR[home_results.FTR == 'A'].count()
        home_goals_for = home_results.FTHG.sum()
        home_goals_against = home_results.FTAG.sum()
        home_opponents = list(df_results[df_results.HomeTeam == team].AwayTeam.values)

        away_results = df_results[df_results['AwayTeam'] == team]
        away_played = len(away_results.index)
        away_win = away_results.FTR[away_results.FTR == 'A'].count()
        away_draw = away_results.FTR[away_results.FTR == 'D'].count()
        away_lose = away_results.FTR[away_results.FTR == 'H'].count()
        away_goals_for = away_results.FTAG.sum()
        away_goals_against = away_results.FTHG.sum()
        away_opponents = list(df_results[df_results.AwayTeam == team].HomeTeam.values)

        # add team opponents to dictionary
        team_opponents = home_opponents + away_opponents
        opponents_d[team] = team_opponents
        
        # create team results dictionary and add to results list
        result_d = {} 
        result_d['Team'] = team
        result_d['P'] = home_played + away_played
        result_d['W'] = home_win + away_win
        result_d['D'] = home_draw + away_draw
        result_d['L'] = home_lose + away_lose
        result_d['GF'] = home_goals_for + away_goals_for
        result_d['GA'] = home_goals_against + away_goals_against
        result_d['GD'] = result_d['GF'] - result_d['GA']
        result_d['PTS'] = result_d['W']*3 + result_d['D']
        results.append(result_d) # append team result dictionary to list of results

    # create PL table dataframe from team results and sort by points (and then goal difference and goals for)
    # show date of data in Position column
    PLtable = pd.DataFrame(results, columns=['Team', 'P', 'W', 'D', 'L', 'GF', 'GA', 'GD', 'PTS'])
    PLtable.sort_values(['PTS', 'GD', 'GF'], ascending=False, inplace=True)
    col_date = before_date if before_date else results_date
    pos_title = 'Position at {}'.format(simple_date(col_date))
    PLtable[pos_title] = range(1, len(PLtable)+1) # add new column for position, with highest points first
    PLtable.set_index([pos_title], inplace=True, drop=True) 
    #PLtable.reset_index(inplace=True)
    
    # Add RPI to the table
    PLtable['PTS%'] = 100*(PLtable.PTS/(PLtable.P*3))
    PLtable['OPP_PTS%'] = PLtable.apply(lambda row: PLtable[PLtable.Team.isin(opponents_d[row.Team])]['PTS%'].mean(), axis=1)
    PLtable['OPP_OPP_PTS%'] = PLtable.apply(lambda row: PLtable[PLtable.Team.isin(opponents_d[row.Team])]['OPP_PTS%'].mean(), axis=1)
    PLtable['RPI'] = (PLtable['PTS%']*.25 + PLtable['OPP_PTS%']*.50 + PLtable['OPP_OPP_PTS%']*.25)
    PLtable['RPI_Position'] = PLtable['RPI'].rank(ascending=False).astype(int)
    
    # return PL table with RPI, sorted by RPI and PTS percentage
    return(PLtable.sort_values(['RPI', 'PTS%'], ascending=False), results_date)

In [15]:
pd.set_option('precision', 1)
PLtableRPI, results_date = gen_prem_table_RPI()
print 'Premier league table ordered by RPI, date of results data: {}'.format(results_date)
PLtableRPI

local results data still latest
Premier league table ordered by RPI, date of results data: 2016-11-13


Unnamed: 0_level_0,Team,P,W,D,L,GF,GA,GD,PTS,PTS%,OPP_PTS%,OPP_OPP_PTS%,RPI,RPI_Position
Position at 13 Nov 16,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,Liverpool,11,8,2,1,30,14,16,26,78.8,46.3,47.0,54.6,1
2,Chelsea,11,8,1,2,26,9,17,25,75.8,45.7,46.9,53.5,2
4,Arsenal,11,7,3,1,24,11,13,24,72.7,43.3,47.6,51.7,3
5,Tottenham,11,5,6,0,15,6,9,21,63.6,46.6,44.3,50.3,4
3,Man City,11,7,3,1,25,10,15,24,72.7,38.6,45.6,48.9,5
6,Man United,11,5,3,3,16,13,3,18,54.5,46.6,46.0,48.4,6
7,Everton,11,5,3,3,15,13,2,18,54.5,44.1,44.4,46.8,7
9,Burnley,11,4,2,5,11,15,-4,14,42.4,48.8,46.6,46.6,8
8,Watford,11,4,3,4,15,19,-4,15,45.5,46.6,46.5,46.3,9
14,Leicester,11,3,3,5,13,18,-5,12,36.4,49.6,46.6,45.5,10


In [16]:
pd.set_option('precision', 1)
request_date = dt.datetime.today().strftime("%Y-%m-%d")
PLtableRPI, results_date = gen_prem_table_RPI(before_date=request_date, update_cache=False)
print 'Premier league table ordered by RPI at {}, date of results data: {}'.format(request_date, results_date)
PLtableRPI

local results data still latest
Premier league table ordered by RPI at 2016-11-17, date of results data: 2016-11-13


Unnamed: 0_level_0,Team,P,W,D,L,GF,GA,GD,PTS,PTS%,OPP_PTS%,OPP_OPP_PTS%,RPI,RPI_Position
Position at 17 Nov 16,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,Liverpool,11,8,2,1,30,14,16,26,78.8,46.3,47.0,54.6,1
2,Chelsea,11,8,1,2,26,9,17,25,75.8,45.7,46.9,53.5,2
4,Arsenal,11,7,3,1,24,11,13,24,72.7,43.3,47.6,51.7,3
5,Tottenham,11,5,6,0,15,6,9,21,63.6,46.6,44.3,50.3,4
3,Man City,11,7,3,1,25,10,15,24,72.7,38.6,45.6,48.9,5
6,Man United,11,5,3,3,16,13,3,18,54.5,46.6,46.0,48.4,6
7,Everton,11,5,3,3,15,13,2,18,54.5,44.1,44.4,46.8,7
9,Burnley,11,4,2,5,11,15,-4,14,42.4,48.8,46.6,46.6,8
8,Watford,11,4,3,4,15,19,-4,15,45.5,46.6,46.5,46.3,9
14,Leicester,11,3,3,5,13,18,-5,12,36.4,49.6,46.6,45.5,10


In [17]:
pd.set_option('precision', 1)
request_date = '2016-10-24' # 9 games
PLtableRPI, results_date = gen_prem_table_RPI(request_date)
print 'Premier league table ordered by RPI at {}, date of results data: {}'.format(request_date, results_date)
PLtableRPI

local results data still latest
Premier league table ordered by RPI at 2016-10-24, date of results data: 2016-11-13


Unnamed: 0_level_0,Team,P,W,D,L,GF,GA,GD,PTS,PTS%,OPP_PTS%,OPP_OPP_PTS%,RPI,RPI_Position
Position at 24 Oct 16,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
3,Liverpool,9,6,2,1,20,11,9,20,74.1,47.3,47.6,54.1,1
4,Chelsea,9,6,1,2,19,9,10,19,70.4,44.9,48.1,52.0,2
2,Arsenal,9,6,2,1,19,9,10,20,74.1,42.8,47.5,51.8,3
1,Man City,9,6,2,1,20,9,11,20,74.1,40.7,46.1,50.4,4
5,Tottenham,9,5,4,0,13,4,9,19,70.4,43.6,43.9,50.4,5
7,Man United,9,4,2,3,13,12,1,14,51.9,50.6,45.7,49.7,6
10,Bournemouth,9,3,3,3,12,12,0,12,44.4,48.6,44.1,46.4,7
12,Leicester,9,3,2,4,11,15,-4,11,40.7,49.0,46.8,46.4,8
14,Burnley,9,3,1,5,8,13,-5,10,37.0,50.2,46.3,45.9,9
6,Everton,9,4,3,2,13,8,5,15,55.6,41.2,44.4,45.6,10


In [18]:
pd.set_option('precision', 1)
request_date = '2016-10-31' # 10 games
PLtableRPI, results_date = gen_prem_table_RPI(request_date)
print 'Premier league table ordered by RPI at {}, date of results data: {}'.format(request_date, results_date)
PLtableRPI

local results data still latest
Premier league table ordered by RPI at 2016-10-31, date of results data: 2016-11-13


Unnamed: 0_level_0,Team,P,W,D,L,GF,GA,GD,PTS,PTS%,OPP_PTS%,OPP_OPP_PTS%,RPI,RPI_Position
Position at 31 Oct 16,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
3,Liverpool,10,7,2,1,24,13,11,23,76.7,45.3,47.4,53.7,1
4,Chelsea,10,7,1,2,21,9,12,22,73.3,44.7,47.4,52.5,2
2,Arsenal,10,7,2,1,23,10,13,23,76.7,40.0,47.8,51.1,3
1,Man City,10,7,2,1,24,9,15,23,76.7,39.0,46.3,50.2,4
5,Tottenham,10,5,5,0,14,5,9,20,66.7,44.3,44.6,50.0,5
8,Man United,10,4,3,3,13,12,1,15,50.0,50.0,45.5,48.9,6
11,Leicester,10,3,3,4,12,16,-4,12,40.0,50.0,46.8,46.7,7
6,Everton,10,5,3,2,15,8,7,18,60.0,40.3,45.1,46.4,8
14,Burnley,10,3,2,5,8,13,-5,11,36.7,51.0,46.3,46.2,9
7,Watford,10,4,3,3,14,13,1,15,50.0,42.7,47.1,45.6,10


In [19]:
#Check results after 10 games...
Image(url= "https://tomkinstimes.com/wp-content/uploads/2016/11/RPI-by-Tim-OBrien.png")

## Building The Spyre App

Useful reference material:
+ How to develop a Spyre app, including tutorials - [https://github.com/adamhajari/spyre](https://github.com/adamhajari/spyre).

Spyre is a web app framework for providing a simple user interface for Python data projects. In simple terms the premrpi app involves:
1. creating a user interface to call the getData() function to return a data table.
2. using pandas (and related modules) to generate the data table.

See [premrpi_app.py in the premrpi github repo](https://github.com/terrydolan/premrpi) for the app source code.