# Prediction Scores with Poisson Regression


Notebook attempting to predict the scores of any fixture given the teams that are playing it based on their performance in the previous season. Such an approach using Poisson Regression employs only historic data and ignores other factors. Nevertheless, it is a good estimator of a team's attacking and defensive strength.

In [2]:
import pandas as pd
import numpy as np
import scipy.stats as scipy
import random


### Load the data

In [5]:
df = pd.read_csv("./Data/Spain/SP1_13.csv")
df_14 = pd.read_csv("./Data/Spain/SP1_14.csv")

In [6]:
df.columns

Index([u'Div', u'Date', u'HomeTeam', u'AwayTeam', u'FTHG', u'FTAG', u'FTR',
       u'HTHG', u'HTAG', u'HTR', u'HS', u'AS', u'HST', u'AST', u'HF', u'AF',
       u'HC', u'AC', u'HY', u'AY', u'HR', u'AR', u'B365H', u'B365D', u'B365A',
       u'BWH', u'BWD', u'BWA', u'IWH', u'IWD', u'IWA', u'LBH', u'LBD', u'LBA',
       u'PSH', u'PSD', u'PSA', u'WHH', u'WHD', u'WHA', u'SJH', u'SJD', u'SJA',
       u'VCH', u'VCD', u'VCA', u'Bb1X2', u'BbMxH', u'BbAvH', u'BbMxD',
       u'BbAvD', u'BbMxA', u'BbAvA', u'BbOU', u'BbMx>2.5', u'BbAv>2.5',
       u'BbMx<2.5', u'BbAv<2.5', u'BbAH', u'BbAHh', u'BbMxAHH', u'BbAvAHH',
       u'BbMxAHA', u'BbAvAHA', u'PSCH', u'PSCD', u'PSCA'],
      dtype='object')

### Cleaning

We do not need information about division, data, referee and the betting odds from various companies for this method. 

In [9]:
res_13 = df.ix[:,:23]
res_13 = res_13.drop(['Div','Date'],axis=1)
res_14 = df_14.ix[:,:23]
res_14 = res_14.drop(['Div','Date'],axis=1)
bet_13 = df.ix[:,23:]

In [10]:
res_13.head()

Unnamed: 0,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,HS,AS,...,AST,HF,AF,HC,AC,HY,AY,HR,AR,B365H
0,Sociedad,Getafe,2,0,H,1,0,H,16,15,...,2,13,6,6,5,1,1,0,0,1.73
1,Valencia,Malaga,1,0,H,0,0,D,9,11,...,2,15,23,9,6,3,5,0,0,1.53
2,Valladolid,Ath Bilbao,1,2,A,1,1,D,8,13,...,3,10,8,5,5,1,0,0,0,2.5
3,Barcelona,Levante,7,0,H,6,0,H,22,4,...,1,15,16,9,3,1,3,0,0,1.08
4,Osasuna,Granada,1,2,A,0,2,A,14,13,...,4,15,17,7,6,1,4,0,0,2.0


### Dataframe to store the final league standings in 2013-14

We create a table with the goals scored, conceded, attacking strength, defensive strength of both teams.

Number of matches played at home = 19

Attacking strength at home (HAS) = (Goals scored at home / 19) / Average Number of goals at home in the season 

Defensive strength at home (HAS) = (Goals conceded at home / 19) / Average Number of goals conceded at home in the season 

In [11]:
#Team, Home Goals Score, Away Goals Score, Attack Strength, Home Goals Conceded, Away Goals Conceded, Defensive Strength
table_13 = pd.DataFrame(columns=('Team','HGS','AGS','HAS','AAS','HGC','AGC','HDS','ADS'))

In [12]:
avg_home_scored_13 = res_13.FTHG.sum() / 380.0
avg_away_scored_13 = res_13.FTAG.sum() / 380.0
avg_home_conceded_13 = avg_away_scored_13
avg_away_conceded_13 = avg_home_scored_13
print "Average number of goals at home",avg_home_scored_13
print "Average number of goals away", avg_away_scored_13
print "Average number of goals conceded at home",avg_away_conceded_13
print "Average number of goals conceded away",avg_home_conceded_13


Average number of goals at home 1.63157894737
Average number of goals away 1.11842105263
Average number of goals conceded at home 1.63157894737
Average number of goals conceded away 1.11842105263


In [13]:
res_home = res_13.groupby('HomeTeam')
res_away = res_13.groupby('AwayTeam')

In [14]:
table_13.Team = res_home.HomeTeam.all().values
table_13.HGS = res_home.FTHG.sum().values
table_13.HGC = res_home.FTAG.sum().values
table_13.AGS = res_away.FTAG.sum().values
table_13.AGC = res_away.FTHG.sum().values
table_13.head()

Unnamed: 0,Team,HGS,AGS,HAS,AAS,HGC,AGC,HDS,ADS
0,Almeria,26,17,,,31,40,,
1,Ath Bilbao,42,24,,,18,21,,
2,Ath Madrid,49,28,,,10,16,,
3,Barcelona,64,36,,,15,18,,
4,Betis,19,17,,,31,47,,


In [15]:
table_13.HAS = (table_13.HGS / 19.0) / avg_home_scored_13
table_13.AAS = (table_13.AGS / 19.0) / avg_away_scored_13
table_13.HDS = (table_13.HGC / 19.0) / avg_home_conceded_13
table_13.ADS = (table_13.AGC / 19.0) / avg_away_conceded_13
table_13.head()

Unnamed: 0,Team,HGS,AGS,HAS,AAS,HGC,AGC,HDS,ADS
0,Almeria,26,17,0.83871,0.8,31,40,1.458824,1.290323
1,Ath Bilbao,42,24,1.354839,1.129412,18,21,0.847059,0.677419
2,Ath Madrid,49,28,1.580645,1.317647,10,16,0.470588,0.516129
3,Barcelona,64,36,2.064516,1.694118,15,18,0.705882,0.580645
4,Betis,19,17,0.612903,0.8,31,47,1.458824,1.516129


In [16]:
#Expected number of goals based on the average poisson probability
def exp_goals(mean):
    max_pmf = 0;
    for i in xrange(7):
        pmf = scipy.distributions.poisson.pmf(i,mean) * 100 
        if pmf > max_pmf:
            max_pmf = pmf
            goals = i
    return goals

In [17]:
test_13 = res_13.ix[:,0:5]
test_13.head()
test_14 = res_14.ix[:,0:5]
test_14.head()

Unnamed: 0,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,Almeria,Espanol,1,1,D
1,Granada,La Coruna,2,1,H
2,Malaga,Ath Bilbao,1,0,H
3,Sevilla,Valencia,1,1,D
4,Barcelona,Elche,3,0,H


In [21]:
table_13[table_13['Team'] == 'Barcelona']
test_14['ER'] = ''


In [32]:
results = []
for index, row in test_13.iterrows():

    home_team = table_13[table_13['Team'] == row['HomeTeam']]
    away_team = table_13[table_13['Team'] == row['AwayTeam']]
    #print "Home : ", home_team.HAS.values, "Away: ", away_team.AAS.
    if row.HomeTeam not in ['Leicester', 'QPR', 'Burnley'] and row.AwayTeam not in ['Leicester', 'QPR', 'Burnley']:
        EH = home_team.HAS.values * away_team.ADS.values * avg_home_scored_13
        EA = home_team.HDS.values * away_team.AAS.values * avg_home_conceded_13
        #print row.HomeTeam, row.AwayTeam
        if exp_goals(EH) > exp_goals(EA):
            results.append('H')
        elif exp_goals(EH) < exp_goals(EA):
            results.append('A')
        else:
            results.append('D')
    else:
        results.append('D')

In [33]:
len(results)

380

In [34]:
test_13['ER'] = results

In [35]:
from sklearn.metrics import accuracy_score

In [36]:
accuracy_score(test_13['ER'],test_13['FTR'])

0.52368421052631575

### PREDICTION OF NUMBER OF GOALS

Taking two sample teams, we predict the probability of number of goals that they might score in this fixture.


In [54]:
team_1 = 'Barcelona'
team_2 = 'Elche'

home_team = table_13[table_13['Team'] == team_1]
away_team = table_13[table_13['Team'] == team_2]
EH = home_team.HAS.values * away_team.ADS.values * avg_home_scored_13
EA = home_team.HDS.values * away_team.AAS.values * avg_home_conceded_13
print EH, EA

[ 4.12903226] [ 0.63157895]


In [55]:
home_team.Team.values[0]

'Barcelona'

In [63]:
def exp_goals_prob(mean):
    max_pmf = 0;
    prob = []
    for i in xrange(0,6):
        pmf = scipy.distributions.poisson.pmf(i,mean) * 100 
        prob.append(pmf[0])
    return prob

In [64]:
prob_goals = pd.DataFrame(columns=['Team','0','1','2','3','4','5'])
home_team_prob = exp_goals_prob(EH)
away_team_prob = exp_goals_prob(EA)

In [65]:
prob_goals.loc[0,1:] = home_team_prob
prob_goals.loc[1,1:] = away_team_prob
prob_goals.iloc[0,0] = team_1
prob_goals.iloc[1,0] = team_2

In [66]:
prob_goals

Unnamed: 0,Team,0,1,2,3,4,5
0,Barcelona,1.60985,6.6471,13.723,18.8876,19.4969,16.1007
1,Elche,53.1752,33.5843,10.6056,2.23275,0.35254,0.0445313


To calculate the probability that the expected score is 2-2, we simply multiply the probability that team_1 scores 2 goals and team_2 scores 2 goals. In this case, it comes out to 2.57%. 

Similarly, if we want to calculate the possibility of a draw, we calculate the probability of each draw first (0-0, 1-1, 2-2) etc. and add them all together. 

Straightaway, such an analysis lends itself to efficient betting. There are different kinds like home win, draw, away win, over 2.5 goals, under 2.5  goals etc. We could calculate the probabilities of each of this happening through Poisson Regression.