## Soccer Analytics - Predicting the attendances of English soccer matches
This notebook utilises a dataset of English League 1 & 2 matches, from the 2017/18 season. The dataset was created by via a series of scripts (python + R) written by myself. 

The code for these scripts can be found in my github repo :-
https://github.com/usingdatascience/Predicting-Soccer-Attendances.  

The dataset file has the following features :-
* Date : date of match HomeTeam
*  Home Team AwayTeam
* Away Team Day_Eve : Is game a day or evening match ?
* Day Type : Is the game on a weekend or during week ?
* Holiday : Is the game played on a bank holiday ?
* Hol Type : Same as Holiday
* Capacity : Capacity of home teams ground
* Average Travelling Fans : Average number of travelling fans that away team takes (based on previous season)
* Cheapest Season T : Lowest Season ticket price for home team
* Home League Position : Current position at time of game, of home team
* Away League Position : Current position at time of game, of away team
* Form Home : Current form of the home team (based on last 5 matches)
* Form Away : Current form of the away team (based on last 5 matches)
* Distance : Distance between the home sides ground and the away team
* Temperature : Temperature on day of game Weather Event : ??
* Lowest Home Ticket Price : Lowest ticket price for a home fan
* Lowest Away Ticket Price : Lowest ticket price for an away fan
* Home PostCode : Postcode of home team
* Away PostCode : Postcode of away team
* Attendance : Attendance for the game
* Highest Home Ticket Price : Highest home ticket price that a fan can pay

This notebook is the first attempt at Multiple Linear Regression on this dataset.

In [None]:
# Load library
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
import statsmodels.api as sm

In [None]:
# Load the dataset
df = pd.read_csv("../input/final_dataset_V2copy.csv")
df.head()

In [None]:
df_sub = df[['HomeTeam','AwayTeam','Day_Eve','Hol_Type','Day_Type','Capacity','Average_Travelling_Fans','Cheapest_Season_T',
             'Home_League_Position','Away_League_Position','Form_Home','Form_Away','Distance','Temperature','Lowest_Home_Ticket_Price',
             'Lowest_Away_Ticket_Price','Highest_Home_Ticket_Price','match_month','Attendance']]
df_sub.iloc[30,:]

In [None]:
# convert the teams columns to numbers using label decoding
df_sub['HTeam']=df_sub['HomeTeam'].astype('category')
df_sub["HTeam"] = df_sub["HTeam"].cat.codes
df_sub['ATeam']=df_sub['AwayTeam'].astype('category')
df_sub["ATeam"] = df_sub["ATeam"].cat.codes
df_sub["DayEve"] = df_sub["Day_Eve"].astype('category')
df_sub["DayEve"] = df_sub["DayEve"].cat.codes
df_sub.head()

### Encoding of Categorical Variables :- HomeTeam + AwayTeam

In [None]:
# show the encoded values/mappings
htcodes = df_sub[['HomeTeam','HTeam']]
htcodes = htcodes.drop_duplicates(['HomeTeam','HTeam'])
htcodes

In [None]:
# show the encoded values/mappings
atcodes = df_sub[['AwayTeam','ATeam']]
atcodes = atcodes.drop_duplicates(['AwayTeam','ATeam'])
atcodes

In [None]:
# show the encoded values/mappings
dayeve_codes = df_sub[['Day_Eve','DayEve']]
dayeve_codes = dayeve_codes.drop_duplicates(['Day_Eve','DayEve'])
dayeve_codes

In [None]:
# drop the original home and away team columns
df_sub = df_sub[['HTeam','ATeam','DayEve','Hol_Type','Day_Type','Capacity','Average_Travelling_Fans','Cheapest_Season_T',
             'Home_League_Position','Away_League_Position','Form_Home','Form_Away','Distance','Temperature','Lowest_Home_Ticket_Price',
             'Lowest_Away_Ticket_Price','Highest_Home_Ticket_Price','match_month','Attendance']]
df_sub.head()

In [None]:
X = np.array(df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Cheapest_Season_T','Home_League_Position','Away_League_Position',
       'Form_Home','Form_Away','Distance','Temperature','Lowest_Home_Ticket_Price','Lowest_Away_Ticket_Price',
       'Highest_Home_Ticket_Price','match_month','HTeam','ATeam']])
y = df_sub.Attendance

### Checking for Multicollinearity
This is the presence of correlation in independant variables.  If variables are correlated, then it becomes extremely difficult for the model to determine the true effect of X on Y or find out which variable is contributing to predict the response variable. To investigate if any variables are correlated, then produce a Correlation Table. (nb. scatter plots could also be done).

In [None]:
correlationMatrix = df_sub.corr().abs()

plt.subplots(figsize=(18, 8))
sns.heatmap(correlationMatrix,annot=True)

# Mask unimportant features
sns.heatmap(correlationMatrix, mask=correlationMatrix < 1, cbar=False)
plt.show()

Attendance is the Response Variable, so we can ignore any correlations here. On the other variables, its only Capacity and "Highest Home Ticket Price" that seem to be exhibiting a correlation.

Before modelling the data via regression, as the data items have different scales (ie. some have values of 0 or 1, some in the hundreds, some in the thousands), we therefore scale all of the X variables.

In [None]:
# Scale the features: X_scaled
X_scaled = scale(X)
X_scaled

In [None]:
# Print the mean and standard deviation of the unscaled features
print("Mean of Unscaled Features: {}".format(np.mean(X))) 
print("Standard Deviation of Unscaled Features: {}".format(np.std(X)))

# Print the mean and standard deviation of the scaled features
print("Mean of Scaled Features: {}".format(np.mean(X_scaled))) 
print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))

In [None]:
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state=42)

# Create the regressor: reg_all
reg_all = LinearRegression()

In [None]:
# Fit the regressor to the training data
reg_all.fit(X_train,y_train)

# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)

# Compute and print R^2 and RMSE
print("R^2: {}".format(reg_all.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test,y_pred))
print("Root Mean Squared Error: {}".format(rmse))

### Model Evaluation Metrics for Regression
MAE = Mean Absolute Error
MSE = Mean Squared Error
RMSE = Root Mean Squared Error

Comparing these metrics:

MAE is the easiest to understand, because it's the average error.
MSE is more popular than MAE, because MSE "punishes" larger errors.
RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$


Mean Squared Error (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$


Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

In [None]:
# calculate MAE using scikit-learn
from sklearn import metrics
print(metrics.mean_absolute_error(y_test, y_pred))

In [None]:
# calculate MSE using scikit-learn
print(metrics.mean_squared_error(y_test, y_pred))

In [None]:
# calculate RMSE using scikit-learn
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In [None]:
import matplotlib.pyplot as plt

# regression coefficients
print('Coefficients: \n', reg_all.coef_)
 
# variance score: 1 means perfect prediction
print('Variance score: {}'.format(reg_all.score(X_test, y_test)))
 
# plot for residual error
 
## setting plot style
plt.style.use('fivethirtyeight')
 
## plotting residual errors in training data
plt.scatter(reg_all.predict(X_train), reg_all.predict(X_train) - y_train,
            color = "green", s = 10, label = 'Train data')
 
## plotting residual errors in test data
plt.scatter(reg_all.predict(X_test), reg_all.predict(X_test) - y_test,
            color = "blue", s = 10, label = 'Test data')
 
## plotting line for zero residual error
plt.hlines(y = 0, xmin = 0, xmax = 50, linewidth = 2)
 
## plotting legend
plt.legend(loc = 'upper right')
 
## plot title
plt.title("Residual errors")
 
## function to show plot
plt.show()

### Above plot suggests non-linearity in the data. It means that the model doesn’t capture nonlinear effects.  The funnel shape in the plot, indicates signs of non-constant variance i.e. heteroskedasticity.

In [None]:
residuals = y_test - y_pred
residuals.describe()

In [None]:
import scipy.stats as stats
plt.figure(figsize=(9,9))
stats.probplot(residuals, dist="norm", plot=plt)

### If residuals are normally distributed, then they should like along the straight line on the Q-Q plot. The residuals here appear to not fully follow the normality line. This is an indication that this model/straight line may not be sufficent to predict attendances.


### Now below try backwards elimination

In [None]:
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Cheapest_Season_T','Home_League_Position','Away_League_Position',
       'Form_Home','Form_Away','Distance','Temperature','Lowest_Home_Ticket_Price','Lowest_Away_Ticket_Price',
       'Highest_Home_Ticket_Price','match_month','HTeam','ATeam']]
y = df_sub.Attendance

In [None]:
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2, fit_intercept=True).fit()
regressor_OLS.summary()

In [None]:
# remove Temperature as greater than 0.05
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Cheapest_Season_T','Home_League_Position','Away_League_Position',
       'Form_Home','Form_Away','Distance','Lowest_Home_Ticket_Price','Lowest_Away_Ticket_Price',
       'Highest_Home_Ticket_Price','match_month','HTeam','ATeam']]
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
# remove away_league_position as greater than 0.05
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Cheapest_Season_T','Home_League_Position',
       'Form_Home','Form_Away','Distance','Lowest_Home_Ticket_Price','Lowest_Away_Ticket_Price',
       'Highest_Home_Ticket_Price','match_month','HTeam','ATeam']]
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
# remove lowest_home_Ticket_price as greater than 0.05
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Cheapest_Season_T','Home_League_Position',
       'Form_Home','Form_Away','Distance','Lowest_Away_Ticket_Price',
       'Highest_Home_Ticket_Price','match_month','HTeam','ATeam']]
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
# remove form_away as greater than 0.05
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Cheapest_Season_T','Home_League_Position',
       'Form_Home','Distance','Lowest_Away_Ticket_Price',
       'Highest_Home_Ticket_Price','match_month','HTeam','ATeam']]
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
# remove flowest_away_Ticket_price as greater than 0.05
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Cheapest_Season_T','Home_League_Position',
       'Form_Home','Distance',
       'Highest_Home_Ticket_Price','match_month','HTeam','ATeam']]
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
# remove ATeam as greater than 0.05
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Cheapest_Season_T','Home_League_Position',
       'Form_Home','Distance',
       'Highest_Home_Ticket_Price','match_month','HTeam']]
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
# remove distance as greater than 0.05
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Cheapest_Season_T','Home_League_Position',
       'Form_Home','Highest_Home_Ticket_Price','match_month','HTeam']]
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
# remove match_month as greater than 0.05
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Cheapest_Season_T','Home_League_Position',
       'Form_Home','Highest_Home_Ticket_Price','HTeam']]
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
# summarize our model
regOLS_model_summary = regressor_OLS.summary()
regOLS_model_summary

In [None]:
fig = plt.figure(figsize=(20,24))
fig = sm.graphics.plot_partregress_grid(regressor_OLS, fig=fig)

In [None]:
# seaborn residual plot
sns.residplot(regressor_OLS.fittedvalues, df_sub['Attendance'], lowess=True, line_kws={'color':'r', 'lw':1})
plt.title('Residual plot')
plt.xlabel('Predicted values')
plt.ylabel('Residuals');

### Above graph suggests non-linearity in the data.  If a funnel shape is evident in the plot, consider it as the signs of non-constant variance i.e. heteroskedasticity.  To rectify this, could try transforming Y, using ln, or Sqrt

In [None]:
fig, ax = plt.subplots(figsize=(24,16))
fig = sm.graphics.influence_plot(regressor_OLS, ax=ax, criterion="cooks")

### The above plot shows up possible outliers. Anything greater than 3 is a possible outlier. Possible action is to remove these from the dataset.

In [None]:
# Q-Q plot for normality
figqq= sm.qqplot(regressor_OLS.resid, line='r')

### This plot shows if residuals are normally distributed. Do residuals follow a straight line well or do they deviate severely? It’s good if residuals are lined well on the straight red line.

### Now try backwards elimination, but with LOG(Y) used for Response variable Y

In [None]:
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Cheapest_Season_T','Home_League_Position','Away_League_Position',
       'Form_Home','Form_Away','Distance','Temperature','Lowest_Home_Ticket_Price','Lowest_Away_Ticket_Price',
       'Highest_Home_Ticket_Price','match_month','HTeam','ATeam']]
y = np.log(y)
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
# remove temperature variable
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Cheapest_Season_T','Home_League_Position','Away_League_Position',
       'Form_Home','Form_Away','Distance','Lowest_Home_Ticket_Price','Lowest_Away_Ticket_Price',
       'Highest_Home_Ticket_Price','match_month','HTeam','ATeam']]
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
# remove form_away variable
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Cheapest_Season_T','Home_League_Position','Away_League_Position',
       'Form_Home','Distance','Lowest_Home_Ticket_Price','Lowest_Away_Ticket_Price',
       'Highest_Home_Ticket_Price','match_month','HTeam','ATeam']]
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
# remove HTeam variable
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Cheapest_Season_T','Home_League_Position','Away_League_Position',
       'Form_Home','Distance','Lowest_Home_Ticket_Price','Lowest_Away_Ticket_Price',
       'Highest_Home_Ticket_Price','match_month','ATeam']]
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
# remove Cheapest_Season_T variable
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Home_League_Position','Away_League_Position',
       'Form_Home','Distance','Lowest_Home_Ticket_Price','Lowest_Away_Ticket_Price',
       'Highest_Home_Ticket_Price','match_month','ATeam']]
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
# remove Away_League_Position variable
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Home_League_Position',
       'Form_Home','Distance','Lowest_Home_Ticket_Price','Lowest_Away_Ticket_Price',
       'Highest_Home_Ticket_Price','match_month','ATeam']]
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
# remove Form_Home variable
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Home_League_Position',
       'Distance','Lowest_Home_Ticket_Price','Lowest_Away_Ticket_Price',
       'Highest_Home_Ticket_Price','match_month','ATeam']]
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
# remove match_month variable
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Home_League_Position',
       'Distance','Lowest_Home_Ticket_Price','Lowest_Away_Ticket_Price',
       'Highest_Home_Ticket_Price','ATeam']]
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
# remove ATeam variable
x = df_sub[['Day_Type','Hol_Type','DayEve','Capacity','Average_Travelling_Fans','Home_League_Position',
       'Distance','Lowest_Home_Ticket_Price','Lowest_Away_Ticket_Price',
       'Highest_Home_Ticket_Price']]
x2 = sm.add_constant(x)
regressor_OLS = sm.OLS(endog = y, exog = x2).fit()
regressor_OLS.summary()

In [None]:
fig = plt.figure(figsize=(20,24))
fig = sm.graphics.plot_partregress_grid(regressor_OLS, fig=fig)

In [None]:
# seaborn residual plot
sns.residplot(regressor_OLS.fittedvalues, y, lowess=True, line_kws={'color':'r', 'lw':1})
plt.title('Residual plot')
plt.xlabel('Predicted values')
plt.ylabel('Residuals');

In [None]:
fig, ax = plt.subplots(figsize=(24,16))
fig = sm.graphics.influence_plot(regressor_OLS, ax=ax, criterion="cooks")

In [None]:
# Q-Q plot for normality
figqq= sm.qqplot(regressor_OLS.resid, line='r')

### This plot shows if residuals are normally distributed. Do residuals follow a straight line well or do they deviate severely? It’s good if residuals are lined well on the straight red line.