### Predictive Sport: probability of goals, shots, and more 


*Context:* 

A football game generates much more events and it is very important and interesting to take into account the context in which those events were generated. This dataset should keep sports analytics enthusiasts awake for long hours as the number of questions that can be asked is huge.

Read <a href=http://crabstats.blogspot.com/>these blogs</a> to get a good understanding of soccer/football stats 


*Data description:* 

Nearly 25,000 soccer games from all leagues all over the world. The fields in the data set are: Columns A to E contains information about the league, home and away teams, date etc Columns F, G and H contain the odds for the home win, draw and away win Columns I to BQ contain the team statistics. 

Home team stats are prefixed with a "h" similarly, away team stats are prefixed with an "a". Examples include ladder position (which is a term for a rank in a group -  <a href=https://www.flashscore.com.au/football/europe/euro/standings/> here </a> an example), games played, goals conceded, away games won etc. Columns BR to CA contain final result information. That is the result, the full time result and if available, the half time score as well.

For each game there is: 
1. Statistics on the two teams, such as ladder position, win-loss history, games played 
2. Odds for home win, draw, away win (some-times is zero if odds not available) 
3. The result for that game (including the half time result if available

The dataset ranges from January 2016 to October 2017 and the statistics have been sourced from a few different websites. Odds come from BET365 and the results have been manually entered from http://www.soccerstats.com

Get more insight about the columns in the data by hovering your mouth in the names <a href=https://www.soccerstats.com/latest.asp?league=germany3>here</a>

*Data Location:* 
- https://www.kaggle.com/frankpac/soccerdata
- reference for column names
*Motivations:*
1. Predictive Model - if a predictive model can be created from this dataset, or are results completely random! 
2. Probability - Is it possible to calculate the probability of a home win, draw or away win based on this dataset.

#### Objectives:

Exploratory analysis:
- Understand which leagues & teams are represented in this data set - use histograms, pandas groupbys to get ideas on this 
- Does playing at home have a higher chance of winning? What about draws and loses?
- How are predicted odds correlated to final results? 
- How many teams played more than 10 games (at home or away)?
- How are the home or away statistics distributed? Are there imbalance in the data set?
- Be creative and explore the data set, make histograms, line (trend) plots to understand the nature of the data


Predictive analysis: 
- which column has the most missing value (note that the *_odds columns are zero when no data is available) 
- is it feasible to drop all rows with missing values? If not, try to fill missing values with appropriate strategy  
- drop columns with greater than 50% missing values
- consider dropping teams with  
- split the data into training (60%), validation(30%) and test(10%) 
- there are 81 columns, which columns are highly correlated? (you may get insight on this from your exploratory analysis, plotting covariance matrices) 
- create predictive models for football games in order to bet on football outcomes. This involves training a model for a subset of features e.g. (home,odd,ladder), and predicting a target variable e.g. win, draw, loss. Consider the following three algorithms. Comment on the performances if you are able to run two or three of them 
    - random forest classifier
    - xgboost 
    - neural networks
- make visualizations predicted outcomes (wins vs losses) together with the ground truth for the validation set
- make sure your model is not overfitting the training data - the statistical precision you would obtain running your model on the training data should be statistically similar to the validation data. Otherwise, the model is not generalizing (overfitting the training set)  


- Given your exploratory and predictive analysis, comment on what else you can do with the current data?

In [8]:
import pandas as pd
import numpy as np
import os, sys
import matplotlib.pyplot as plt

%matplotlib inline

In [3]:
dfball = pd.read_excel('data/SoccerData.xlsx')

In [4]:
dfball.shape

(24830, 81)

In [22]:
dfball.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24830 entries, 0 to 24829
Data columns (total 81 columns):
league           24830 non-null object
teams_no         24830 non-null int64
date             24830 non-null datetime64[ns]
home_team        24830 non-null object
away_team        24830 non-null object
home_odd         24830 non-null float64
draw_odd         24830 non-null float64
away_odd         24830 non-null float64
h_played         24830 non-null int64
a_played         24830 non-null int64
ph_ladder5       24830 non-null int64
ph_ladder4       24830 non-null int64
ph_ladder3       24830 non-null int64
ph_ladder2       24830 non-null int64
ph_ladder1       24830 non-null int64
h_ladder         24830 non-null int64
pa_ladder5       24830 non-null int64
pa_ladder4       24830 non-null int64
pa_ladder3       24830 non-null int64
pa_ladder2       24830 non-null int64
pa_ladder1       24830 non-null int64
a_ladder         24830 non-null int64
h_won            24830 non-null int64

In [26]:
c = ['h_final','a_final','h_form','h_elo','a_drawn','h_played','h_ladder','h_scored_h','h_points_h','h_played_h','h_offensiv','h_clean_h','h_goal_signal']
dfball[c]

Unnamed: 0,h_final,a_final,h_form,h_elo,a_drawn,h_played,h_ladder,h_scored_h,h_points_h,h_played_h,h_offensiv,h_clean_h,h_goal_signal
0,1,1.0,2,1534,4,19,18,9,13,9,20,55.556,0.444444
1,1,1.0,6,1643,7,19,8,12,16,9,12,44.444,0.444444
2,5,1.0,-12,1589,2,19,10,19,19,10,3,0.000,0.500000
3,2,2.0,-1,1592,4,19,14,9,11,10,14,30.000,0.000000
4,4,1.0,17,1681,6,19,3,17,21,10,4,40.000,0.800000
5,1,0.0,4,1799,7,19,1,17,20,9,2,44.444,1.333333
6,2,0.0,0,1514,5,19,19,9,10,9,10,22.222,-0.111111
7,1,3.0,-11,1529,9,19,20,7,9,10,19,30.000,-0.200000
8,2,0.0,6,2224,4,19,1,27,22,8,1,37.500,2.625000
9,1,0.0,7,1637,7,19,4,13,18,9,6,55.556,1.000000


In [17]:
df = dfball[['home_team','home_odd','h_final']]
df2 = df.groupby('home_team').count()

In [18]:
df2.sort_values('home_odd',ascending=False)

Unnamed: 0_level_0,home_odd
home_team,Unnamed: 1_level_1
Coventry City,45
Red Star,45
Rochdale,43
Oxford Utd,43
Bradford,42
Plymouth,42
Swindon,42
Walsall,42
Accrington,42
Wigan,42


In [15]:
df2

Unnamed: 0_level_0,home_odd
home_team,Unnamed: 1_level_1
1860 Munich,57.672
A.O Xanthi,13.420
ABC,16.880
AD Fafe,33.109
AEK Athens,29.313
AFC Eskilstuna,7.010
AIK,39.377
AZ,3.100
Aachen,0.000
Aalborg,50.554
