# Capstone Project: Milestone Report

## The problem

In this project we look at predicting the outcome of a soccer game from one of the major European leagues. Our client is a betting agency who is looking into improving its predictive power to better pr0duce odds for the games. Our benchmark is the average correctness of the major betting agencies accepting bets on European soccer. As odds are inversely proportional to the probability of a certain outcome (Home win, Draw, or Away win) lower odds correspond to higher probability of a certain outcome. By selecting the lowest odds as the betting agency prediction, we will measure their prediction against the actual outcome of a game. Our goal here is to produce a prediction model that does better than the average consensus of the agencies.

The leagues selected and the computed betting agencies correctness in each are shown in the graph below. The overall correctness is 53.25%.


## The data

The data comes from the Soccer Database at https://www.kaggle.com/hugomathien/soccer on the Kaggle website. This database is actually a collection of data from different sources:
1. http://football-data.mx-api.enetscores.com/ : scores, lineup, team formation and events

2. http://www.football-data.co.uk/ : betting odds. 

3. http://sofifa.com/ : players and teams attributes from EA Sports FIFA games

The database consists of 4 different tables; *Player*, *Player_Attributes*, *Team*, and *Match*. The Player table includes name, birthdate, height and weight of the player associated with a particular player_id. The Team table contains the full and short (3 letters) names of a team. The Player_Attributes comes from the EA Sports Fifa game and includes several characteristics: overall_ranking, preferred foot, goal keeping abilities, shooting abilities, offensive and defensive rankings, and others for a total of about 40. Finally, the Match team contains all details of games played starting with the 2008/09 season and up to the 2015/2016 season. It contains the teams involved, the goals scored, the team formations, the teams' starting 11, and betting odds from several betting agencies for a total of 115 variables. Most of the information relating to the events happened during a game, such as free-kicks awarded, shots on goal, or scoring players' names, is contained within a few columns in HTML format.

Most of the data is clean, there are only a few missing values, corresponding to a few games in the Italian Serie A during the 2010/11 season. In addition the betting odds were not collected for two of the leagues: Poland Ekstraklasa and Switzerland Super League. These leagues will be excluded from the analysis.

## Preliminary Exploration

Our preliminary analysis of the data will look into five different areas; Upsets, Home Advantage, Stage Effect, Team Performance, and Goals Scored.

### Upsets

The goal here is to better understand how the betting agencies are doing in predicting outcomes for different leagues and different outcomes. The following graph shows how the betting agencies consensus is faring in each league, for which betting odds are avaialable. 

![League Correctness](league_correctness.png)

A few interesting things stand out from the above bar chart. First of all booking agencies almost never predict a draw for games. This basically means that they are ok with being wrong roughly one fourth of the times, as 25.3% of all games actually end in a draw. Of all games only 31 were 'expected' to more likely be a draw, 28 in the Italian league and 3 in the French league. This is something to consider if one wants to improve on the booking agencies overall correctness rate, computed at 53.2%. It is important here to specify that betting agencies have several ways to 'hedge' against such predictions which are reflected in the odds offered on the games. In other words, while they might favor a result over another their overall odds strategy will offset such problems, as the odds' difference will be small.

Another observation is that generally agencies do a better job of predicting a home win rather than an away win. The only exceptions are the Scottish league, where the opposite is true, and the Portuguese league where the difference is minimal. Comparison with predictions of draw should not be done given the really low number of such cases. Predicting a home win seems, intuitively, easier than an away win, especially when the difference between two teams is high. We will return to this point a little bit later to confirm our intuition.

To learn more from this graph we zoom in on the results and display horizontal lines at the values of the average predictions across all leagues (by predicted outcome of a game). In the following graph we drop the draw category, due to the low amount of games and reliability of related statistics.

![League Correctness Zoom](league_correctness_zoom.png)

As we can see the agencies seem to do better on certain leagues. The Spanish and the Portuguese leagues are both predicted correctly above average for both Home and Away wins. Other leagues have differing outcomes, given Home or Away. It is easier than average to predict a home win, and worse than average to predict an away win in; Belgium, England, Italy and the Netherlands, while the opposite is true for Scotland. The hardest leagues to predict are the French and German leagues where betting agencies do worse than average in both home and away wins.

### Home Advantage

We seek here confirmation of the intuition that home teams have what is usually referred to as home advantage, that is: home teams win more often than away teams. We look at the value of the difference between home and away wins' percentages, and goals scored, for teams in all leagues.

![Home Advantage and Goals](home_adv_goals.png)

There seems to be a clear home advantage as home teams score almost half a goal extra compared to away teams. In addition home teams win about 46% of all games while away teams only win about 29% of all games, the remaining 25%, as stated earlier ends in a draw. 

We also looked at whether there is any relation between this result and how predictable a league is. To do this we first computed a weighted average for correctness of predictions, including all predictions: Home, Draw, and Away, and then looked at whether there is any correlation when using the above 'home advantage' values to explain league predictability. Dropping, once again, the Polish and Swiss leagues as we don't have betting odds for them. The Pearson test seemed to show that there is indeed a positive linear correlation between these variables, as the Pearson's correlation coefficient r=.60 and the p-value=0.09. Looking at the graph below we see that the value for Scotland is somewhat of an outlier, indeed once Scotland is removed the new values are: r=.45 and p-value=.27.

![Regression Line](reg_line_win_diff_league_pred.png)

The p-value, in both tests, is too high to be able to state that the positive correlation observed is not due to randomness in the data. Nevertheless, the p-values might be high due to the few data points available. Having previously noticed that home wins are predicted correctly more than away wins we can still say that it seems to be somewhat easier to predict games when the home advantage is higher.

### Stage Effect

Is there any difference in home wins percentage by stage number? That is: is it easier to win home games early, or late, in the season, or is there any difference at all? To answer such questions we limit ourselves to the 5 major European leagues: England Premiere League, France Ligue 1, Germany Bundesliga, Italy Serie A and Spain Liga BBVA. Although we group our data by stage number we still have about 80 games per stage number per league, so that our results will be somewhat *robust*. 

![Home Wins by Stage](home_wins_by_stage.png)

The top graph displaying the home win percentage by stage does not reveal anything unusual, results seem to hover around the average for all stages, which for the top 5 leagues is 46.27%. The second graph looks at the (absolute) difference with the average in each league, and while showing some peaks again does not seem to show that there is any meaningful difference.
The variation among single stages is too high for us to pick up any meaningful difference, so we try to group stages and display the average of N consecutive stages, with N a parameter we set equal to 5, corresponding for most leagues to about a month of games. We also separate the plots and display the individual country average home win percentage. Results are shown in the following graph.

![Home Wins by Stage](home_wins_by_stage_N5.png)

All graphs use the same axes limits so that it is easier to see any anomalies. It looks like the graph for France Ligue 1 doesn't display any pattern, while those for the other leagues do. In England and Germany home field advantage seems to be more important towards the end of the season, while in Italy the opposite is true. Finally, in Spain the best time to play at home is the middle of the season.

Another interesting observation is that generally home field advantage is less important through the first few games, as all graphs above are at or below their averages. We recall that the first point in each graph, at the 0 mark, correspond to the average of the first N=5 games, while the last corresponds to the average of the last 5 games.

### Team Performance

A look at individual teams performance in terms of winning percentage. The top 50 teams, and the bottom 15, in terms of winning percentage are labeled on the graph.
![Tean Win Percentage](team_win_percentage.png)

Not surprisingly, the two big Spanish juggernauts Barcelona and Real Madrid come out on top of this special ranking. Several less known teams appear in this ranking as well, given that all the games in question are played within each country league. Thus we see close to the top; Celtic and Rangers (Scotland), RSC Anderlecht and Club Brugge KV (Belgium), FC Basel (Switzerland), and Legia Warszawa and Lech Poznan (Poland). All these teams can be considered juggernauts in their domestic leagues.

As a further consideration we now have a better idea about why the Spanish, Dutch, and Portuguese leagues turned out to be the three most predictable leagues, with respectively 56.2%, 55.8%, and 54.8% of results predicted correctly. Each of these leagues has two teams in the top 9. An anomaly seems to be Scotland again. Although Celtic and Rangers are right after the two Spanish juggernauts Barcelona and Real Madrid and the the two top Portuguese teams Benfica and Porto, the league overall is the least predictable with the exception of the French league. This is probably a consequence of the fact that the next most winning team is well below the two top teams. 

One interesting thing that shows from this graph is how the first two teams in Spain, Portugal and Scotland are well separated from the rest, a fact that only applies to another team Bayern Munich from Germany. Such a dominance is not seen in other leagues. In particular we see that France has seen no dominating team (during those years) and therefore has turned out to be a little less predictable with only 50.51% correctly predicted outcomes.

### Goals

The game of soccer is all about scoring one more goal than the opponent, so we conclude with a look at how many goals are scored per game. The following graph shows histogram of the distribution of goals scored per game for each of the 5 major leagues.

![Goals Histogram](goals_hist.png)

All the distribution are right skewed, but in general seem to look roughly bell-shaped. No differences are apparent from these graphs, besides that in some leagues, like Germany and Spain, slightly more high scoring games are played.

## The approach

In order to predict outcomes for the games we will build onto the work done by the betting agencies and therefore use the consensus odds for Home Win, Draw, and Away Win for the first three features.
 * **avgH**: average odds for Home win for up to 10 betting agencies
 * **avgD**: average odds for Home win for up to 10 betting agencies
 * **avgA**: average odds for Home win for up to 10 betting agencies

At the moment we have a Logistic Regression classifier pipeline working to test the features. After dividing the data into a training set and a test set we use the accuracy of model using only the consensus odds features as a baseline. Such Logistic Regression has a score of 53.03% on the test set. The actual score used is the median of 10 runs of the algorithm using different training and test sets.
 
A second set of features tries to account for the overall strength of a team, in terms of average points during the current season, and whether the team is hot or cold, that is: the average points over the last 5 games. Notice that if a team has played less than 5 games the average will be out of only the games played that far. Points are awarded as: 3 for a win, 1 for a tie, 0 for a loss.
 * **points_home**: the average points by a team up to the previous game in the current season 
 * **points_away**: the average points by a team up to the previous game in the current season 
 * **streak_home**: the average points by a team up to the previous game in the current season for up to the last 5 games
 * **streak_away**: the average points by a team up to the previous game in the current season for up to the last 5 games
 
Adding the average points and the streak features increases the accuracy to 53.97%.

While exploring the actual predictions of the model, we noted that the model never predicted a Draw, but only Home wins and Away wins. This is not too different from what the betting agencies do. Only in 34 instances out of more than 16,000 games under consideration the betting agencies favored a Draw. This is somewhat surprising as more than a fourth of all games end with a Draw. In order to *force* the model to predict more draws we used the class_weight option of the LogisticRegression classifier of sklearn. While this forced the model to make some Draw predictions, it did not improve on the accuracy of the model. A different approach was then taken to add two more features that characterize games that are more likely to end in a draw. 
 * **diff**: the difference between highest and lowest value of odds consensus
 * **tie**: a score from 0 to 5 based on whether a match satisfies, or not, 5 criterias more likely to identify tie games

The first *diff* should help the classifier identify games likely to end in a tie, as those where the odds difference among the various predicted results is small. The second is based on 5 criterias likely to identify games whose outcome was a tie, based on the values of the previous features. These criterias are:
 1. *diff* value less than the mean
 2. *avgD* less than the mean
 3. *avgA* less than the mean
 4. *points_home* less than the mean
 5. *streak_home* less than the mean

The feature *diff* once added, increased accuracy to 54.29%, while the second, which is computed using *diff* as well, left accuracy at 53.95%. When using both together accuracy was computed at 54.27%. With the addition of these last features the model start predicting draws, exactly 1.1% of the times, although the accuracy for Draw is only 24.6%. For Away win the accuracy is 48.1%, and for Home win is 56.9%. Being more accurate on Home wins it is no surprise that the model tries to predict them more often at a rate of 69.0%, versus Away wins predicted only 29.9% of the times.

Before looking into any extra feature we will look into other classifiers that might perform better on this type of data. 