## Introduction

As we all know, soccer is the number one sport in the world. There're billions of fans of soccers, millions of football players and lots of fantastic matches every week. Obviously, there are a lot of data related to each player, each match and each football club, which is very valueable for us to analyse. This tutorial will introduce you to some basic methods for analyse and visualize soccer data. By doing this we can let the soccer data talk why we think intuitively one football club/league/player is better than other, and how to correlate one player's data with his position and performance. Furthermore, we can use visualization technique to present our result, let it be obvious, vividly and convincing.

An example of the usage of data is the player's capacity in soccer game:
[<img src="http://www.ufifa16.net/wp-content/uploads/2016/08/FIFA-16-players-stats.jpg">](http://www.ufifa16.net/wp-content/uploads/2016/08/FIFA-16-players-stats.jpg)
(click for full-size version).  

This is the capacity of Reus in the famous soccer game FIFA16, we may wonder how these numeric capacity be calculated or predicted, and how Reus's performance in reality match his ability in game. One thing we can do to reveal this relation is to analyse the player's performance in real match and map his statistical data into his ability. The same things can happens for a particular team or league.

### Tutorial content

In this tutorial, we will show how to do some soccer data analysis and visualization in Python, specifically using [Bokeh](https://bokeh.pydata.org/en/latest/) and [Pandas](https://pandas.pydata.org/).

We'll be using European Soccer Database which contains 25k+ matches, players & teams attributes for European Professional Football: https://www.kaggle.com/hugomathien/soccer/data, these dataset is collected from several different data source such as http://football-data.mx-api.enetscores.com/, http://www.football-data.co.uk/ and http://sofifa.com/.

We will cover the following topics in this tutorial:
- [Installing the libraries](#Installing-the-libraries)
- [Loading and Pre-processing data](#Loading-data)
- [Basic analysis](#Basic-analysis)
- [Basic visualization](#Basic-visualization)
- [Example application: Most Valueable Player (MVP) in Premier League](#Example-application:-MVP)

## Installing the libraries

Before getting started, you'll need to install the various libraries that we will use.  You can install Bokeh and Pandas using `pip`:

    $ pip install --upgrade bokeh
    
    $ pip install pandas

In [1]:
import pandas as pd
from bokeh.plotting import figure, output_notebook, show, ColumnDataSource
from bokeh.models.widgets import Panel, Tabs
from bokeh.models import HoverTool
import numpy as np
import networkx as nx
import sqlite3
import re
output_notebook()

## Loading and Pre-processing data

Now that we've installed and loaded the libraries, let's load our database. First, download the database from https://www.kaggle.com/hugomathien/soccer/version/10#_=_, and unzip the unzip the `soccer.zip` into the same folder with our notebook, this database is in sqlite format, we can use sqlite to make connections to the database and use pandas to load and make queries. Now we can look at the tables and their schema by the following code:

In [2]:
database = 'database.sqlite'
conn = sqlite3.connect(database)
pd.set_option('display.max_colwidth', -1)
tables = pd.read_sql("""SELECT sql FROM sqlite_master WHERE type='table';""", conn)
for s in tables['sql'][1:]:
    print (s)

CREATE TABLE "Player_Attributes" (
	`id`	INTEGER PRIMARY KEY AUTOINCREMENT,
	`player_fifa_api_id`	INTEGER,
	`player_api_id`	INTEGER,
	`date`	TEXT,
	`overall_rating`	INTEGER,
	`potential`	INTEGER,
	`preferred_foot`	TEXT,
	`attacking_work_rate`	TEXT,
	`defensive_work_rate`	TEXT,
	`crossing`	INTEGER,
	`finishing`	INTEGER,
	`heading_accuracy`	INTEGER,
	`short_passing`	INTEGER,
	`volleys`	INTEGER,
	`dribbling`	INTEGER,
	`curve`	INTEGER,
	`free_kick_accuracy`	INTEGER,
	`long_passing`	INTEGER,
	`ball_control`	INTEGER,
	`acceleration`	INTEGER,
	`sprint_speed`	INTEGER,
	`agility`	INTEGER,
	`reactions`	INTEGER,
	`balance`	INTEGER,
	`shot_power`	INTEGER,
	`jumping`	INTEGER,
	`stamina`	INTEGER,
	`strength`	INTEGER,
	`long_shots`	INTEGER,
	`aggression`	INTEGER,
	`interceptions`	INTEGER,
	`positioning`	INTEGER,
	`vision`	INTEGER,
	`penalties`	INTEGER,
	`marking`	INTEGER,
	`standing_tackle`	INTEGER,
	`sliding_tackle`	INTEGER,
	`gk_diving`	INTEGER,
	`gk_handling`	INTEGER,
	`gk_kicking`	INTEGER,
	`gk_p

We can see that there are total 7 tables in the dataset, _Country_ indicate all countries which has influential soccer leagues. _Match_, _Player_, _Team_ and _League_ contains all real match and player information while *Player\_Attribute* and *Team\_Attribute* contains information from FIFA Video Games.

In [3]:
leagues = pd.read_sql("""Select League.name, league.id, Country.name From League join Country On League.country_id = Country.id;""", conn)
leagues

Unnamed: 0,name,id,name.1
0,Belgium Jupiler League,1,Belgium
1,England Premier League,1729,England
2,France Ligue 1,4769,France
3,Germany 1. Bundesliga,7809,Germany
4,Italy Serie A,10257,Italy
5,Netherlands Eredivisie,13274,Netherlands
6,Poland Ekstraklasa,15722,Poland
7,Portugal Liga ZON Sagres,17642,Portugal
8,Scotland Premier League,19694,Scotland
9,Spain LIGA BBVA,21518,Spain


We can see the dataset contains leagues from 10 European countries, in this tutorial, we take the most famous England Premier League as our example to analyse. It's very simple to use similar method to other leagues and countries.

In [4]:
pd.options.display.max_columns = 100
pd.set_option('max_colwidth', 60)
england_matches = pd.read_sql("""Select * From Match Where league_id = 1729;""", conn)
england_matches

Unnamed: 0,id,country_id,league_id,season,stage,date,match_api_id,home_team_api_id,away_team_api_id,home_team_goal,away_team_goal,home_player_X1,home_player_X2,home_player_X3,home_player_X4,home_player_X5,home_player_X6,home_player_X7,home_player_X8,home_player_X9,home_player_X10,home_player_X11,away_player_X1,away_player_X2,away_player_X3,away_player_X4,away_player_X5,away_player_X6,away_player_X7,away_player_X8,away_player_X9,away_player_X10,away_player_X11,home_player_Y1,home_player_Y2,home_player_Y3,home_player_Y4,home_player_Y5,home_player_Y6,home_player_Y7,home_player_Y8,home_player_Y9,home_player_Y10,home_player_Y11,away_player_Y1,away_player_Y2,away_player_Y3,away_player_Y4,away_player_Y5,away_player_Y6,...,home_player_11,away_player_1,away_player_2,away_player_3,away_player_4,away_player_5,away_player_6,away_player_7,away_player_8,away_player_9,away_player_10,away_player_11,goal,shoton,shotoff,foulcommit,card,cross,corner,possession,B365H,B365D,B365A,BWH,BWD,BWA,IWH,IWD,IWA,LBH,LBD,LBA,PSH,PSD,PSA,WHH,WHD,WHA,SJH,SJD,SJA,VCH,VCD,VCA,GBH,GBD,GBA,BSH,BSD,BSA
0,1729,1729,1729,2008/2009,1,2008-08-17 00:00:00,489042,10260,10261,1,1,1,2,4,6,8,2,4,6,8,4,6,1,2,4,6,8,2,4,6,8,5,5,1,3,3,3,3,7,7,7,7,10,10,1,3,3,3,3,7,...,30829.0,24224,25518.0,24228.0,30929,29581.0,38807.0,40565.0,30360.0,33852.0,34574.0,37799.0,<goal><value><comment>n</comment><stats><goals>1</goals>...,<shoton><value><stats><blocked>1</blocked></stats><event...,<shotoff><value><stats><shotoff>1</shotoff></stats><even...,<foulcommit><value><stats><foulscommitted>1</foulscommit...,<card><value><comment>y</comment><stats><ycards>1</ycard...,<cross><value><stats><crosses>1</crosses></stats><event_...,<corner><value><stats><corners>1</corners></stats><event...,<possession><value><comment>56</comment><event_incident_...,1.29,5.50,11.00,1.30,4.75,8.25,1.30,4.40,8.50,1.25,4.50,10.00,,,,1.25,4.50,10.00,1.25,5.00,10.00,1.28,5.50,12.00,1.30,4.75,10.00,1.29,4.50,11.00
1,1730,1729,1729,2008/2009,1,2008-08-16 00:00:00,489043,9825,8659,1,0,1,2,4,6,8,2,4,6,8,4,6,1,2,4,6,8,5,7,9,1,3,5,1,3,3,3,3,7,7,7,7,10,10,1,3,3,3,3,7,...,36410.0,36373,36832.0,23115.0,37280,24728.0,24664.0,31088.0,23257.0,24171.0,25922.0,27267.0,<goal><value><comment>n</comment><stats><goals>1</goals>...,<shoton><value><stats><blocked>1</blocked></stats><event...,<shotoff><value><stats><shotoff>1</shotoff></stats><even...,<foulcommit><value><stats><foulscommitted>1</foulscommit...,<card />,<cross><value><stats><crosses>1</crosses></stats><event_...,<corner><value><stats><corners>1</corners></stats><event...,<possession><value><comment>65</comment><event_incident_...,1.20,6.50,15.00,1.22,5.50,10.00,1.20,5.20,11.00,1.20,5.00,11.00,,,,1.17,5.50,12.00,1.20,5.50,12.00,1.25,6.00,13.00,1.22,5.50,13.00,1.22,5.00,13.00
2,1731,1729,1729,2008/2009,1,2008-08-16 00:00:00,489044,8472,8650,0,1,1,2,4,6,8,2,4,6,8,4,6,1,2,4,6,8,2,4,6,8,4,6,1,3,3,3,3,7,7,7,7,10,10,1,3,3,3,3,7,...,24410.0,30660,37442.0,30617.0,24134,414792.0,37139.0,30618.0,40701.0,24800.0,24635.0,30853.0,<goal><value><comment>n</comment><stats><goals>1</goals>...,<shoton><value><stats><blocked>1</blocked></stats><event...,<shotoff><value><stats><shotoff>1</shotoff></stats><even...,<foulcommit><value><stats><foulscommitted>1</foulscommit...,<card><value><comment>y</comment><stats><ycards>1</ycard...,<cross><value><stats><crosses>1</crosses></stats><event_...,<corner><value><stats><corners>1</corners></stats><event...,<possession><value><comment>45</comment><event_incident_...,5.50,3.60,1.67,5.00,3.35,1.67,4.50,3.50,1.65,4.50,3.30,1.67,,,,5.50,3.30,1.57,4.33,3.40,1.73,5.50,3.80,1.65,5.00,3.40,1.70,4.50,3.40,1.73
3,1732,1729,1729,2008/2009,1,2008-08-16 00:00:00,489045,8654,8528,2,1,1,2,4,6,8,2,4,6,8,4,6,1,2,6,8,4,2,4,6,8,4,6,1,3,3,3,3,7,7,7,7,10,10,1,3,3,3,3,7,...,23139.0,34421,34987.0,35472.0,111865,25005.0,35327.0,25150.0,97988.0,41877.0,127857.0,34466.0,<goal><value><comment>n</comment><stats><goals>1</goals>...,<shoton><value><stats><shoton>1</shoton></stats><event_i...,<shotoff><value><stats><shotoff>1</shotoff></stats><even...,<foulcommit><value><stats><foulscommitted>1</foulscommit...,<card><value><comment>y</comment><stats><ycards>1</ycard...,<cross><value><stats><crosses>1</crosses></stats><event_...,<corner><value><stats><corners>1</corners></stats><event...,<possession><value><comment>50</comment><event_incident_...,1.91,3.40,4.20,1.90,3.20,3.80,1.80,3.30,3.80,1.80,3.20,4.00,,,,1.83,3.20,3.75,1.91,3.25,3.75,1.90,3.50,4.35,1.91,3.25,4.00,1.91,3.25,3.80
4,1733,1729,1729,2008/2009,1,2008-08-17 00:00:00,489046,10252,8456,4,2,1,2,4,6,8,2,4,6,8,4,6,1,2,4,6,8,1,3,5,7,9,5,1,3,3,3,3,7,7,7,7,10,10,1,3,3,3,3,7,...,26165.0,31432,46403.0,24208.0,23939,33963.0,47413.0,40198.0,42119.0,,33633.0,107216.0,<goal><value><comment>n</comment><stats><goals>1</goals>...,<shoton><value><stats><blocked>1</blocked></stats><event...,<shotoff><value><stats><shotoff>1</shotoff></stats><even...,<foulcommit><value><stats><foulscommitted>1</foulscommit...,<card><value><comment>y</comment><stats><ycards>1</ycard...,<cross><value><stats><corners>1</corners></stats><event_...,<corner><value><stats><corners>1</corners></stats><event...,<possession><value><comment>51</comment><event_incident_...,1.91,3.40,4.33,1.95,3.20,3.60,2.00,3.20,3.30,1.83,3.20,3.75,,,,1.91,3.20,3.50,1.91,3.25,3.75,1.90,3.50,4.35,1.91,3.25,4.00,1.91,3.30,3.75
5,1734,1729,1729,2008/2009,1,2008-08-16 00:00:00,489047,8668,8655,2,3,1,2,4,6,8,1,3,5,7,9,5,1,2,4,6,8,4,6,8,2,6,4,1,3,3,3,3,7,7,7,7,7,11,1,3,3,3,3,7,...,24160.0,30622,37764.0,19020.0,23921,24136.0,30342.0,23889.0,23916.0,23922.0,34176.0,30646.0,<goal><value><comment>n</comment><stats><goals>1</goals>...,<shoton><value><stats><shoton>1</shoton></stats><event_i...,<shotoff><value><stats><shotoff>1</shotoff></stats><even...,<foulcommit><value><stats><foulscommitted>1</foulscommit...,<card><value><comment>y</comment><stats><ycards>1</ycard...,<cross><value><stats><crosses>1</crosses></stats><event_...,<corner><value><stats><corners>1</corners></stats><event...,<possession><value><comment>46</comment><event_incident_...,2.00,3.30,4.00,1.85,3.25,4.00,2.00,3.20,3.30,1.80,3.20,4.00,,,,1.95,3.10,3.50,2.00,3.25,3.40,2.05,3.30,4.00,2.00,3.25,3.75,2.00,3.25,3.50
6,1735,1729,1729,2008/2009,1,2008-08-16 00:00:00,489048,8549,8586,2,1,1,2,4,6,8,2,4,6,8,4,6,1,2,4,6,8,2,4,6,8,4,6,1,3,3,3,3,7,7,7,7,10,10,1,3,3,3,3,7,...,42183.0,30455,34182.0,38697.0,24531,40006.0,30895.0,30818.0,31097.0,23760.0,41157.0,23949.0,<goal><value><comment>dg</comment><event_incident_typefk...,<shoton><value><stats><shoton>1</shoton></stats><event_i...,<shotoff><value><stats><shotoff>1</shotoff></stats><even...,<foulcommit><value><stats><foulscommitted>1</foulscommit...,<card><value><comment>y</comment><stats><ycards>1</ycard...,<cross><value><stats><corners>1</corners></stats><event_...,<corner><value><stats><corners>1</corners></stats><event...,<possession><value><comment>49</comment><event_incident_...,3.20,3.40,2.25,2.80,3.20,2.30,2.90,3.20,2.20,2.80,3.20,2.20,,,,2.90,3.20,2.15,2.88,3.40,2.20,3.20,3.40,2.30,3.00,3.25,2.30,2.80,3.25,2.30
7,1736,1729,1729,2008/2009,1,2008-08-16 00:00:00,489049,8559,10194,3,1,1,2,4,6,8,1,3,5,7,9,5,1,2,4,6,8,2,4,6,8,4,6,1,3,3,3,3,7,7,7,7,7,11,1,3,3,3,3,7,...,34261.0,23794,23369.0,34214.0,40695,40574.0,25668.0,23333.0,23253.0,23072.0,23288.0,23314.0,<goal><value><comment>n</comment><stats><goals>1</goals>...,<shoton><value><stats><blocked>1</blocked></stats><event...,<shotoff><value><stats><shotoff>1</shotoff></stats><even...,<foulcommit><value><stats><foulscommitted>1</foulscommit...,<card><value><comment>y</comment><stats><ycards>1</ycard...,<cross><value><event_incident_typefk>123</event_incident...,<corner><value><stats><corners>1</corners></stats><event...,<possession><value><comment>58</comment><event_incident_...,1.83,3.50,4.50,1.75,3.30,4.40,1.75,3.30,4.20,1.73,3.40,4.00,,,,1.80,3.10,4.00,1.80,3.20,4.33,1.85,3.40,4.80,1.83,3.25,4.50,1.80,3.25,4.33
8,1737,1729,1729,2008/2009,1,2008-08-16 00:00:00,489050,8667,9879,2,1,1,2,4,6,8,2,4,6,8,4,6,1,2,4,6,8,2,4,6,8,4,6,1,3,3,3,3,7,7,7,7,10,10,1,3,3,3,3,7,...,23352.0,30633,37266.0,26777.0,23780,24781.0,24020.0,30338.0,24843.0,24737.0,34248.0,24741.0,<goal><value><comment>n</comment><stats><goals>1</goals>...,<shoton><value><stats><shoton>1</shoton></stats><event_i...,<shotoff><value><stats><shotoff>1</shotoff></stats><even...,<foulcommit><value><stats><foulscommitted>1</foulscommit...,<card><value><comment>y</comment><stats><ycards>1</ycard...,<cross><value><stats><crosses>1</crosses></stats><event_...,<corner><value><stats><corners>1</corners></stats><event...,<possession><value><comment>49</comment><event_incident_...,2.60,3.20,2.80,2.45,3.15,2.65,2.40,3.20,2.60,2.40,3.25,2.50,,,,2.50,2.90,2.62,2.38,3.20,2.75,2.60,3.40,2.80,2.60,3.25,2.60,2.60,3.25,2.50
9,1738,1729,1729,2008/2009,1,2008-08-17 00:00:00,489051,8455,8462,4,0,1,2,4,6,8,3,5,7,4,6,5,1,2,4,6,8,2,4,6,8,4,6,1,3,3,3,3,6,6,6,8,8,11,1,3,3,3,3,7,...,37804.0,36286,34036.0,34418.0,24216,23953.0,25517.0,23988.0,26108.0,38820.0,30348.0,30830.0,<goal><value><comment>n</comment><stats><goals>1</goals>...,<shoton><value><stats><shoton>1</shoton></stats><event_i...,<shotoff><value><stats><shotoff>1</shotoff></stats><even...,<foulcommit><value><stats><foulscommitted>1</foulscommit...,<card><value><comment>y</comment><stats><ycards>1</ycard...,<cross><value><stats><crosses>1</crosses></stats><event_...,<corner><value><stats><corners>1</corners></stats><event...,<possession><value><comment>57</comment><event_incident_...,1.33,5.00,10.00,1.30,4.75,8.25,1.30,4.40,8.50,1.29,4.33,9.00,,,,1.30,4.20,8.50,1.25,5.00,10.00,1.33,5.00,11.00,1.33,4.75,9.00,1.33,4.20,10.00


We can use SQL query to select all matches in the England Premier League in the dataset, we can see that each row contains many information, some are useful some may be just raw data format collected from Internet API, and others are useless. Therefore we need first pre-processing the data into Pandas Dataframe and extract useful and proper information from the raw data.

In [5]:
team_pattern = re.compile(r'<team>(\d*)</team>')
player_pattern = re.compile(r'<player1>(\d*)</player1>')
elapsed_pattern_1 = re.compile(r'<elapsed>(\d*)</elapsed>')
elapsed_pattern_2 = re.compile(r'<elapsed>(\d*)</elapsed>(?:<player2>(\d*)</player2>)?')
pos_pattern = re.compile(r'<homepos>(\d*)</homepos>')
card_pattern = re.compile(r'<card_type>(\w)</card_type>')

def preprocess_match(raw_data, two_player = False, pos = False, card = False):
    if two_player:
        e = elapsed_pattern_2.findall(raw_data)
    else:
        e = elapsed_pattern_1.findall(raw_data)
    if pos:
        p = pos_pattern.findall(raw_data)
        return list(zip(e, p))
    else:
        p = player_pattern.findall(raw_data)
        t = team_pattern.findall(raw_data)
        if card:
            c = card_pattern.findall(raw_data)
            return list(zip(t, e, p, c))
    return list(zip(t, e, p))

england_matches['shoton'] = england_matches['shoton'].apply(lambda x: preprocess_match(x))
england_matches['shotoff'] = england_matches['shotoff'].apply(lambda x: preprocess_match(x))
england_matches['goal'] = england_matches['goal'].apply(lambda x: preprocess_match(x, True))
england_matches['foulcommit'] = england_matches['foulcommit'].apply(lambda x: preprocess_match(x, True))
england_matches['cross'] = england_matches['cross'].apply(lambda x: preprocess_match(x))
england_matches['corner'] = england_matches['corner'].apply(lambda x: preprocess_match(x))
england_matches['possession'] = england_matches['possession'].apply(lambda x: preprocess_match(x, False, True))
england_matches['card'] = england_matches['card'].apply(lambda x: preprocess_match(x, False, False, True))
england_matches['home_players'] = england_matches[england_matches.columns[55:66]].apply(lambda x: set(x.dropna().astype(int)), axis=1)
england_matches['away_players'] = england_matches[england_matches.columns[66:77]].apply(lambda x: set(x.dropna().astype(int)), axis=1)

For each match, we can see the detailed information is organized as XML format, which is not very good for us to do further analyse directly. So we want first extract these information and format them into a well-understand/organized format. We use regular expression to extract useful information: time, player (or event) and turn them into a list of tuples. Moreover, in the raw data, there are 44 columns to represent the players in the field, we can merge them into two lists, one for the home team and the other for the away team.

In [6]:
format_england_matches = england_matches[['id', 'country_id', 'league_id', 'season', 'stage', 'date', 'match_api_id', 'home_team_api_id', 'away_team_api_id', 'home_team_goal', 'away_team_goal', 'shoton', 'shotoff', 'goal', 'foulcommit', 'cross', 'corner', 'possession', 'card', 'home_players', 'away_players']]

The next step is to drop useless attribute from the raw data. We choose 21 attributes as our final dataset.

In [7]:
format_england_matches

Unnamed: 0,id,country_id,league_id,season,stage,date,match_api_id,home_team_api_id,away_team_api_id,home_team_goal,away_team_goal,shoton,shotoff,goal,foulcommit,cross,corner,possession,card,home_players,away_players
0,1729,1729,1729,2008/2009,1,2008-08-17 00:00:00,489042,10260,10261,1,1,"[(10260, 3, 24154), (10260, 7, 24157), (10260, 14, 30829...","[(10260, 4, 30373), (10261, 5, 37799), (10261, 22, 24228...","[(10261, (22, 38807), 37799), (10260, (24, 24154), 24148)]","[(10261, (1, 32569), 25518), (10261, (2, 24157), 30929),...","[(10260, 7, 30829), (10260, 14, 24148), (10261, 19, 3880...","[(10261, 19, 38807), (10261, 22, 40565), (10261, 22, 388...","[(25, 56), (45, 54), (70, 54), (90, 55)]","[(10260, 78, 24157, y), (10260, 82, 30362, y), (10260, 9...","{34944, 30373, 30726, 30829, 30865, 24148, 24154, 32569,...","{24224, 24228, 37799, 29581, 25518, 34574, 30929, 40565,..."
1,1730,1729,1729,2008/2009,1,2008-08-16 00:00:00,489043,9825,8659,1,0,"[(9825, 7, 31013), (9825, 7, 30960), (9825, 9, 26111), (...","[(8659, 6, 23257), (9825, 9, 26181), (9825, 11, 38835), ...","[(9825, (4, 39297), 26181)]","[(8659, (2, 26181), 36832), (9825, (3, 23257), 31291), (...","[(9825, 3, 39297), (9825, 3, 39297), (9825, 10, 31291), ...","[(9825, 3, 39297), (9825, 3, 39297), (9825, 9, 30960), (...","[(27, 65), (45, 61), (74, 65), (90, 66)]",[],"{39297, 31013, 23686, 26181, 30986, 30960, 38835, 30935,...","{36832, 37280, 25922, 27267, 23115, 24171, 31088, 36373,..."
2,1731,1729,1729,2008/2009,1,2008-08-16 00:00:00,489044,8472,8650,0,1,"[(8472, 5, 23927), (8472, 13, 24410), (8650, 18, 30618),...","[(8472, 10, 30352), (8472, 27, 23927), (8650, 35, 30618)...","[(8650, (83, 30889), 30853)]","[(8650, (12, 38802), 39647), (8472, (15, 24134), 23927),...","[(8472, 7, 24410), (8472, 7, 23927), (8650, 9, 37139), (...","[(8650, 26, 30618), (8472, 30, 38802), (8650, 35, 30618)...","[(25, 45), (45, 43), (70, 48), (90, 46)]","[(8650, 56, 37442, y), (8650, 90, 46621, y)]","{17866, 24655, 30352, 38802, 32562, 38836, 36786, 23927,...","{24800, 37442, 30660, 30853, 24134, 414792, 37139, 30617..."
3,1732,1729,1729,2008/2009,1,2008-08-16 00:00:00,489045,8654,8528,2,1,"[(8654, 15, 34543), (8654, 27, 34543), (8528, 30, 97988)...","[(8528, 7, 127857), (8528, 14, 41877), (8528, 17, 111865...","[(8654, (4, 36394), 23139), (8654, (10, 37277), 23139), ...","[(8654, (1, 25005), 23139), (8528, (2, ), 25150), (8654,...","[(8654, 4, 36394), (8528, 7, 41877), (8654, 9, 36394), (...","[(8654, 9, 36394), (8654, 14, 36394), (8654, 15, 36394),...","[(25, 50), (45, 56), (69, 41), (90, 52)]","[(8654, 39, 24223, y), (8654, 49, 37277, y), (8528, 68, ...","{23139, 24773, 23818, 36394, 34543, 24223, 37169, 30966,...","{34466, 97988, 34987, 25005, 35472, 127857, 34421, 41877..."
4,1733,1729,1729,2008/2009,1,2008-08-17 00:00:00,489046,10252,8456,4,2,"[(10252, 8, 23354), (10252, 10, 26165), (10252, 12, 2616...","[(10252, 6, 26165), (8456, 20, 47413), (10252, 26, 23782...","[(10252, (47, 23354), 26165), (8456, (64, ), 40198), (10...","[(10252, (5, 33963), 30357), (8456, (7, 38609), 42119), ...","[(10252, 2, 23354), (8456, 3, 33633), (10252, 6, 23354),...","[(10252, 2, 23354), (8456, 3, 33633), (10252, 6, 23354),...","[(25, 51), (45, 54), (70, 49), (90, 52)]","[(8456, 34, 23939, y)]","{23264, 23782, 30380, 24780, 43280, 38609, 23282, 24658,...","{33633, 46403, 23939, 40198, 42119, 31432, 33963, 24208,..."
5,1734,1729,1729,2008/2009,1,2008-08-16 00:00:00,489047,8668,8655,2,3,"[(8668, 4, 109058), (8655, 5, 23916), (8655, 15, 30342),...","[(8655, 7, 30646), (8655, 13, 30646), (8668, 19, 30857),...","[(8655, (22, 23916), 30342), (8668, (45, ), 24011), (866...","[(8668, (3, 37764), 24160), (8655, (6, 24011), 30342), (...","[(8655, 2, 37764), (8655, 5, 23916), (8655, 5, 23916), (...","[(8655, 5, 23916), (8668, 19, 24011), (8655, 25, 30342),...","[(25, 46), (45, 59), (70, 51), (90, 51)]","[(8655, 45, 24136, y), (8655, 59, 37764, y), (8668, 75, ...","{24160, 109058, 30371, 24004, 23268, 24006, 31465, 30857...","{34176, 37764, 30342, 24136, 19020, 23916, 23921, 23889,..."
6,1735,1729,1729,2008/2009,1,2008-08-16 00:00:00,489048,8549,8586,2,1,"[(8549, 10, 30892), (8549, 11, 35608), (8549, 45, 35608)...","[(8586, 5, 23760), (8586, 9, 41157), (8586, 15, 23760), ...","[(8549, (32, ), 24166), (8549, (71, 35608), 24166), (854...","[(8549, (8, 23949), 24167), (8549, (19, ), 24166), (8549...","[(8549, 2, 24393), (8586, 3, 30895), (8586, 6, 23760), (...","[(8549, 2, 24393), (8586, 6, 23760), (8586, 9, 23760), (...","[(25, 49), (45, 49), (76, 53), (90, 53)]","[(8586, 61, 30818, y), (8586, 76, 23760, y), (8549, 90, ...","{24161, 24166, 24167, 42183, 24393, 97932, 30892, 24753,...","{30818, 41157, 40006, 34182, 38697, 23949, 30895, 23760,..."
7,1736,1729,1729,2008/2009,1,2008-08-16 00:00:00,489049,8559,10194,3,1,"[(8559, 20, 23934), (8559, 31, 24372), (8559, 38, 23785)...","[(10194, 6, 23288), (10194, 9, 23288), (8559, 14, 24372)...","[(8559, (34, 23933), 26454), (8559, (41, 23783), 23934),...","[(10194, (2, 24372), 23333), (10194, (8, 23933), 23333),...","[(10194, 1, 26454), (10194, 3, 23314), (10194, 6, 23253)...","[(10194, 11, 23253), (10194, 45, 23253), (10194, 45, 232...","[(25, 58), (45, 47), (70, 51), (90, 47)]","[(8559, 62, 26454, y), (10194, 64, 23333, y), (10194, 90...","{23783, 23785, 35532, 24336, 24372, 34261, 26454, 23931,...","{23072, 25668, 23333, 34214, 23369, 23794, 23314, 23253,..."
8,1737,1729,1729,2008/2009,1,2008-08-16 00:00:00,489050,8667,9879,2,1,"[(9879, 2, 24737), (8667, 3, 39073), (8667, 3, 23023), (...","[(9879, 10, 34248), (9879, 18, 37266), (9879, 19, 24020)...","[(9879, (9, 24843), 34248), (8667, (23, 34430), 39073), ...","[(8667, (16, 24843), 30595), (8667, (21, 24843), 23023),...","[(9879, 1, 34248), (8667, 4, 23022), (9879, 7, 24843), (...","[(8667, 4, 23022), (9879, 7, 24843), (9879, 18, 24843), ...","[(24, 49), (70, 50), (90, 53)]","[(8667, 29, 34430, y), (8667, 44, 23022, y), (8667, 90, ...","{39073, 34275, 30595, 23021, 23022, 23438, 23023, 23025,...","{24737, 30338, 23780, 24741, 34248, 30633, 24843, 24781,..."
9,1738,1729,1729,2008/2009,1,2008-08-17 00:00:00,489051,8455,8462,4,0,"[(8462, 3, 30830), (8462, 3, 30348), (8455, 9, 30631), (...","[(8455, 11, 37804), (8455, 19, 30699), (8462, 20, 25517)...","[(8455, (12, 30699), 30630), (8455, (26, 30686), 37804),...","[(8455, (2, 23988), 30686), (8462, (4, 38834), 26108), (...","[(8455, 1, 25925), (8455, 5, 30686), (8455, 5, 30631), (...","[(8455, 5, 30686), (8455, 7, 30631), (8455, 35, 30686), ...","[(27, 57), (45, 64), (72, 65)]","[(8462, 45, 36286, y)]","{30627, 25925, 30630, 30631, 30859, 30699, 37804, 38834,...","{38820, 30348, 25517, 30830, 23953, 34418, 34036, 23988,..."


As we can see above, after the pre-processing, the data format is more suitable for further processing.

## Basic analysis

Now let's do some basic analysis based on the data of England Premier League.

In [8]:
teams = pd.read_sql("""Select * From Team;""", conn)
all_players = pd.read_sql("""Select * From Player""", conn)
player_dict = {p['player_api_id']: p['player_name'] for index, p in all_players.iterrows()}

First, we want to analyse a specific team in England. We can take a look at the _Team_ table and choose a team we like. In this tutorial, we use Manchester United as an example. We first extract all match data of Manchester United in 2008/2009 season from the dataset.

In [9]:
def extract_team_data(team_id, season):
    data_home = format_england_matches.loc[(format_england_matches['home_team_api_id'] == team_id) & (format_england_matches['season'] == season)]
    data_away = format_england_matches.loc[(format_england_matches['away_team_api_id'] == team_id) & (format_england_matches['season'] == season)]
    return data_home, data_away

mun_id = 10260
one_season = '2008/2009'
mun_data_home, mun_data_away = extract_team_data(mun_id, one_season)

The first thing we can easily do is to calculate the winning rate of Manchester United in 2008/2009 season.

In [10]:
def cal_winning_rate(data_home, data_away):
    home_win = len(data_home[data_home.apply(lambda x: x['home_team_goal'] > x['away_team_goal'], axis=1)].index)
    away_win = len(data_away[data_away.apply(lambda x: x['away_team_goal'] > x['home_team_goal'], axis=1)].index)
    return home_win/len(data_home.index), away_win/len(data_away.index), (home_win + away_win)/(len(data_home.index) + len(data_away.index))

cal_winning_rate(mun_data_home, mun_data_away)

(0.8421052631578947, 0.631578947368421, 0.7368421052631579)

We can then calculate the winning rate of Manchester United from 2008/2009 to 2015/2016, to see the change of this team.

In [11]:
seasons = ['2008/2009', '2009/2010', '2010/2011', '2011/2012',\
           '2012/2013', '2013/2014', '2014/2015', '2015/2016']

def all_winning_rate(team_id):
    all_winning_rate = []
    for s in seasons:
        h, a = extract_team_data(mun_id, s)
        all_winning_rate.append(cal_winning_rate(h, a))
    return all_winning_rate

all_winning_rate(mun_id)

[(0.8421052631578947, 0.631578947368421, 0.7368421052631579),
 (0.8421052631578947, 0.5789473684210527, 0.7105263157894737),
 (0.9473684210526315, 0.2631578947368421, 0.6052631578947368),
 (0.7894736842105263, 0.6842105263157895, 0.7368421052631579),
 (0.8421052631578947, 0.631578947368421, 0.7368421052631579),
 (0.47368421052631576, 0.5263157894736842, 0.5),
 (0.7368421052631579, 0.3157894736842105, 0.5263157894736842),
 (0.631578947368421, 0.3684210526315789, 0.5)]

We can assume that the winning rate is related to many reasons. One of them is the performance of the manager. The manager can design and implement different tactics for the team, which can be represented by different statistic of a match. So let's extract the stat of a single matches. This function will calculate the stat for a single match, and return detailed information including the event, time and player involved, and another option is just return the simple information, which is the count of the event.

In [12]:
def cal_stat(match, team_id, detailed = True):
    shot_on = list(map(lambda x: (int(x[1]), int(x[2])), filter(lambda x: int(x[0]) == team_id, match['shoton'])))
    shot_off = list(map(lambda x: (int(x[1]), int(x[2])), filter(lambda x: int(x[0]) == team_id, match['shotoff'])))
    goal = list(map(lambda x: (int(x[1][0]), int(x[2])), filter(lambda x: int(x[0]) == team_id, match['goal'])))
    lost = list(map(lambda x: (int(x[1][0]), int(x[2])), filter(lambda x: int(x[0]) != team_id, match['goal'])))
    assist = list(map(lambda x: (int(x[1][0]), int(x[1][1])), filter(lambda x: int(x[0]) == team_id and x[1][1], match['goal'])))
    foul = list(map(lambda x: (int(x[1][0]), int(x[2])), filter(lambda x: int(x[0]) == team_id, match['foulcommit'])))
    cross = list(map(lambda x: (int(x[1]), int(x[2])), filter(lambda x: int(x[0]) == team_id, match['cross'])))
    y_card = list(map(lambda x: (int(x[1]), int(x[2])), filter(lambda x: int(x[0]) == team_id and x[3] == 'y', match['card'])))
    r_card = list(map(lambda x: (int(x[1]), int(x[2])), filter(lambda x: int(x[0]) == team_id and x[3] == 'r', match['card'])))
    pos = list(map(lambda x: (int(x[0]), int(x[1])) if team_id == int(match['home_team_api_id']) else (int(x[0]), 100 - int(x[1])), match['possession']))
    if detailed:
        return shot_on, shot_off, goal, lost, assist, foul, cross, y_card, r_card, pos
    else:
        return len(shot_on), len(shot_off), len(goal), len(lost), len(assist), len(foul), len(cross), len(y_card), len(r_card), sum(map(lambda x: x[1], pos)) / len(pos)

single_match = format_england_matches.iloc[2,:]
cal_stat(single_match, 8650, True)

([(18, 30618),
  (34, 30618),
  (37, 37139),
  (40, 37139),
  (55, 24800),
  (59, 30618),
  (72, 30618),
  (73, 37139),
  (73, 30853),
  (88, 37139),
  (90, 30853)],
 [(35, 30618), (43, 24635), (53, 30618), (56, 24635), (81, 30889)],
 [(83, 30853)],
 [],
 [(83, 30889)],
 [(12, 39647),
  (16, 30853),
  (24, 37442),
  (38, 24635),
  (41, 24134),
  (48, 37442),
  (51, 30853),
  (56, 37442),
  (60, 30853),
  (64, 39647),
  (85, 39647),
  (90, 46621)],
 [(9, 37139),
  (10, 30618),
  (23, 37442),
  (25, 24635),
  (26, 30618),
  (33, 30853),
  (35, 30618),
  (35, 37139),
  (36, 30618),
  (36, 30618),
  (36, 30618),
  (37, 30618),
  (42, 39647),
  (42, 30618),
  (52, 39647),
  (56, 30889),
  (66, 30618),
  (75, 30618),
  (78, 46621)],
 [(56, 37442), (90, 46621)],
 [],
 [(25, 55), (45, 57), (70, 52), (90, 54)])

Now we can calculate the average stat for every season for Manchester United, and see how they are related to the winning rate of Manchester United. The following function can calculate the statistics for every season for a specific team, for thier home, away and total matches respectively.

In [13]:
def cal_season_avg_stat(team_id, data_home, data_away, avg=True):
    if avg:
        home_stat = []
        away_stat = []
        for index, match in data_home.iterrows():
            home_stat.append(cal_stat(match, team_id, False))
        for index, match in data_away.iterrows():
            away_stat.append(cal_stat(match, team_id, False))
        season_stat = home_stat+away_stat
        return [sum(y) / len(y) for y in zip(*home_stat)], [sum(y) / len(y) for y in zip(*away_stat)], [sum(y) / len(y) for y in zip(*season_stat)]
    else:
        home_stat = [[] for i in range(10)]
        away_stat = [[] for i in range(10)]
        for index, match in data_home.iterrows():
            new = list(map(lambda x: list(map(lambda x: x[0] ,x)),cal_stat(match, team_id, True)))
            home_stat = [home_stat[i] + new[i] for i in range(10)]
        for index, match in data_away.iterrows():
            new = list(map(lambda x: list(map(lambda x: x[0] ,x)),cal_stat(match, team_id, True)))
            away_stat = [away_stat[i] + new[i] for i in range(10)]
        return home_stat, away_stat, [home_stat[i] + away_stat[i] for i in range(10)]

def all_stat(team_id):
    all_stat = []
    for s in seasons:
        h, a = extract_team_data(team_id, s)
        all_stat.append(cal_season_avg_stat(team_id, h, a))
    return all_stat

all_stat(mun_id)

[([9.157894736842104,
   8.31578947368421,
   2.6315789473684212,
   0.6842105263157895,
   1.6842105263157894,
   11.473684210526315,
   21.473684210526315,
   1.263157894736842,
   0.05263157894736842,
   58.22105263157895],
  [7.421052631578948,
   7.0,
   1.5789473684210527,
   0.5263157894736842,
   1.2105263157894737,
   10.894736842105264,
   16.736842105263158,
   2.1052631578947367,
   0.05263157894736842,
   54.54511278195489],
  [8.289473684210526,
   7.657894736842105,
   2.1052631578947367,
   0.6052631578947368,
   1.4473684210526316,
   11.18421052631579,
   19.105263157894736,
   1.6842105263157894,
   0.05263157894736842,
   56.38308270676692]),
 ([9.210526315789474,
   8.263157894736842,
   2.526315789473684,
   1.105263157894737,
   2.0,
   10.157894736842104,
   28.68421052631579,
   1.2105263157894737,
   0.0,
   58.0921052631579],
  [6.578947368421052,
   6.526315789473684,
   1.631578947368421,
   1.0526315789473684,
   1.105263157894737,
   11.157894736842104,
 

From above we do some basic analyse for the soccer dataset, we now can calculate the winning rate and differnet type of data for a specific football club, however, these data are numeric and not very obvious for us to see the performance of this club. Therefore, we need to visualize these data.

## Basic Visualization

In this section, this tutorial will show the basic visualization for soccer data based on Bokeh Visualization Library.

Let's start from the simple one by using Bokeh to draw a multi-line gragh of the winning rate of Manchester United from 2008 to 2016, and add more features to this gragh, we use the statistics from last section and visualize them into the graph, to see the relationship between each of them and the winning rate, and the trend of tatics of the club. We use line and circle in Bokeh to show the data and use tab to select home/away data.

In [14]:
Stat = ['Shot On', 'Shot Off', 'Goal', 'Lost', 'Assist', 'Foul', 'Cross', 'Yellow Card', 'Red Card', 'Possession']
Color = ['grey','purple','navy','pink','orange','firebrick','blue','green','red','black']

def plot_stat(team_id, team_name, data_type='total'):
    if data_type == 'total':
        index = 2
    elif data_type == 'home':
        index = 0
    elif data_type == 'away':
        index = 1
    # use winning rate data
    win_rate = all_winning_rate(team_id)
    stat = all_stat(team_id)
    x = list(map(lambda x: int(x[:4]), seasons))
    yt = list(map(lambda x: x[index], win_rate))

    # create a new plot with a title and axis labels
    title = team_name + " Statistics(08-16)"
    p = figure(title=title, x_axis_label='season', y_axis_type="log")
    
    for i in range(10):
        if i == 8:
            continue
        p.line(x, list(map(lambda x: x[index][i], stat)), legend=Stat[i], line_dash=(4, 4), line_color=Color[i], line_width=2)
        p.circle(x, list(map(lambda x: x[index][i], stat)), legend=Stat[i], color=Color[i], alpha=0.5)
    
    # add winning rate to the graph
    p.square(x, yt, legend="Winning Rate", fill_color=None, line_color="green")
    p.line(x, yt, legend="Winning Rate", line_color="green")
    
    # make the legend label interactive
    p.legend.click_policy="hide"
    
    # show the results
    return p

def plot_stat_tab(team_id, team_name):
    # create and show tabs
    ph = plot_stat(team_id, team_name, 'home')
    tab1 = Panel(child=ph, title="Home")
    pa = plot_stat(team_id, team_name, 'away')
    tab2 = Panel(child=pa, title="Away")
    pt = plot_stat(team_id, team_name, 'total')
    tab3 = Panel(child=pt, title="Total")
    tabs = Tabs(tabs=[ tab1, tab2, tab3 ])
    show(tabs)

plot_stat_tab(mun_id, "Manchester United")

From the figure above we can see many change of Manchester United (Red). For example, after Sir. Alex Forgeson's retirement (2012), the Red's winning rate fall from 70%+ to about 50%, and average goal fall from over 2 to 1.5. The number of cross rise but the number of shot fall. When Louis Van Gaal become the new manager (2014), the possession rise because he focus on controlling the ball. Because we create interactive legend label, we can click it to show (hide) different attributes. We also add tabs to the figure so we can see the difference between home match and away match vividly.

In [15]:
def plot_stat_his(season, team_id, team_name, data_type='total'):
    h, a = extract_team_data(team_id, season)
    h, a, t = cal_season_avg_stat(team_id, h, a, False)
    if data_type == 'home':
        data = h
    elif data_type == 'away':
        data = a
    elif data_type == 'total':
        data = t
    p = figure(title="Distribution of Match Data" + " of " + team_name + " in " + season)
    for i in range(8):
        hist, edges = np.histogram(data[i], density=True, bins=90)
        p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], color = Color[i], legend=Stat[i], alpha=0.5)
    p.legend.click_policy="hide"
    p.xaxis.axis_label = 'Match Time'
    p.yaxis.axis_label = 'Count'
    return p

def plot_stat_histab(season, team_id, team_name):
    # create and show tabs
    ph = plot_stat_his(season, team_id, team_name, 'home')
    tab1 = Panel(child=ph, title="Home")
    pa = plot_stat_his(season, team_id, team_name, 'away')
    tab2 = Panel(child=pa, title="Away")
    pt = plot_stat_his(season, team_id, team_name, 'total')
    tab3 = Panel(child=pt, title="Total")
    tabs = Tabs(tabs=[ tab1, tab2, tab3 ])
    show(tabs)
    
plot_stat_histab('2008/2009', mun_id, 'Manchester United')

The above function use the statiistics for one club in one season, calculate the distribution of each match data in every minite of the match. From this figure we can see that one interesting fact is Manchester United in 2008/2009 is very good at goal in 45min and 90min, which is the end of the half/whole match. More interesting facts can be found if you look deep into this figure.

## Example application: Most Valueable Player (MVP) in England

Now we command skills of pre-processing soccer data, analysing it by Pandas Dataframe and visualizing it by Bokeh, we now can combine them together to make an cool application, which can show the most valueable player in england in different aspects.

In [16]:
def player_stat(match, stat, index1, index2, assist=False, card=False):
    for s in match[stat]:
        if card:
            if s[index1] not in players:
                if s[index1 + 1] == 'y':
                    players[s[index1]] = [0, 0, 0, 0, 0, 1, 0]
                else:
                    players[s[index1]] = [0, 0, 0, 0, 0, 0, 1]
            else:
                if s[index1 + 1] == 'y':
                    players[s[index1]][5] += 1
                else:
                    players[s[index1]][6] += 1
        elif assist:
            if s[index1]:
                if s[index1][1] not in players:
                    players[s[index1][1]] = [0, 0, 0, 1, 0, 0, 0]
                else:
                    players[s[index1][1]][index2] += 1
        else:
            if s[index1] not in players:
                players[s[index1]] = [0, 0, 0, 0, 0, 0, 0]
                players[s[index1]][index2] += 1
            else:
                players[s[index1]][index2] += 1
                    
def create_player_dict(season):
    data = format_england_matches.loc[format_england_matches['season'] == season]
    for index, match in data.iterrows():
        player_stat(match, 'shoton', 2, 0)
        player_stat(match, 'shotoff', 2, 1)
        player_stat(match, 'goal', 2, 2)
        player_stat(match, 'goal', 1, 3, True)
        player_stat(match, 'cross', 2, 4)
        player_stat(match, 'card', 2, 5, False, True)
        player_stat(match, 'card', 2, 6, False, True)

First, we should calculate each player's statistics. We can extract these from the dataset season by season, and use a dictionary to store them.

In [17]:
players = {}
def plot_player_tabs(season):
    create_player_dict(season)
    shots, goals, assists, crosses, ycards, rcards, names, ids = [], [], [], [], [], [], [], []
    for p in players:
        if players[p][2] > 0:
            if int(p) in player_dict:
                shots.append(players[p][0] + players[p][1])
                goals.append(players[p][2])
                assists.append(players[p][3])
                ycards.append(players[p][5])
                rcards.append(players[p][6])
                names.append(player_dict[int(p)])
                ids.append(int(p))
    source = ColumnDataSource(data=dict(shots=shots,goals=goals,assists=assists, ycards=ycards, rcards=rcards, names=names, ids=ids))
    hover1 = HoverTool(tooltips=[("(shots, goals)", "($x, $y)"), ("name", "@names"), ("id", "@ids")])
    p1 = figure(tools=[hover1, 'pan'], title="Player's goal rate in England Premier League", x_axis_label='Shots', y_axis_label="Goals")
    p1.circle('shots', 'goals', size=20, source=source, alpha=0.3, legend="Player", color='purple')
    hover2 = HoverTool(tooltips=[("(assists, goals)", "($x, $y)"), ("name", "@names"), ("id", "@ids")])
    p2 = figure(tools=[hover2, 'pan'], title="Player's assists and goals in England Premier League", x_axis_label='Assists', y_axis_label="Goals")
    p2.circle('assists', 'goals', size=20, source=source, alpha=0.3, legend='Player', color='red')
    hover3 = HoverTool(tooltips=[("(ycards, rcards)", "($x, $y)"), ("name", "@names"), ("id", "@ids")])
    p3 = figure(tools=[hover3, 'pan'], title="Player's cards in England Premier League", x_axis_label='Yellow Cards', y_axis_label="Red Cards")
    p3.circle('ycards', 'rcards', size=20, source=source, alpha=0.3, legend='Player', color='orange')
    tab1 = Panel(child=p1, title="Goal Rate")
    tab2 = Panel(child=p2, title="Assists and Goals")
    tab3 = Panel(child=p3, title="Cards")
    tabs = Tabs(tabs=[ tab1, tab2, tab3 ])
    show(tabs)

plot_player_tabs('2011/2012')

Then we can take advantage of Bokeh to visualize these data. We show 3 aspect of a player, the first one is the number of shots and number of goals, from the figure we can see basic rule is more shots cause more goals overall, and the top-right point is the best shoter in this season, and the left-top point have the most shot-transform rate.

The sencond tag shows relationship of assists and goals of one player. The top-right point means that player has both the highest number of goals and assists.

The third tag show one player's cards in this season. We can see who is the most tough guy in Premier League in that season.

## Summary and references

This tutorial highlighted some method and skills to analyse and visualize the soccer dataset.  Much more detail about the libraries and questions on visualization of soccer dataset, and examples and usages of soccer are available from the following links.

1. Bokeh: https://bokeh.pydata.org/en/latest/
2. FIFA database: https://www.easports.com/fifa/ultimate-team/fut/database
3. Visualization EPL: https://blog.graphiq.com/visualization-update-english-premier-league-graphiq-feed-world-football-358417ac9530
4. European Soccer Database: https://www.kaggle.com/hugomathien/soccer/data
5. Interactive visualizaiton of EPL data: http://kpotluri.github.io/SoccerGuru/