# Análise e Predição de Química entre Jogadores de Futebol

No futebol moderno, a performance coletiva vai além das métricas individuais como gols, passes certos ou desarmes. 
A interação entre os jogadores - a chamada "química" - é um fator crucial, mas frequentemente ignorado, especialmente 
em decisões de recrutamento. Casos como a contratação simultânea de Andy Carroll e Luis Suárez pelo Liverpool em 2011 
ilustram como a falta de entrosamento pode comprometer o desempenho, mesmo com atletas tecnicamente qualificados.

Este trabalho busca investigar a qualidade do entrosamento entre duplas de jogadores que atuaram juntas, utilizando 
dados de desempenho individual e coletivo. O objetivo é classificar o nível de química entre essas duplas e treinar 
um modelo capaz de prever a compatibilidade entre jogadores que nunca atuaram lado a lado.

Grupo: Gabriel Castelo, Matheus Vaz, Victor Kenji e Vinicius Gomes

Repositório no GitHub: [https://github.com/vinisilvag/cdaf-projeto](https://github.com/vinisilvag/cdaf-projeto)

In [1]:
# Standard library
import random
from functools import reduce
from collections import defaultdict

# Data manipulation
import pandas as pd
import scipy.ndimage

# Progress bar
from tqdm import tqdm

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import matplotsoccer as mps

# Web scraping / HTML parsing
import requests
from bs4 import BeautifulSoup, Comment

# Fuzzy matching
from thefuzz import fuzz, process

# Soccer action analytics
import socceraction.spadl as spd
from socceraction.vaep import features as ft
from socceraction.vaep import labels as lab
from socceraction.vaep import formula as fm

# Machine learning
from sklearn.model_selection import train_test_split
import xgboost as xgb
import sklearn.metrics as mt

# Pandas configuration
pd.set_option('future.no_silent_downcasting', True)

In [8]:
# Import custom modules
from webscraping import *
from viz_utils import *
from io_utils import *
from misc_utils import *
from metrics import *

### Scrapping de dados da FBref para enriquecer os dados que temos dos jogadores

A metodologia do trabalho de base usa dados proprietários. Assim, pretendemos contornar esse problema fazendo um scrapping de dados da FBref para enriquecer as informações dos jogadores e usar atributos semelhantes aos que eles utilizaram na nossa metodologia para cálculo das métricas e treinamento do modelo.

OBS: Scrapping finalizado! Não é necessário rodar as células a seguir, os dados coletados já estão salvos localmente (a não ser que a gente queira adicionar mais dados).

In [6]:
player_data = {
    "England": [
        ("https://fbref.com/en/comps/9/2017-2018/passing/2017-2018-Premier-League-Stats", "stats_passing"),
        ("https://fbref.com/en/comps/9/2017-2018/defense/2017-2018-Premier-League-Stats", "stats_defense"),
        ("https://fbref.com/en/comps/9/2017-2018/passing_types/2017-2018-Premier-League-Stats", "stats_passing_types"),
        ("https://fbref.com/en/comps/9/2017-2018/misc/2017-2018-Premier-League-Stats", "stats_misc"),
    ],
    "Spain": [
        ("https://fbref.com/en/comps/12/2017-2018/passing/2017-2018-La-Liga-Stats", "stats_passing"),
        ("https://fbref.com/en/comps/12/2017-2018/defense/2017-2018-La-Liga-Stats", "stats_defense"),
        ("https://fbref.com/en/comps/12/2017-2018/passing_types/2017-2018-La-Liga-Stats", "stats_passing_types"),
        ("https://fbref.com/en/comps/12/2017-2018/misc/2017-2018-La-Liga-Stats", "stats_misc"),
    ],
    "Germany": [
        ("https://fbref.com/en/comps/20/2017-2018/passing/2017-2018-Bundesliga-Stats", "stats_passing"),
        ("https://fbref.com/en/comps/20/2017-2018/defense/2017-2018-Bundesliga-Stats", "stats_defense"),
        ("https://fbref.com/en/comps/20/2017-2018/passing_types/2017-2018-Bundesliga-Stats", "stats_passing_types"),
        ("https://fbref.com/en/comps/20/2017-2018/misc/2017-2018-Bundesliga-Stats", "stats_misc"),
    ],
    "Italy": [
        ("https://fbref.com/en/comps/11/2017-2018/passing/2017-2018-Serie-A-Stats", "stats_passing"),
        ("https://fbref.com/en/comps/11/2017-2018/defense/2017-2018-Serie-A-Stats", "stats_defense"),
        ("https://fbref.com/en/comps/11/2017-2018/passing_types/2017-2018-Serie-A-Stats", "stats_passing_types"),
        ("https://fbref.com/en/comps/11/2017-2018/misc/2017-2018-Serie-A-Stats", "stats_misc"),
    ],
    "France": [
        ("https://fbref.com/en/comps/13/2017-2018/passing/2017-2018-Ligue-1-Stats", "stats_passing"),
        ("https://fbref.com/en/comps/13/2017-2018/defense/2017-2018-Ligue-1-Stats", "stats_defense"),
        ("https://fbref.com/en/comps/13/2017-2018/passing_types/2017-2018-Ligue-1-Stats", "stats_passing_types"),
        ("https://fbref.com/en/comps/13/2017-2018/misc/2017-2018-Ligue-1-Stats", "stats_misc"),
    ]
}

In [7]:
england_players_scrapped_data = get_data_by_league(player_data["England"])

ValueError: Table stats_passing not found!

In [5]:
spain_players_scrapped_data = get_data_by_league(player_data["Spain"])

ValueError: Table stats_passing not found!

In [None]:
germany_players_scrapped_data = get_data_by_league(player_data["Germany"])

In [None]:
italy_players_scrapped_data = get_data_by_league(player_data["Italy"])

In [None]:
france_players_scrapped_data = get_data_by_league(player_data["France"])

In [None]:
england_players = reduce(lambda left, right: pd.merge(left, right, on='Player', how='inner'), england_players_scrapped_data)
spain_players = reduce(lambda left, right: pd.merge(left, right, on='Player', how='inner'), spain_players_scrapped_data)
germany_players = reduce(lambda left, right: pd.merge(left, right, on='Player', how='inner'), germany_players_scrapped_data)
italy_players = reduce(lambda left, right: pd.merge(left, right, on='Player', how='inner'), italy_players_scrapped_data)
france_players = reduce(lambda left, right: pd.merge(left, right, on='Player', how='inner'), france_players_scrapped_data)
enhanced_players = pd.concat([england_players, spain_players, germany_players, italy_players, france_players], ignore_index=True)
enhanced_players.to_csv('data/players_scrapped.csv', index=False)

In [7]:
players_merged = save_players_merged("data/")
players_merged.to_csv('data/players_merged.csv', index=False)

### Funções auxiliares

In [14]:
leagues = ['England']
# leagues = ['England', 'Spain', 'Germany', 'Italy', 'France']
events = {}
matches = {}

for league in tqdm(leagues):
    path = r'data/matches/matches_{}.json'.format(league)
    matches[league] = load_matches(path)
    path = r'data/events/events_{}.json'.format(league)
    events[league] = load_events(path)

path = r'data/'
players = load_players(path)
players['player_name'] = players['player_name'].str.decode('unicode-escape')

100%|██████████| 1/1 [00:05<00:00,  5.48s/it]


## Mapeamento para SPADL

In [15]:
spadl = {}
for league in leagues:
    spadl[league] = spadl_transform(events=events[league], matches=matches[league])
    # Adicionando o nome dos players
    spadl[league] = spadl[league].merge(players[['player_id', 'player_name']], on='player_id', how='left')

NameError: name 'spadl_transform' is not defined

## Análise exploratória

In [None]:
spadl['England']

In [None]:
spadl['Spain']

In [None]:
spadl['Germany']

In [None]:
spadl['Italy']

In [None]:
spadl['France']

### Players

In [None]:
players_england = spadl['England']['player_name'].unique()
print("Número de jogadores na Premier League:", len(players_england))
players_england

In [None]:
players_spain = spadl['Spain']['player_name'].unique()
print("Número de jogadores na La Liga:", len(players_spain))
players_spain

In [None]:
players_germany = spadl['Germany']['player_name'].unique()
print("Número de jogadores na Bundesliga:", len(players_germany))
players_germany

In [None]:
players_italy = spadl['Italy']['player_name'].unique()
print("Número de jogadores na Serie A:", len(players_italy))
players_italy

In [None]:
players_france = spadl['France']['player_name'].unique()
print("Número de jogadores na Ligue 1:", len(players_france))
players_france

### Número médio de ações por jogo em cada liga

Observações
- Mais ou menos o mesmo número médio de ações por jogo.

In [None]:
spadl['England'].groupby('game_id').size().mean()

In [None]:
spadl['Spain'].groupby('game_id').size().mean()

In [None]:
spadl['Germany'].groupby('game_id').size().mean()

In [None]:
spadl['Italy'].groupby('game_id').size().mean()

In [None]:
spadl['France'].groupby('game_id').size().mean()

### Distribuição dos Tipos de Ação

Histograma com a frequência ded cada tipo de ação existente nos dados analisados.

Observações:
- Passe é a ação mais comum;
- Alternância, a depender da liga, entre interceptações e bolas carregadas como segundo tipo de ação mais realizada;
- Ocorrências menos frequentes de lançamentos, cruzamentos, "chutões" e desarmes.

In [None]:
plot_action_counts(spadl['England']['type_name'].value_counts())

In [None]:
plot_action_counts(spadl['Spain']['type_name'].value_counts())

In [None]:
plot_action_counts(spadl['Germany']['type_name'].value_counts())

In [None]:
plot_action_counts(spadl['Italy']['type_name'].value_counts())

In [None]:
plot_action_counts(spadl['France']['type_name'].value_counts())

### Ações por Jogador

Top 10 jogadores com mais ações registradas na liga.

Observações:
- Os jogadores mais participativos são, em geral, meio campistas ou jogadores muito importantes do time;
- O número de ações parece flutuar pouco.

In [None]:
plot_top_active_players(spadl['England']['player_name'].value_counts().head(10))

In [None]:
plot_top_active_players(spadl['Spain']['player_name'].value_counts().head(10))

In [None]:
plot_top_active_players(spadl['Germany']['player_name'].value_counts().head(10))

In [None]:
plot_top_active_players(spadl['Italy']['player_name'].value_counts().head(10))

In [None]:
plot_top_active_players(spadl['France']['player_name'].value_counts().head(10))

### Mapa de Calor dos Passes

Mapa de calor dos passes em função das áreas do campo.

Observações:
- A troca de passes nas 5 ligas segue mais ou menos a mesma distribuição pelo campo;
- Pouca troca de passe na área adversária;
- Parece comum a existência de um jogador no meio do campo que distribui e conecta a bola entre a defesa e o ataque.

In [None]:
england_passes = spadl['England'][spadl['England']['type_name'] == 'pass']
plot_pass_heatmap(england_passes)

In [None]:
spain_passes = spadl['Spain'][spadl['Spain']['type_name'] == 'pass']
plot_pass_heatmap(spain_passes)

In [None]:
germany_passes = spadl['Germany'][spadl['Germany']['type_name'] == 'pass']
plot_pass_heatmap(germany_passes)

In [None]:
italy_passes = spadl['Italy'][spadl['Italy']['type_name'] == 'pass']
plot_pass_heatmap(italy_passes)

In [None]:
france_passes = spadl['France'][spadl['France']['type_name'] == 'pass']
plot_pass_heatmap(france_passes)

### Sequência: Ações que Levaram a Gol

Histograma com a frequência de cada tipo de ação executada que precedeu um gol.

Observações:
- Passes e cruzamentos lideram como ações que precedem gols;
- Bolas carregadas não resultam em tantos gols;
- Na La Liga, rebotes tendem a gerar mais gols do que escanteios.

In [None]:
goal_indices = spadl['England'][(spadl['England']['type_name'] == 'shot') & (spadl['England']['result_name'] == 'success')].index
pre_goal_actions = spadl['England'].loc[goal_indices - 1]
action_sequences = pre_goal_actions['type_name'].value_counts()
plot_action_sequences(action_sequences)

In [None]:
goal_indices = spadl['Spain'][(spadl['Spain']['type_name'] == 'shot') & (spadl['Spain']['result_name'] == 'success')].index
pre_goal_actions = spadl['Spain'].loc[goal_indices - 1]
action_sequences = pre_goal_actions['type_name'].value_counts()
plot_action_sequences(action_sequences)

In [None]:
goal_indices = spadl['Germany'][(spadl['Germany']['type_name'] == 'shot') & (spadl['Germany']['result_name'] == 'success')].index
pre_goal_actions = spadl['Germany'].loc[goal_indices - 1]
action_sequences = pre_goal_actions['type_name'].value_counts()
plot_action_sequences(action_sequences)

In [None]:
goal_indices = spadl['Italy'][(spadl['Italy']['type_name'] == 'shot') & (spadl['Italy']['result_name'] == 'success')].index
pre_goal_actions = spadl['Italy'].loc[goal_indices - 1]
action_sequences = pre_goal_actions['type_name'].value_counts()
plot_action_sequences(action_sequences)

In [None]:
goal_indices = spadl['France'][(spadl['France']['type_name'] == 'shot') & (spadl['France']['result_name'] == 'success')].index
pre_goal_actions = spadl['France'].loc[goal_indices - 1]
action_sequences = pre_goal_actions['type_name'].value_counts()
plot_action_sequences(action_sequences)

### Mapa de Calor de Finalizações

Mapa de calor com as finalizações realizadas na liga.

Observações:
- As 5 ligas seguem o mesmo padrão de finalizações.

In [None]:
plot_shot_heatmap(spadl['England'][spadl['England']['type_name'] == 'shot'])

In [None]:
plot_shot_heatmap(spadl['Spain'][spadl['Spain']['type_name'] == 'shot'])

In [None]:
plot_shot_heatmap(spadl['Germany'][spadl['Germany']['type_name'] == 'shot'])

In [None]:
plot_shot_heatmap(spadl['Italy'][spadl['Italy']['type_name'] == 'shot'])

In [None]:
plot_shot_heatmap(spadl['France'][spadl['France']['type_name'] == 'shot'])

### Mapa de Assistências

Mapa com as setas que indicam passes que antecedem um chute.

Observações:
- Apenas uma amostra das assistências foram plotadas, para evitar poluição;
- As assistências precedem chutes contra o gol adversário, majoritariamente;
- Os chutes em sequência partem principalmente da grande área do goleiro adversário.

In [None]:
shot_idx = spadl['England'][spadl['England']['type_name'] == 'shot'].index
assist_idx = shot_idx - 1
assists = spadl['England'].loc[assist_idx][spadl['England'].loc[assist_idx]['type_name'] == 'pass']
plot_assists_heatmap(assists.head(300))

In [None]:
shot_idx = spadl['Spain'][spadl['Spain']['type_name'] == 'shot'].index
assist_idx = shot_idx - 1
assists = spadl['Spain'].loc[assist_idx][spadl['Spain'].loc[assist_idx]['type_name'] == 'pass']
plot_assists_heatmap(assists.head(300))

In [None]:
shot_idx = spadl['Germany'][spadl['Germany']['type_name'] == 'shot'].index
assist_idx = shot_idx - 1
assists = spadl['Germany'].loc[assist_idx][spadl['Germany'].loc[assist_idx]['type_name'] == 'pass']
plot_assists_heatmap(assists.head(300))

In [None]:
shot_idx = spadl['Italy'][spadl['Italy']['type_name'] == 'shot'].index
assist_idx = shot_idx - 1
assists = spadl['Italy'].loc[assist_idx][spadl['Italy'].loc[assist_idx]['type_name'] == 'pass']
plot_assists_heatmap(assists.head(300))

In [None]:
shot_idx = spadl['France'][spadl['France']['type_name'] == 'shot'].index
assist_idx = shot_idx - 1
assists = spadl['France'].loc[assist_idx][spadl['France'].loc[assist_idx]['type_name'] == 'pass']
plot_assists_heatmap(assists.head(300))

In [None]:
games = select_random_games(spadl)
print("Partidas selecionadas:", games)

plot_buildup_last_events(spadl, games, last_n=10)
plot_attack_heatmap(spadl, games, bins=30)

## Paper - Terinamento do VAEP e cálculo do JOI e JDI

In [None]:
features = {}
labels = {}
for league in leagues:
    features[league] = features_transform(spadl[league])
    features[league] = features[league].astype({"period_id_a0": int, "period_id_a1": int, "period_id_a2": int})
    labels[league] = labels_transform(spadl[league])

100%|███████████████████████████████████████████████| 380/380 [00:16<00:00, 23.09it/s]
100%|███████████████████████████████████████████████| 380/380 [00:18<00:00, 20.07it/s]


In [None]:
features_concat = pd.concat(features.values(), ignore_index=True)
labels_concat = pd.concat(labels.values(), ignore_index=True)

X_train, X_test, y_train, y_test = train_test_split(features_concat, labels_concat, test_size=0.2, random_state=42)
models = train_vaep(X_train, y_train, X_test, y_test)

training scores model
scores Train NBS: 0.8070185437562029

scores Test NBS: 0.8066906690512291

----------------------------------------
training concedes model
concedes Train NBS: 0.9591881632001684

concedes Test NBS: 0.966907347830372

----------------------------------------


In [None]:
preds = {}
for league in leagues:
    preds[league] = generate_predictions(features=features[league], models=models)

In [None]:
action_values = {}
for league in leagues:
    action_values[league] = calculate_action_values(spadl=spadl[league], predictions=preds[league])

In [None]:
action_values["England"]

Unnamed: 0,original_event_id,action_id,game_id,start_x,start_y,end_x,end_y,type_name,result_name,player_id,Pscores,Pconcedes,offensive_value,defensive_value,vaep_value
0,177959171,0,2499719,51.45,34.68,32.55,14.96,pass,success,25413,0.000675,0.000127,0.000000,-0.000000,0.000000
1,177959172,1,2499719,32.55,14.96,53.55,17.00,pass,success,370224,0.003925,0.000683,0.003250,-0.000556,0.002693
2,177959173,2,2499719,53.55,17.00,36.75,19.72,pass,success,3319,0.002881,0.000905,-0.001044,-0.000222,-0.001265
3,177959174,3,2499719,36.75,19.72,43.05,3.40,pass,success,120339,0.003034,0.000719,0.000153,0.000186,0.000340
4,177959175,4,2499719,43.05,3.40,75.60,8.16,pass,success,167145,0.005474,0.000593,0.002440,0.000126,0.002565
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
610378,251596226,1136,2500098,55.65,7.48,103.95,19.04,pass,success,20620,0.081724,0.001691,0.064737,-0.000010,0.064726
610379,251596229,1137,2500098,103.95,19.04,103.95,19.04,cross,fail,14703,0.032199,0.004769,-0.049525,-0.003078,-0.052603
610380,251596408,1138,2500098,2.10,46.92,0.00,46.24,interception,success,8239,0.004789,0.036650,0.000020,-0.004450,-0.004430
610381,251596232,1139,2500098,105.00,0.00,92.40,36.04,corner_crossed,success,70965,0.059718,0.003889,0.013218,-0.003889,0.009330


### Teste JOI