# Introduction

L'objectif de ce projet est d'analyser des données de matchs de football, et de prédire le résultat de matchs avec des algorithmes de classification supervisée comme KNN, le perceptron et l'arbre de décision.

Nous voulons prédire l'issue de matchs, c'est-à-dire si la première équipe va gagner, si la deuxième équipe va gagner ou bien s'il y aura match nul. Le problème de classification de matchs sera donc un problème à 3 classes.


## Données d'apprentissage

Dans cette partie l'objectif est de récupérer les données depuis la BD et de les représenter correctement afin de les exploiter avec des algorithmes de classification pour prédire l'issue d'un match.


In [157]:
import sqlite3
import pandas as pd
import numpy as np

from IPython.display import display
pd.options.display.max_columns = None

conn = sqlite3.connect("data/database.sqlite")
conn.row_factory = sqlite3.Row

In [158]:
teams = pd.read_sql_query("Select * from Team", conn)

In [159]:
team_attributes = pd.read_sql_query("Select * from Team_Attributes", conn)

In [160]:
countries = pd.read_sql_query("Select * from Country", conn)

In [161]:
league = pd.read_sql_query("Select * from League", conn)

In [162]:
match = pd.read_sql_query("Select * from Match", conn)

In [163]:
players = pd.read_sql_query("Select * from Player", conn)

In [164]:
player_attributes = pd.read_sql_query("Select * from Player_attributes", conn)

L'objectif est de créer des labeled set contenant des données d'apprentissage. Pour cela nous allons aggreger les informations des tables pour  représenter un seul vecteur match. Nous utilisons la valeurs NaN pour les colonnes non-renseignées, qui seront ignorées par les algorithmes de classification. Les colonnes catégorielles (qui ne sont pas numériques) seront représenter comme des catégories et seront traitée spécifiquement.

Une donnée aura pour forme un vecteur match, associé à sa classe, parmis {'home_win', 'draw', 'away_win'}, indiquant quelle a été le résultat du match.

In [213]:
conn.close()

In [219]:
display(teams.sample(5))

Unnamed: 0,id,team_api_id,team_fifa_api_id,team_long_name,team_short_name
84,11817,8576,614.0,AC Ajaccio,AJA
8,9,7947,,FCV Dender EH,DEN
163,26554,8526,635.0,De Graafschap,GRA
233,38791,188163,,Tondela,TON
23,2476,8573,682.0,KV Oostende,OOS


In [215]:
display(teams[teams['team_long_name']== 'FC Barcelona'])
display(teams[teams['team_long_name']=="FC Bayern Munich"])

Unnamed: 0,id,team_api_id,team_fifa_api_id,team_long_name,team_short_name
258,43042,8634,241.0,FC Barcelona,BAR


Unnamed: 0,id,team_api_id,team_fifa_api_id,team_long_name,team_short_name
94,15617,9823,21.0,FC Bayern Munich,BMU


In [168]:
display(team_attributes.sample(5))
#id, team_fifa_api_id, team_api_id, *date*, attributs de l'équipe
#Data : attributs 
#13 Attributs catégories (object), sinon int64 et float64

Unnamed: 0,id,team_fifa_api_id,team_api_id,date,buildUpPlaySpeed,buildUpPlaySpeedClass,buildUpPlayDribbling,buildUpPlayDribblingClass,buildUpPlayPassing,buildUpPlayPassingClass,buildUpPlayPositioningClass,chanceCreationPassing,chanceCreationPassingClass,chanceCreationCrossing,chanceCreationCrossingClass,chanceCreationShooting,chanceCreationShootingClass,chanceCreationPositioningClass,defencePressure,defencePressureClass,defenceAggression,defenceAggressionClass,defenceTeamWidth,defenceTeamWidthClass,defenceDefenderLineClass
0,1,434,9930,2010-02-22 00:00:00,60,Balanced,,Little,50,Mixed,Organised,60,Normal,65,Normal,55,Normal,Organised,50,Medium,55,Press,45,Normal,Cover
1,2,434,9930,2014-09-19 00:00:00,52,Balanced,48.0,Normal,56,Mixed,Organised,54,Normal,63,Normal,64,Normal,Organised,47,Medium,44,Press,54,Normal,Cover
2,3,434,9930,2015-09-10 00:00:00,47,Balanced,41.0,Normal,54,Mixed,Organised,54,Normal,63,Normal,64,Normal,Organised,47,Medium,44,Press,54,Normal,Cover
3,4,77,8485,2010-02-22 00:00:00,70,Fast,,Little,70,Long,Organised,70,Risky,70,Lots,70,Lots,Organised,60,Medium,70,Double,70,Wide,Cover
4,5,77,8485,2011-02-22 00:00:00,47,Balanced,,Little,52,Mixed,Organised,53,Normal,48,Normal,52,Normal,Organised,47,Medium,47,Press,52,Normal,Cover


In [169]:
#1458 entrées pour 299 équipes : différentes dates
display(team_attributes[team_attributes['team_api_id']==8634]) #FC Barcelone
#Ex. Le FC Barcelone compte 5 entrées pour 6 années de 2010 à 2015.

Unnamed: 0,id,team_fifa_api_id,team_api_id,date,buildUpPlaySpeed,buildUpPlaySpeedClass,buildUpPlayDribbling,buildUpPlayDribblingClass,buildUpPlayPassing,buildUpPlayPassingClass,buildUpPlayPositioningClass,chanceCreationPassing,chanceCreationPassingClass,chanceCreationCrossing,chanceCreationCrossingClass,chanceCreationShooting,chanceCreationShootingClass,chanceCreationPositioningClass,defencePressure,defencePressureClass,defenceAggression,defenceAggressionClass,defenceTeamWidth,defenceTeamWidthClass,defenceDefenderLineClass
118,119,241,8634,2010-02-22 00:00:00,42,Balanced,,Little,30,Short,Free Form,65,Normal,40,Normal,70,Lots,Free Form,70,High,30,Contain,70,Wide,Offside Trap
119,120,241,8634,2011-02-22 00:00:00,43,Balanced,,Little,34,Mixed,Free Form,59,Normal,25,Little,68,Lots,Free Form,67,High,43,Press,68,Wide,Cover
120,121,241,8634,2012-02-22 00:00:00,24,Slow,,Little,25,Short,Free Form,37,Normal,24,Little,54,Normal,Free Form,66,Medium,63,Press,66,Normal,Cover
121,122,241,8634,2013-09-20 00:00:00,35,Balanced,,Little,32,Short,Free Form,37,Normal,31,Little,35,Normal,Free Form,61,Medium,63,Press,65,Normal,Cover
122,123,241,8634,2014-09-19 00:00:00,35,Balanced,35.0,Normal,32,Short,Free Form,37,Normal,31,Little,35,Normal,Free Form,61,Medium,63,Press,65,Normal,Cover
123,124,241,8634,2015-09-10 00:00:00,36,Balanced,35.0,Normal,51,Mixed,Free Form,36,Normal,49,Normal,56,Normal,Free Form,61,Medium,65,Press,65,Normal,Cover


Dans team_attributes les colonnes d'attributs proviennent de Fifa qui apportent des informations sur la qualité de l'équipe. Il y a des colonnes numériques et catégorielles : Colonnes catégorielles ?

In [198]:
null_teams = team_attributes[team_attributes.isnull().any(axis=1)]
display(null_teams)
null_teams.info()
#Entrées avec colonnes NaN : 969

Unnamed: 0,id,team_fifa_api_id,team_api_id,date,buildUpPlaySpeed,buildUpPlaySpeedClass,buildUpPlayDribbling,buildUpPlayDribblingClass,buildUpPlayPassing,buildUpPlayPassingClass,buildUpPlayPositioningClass,chanceCreationPassing,chanceCreationPassingClass,chanceCreationCrossing,chanceCreationCrossingClass,chanceCreationShooting,chanceCreationShootingClass,chanceCreationPositioningClass,defencePressure,defencePressureClass,defenceAggression,defenceAggressionClass,defenceTeamWidth,defenceTeamWidthClass,defenceDefenderLineClass
0,1,434,9930,2010-02-22 00:00:00,60,Balanced,,Little,50,Mixed,Organised,60,Normal,65,Normal,55,Normal,Organised,50,Medium,55,Press,45,Normal,Cover
3,4,77,8485,2010-02-22 00:00:00,70,Fast,,Little,70,Long,Organised,70,Risky,70,Lots,70,Lots,Organised,60,Medium,70,Double,70,Wide,Cover
4,5,77,8485,2011-02-22 00:00:00,47,Balanced,,Little,52,Mixed,Organised,53,Normal,48,Normal,52,Normal,Organised,47,Medium,47,Press,52,Normal,Cover
5,6,77,8485,2012-02-22 00:00:00,58,Balanced,,Little,62,Mixed,Organised,45,Normal,70,Lots,55,Normal,Organised,40,Medium,40,Press,60,Normal,Cover
6,7,77,8485,2013-09-20 00:00:00,62,Balanced,,Little,45,Mixed,Organised,40,Normal,50,Normal,55,Normal,Organised,42,Medium,42,Press,60,Normal,Cover
9,10,614,8576,2010-02-22 00:00:00,60,Balanced,,Little,40,Mixed,Organised,45,Normal,35,Normal,55,Normal,Organised,30,Deep,70,Double,30,Narrow,Offside Trap
10,11,614,8576,2011-02-22 00:00:00,65,Balanced,,Little,45,Mixed,Organised,65,Normal,65,Normal,50,Normal,Organised,45,Medium,45,Press,50,Normal,Cover
11,12,614,8576,2012-02-22 00:00:00,59,Balanced,,Little,52,Mixed,Organised,48,Normal,34,Normal,52,Normal,Organised,38,Medium,47,Press,53,Normal,Cover
12,13,614,8576,2013-09-20 00:00:00,59,Balanced,,Little,52,Mixed,Organised,48,Normal,34,Normal,52,Normal,Organised,38,Medium,47,Press,53,Normal,Cover
15,16,47,8564,2010-02-22 00:00:00,45,Balanced,,Little,30,Short,Free Form,55,Normal,45,Normal,70,Lots,Free Form,30,Deep,35,Press,60,Normal,Offside Trap


<class 'pandas.core.frame.DataFrame'>
Int64Index: 969 entries, 0 to 1455
Data columns (total 25 columns):
id                                969 non-null int64
team_fifa_api_id                  969 non-null int64
team_api_id                       969 non-null int64
date                              969 non-null object
buildUpPlaySpeed                  969 non-null int64
buildUpPlaySpeedClass             969 non-null object
buildUpPlayDribbling              0 non-null float64
buildUpPlayDribblingClass         969 non-null object
buildUpPlayPassing                969 non-null int64
buildUpPlayPassingClass           969 non-null object
buildUpPlayPositioningClass       969 non-null object
chanceCreationPassing             969 non-null int64
chanceCreationPassingClass        969 non-null object
chanceCreationCrossing            969 non-null int64
chanceCreationCrossingClass       969 non-null object
chanceCreationShooting            969 non-null int64
chanceCreationShootingClass       969 n

In [170]:
#players.drop(['birthday', 'height', 'weight'],axis=1,inplace=True)
display(players.head(5))

Unnamed: 0,id,player_api_id,player_name,player_fifa_api_id,birthday,height,weight
0,1,505942,Aaron Appindangoye,218353,1992-02-29 00:00:00,182.88,187
1,2,155782,Aaron Cresswell,189615,1989-12-15 00:00:00,170.18,146
2,3,162549,Aaron Doran,186170,1991-05-13 00:00:00,170.18,163
3,4,30572,Aaron Galindo,140161,1982-05-08 00:00:00,182.88,198
4,5,23780,Aaron Hughes,17725,1979-11-08 00:00:00,182.88,154


In [200]:
player_attributes.sample(5)

#ids, date, overall_rating, potential, ..attributes...

#Informations agrégée dans 'overall_rating' ou 'potential'

#Catégoriel : type object

#'preferred_foot'[right,left,(None)]
#'attacking_work_rate'[low,medium,?norm?, high..(None)], 
#'defensive_work_rate'[low,medium,?ormal?, high..(None),(_0),[?Numerical ex.489710],]
#Numérique : type int64 or float64
#Taille : 183978


Unnamed: 0,id,player_fifa_api_id,player_api_id,date,overall_rating,potential,preferred_foot,attacking_work_rate,defensive_work_rate,crossing,finishing,heading_accuracy,short_passing,volleys,dribbling,curve,free_kick_accuracy,long_passing,ball_control,acceleration,sprint_speed,agility,reactions,balance,shot_power,jumping,stamina,strength,long_shots,aggression,interceptions,positioning,vision,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes
51397,51398,1393,26461,2007-08-30 00:00:00,78.0,84.0,right,high,medium,38.0,82.0,60.0,58.0,74.0,70.0,42.0,38.0,33.0,73.0,76.0,79.0,76.0,78.0,82.0,77.0,87.0,78.0,60.0,48.0,84.0,82.0,88.0,76.0,75.0,34.0,33.0,56.0,11.0,21.0,33.0,21.0,21.0
96426,96427,173909,27335,2007-02-22 00:00:00,75.0,79.0,right,high,high,44.0,34.0,53.0,78.0,78.0,72.0,74.0,69.0,72.0,79.0,72.0,77.0,79.0,47.0,78.0,78.0,82.0,78.0,76.0,67.0,75.0,77.0,80.0,77.0,69.0,72.0,78.0,67.0,6.0,7.0,72.0,7.0,6.0
12758,12759,171897,31045,2013-06-07 00:00:00,76.0,83.0,left,high,medium,84.0,66.0,58.0,73.0,68.0,83.0,69.0,76.0,74.0,82.0,88.0,84.0,86.0,79.0,91.0,80.0,75.0,85.0,63.0,79.0,71.0,73.0,72.0,77.0,73.0,67.0,75.0,72.0,6.0,9.0,9.0,10.0,9.0
40465,40466,192636,115512,2014-03-28 00:00:00,73.0,79.0,left,high,medium,68.0,56.0,59.0,78.0,55.0,72.0,80.0,75.0,75.0,74.0,69.0,68.0,67.0,76.0,62.0,83.0,69.0,80.0,76.0,77.0,64.0,60.0,69.0,77.0,78.0,48.0,58.0,40.0,10.0,15.0,7.0,15.0,13.0
101871,101872,201428,212420,2013-03-22 00:00:00,56.0,68.0,right,medium,medium,62.0,41.0,51.0,63.0,48.0,61.0,38.0,32.0,54.0,62.0,67.0,71.0,70.0,53.0,73.0,51.0,72.0,63.0,56.0,43.0,54.0,42.0,50.0,54.0,24.0,46.0,60.0,50.0,9.0,15.0,11.0,6.0,5.0


In [192]:
#Entrées avec des valeurs NaN : 3624
null_players = player_attributes[player_attributes.isnull().any(axis=1)]
display(null_players)
#Pour ces entrées, seules les valeurs NaN doivent être ignorées (et None/_0/o/Numeric pour les catégories)

Unnamed: 0,id,player_fifa_api_id,player_api_id,date,overall_rating,potential,preferred_foot,attacking_work_rate,defensive_work_rate,crossing,finishing,heading_accuracy,short_passing,volleys,dribbling,curve,free_kick_accuracy,long_passing,ball_control,acceleration,sprint_speed,agility,reactions,balance,shot_power,jumping,stamina,strength,long_shots,aggression,interceptions,positioning,vision,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes
373,374,156626,46447,2010-08-30 00:00:00,64.0,71.0,right,,_0,41.0,33.0,74.0,57.0,24.0,30.0,35.0,40.0,45.0,44.0,60.0,61.0,59.0,58.0,73.0,48.0,75.0,64.0,71.0,39.0,71.0,58.0,28.0,61.0,39.0,62.0,61.0,57.0,15.0,14.0,13.0,10.0,12.0
374,375,156626,46447,2010-02-22 00:00:00,64.0,71.0,right,,_0,41.0,33.0,74.0,57.0,24.0,30.0,35.0,40.0,45.0,44.0,60.0,61.0,59.0,58.0,73.0,48.0,75.0,64.0,71.0,39.0,71.0,62.0,66.0,61.0,58.0,62.0,61.0,57.0,6.0,20.0,45.0,20.0,20.0
375,376,156626,46447,2008-08-30 00:00:00,66.0,71.0,right,,_0,41.0,33.0,74.0,57.0,24.0,30.0,35.0,40.0,45.0,44.0,60.0,61.0,59.0,58.0,73.0,48.0,75.0,64.0,74.0,39.0,71.0,62.0,66.0,61.0,58.0,67.0,61.0,57.0,6.0,20.0,45.0,20.0,20.0
376,377,156626,46447,2007-08-30 00:00:00,68.0,75.0,right,,_0,41.0,33.0,74.0,57.0,24.0,30.0,35.0,40.0,45.0,44.0,60.0,61.0,59.0,62.0,73.0,48.0,75.0,72.0,74.0,39.0,76.0,66.0,69.0,61.0,58.0,69.0,64.0,57.0,6.0,20.0,45.0,20.0,20.0
377,378,156626,46447,2007-02-22 00:00:00,66.0,65.0,right,,_0,41.0,33.0,70.0,51.0,24.0,30.0,35.0,55.0,45.0,44.0,60.0,61.0,59.0,62.0,73.0,48.0,75.0,72.0,74.0,39.0,76.0,66.0,69.0,61.0,55.0,66.0,63.0,57.0,6.0,9.0,45.0,13.0,10.0
392,393,202425,245653,2011-02-22 00:00:00,64.0,69.0,left,,_0,47.0,29.0,65.0,57.0,36.0,47.0,27.0,45.0,55.0,52.0,62.0,67.0,65.0,64.0,59.0,57.0,64.0,65.0,65.0,37.0,67.0,59.0,23.0,52.0,42.0,64.0,65.0,62.0,11.0,5.0,15.0,10.0,10.0
393,394,202425,245653,2007-02-22 00:00:00,64.0,69.0,left,,_0,47.0,29.0,65.0,57.0,36.0,47.0,27.0,45.0,55.0,52.0,62.0,67.0,65.0,64.0,59.0,57.0,64.0,65.0,65.0,37.0,67.0,59.0,23.0,52.0,42.0,64.0,65.0,62.0,11.0,5.0,15.0,10.0,10.0
446,447,52782,38423,2010-02-22 00:00:00,68.0,70.0,right,,_0,60.0,50.0,60.0,74.0,,74.0,,53.0,62.0,73.0,74.0,70.0,,63.0,,64.0,,71.0,64.0,55.0,63.0,69.0,70.0,,47.0,52.0,50.0,,7.0,20.0,62.0,20.0,20.0
447,448,52782,38423,2009-08-30 00:00:00,68.0,70.0,right,,_0,60.0,50.0,60.0,74.0,,74.0,,53.0,62.0,73.0,74.0,70.0,,63.0,,64.0,,71.0,64.0,55.0,63.0,69.0,70.0,,47.0,52.0,50.0,,7.0,20.0,62.0,20.0,20.0
448,449,52782,38423,2008-08-30 00:00:00,68.0,69.0,right,,_0,60.0,50.0,60.0,74.0,,74.0,,53.0,62.0,73.0,74.0,70.0,,63.0,,64.0,,71.0,64.0,55.0,63.0,69.0,70.0,,47.0,52.0,50.0,,7.0,20.0,62.0,20.0,20.0


In [208]:
display(df_match.sample(5))
#Taille : 25979
#ID : 'match_api_id'

#country, league, season, stage?, date, match_api_id, -teams-, -goals-, positions of players(X,Y), players (id float), -bizarre (goal, shoton.. XML), -cotes sport
df_match = match.copy()
df_match.drop(['id', 'country_id', 'league_id', 'season']+['goal', 'shoton', 'shotoff', 'foulcommit', 'card', 'cross', 'corner', 'possession']+list(df_match.columns[-30:]), axis=1, inplace=True)

#A partir de season/date get les attributs les plus adéquats pour les joueurs et les équipes
#Data : -teams- [inject team attributes here], home_team_goal, -goals-, positions of players(X,Y), players [inject players attributes here]

#Classe de sortie : 'home_win', 'draw', 'away_win' -> obtenue avec les buts marqués par les équipes


Unnamed: 0,id,country_id,league_id,season,stage,date,match_api_id,home_team_api_id,away_team_api_id,home_team_goal,away_team_goal,home_player_X1,home_player_X2,home_player_X3,home_player_X4,home_player_X5,home_player_X6,home_player_X7,home_player_X8,home_player_X9,home_player_X10,home_player_X11,away_player_X1,away_player_X2,away_player_X3,away_player_X4,away_player_X5,away_player_X6,away_player_X7,away_player_X8,away_player_X9,away_player_X10,away_player_X11,home_player_Y1,home_player_Y2,home_player_Y3,home_player_Y4,home_player_Y5,home_player_Y6,home_player_Y7,home_player_Y8,home_player_Y9,home_player_Y10,home_player_Y11,away_player_Y1,away_player_Y2,away_player_Y3,away_player_Y4,away_player_Y5,away_player_Y6,away_player_Y7,away_player_Y8,away_player_Y9,away_player_Y10,away_player_Y11,home_player_1,home_player_2,home_player_3,home_player_4,home_player_5,home_player_6,home_player_7,home_player_8,home_player_9,home_player_10,home_player_11,away_player_1,away_player_2,away_player_3,away_player_4,away_player_5,away_player_6,away_player_7,away_player_8,away_player_9,away_player_10,away_player_11,goal,shoton,shotoff,foulcommit,card,cross,corner,possession,B365H,B365D,B365A,BWH,BWD,BWA,IWH,IWD,IWA,LBH,LBD,LBA,PSH,PSD,PSA,WHH,WHD,WHA,SJH,SJD,SJA,VCH,VCD,VCA,GBH,GBD,GBA,BSH,BSD,BSA
17739,17740,17642,17642,2008/2009,20,2009-02-28 00:00:00,509283,6403,7841,2,0,1.0,2.0,4.0,6.0,8.0,2.0,4.0,6.0,8.0,4.0,6.0,1.0,2.0,4.0,6.0,8.0,2.0,4.0,6.0,8.0,4.0,6.0,1.0,3.0,3.0,3.0,3.0,7.0,7.0,7.0,7.0,10.0,10.0,1.0,3.0,3.0,3.0,3.0,7.0,7.0,7.0,7.0,10.0,10.0,45245.0,11863.0,24590.0,3263.0,98363.0,11795.0,,22427.0,28461.0,,,150077.0,150080.0,11349.0,11861.0,39629.0,,150082.0,11865.0,40957.0,164212.0,29590.0,,,,,,,,,2.0,3.2,3.5,1.95,3.15,3.7,2.0,3.0,3.3,1.91,3.2,3.5,,,,2.15,2.9,3.2,2.1,2.88,3.75,2.1,3.0,3.5,2.0,3.2,3.6,1.91,3.1,3.75
13272,13273,10257,10257,2015/2016,9,2015-10-25 00:00:00,2060433,8600,9891,1,0,1.0,3.0,5.0,7.0,1.0,3.0,5.0,7.0,9.0,5.0,5.0,1.0,2.0,4.0,6.0,8.0,2.0,4.0,6.0,8.0,4.0,6.0,1.0,3.0,3.0,3.0,7.0,7.0,7.0,7.0,7.0,9.0,11.0,1.0,3.0,3.0,3.0,3.0,7.0,7.0,7.0,7.0,10.0,10.0,13033.0,279873.0,19249.0,27719.0,268394.0,192325.0,41694.0,422685.0,476771.0,38920.0,27734.0,213299.0,182715.0,38122.0,39331.0,415024.0,42443.0,33706.0,415500.0,156534.0,74382.0,259507.0,<goal><value><comment>n</comment><stats><goals...,<shoton><value><stats><shoton>1</shoton></stat...,<shotoff><value><stats><shotoff>1</shotoff></s...,<foulcommit><value><stats><foulscommitted>1</f...,<card><value><comment>y</comment><stats><ycard...,<cross><value><stats><crosses>1</crosses></sta...,<corner><value><stats><corners>1</corners></st...,<possession><value><comment>58</comment><stats...,1.67,3.75,5.5,1.75,3.6,4.5,1.8,3.5,4.2,1.67,3.6,5.0,1.72,3.91,5.47,1.73,3.3,5.5,,,,1.7,3.8,5.75,,,,,,
20304,20305,19694,19694,2010/2011,32,2011-04-10 00:00:00,840420,8429,8548,0,1,1.0,2.0,4.0,6.0,8.0,2.0,4.0,6.0,8.0,4.0,6.0,1.0,2.0,4.0,6.0,8.0,2.0,4.0,6.0,8.0,4.0,6.0,1.0,3.0,3.0,3.0,3.0,7.0,7.0,7.0,7.0,10.0,10.0,1.0,3.0,3.0,3.0,3.0,7.0,7.0,7.0,7.0,10.0,10.0,11024.0,127140.0,32956.0,25505.0,24077.0,190493.0,,32924.0,33553.0,192856.0,25674.0,32496.0,32430.0,32616.0,23998.0,39669.0,32705.0,23792.0,70360.0,171803.0,36783.0,8922.0,,,,,,,,,9.5,5.5,1.29,9.0,5.0,1.25,8.5,4.4,1.3,10.0,5.5,1.29,,,,10.0,5.0,1.25,9.0,5.5,1.29,10.0,5.5,1.3,9.0,5.0,1.27,10.0,5.4,1.29
23076,23077,21518,21518,2012/2013,12,2012-11-18 00:00:00,1260208,9910,8661,1,1,1.0,2.0,4.0,6.0,8.0,4.0,6.0,3.0,5.0,7.0,5.0,1.0,2.0,4.0,6.0,8.0,4.0,6.0,3.0,5.0,7.0,5.0,1.0,3.0,3.0,3.0,3.0,6.0,6.0,8.0,8.0,8.0,11.0,1.0,3.0,3.0,3.0,3.0,6.0,6.0,8.0,8.0,8.0,11.0,71498.0,183462.0,41568.0,192065.0,74721.0,33747.0,205034.0,72623.0,31092.0,39215.0,174472.0,37439.0,260659.0,97491.0,268039.0,291635.0,193978.0,172948.0,103904.0,25462.0,200630.0,45204.0,<goal><value><comment>n</comment><stats><goals...,<shoton />,<shotoff />,<foulcommit />,<card><value><comment>y</comment><stats><ycard...,<cross />,<corner />,<possession />,1.91,3.4,4.0,1.83,3.4,4.33,1.9,3.5,3.7,1.95,3.5,3.75,1.94,3.59,4.4,1.91,3.4,4.0,1.83,3.5,4.5,1.92,3.6,4.33,1.83,3.4,4.33,1.8,3.5,4.2
17752,17753,17642,17642,2008/2009,21,2009-03-08 00:00:00,509296,9809,9772,1,2,1.0,2.0,4.0,6.0,8.0,2.0,4.0,6.0,8.0,4.0,6.0,1.0,2.0,4.0,6.0,8.0,2.0,4.0,6.0,8.0,4.0,6.0,1.0,3.0,3.0,3.0,3.0,7.0,7.0,7.0,7.0,10.0,10.0,1.0,3.0,3.0,3.0,3.0,7.0,7.0,7.0,7.0,10.0,10.0,33550.0,22519.0,45237.0,106175.0,158844.0,25798.0,25915.0,,43293.0,150053.0,112801.0,30582.0,57071.0,30968.0,96937.0,52133.0,46509.0,40004.0,40603.0,38821.0,34653.0,35425.0,,,,,,,,,5.75,3.4,1.6,6.3,3.6,1.5,4.8,3.5,1.55,5.0,3.3,1.62,,,,5.0,3.5,1.57,6.0,3.6,1.53,5.5,3.5,1.55,5.5,3.4,1.6,5.5,3.4,1.57


In [220]:
display(df_match.sample(5))
null_match = df_match[df_match.isnull().any(axis=1)]
null_match.sample(3)
#Certaines équipes ont des joueurs non-spécifiés 'home_player_1': seule possibilité pour exploiter est de mettre des NaN dans toutes les attributs pour ce joueur


Unnamed: 0,stage,date,match_api_id,home_team_api_id,away_team_api_id,home_team_goal,away_team_goal,home_player_X1,home_player_X2,home_player_X3,home_player_X4,home_player_X5,home_player_X6,home_player_X7,home_player_X8,home_player_X9,home_player_X10,home_player_X11,away_player_X1,away_player_X2,away_player_X3,away_player_X4,away_player_X5,away_player_X6,away_player_X7,away_player_X8,away_player_X9,away_player_X10,away_player_X11,home_player_Y1,home_player_Y2,home_player_Y3,home_player_Y4,home_player_Y5,home_player_Y6,home_player_Y7,home_player_Y8,home_player_Y9,home_player_Y10,home_player_Y11,away_player_Y1,away_player_Y2,away_player_Y3,away_player_Y4,away_player_Y5,away_player_Y6,away_player_Y7,away_player_Y8,away_player_Y9,away_player_Y10,away_player_Y11,home_player_1,home_player_2,home_player_3,home_player_4,home_player_5,home_player_6,home_player_7,home_player_8,home_player_9,home_player_10,home_player_11,away_player_1,away_player_2,away_player_3,away_player_4,away_player_5,away_player_6,away_player_7,away_player_8,away_player_9,away_player_10,away_player_11
12332,27,2014-03-09 00:00:00,1536788,9857,7943,0,0,1.0,3.0,5.0,7.0,1.0,3.0,5.0,7.0,9.0,5.0,5.0,1.0,2.0,4.0,6.0,8.0,3.0,5.0,7.0,3.0,5.0,7.0,1.0,3.0,3.0,3.0,7.0,7.0,7.0,7.0,7.0,9.0,11.0,1.0,3.0,3.0,3.0,3.0,7.0,7.0,7.0,10.0,10.0,10.0,27691.0,42308.0,27721.0,73841.0,38148.0,170667.0,26214.0,24789.0,24443.0,40962.0,27690.0,414788.0,154264.0,293205.0,32769.0,112476.0,399549.0,74034.0,27685.0,195199.0,27657.0,41542.0
394,20,2009-12-26 00:00:00,665645,9986,9999,3,0,1.0,2.0,4.0,6.0,8.0,2.0,4.0,6.0,8.0,4.0,6.0,1.0,2.0,4.0,6.0,8.0,2.0,4.0,6.0,8.0,4.0,6.0,1.0,3.0,3.0,3.0,3.0,7.0,7.0,7.0,7.0,10.0,10.0,1.0,3.0,3.0,3.0,3.0,7.0,7.0,7.0,7.0,10.0,10.0,104388.0,,38417.0,37571.0,46666.0,38423.0,39145.0,38439.0,131409.0,38920.0,38419.0,38318.0,,38247.0,94288.0,69805.0,46217.0,33671.0,23997.0,148338.0,148336.0,173432.0
12793,34,2015-05-03 00:00:00,1786350,9876,8600,0,1,1.0,2.0,4.0,6.0,8.0,3.0,5.0,7.0,3.0,5.0,7.0,1.0,2.0,4.0,6.0,8.0,3.0,5.0,7.0,5.0,4.0,6.0,1.0,3.0,3.0,3.0,3.0,7.0,7.0,7.0,10.0,10.0,10.0,1.0,3.0,3.0,3.0,3.0,6.0,6.0,6.0,8.0,10.0,10.0,27558.0,238838.0,39440.0,30740.0,97789.0,41332.0,110378.0,41329.0,362194.0,30709.0,40601.0,13033.0,275034.0,19249.0,306296.0,195190.0,27731.0,188652.0,267863.0,39450.0,401343.0,38920.0
24897,6,2009-08-15 00:00:00,663694,10179,10243,3,3,1.0,2.0,4.0,6.0,8.0,2.0,6.0,8.0,4.0,4.0,6.0,1.0,2.0,4.0,6.0,8.0,2.0,4.0,6.0,8.0,4.0,6.0,1.0,3.0,3.0,3.0,3.0,7.0,7.0,7.0,7.0,10.0,10.0,1.0,3.0,3.0,3.0,3.0,7.0,7.0,7.0,7.0,10.0,10.0,10637.0,37844.0,34066.0,26193.0,95217.0,41084.0,66907.0,132843.0,33107.0,50104.0,25393.0,25811.0,25813.0,121080.0,41717.0,95212.0,92260.0,119702.0,30936.0,33681.0,25824.0,41308.0
20087,34,2010-04-17 00:00:00,820481,8429,8597,3,0,1.0,3.0,4.0,7.0,1.0,3.0,5.0,7.0,9.0,4.0,6.0,1.0,2.0,4.0,6.0,8.0,2.0,4.0,6.0,8.0,4.0,6.0,1.0,3.0,3.0,3.0,7.0,7.0,7.0,7.0,7.0,10.0,10.0,1.0,3.0,3.0,3.0,3.0,7.0,7.0,7.0,7.0,10.0,10.0,11024.0,17480.0,32924.0,32917.0,,31984.0,28901.0,32956.0,25674.0,28199.0,179818.0,32713.0,32711.0,4939.0,32700.0,32696.0,102777.0,32907.0,32694.0,13917.0,24364.0,21528.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25979 entries, 0 to 25978
Data columns (total 73 columns):
stage               25979 non-null int64
date                25979 non-null object
match_api_id        25979 non-null int64
home_team_api_id    25979 non-null int64
away_team_api_id    25979 non-null int64
home_team_goal      25979 non-null int64
away_team_goal      25979 non-null int64
home_player_X1      24158 non-null float64
home_player_X2      24158 non-null float64
home_player_X3      24147 non-null float64
home_player_X4      24147 non-null float64
home_player_X5      24147 non-null float64
home_player_X6      24147 non-null float64
home_player_X7      24147 non-null float64
home_player_X8      24147 non-null float64
home_player_X9      24147 non-null float64
home_player_X10     24147 non-null float64
home_player_X11     24147 non-null float64
away_player_X1      24147 non-null float64
away_player_X2      24147 non-null float64
away_player_X3      24147 non-null float64
a

Les infos d'après match sont inutiles, ne servent pas à prédire le résultat d'un match donc on ne les incluera pas au modèle. Les côtes de paris non plus, car elles ne représentent pas les données du match en tant que tel. On pourra comparer nos prédictions et les côtes des matchs à la fin.