# **Dataset Description**

##**Intro**

The data was collected from 2 major sources 'Transfermarkt' and 'FIFA' games.

Transfermarkt is a webpage containing information about nearly all football(soccer) leagues, competitions, teams, and players. From where we took most of the real life information about around 18000 players and 650 teams we are going to analyse for our capstone project.

FIFA games provide ratings of players' abilities based on their real-life performance and ablilities. And also real life history of trophies and sidelines that player had.

After data collection our team mada a database consisting of players, transfers, markval, stats, natstats, teams_leagues, sidelined, trophies, sofifa_head and sofifa dataframes.

In [0]:
import pandas as pd
pd.set_option('display.max_columns', None)

In [0]:
players_path = '/content/drive/My Drive/Capstone/Data/Clean/Players_Clean.pkl'
transfers_path = '/content/drive/My Drive/Capstone/Data/Clean/Transfers_Clean.pkl'
markval_path = '/content/drive/My Drive/Capstone/Data/Clean/markval_Clean.pkl'
stats_path = '/content/drive/My Drive/Capstone/Data/Clean/stats_Clean.pkl'
natstats_path = '/content/drive/My Drive/Capstone/Data/Clean/natstats_Clean.pkl'
teams_leagues_path = '/content/drive/My Drive/Capstone/Data/Clean/teams_leagues_Clean.pkl'
sidelined_path = '/content/drive/My Drive/Capstone/Data/Clean/sidelined_Clean.pkl'
trophies_path = '/content/drive/My Drive/Capstone/Data/Clean/trophies_Clean.pkl'
sofifa_head_path = '/content/drive/My Drive/Capstone/Data/Clean/sofifa_head_Clean.pkl'
sofifa_path = '/content/drive/My Drive/Capstone/Data/Clean/sofifa_long_Clean.pkl'
countries_continents_path = '/content/drive/My Drive/Capstone/Data/Clean/countries_leagues_Clean.pkl'

## **Players**

Players file contains general information about the players gathered from Transfermarkt.

###**Columns**

*   Indexing is done with DOB which is a Datetime64 type value
*   tm_id -> Unique Transfermarkt Id for each player
*   name -> Name of the player
*   team -> Current team of the player (by March 2020)
*   nationality -> Citizenship of the player
*   dob -> Player's date of birth
*   height -> Player's height (by March 2020)
*   sf -> Player's dominant foot
*   field_position -> Players field position in his current team (by March 2020) example RW for right winger
*   joined -> Date of joining the current team (by March 2020)
*   contract_expires -> Expiration date of player's last contract (by March 2020)
*   followers -> Number of player's instagram followers (by March 2020)
*   sofifa_id -> The player's id in sofifa api(this is available not for every player)
*   years_left -> Years left by player's current contract(by March 2020)
*   age -> Player's age(by March 2020)
*   current_mv -> Player's current_mv(by March 2020)
*   main_field_position -> Player's main position group(by March 2020) example attackers
*.  league -> Player's current team's league
*   nationalities -> Player's nationalities(citizenships hold)
*   years_at_club -> Year since the player joined his current club.

In [0]:
df = pd.read_pickle(players_path)
print(df.head())

Index(['tm_id', 'name', 'club', 'nationality', 'dob', 'height', 'sf',
       'field_position', 'joined', 'contract_expires', 'followers',
       'sofifa_id', 'years_left', 'age', 'current_mv', 'main_field_position',
       'league', 'nationalities', 'continent', 'years_at_club'],
      dtype='object')
             tm_id                  name           club nationality  \
DOB                                                                   
1986-03-27   17259          Manuel Neuer  Bayern Munich     Germany   
1988-08-03   40680          Sven Ulreich  Bayern Munich     Germany   
2000-01-28  336307     Christian Früchtl  Bayern Munich     Germany   
1999-04-04  317444  Ron-Thorben Hoffmann  Bayern Munich     Germany   
1996-02-14  281963       Lucas Hernández  Bayern Munich      France   

                  dob  height     sf field_position     joined  \
DOB                                                              
1986-03-27 1986-03-27   193.0  right             GK 2011-07-01   
1

###**Datatypes of each column**

In [0]:
print(df.dtypes)

tm_id                           int64
name                           object
club                           object
nationality                    object
dob                    datetime64[ns]
height                        float64
sf                           category
field_position               category
joined                 datetime64[ns]
contract_expires       datetime64[ns]
followers                     float64
sofifa_id                     float64
years_left                   category
age                           float64
current_mv                    float64
main_field_position          category
league                         object
nationalities                  object
continent                    category
years_at_club                 float64
dtype: object


## **Transfers**

Transfers file contains the transfer history of players gathered from Transfermarkt.

###**Columns**

*   Indexing is done with Date which is a Datetime64 value and shows the date of the transfer
*   tm_id -> Player's Transfermarkt Id
*   from -> Club player left during the transfer
*   to -> Club player joined after the transfer
*   fee -> Amount payed for the player's transfer
*   mv -> Player's market value at the time of the transfer
*   season -> Season the transfer took place
*   loan -> Boolean value showing wether the transfer was a loan or not

In [0]:
df = pd.read_pickle(transfers_path)
print(df.head())

             tm_id             from                to         fee          mv  \
Date                                                                            
2011-07-01   17259    FC Schalke 04     Bayern Munich  30000000.0  28000000.0   
2015-07-01   40680    VfB Stuttgart     Bayern Munich   3500000.0   3500000.0   
1998-07-01   40680  Schornbach Yth.    Stuttgart Yth.         0.0         NaN   
1994-07-01   40680   Lichtenw. Yth.   Schornbach Yth.         0.0         NaN   
2014-07-01  336307  Deggendorf Yth.  FC Bayern Münche         0.0         NaN   

           season   loan  
Date                      
2011-07-01  11/12  False  
2015-07-01  15/16  False  
1998-07-01  98/99  False  
1994-07-01  94/95  False  
2014-07-01  14/15  False  


###**Datatypes of each column**

In [0]:
print(df.dtypes)

tm_id        int64
from        object
to          object
fee        float64
mv         float64
season    category
loan          bool
dtype: object


## **Market Value**

Market Value file contains market values of players of the clubs we used, from seasons 05/06 to 19/20

###**Columns**

*   Indexing is done with Date which is a Datetime64 value and shows the end date of the season
*   tm_id -> Player's Transfermarkt Id
*   club -> Player's club during that season
*   league -> Name of the league the team competed in during that season
*   season -> The season we are using 
*   mv -> Player's market value at that season

In [0]:
df = pd.read_pickle(markval_path)
print(df.head())

            tm_id           club      league season         mv
Date                                                          
2006-06-30   2421  Bayern Munich  Bundesliga  05/06   750000.0
2006-06-30    532  Bayern Munich  Bundesliga  05/06  9500000.0
2006-06-30   2989  Bayern Munich  Bundesliga  05/06  2500000.0
2007-06-30   2421  Bayern Munich  Bundesliga  06/07  1000000.0
2007-06-30  39732  Bayern Munich  Bundesliga  06/07    50000.0


###**Datatypes of each column**

In [0]:
print(df.dtypes)

tm_id        int64
club        object
league      object
season    category
mv         float64
dtype: object


## **Stats**

Stats file shows, in long format, the feats player accomplished during his career. The attributes in question are inclusions in squad, number of appearances, minutes played, points per game (PPG), substitutions on/off, yellow cards, second yellow cards, red cards and own goals for all players, also number of clean sheets and conceded goals for goalkeepers and number of assists, goals and minutes per goal for defenders, midfielders and forwards.
The column Attribute is categorical and contains the values from this list.
mpg - minutes per goal
s - squad
soff - substitions off
son - subtitions on
app- appearances
og - own goals
g - goals
rc - red cards
yc - yellow cards
syc - second yellow cards
gc - goals conceded
a - assists
pg - penalty goals
ppg - points per game
mp - minutes played
cs - clean sheets

###**Columns**

*   Indexing is done with Date which is a Datetime64 value and shows the end date of the season
*   tm_id -> Player's Transfermarkt Id
*   club -> Player's club during that season
*   season -> The season we are using 
*   competition -> The competition player's club was competing in during that season
*   attribute -> Name of the attribute
*   value -> Numerical value of the given attribute

In [0]:
df = pd.read_pickle(stats_path)
print(df.head())

            tm_id           club season       competition attribute  value
Date                                                                      
2019-06-30  17259  Bayern Munich  18/19         DFB-Pokal         s    3.0
2019-06-30  17259  Bayern Munich  18/19        Bundesliga         s   26.0
2019-06-30  17259  Bayern Munich  18/19  Champions League         s    8.0
2019-06-30  17259  Bayern Munich  18/19      DFL-Supercup         s    1.0
2018-06-30  17259  Bayern Munich  17/18         DFB-Pokal         s    1.0


###**Datatypes of each column**

In [0]:
print(df.dtypes)

tm_id             int64
club             object
season         category
competition      object
attribute      category
value           float64
dtype: object


##**National Stats**

National Stats file contains the same attributes mentioned in the stats file for national competition the players have competed in.

###**Columns**

*   Standart indexing
*   tm_id -> Player's Transfermarkt Id
*   national_club -> Player's current national team(main, u-19, u-21, ..)
*   competition -> The competition player's club was competing in during that season
*   attribute -> Name of the attribute
*   value -> Numerical value of the given attribute

In [0]:
df = pd.read_pickle(natstats_path)
print(df.head())

   tm_id national_club               competition attribute  value
0  17259       Germany      UEFA Euro qualifying       app     26
1  17259       Germany  International Friendlies       app     22
2  17259       Germany                 World Cup       app     16
3  17259       Germany   World Cup qualification       app     13
4  17259       Germany                      EURO       app     11


###**Datatypes of each column**

In [0]:
print(df.dtypes)

tm_id               int64
national_club      object
competition        object
attribute        category
value               int64
dtype: object


##**Teams/Leagues**

Teams/Leagues file contains information about the clubs and leagues used in the dataset.

###**Columns**

*   Indexing is done with Date which is a Datetime64 value and shows the end date of the season
*   club -> Name of the club
*   league -> Name of the league
*   season -> The season the club was in that league 

In [0]:
df = pd.read_pickle(teams_leagues_path)
print(df.head())

                  club      league season  country
Date                                              
2006-06-30  1. FC Köln  Bundesliga  05/06  Germany
2007-06-30  1. FC Köln  Bundesliga  06/07  Germany
2008-06-30  1. FC Köln  Bundesliga  07/08  Germany
2009-06-30  1. FC Köln  Bundesliga  08/09  Germany
2010-06-30  1. FC Köln  Bundesliga  09/10  Germany


###**Datatypes of each column**

In [0]:
print(df.dtypes)

club       object
league     object
season     object
country    object
dtype: object


##**Sidelines**

Sidelines file shows the numbers and reasons for injuries and sidelines players had during their career.

###**Columns**

*   Indexing is done with Start_Date which is a Datetime64 value and shows the start date of the sideline player had
*   sofifa_id -> Unique Id for each player taken from Sofifa.com
*   reason -> Reason of player's sideline
*   start_date -> Start date of the sideline
*   end_date -> End date of the sideline
*   duration -> Duration of the sideline in days
*   active -> Boolean values showing wether the player is still sidelined (by March 2020)

In [0]:
df = pd.read_pickle(sidelined_path)
print(df.head())

            sofifa_id              reason start_date   end_date  duration  \
Start_Date                                                                  
2019-09-25     158023           Hamstring 2019-09-25 2019-10-01       6.0   
2019-08-05     158023  Calf Muscle Strain 2019-08-05 2019-09-16      42.0   
2018-10-21     158023          Broken Arm 2018-10-21 2018-11-10      20.0   
2017-03-20     158023           Suspended 2017-03-20 2017-04-03      14.0   
2016-11-19     158023               Virus 2016-11-19 2016-11-22       3.0   

            active  
Start_Date          
2019-09-25   False  
2019-08-05   False  
2018-10-21   False  
2017-03-20   False  
2016-11-19   False  


###**Datatypes of each column**

In [0]:
print(df.dtypes)

sofifa_id              int64
reason                object
start_date    datetime64[ns]
end_date      datetime64[ns]
duration             float64
active                  bool
dtype: object


##**Trophies**

Trophies file contains the trophies player won during his career.

###**Columns**

*   Indexing is done with Date which is a Datetime64 value and shows the end date of the season during which player won the trophie
*   sofifa_id -> Unique Id for each player taken from Sofifa.com
*   Competition -> Name of the competition that player participated in
*   Trophie -> Type of the trophie (Winner or Runner-up)
*   Season -> Season when the competition took place

In [0]:
df = pd.read_pickle(trophies_path)
print(df.head())

            sofifa_id             Competition     Trophy     Season
Date                                                               
2014-06-30     158023               World Cup  Runner-up       2014
2014-06-30     158023          Copa Catalunya     Winner  2013/2014
2018-06-30     158023  Supercopa de Catalunya     Winner       2018
2007-06-30     158023          UEFA Super Cup  Runner-up  2006/2007
2006-06-30     158023     FIFA Club World Cup  Runner-up       2006


###**Datatypes of each column**

In [0]:
print(df.dtypes)

sofifa_id         int64
Competition      object
Trophy         category
Season         category
dtype: object


##**Sofifa Head**

Sofifa Head file contains the general information of the players taken from FIFA Games.

###**Columns**

*   Standart indexing
*   sofifa_id -> Unique Id for each player taken from Sofifa.com
*   short_name -> Player's name shown on his jersey
*   long_name -> Player's full name
*   age -> Player's age by the time that game was released
*   dob -> Player's date of birth
*   height_cm -> Player's height in cm by the time game was released
*   weight_kg -> Player's weight in kg by the time game was released
*   nationality -> Player's nationality
*   club -> Player's club at the time game was released
*   main_position -> Position player mostly plays in
*   player_positions -> Positions player is trained to play in
*   overall -> Player's rating in the game (from 100)
*   potential -> Maximum upgradable rating of the player (from 100)*
*   preferred_foot -> Player's strong foot
*   weak_foot -> Rating of player's non-dominant foot (from 5)
*   skill_moves -> Rating of player's skill moves (from 5)
*   work_rate -> Player's engagement during attack/defence
*   team_position -> Player's position in his current team (by the time game was released)
*   joined -> Date player joined his current team (by the time game was released)
*   contract_valid_until -> Expiration day of player's current contract (by the time game was released)
*   game_year -> Year game was released in

In [0]:
df = pd.read_pickle(sofifa_head_path)
print(df.head())

   sofifa_id         short_name                            long_name  age  \
0     158023           L. Messi       Lionel Andrés Messi Cuccittini   32   
1      20801  Cristiano Ronaldo  Cristiano Ronaldo dos Santos Aveiro   34   
2     190871          Neymar Jr        Neymar da Silva Santos Junior   27   
3     200389           J. Oblak                            Jan Oblak   26   
4     183277          E. Hazard                          Eden Hazard   28   

         dob  height_cm  weight_kg nationality                 club  \
0 1987-06-24        170         72   Argentina         FC Barcelona   
1 1985-02-05        187         83    Portugal             Juventus   
2 1992-02-05        175         68      Brazil  Paris Saint-Germain   
3 1993-01-07        188         87    Slovenia      Atlético Madrid   
4 1991-01-07        175         74     Belgium          Real Madrid   

  main_position player_positions  overall  potential preferred_foot weak_foot  \
0            RW     [RW, CF, 

###**Datatypes of each column**

In [0]:
print(df.dtypes)

sofifa_id                            int64
short_name                          object
long_name                           object
age                                  int64
dob                         datetime64[ns]
height_cm                            int64
weight_kg                            int64
nationality                         object
club                                object
main_position                     category
player_positions                    object
overall                              int64
potential                            int64
preferred_foot                    category
weak_foot                            int64
skill_moves                          int64
work_rate                         category
team_position                     category
joined                      datetime64[ns]
contract_valid_until        datetime64[ns]
international_reputation             int64
release_clause_eur                 float64
wage_eur                             int64
game_year  

\*   Each individual in FIFA career mode has a pre-programmed player potential which determines how fast their attributes should grow and when they should stop. You can think of it as a player’s predicted or peak overall rating.

##**Sofifa Long**

### Columns
*   Standart indexing
*   sofifa_id -> Unique Id for each player taken from Sofifa.com
*   player_positions -> Positions player is trained to play in
*   main_position -> Position player mostly plays in
*   club_position -> Position where the player was mostly used in his club
*   game_year -> Year game was released in
*   attribute -> attributes from FIFA specific to each position(example, attackers - pace, shooting,....)
*   value -> player's skills in that attribute in FIFA in 0 - 100 range

The detailed description of FIFA attributes can be found in https://www.kaggle.com/stefanoleone992/fifa-20-complete-player-dataset

In [0]:
df = pd.read_pickle(sofifa_path)
print(df.head())

            sofifa_id   player_positions main_position club_position  \
game_year                                                              
2015-01-01     158023               [CF]            CF            CF   
2015-01-01      41236               [ST]            ST            ST   
2015-01-01     176580           [ST, CF]            ST           RES   
2015-01-01     167397               [ST]            ST           SUB   
2015-01-01     188350  [LM, RM, ST, CAM]            LM           SUB   

            game_year attribute value  
game_year                              
2015-01-01 2015-01-01      pace    93  
2015-01-01 2015-01-01      pace    76  
2015-01-01 2015-01-01      pace    83  
2015-01-01 2015-01-01      pace    77  
2015-01-01 2015-01-01      pace    91  


### Datatypes of each column

In [0]:
print(df.dtypes)

sofifa_id                    int64
player_positions            object
main_position             category
club_position             category
game_year           datetime64[ns]
attribute                   object
value                       object
dtype: object


# Countries Leagues Continents
This dataset stores information about the leagues's countries and their cotinent. Each row is one league of a country. If a country has 3 leagues than it has three rows, where the columns country and continent are the same and league is different. Example below.

## Columns
*  country -> The country
*  league -> The league 
*  continent -> The continent of the country

In [0]:
df = pd.read_pickle(countries_continents_path)
print(df.head())

   country           league continent
0  England   Premier League        EU
1  England     Championship        EU
2  England       League One        EU
3  England       League Two        EU
4  England  National League        EU


## Datatypes of each column

In [0]:
print(df.dtypes)

country      object
league       object
continent    object
dtype: object
