# Six Degrees of (FIFA 15-21) Separation

In this project, I would like to explore the networks formed through the world's top football leagues. The data I am using is from this [Kaggle Dataset](https://www.kaggle.com/stefanoleone992/fifa-21-complete-player-dataset) on FIFA 15-21 player level data from the website [sofifa](https://sofifa.com/). 
<br>
This takes inspiration from the famous Six Degrees of Separation Theory. If you haven't heard of this, a short summary is that "any two people in the world can be linked through 6 or less mutual connections." 
<br>
<br>
I want to see if this theory holds true for the footballing world.
<br>
<br>
<b>My main learning objectives are: </b>
* Understand how many networks (connected components) exist among players in the world's top leagues.
* Identify the average and largest number of degrees of separation between players in the network.
* Identify the player who is the "center" of the footballing world (at least represented by FIFA).

<br>

<b>(In case you are interested) Domain Knowledge, Data Structures and Alogrithms Used </b>
* Union Find Data Structure
* Breadth First Search Graph Traversal
* Elementary Graph Theory
* Understanding <i>The Beautiful Game</i>
* Knowing a lot of players from playing (too much) FIFA over the past few years

<br>

<b>References:</b>
* [Six Degrees of Separation Wikipedia Article](https://en.wikipedia.org/wiki/Six_degrees_of_separation)
* [Six Degrees of Kevin Bacon Wikipedia Article](https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon)
* [Kaggle FIFA 21 Complete Player Dataset](https://www.kaggle.com/stefanoleone992/fifa-21-complete-player-dataset)
* [sofifa](https://sofifa.com/)


![FIFA 21](https://atalayar.com/sites/default/files/styles/foto_/public/noticias/Atalayar_FIFA%2021.jpg?itok=V4t36zZI)
![](https://miro.medium.com/max/2408/0*8T9WFQKlf147X7FE.jpg)

# I: Imports and Initalization

In [None]:
import numpy as np
import pandas as pd
import os

df = pd.DataFrame()
filepath = '/kaggle/input/fifa-21-complete-player-dataset/'
for filename in os.listdir(filepath):
    try:
        current_df = pd.read_csv(filepath+filename, encoding='latin-1')
        current_year = filename.split('players_')[-1][:2]
        current_df['fifa_year'] = current_year
        
        df = df.append(current_df)
    except:
        print(filename)

df = df.reset_index(drop=True)

fifa21_df = pd.read_csv('/kaggle/input/fifa-21-complete-player-dataset/players_21.csv')

In [None]:
df.head()

## I (a) Connection Principles - What makes two players connected?

<b>Two players are connected if ... </b><br>
* They were in the same team on the same year
* They were in the same national team on the same year

## I (b) National Team Connection Principles - National Team is tricky, since not all are in FIFA. Principles as follows: 
* If NT is on FIFA apply only to those called up (existing NT jersey number for year)
* If NT is NOT on FIFA, we will assume Top 30 by overall rating for the year were called up. (30 is preliminary squal shortlist size for major tournaments)
* We will create <i>nation_call_up</i> column in dataset to define

In [None]:
#nationalities that are included in FIFA
fifa_nt = df.loc[~df['nation_jersey_number'].isnull(), 'nationality'].unique()

#for other nationalities, lets assume they are part of nt if they are within top 30 players of each FIFA year
call_up_limit = 30
fifa_yrs = df['fifa_year'].unique()
non_fifa_nts = df.loc[~df['nationality'].isin(fifa_nt), 'nationality'].unique()
nt_call_ups = []
for fy in fifa_yrs:
    for nt in non_fifa_nts:
        current_df = df.loc[(df['fifa_year']==fy) & (df['nationality']==nt)][['sofifa_id', 'overall']].sort_values(
            'overall', ascending=False).reset_index(drop=True)

        current_call_ups = list(current_df.loc[:min(call_up_limit, len(current_df)), 'sofifa_id'].values)
        nt_call_ups.extend(current_call_ups)

df['nation_call_up'] = np.where(df['sofifa_id'].isin(nt_call_ups), 1, 0)
df['nation_call_up'] = np.where(~df['nation_jersey_number'].isnull(), 1, 0)

Creating a utility function that will come in handy to identify players from sofifa_id

In [None]:
def whois(sofifa_id, player_only=False):
    if player_only:
        return df[df['sofifa_id']==sofifa_id]['short_name'].iloc[0]
    else:
        return df[df['sofifa_id']==sofifa_id]

# II. Algorithm to link players

I will be using two different data structures/algorithm approaches to link players, <i>Union Find</i> and <i>Graph Traversal</i>. Both have their advantages, disadvantages and analysis points

## II (a) Connecting Disjoint Sets with Union Find (Approach 1)

This method is more efficient but less interpretable. We can also build a graph in a minimal way using this but it will fail to capture all direct connections

Union Find Data Structures and Helper Functions

In [None]:
#Union Find approach data structures and functions
head = {}
for player in df['sofifa_id'].unique():
    head[player] = player
       
def find(x):
    if head[x]==x:
        return x
    else:
        head[x]=find(head[x])
        return head[x]

def union(head, x1, x2):
    #we already pass x1, x2 as head of set
    if x2!=x1:
        head[x1]=x2

In [None]:
club_edge = {}
nt_edge = {}

for row in df.itertuples():
    player = getattr(row, 'sofifa_id')
    fifa_year = str(getattr(row, 'fifa_year'))
    club = str(getattr(row, 'club_name'))
    nt = getattr(row, 'nationality')
    
    has_nt = True
    has_club = True
    
    #is the person not called up for NT AND whose NT is in FIFA
    if getattr(row, 'nation_call_up')==0:
        has_nt=False
    
    if str(club)=='nan':
        has_club=False
    
    #add year to each link
    nt_yr = nt + fifa_year
    club_yr = club + fifa_year
    
    #national team
    if has_nt and nt_yr in nt_edge.keys():
        union(head, find(player), find(nt_edge[nt_yr]))
    elif has_nt:
        nt_edge[nt_yr] = player
    
    #club
    if has_club and club_yr in club_edge.keys():
        union(head, find(player), find(club_edge[club_yr]))
    elif has_club:
        club_edge[club_yr] = player

#Final run to link all players back to head of disjoint set
for player in head.keys():
    head[player] = find(head[player])

## II (b) Graph Traversal with Breadth First Search


For the Graph Traversal based approach, we will create a graph with all DIRECT connections (same club, NT in a year)

In [None]:
#Graph Approach data structues
g = {}
for i in df['sofifa_id'].unique():
    g[i] = []

I had to split this into two parts - a.) Building list of players that shared a Club,NT and FIFA year combination and b.) Connecting players to direct connections based on Club, NT

In [None]:
#Note down all edges for each Club,NT and FIFA year combination
#I could have placed this in the same loop as Union Find but wanted to have separation of concerns for easy debugging
nt_edge = {}
club_edge = {}
for row in df.itertuples():
    player = getattr(row, 'sofifa_id')
    fifa_year = str(getattr(row, 'fifa_year'))
    club = str(getattr(row, 'club_name'))
    nt = getattr(row, 'nationality')
    
    has_nt = True
    has_club = True
    
    #is the person not called up for NT AND whose NT is in FIFA
    if getattr(row, 'nation_call_up')==0:
        has_nt=False
    
    if str(club)=='nan':
        has_club=False
    
    #add year to each link
    nt_yr = nt + fifa_year
    club_yr = club + fifa_year
        
    #national team
    if has_nt and nt_yr in nt_edge.keys():
        nt_edge[nt_yr].extend([player])
    elif has_nt:
        nt_edge[nt_yr] = [player]
    
    #club
    if has_club and club_yr in club_edge.keys():
        club_edge[club_yr].extend([player])
    elif has_club:
        club_edge[club_yr] = [player]

#Build full graph by connecting players to direct connections based on Club, NT
for row in df.itertuples():
    player = getattr(row, 'sofifa_id')
    fifa_year = str(getattr(row, 'fifa_year'))
    club = str(getattr(row, 'club_name'))
    nt = getattr(row, 'nationality')
    
    has_nt = True
    has_club = True
    
    #is the person not called up for NT AND whose NT is in FIFA
    if getattr(row, 'nation_call_up')==0:
        has_nt=False
    
    if str(club)=='nan':
        has_club=False
    
    #add year to each link
    nt_yr = nt + fifa_year
    club_yr = club + fifa_year
    
    #national team
    if has_nt:
        g[player].extend(nt_edge[nt_yr])
    
    #club
    if has_club:
        g[player].extend(club_edge[club_yr])

Running BFS algorithm

In [None]:
#dfs built graph with all DIRECT connections listed
visited = {}
distance = {}
parent = {}
INF = df['sofifa_id'].nunique()
def initialize_search():
    for i in df['sofifa_id'].unique():
        visited[i] = False
        distance[i] = INF
        parent[i] = i


#We use Breadth First Search (BFS) instead of Depth First Search (DFS) to not have to workaround python recursion limit of around 1000
def bfs(v):
    visited[v] = True
    distance[v] = 0
    q = []
    q.append(v)
    
    while len(q)!=0:
        current = q.pop(0)
        for node in g[current]:
            if not visited[node]:
                q.append(node)
                visited[node] = True
                distance[node] = distance[current]+1
                parent[node] = current


#Traverse from Lionel Messi since df is sorted by overall in FIFA 15
cnt_cc=0
initialize_search()
for player in df['sofifa_id'].unique():
    if not visited[player]:
        cnt_cc+=1
        bfs(player)

# III. Analysis and Conclusions

## III(a) Topline Analysis of Results based on FIFA 21 (latest) dataset
### * What is the size of the biggest group of players connected by either i.) club or ii.) national team from FIFA 15-21? 
### * How many groups are there? 
### * What is the average distance that separates them from the first player? 

Check that BFS and Union Find approaches work aligned by comparing number of connected components

In [None]:
print(f"BFS connected components: {cnt_cc}; Union Find connected components: {pd.Series(head.values()).nunique()}")

In [None]:
#Link group number to FIFA 21 dataset
group_number = {}
cnt = 1
for i in pd.Series(head.values()).unique():
    group_number[i] = cnt
    cnt+=1

fifa21_df['group'] = fifa21_df['sofifa_id'].apply(lambda x: group_number[head[x]])
fifa21_df['distance'] = fifa21_df['sofifa_id'].apply(lambda x: distance[x])

In [None]:
fifa21_df.groupby(['group']).agg({'sofifa_id':'count', 'distance':'mean'}).reset_index().sort_values(
    'sofifa_id', ascending=False)

In [None]:
len(fifa21_df[fifa21_df['group']==1])/len(fifa21_df)

## Conclusions:
### * The biggest connected component (Group 1) actually containts 18,804 (99.26%) of players
### * The average distance between this group (from Messi) is 2.9 connections
### * Maximum distance (from Messi) from anyone in the group is just 5!

<br>

### The Six Degrees of Separation Theory holds true for the footballing world!

## III (b) Who is the most <i>well-connected</i> football player in FIFA 15-21?

In [None]:
#Find the vertex (player) with the highest degree in the graph
degree = {}
max_player = ''
max_connections = 0
for player in g.keys():
    degree[player] = pd.Series(g[player]).nunique()-1
    if degree[player]>max_connections:
        max_connections = degree[player]
        max_player=player

print(whois(max_player,player_only=True), max_connections)

### Conclusion: Out most connected player is <b>Allan Nyom</b> with <b> 228 direct connections!</b>

This comes as no surprise given his connections across multiple top leagues with Udinese, Granada, West Brom, Watford and the Cameroon NT (many of whom play in top European Leagues)

![Allan Nyom](https://upload.wikimedia.org/wikipedia/commons/thumb/7/7c/Allan_Nyom_2019.jpg/220px-Allan_Nyom_2019.jpg) <br>

## III (c) Who is at the center of the Football Universe? (at least the one modelled by FIFA 15-21)

<br>

#### This is actually already outside of my current knowledge, so I referred to original Six Degrees of Separation graph research and other variations for the algorithms they used and implemented it


[Six Degrees of Separation Wikipedia Article](https://en.wikipedia.org/wiki/Six_degrees_of_separation)
<br>
[Six Degrees of Kevin Bacon Wikipedia Article](https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon)

In [None]:
highly_connected_players = pd.DataFrame(degree.items()).sort_values(1, ascending=False)[:30][0].values #Limit to Top 20 highly connected players

df_search = df[(df['sofifa_id'].apply(lambda x: group_number[head[x]])==1)].reset_index(drop=True)
df_search21 = fifa21_df[(fifa21_df['sofifa_id'].apply(lambda x: group_number[head[x]])==1)].reset_index(drop=True)
n_players = df_search['sofifa_id'].nunique()
avg_dist = {}


min_avg_dist = 10 #arbitrary number greater than 6
central_player = ""
max_dist = 0
for player in highly_connected_players:
    initialize_search()
    bfs(player)
    
    cdist = 0
    cmax_dist = 0
    for p1 in df_search['sofifa_id'].unique():
        cdist+=(distance[p1]/n_players)  
        max_dist = max(max_dist, distance[p1])
    
    avg_dist[player] = cdist
    if avg_dist[player]<min_avg_dist:
        min_avg_dist = avg_dist[player]
        central_player = player
        max_dist = cmax_dist
    

print(f"Most central player is {whois(central_player, player_only=True)} with average distance: {min_avg_dist} and maximum distance to any player in his network of: {max_dist}")

### Conclusion: The <i>Center of the Footballing Universe</i> is <b>Allan Nyom</b> with <b> average distance of 2.87</b> and <b> maximum distance to any player of 6</b> across FIFA 15-21

<i>Note: This may be overstated compared to looking at FIFA 21 only as there will be many players who have already retired, whom Allan Nyom would have been connected to if earlier generations of FIFA were included. </i>

<br>

<i>Note 2: For FIFA 21 players only it is also Allan Nyom has an average distance of 2.6</i>

# Lets end with a celebration and a great song by The Script. Thanks for Reading!

![](https://static01.nyt.com/images/2014/06/29/sports/29colombiacup2/29colombiacup2-superJumbo.jpg?quality=90&auto=webp)


![The Script - Six Degrees of Separation](http://25.media.tumblr.com/tumblr_me6mqdyMHb1rz1sbfo1_500.jpg)