# Player Data Analysis from Transfermarkt
This notebook constructs a graph of players and their most played teammates using data from Transfermarkt. 
The aim is to analyze relationships between players using graph analysis techniques, uncovering key insights such as:

## 1. Importing Necessary Libraries and Setting Constants

In this section, we import the libraries required for:
- **Data manipulation** (`pandas`)
- **Graph construction and analysis** (`networkx`)
- **Interactive visualization** (`plotly`)
- **Web scraping** (`selenium` and `BeautifulSoup`)
- **Utility Libraries** (`time` and `re`)

Moreover, the Constants, concern the amount of data that will be gathered during the scrapping process.
- **NUM_PLAYERS**: represents the amount of players from the top most valuable players in the world will be considered
- **NUM_TEAMMATES**: the number of teammates wich will be accessed for each player.

In [4]:
# 1. Data Manipulation
import pandas as pd  # For handling and analyzing structured data

# 2. Graph Construction
import networkx as nx  # For creating and analyzing network graphs

# 3. Data Visualization
import plotly.graph_objs as go  # For interactive graph visualization

# 4. Web Scraping
from selenium import webdriver  # Automates browser interaction
from selenium.webdriver.chrome.service import Service  # For managing the Chrome service
from bs4 import BeautifulSoup  # For parsing HTML data

# 5. Utility Libraries
import time  # For adding delays in scraping
import re  # For regular expressions, used to clean and extract data


In [5]:
NUM_PLAYERS = 75
NUM_TEAMMATES = 5

## 2. Setting Up WebDriver Service

In this section, we set up the Chrome WebDriver using Selenium. This allows us to interact with web pages for scraping data from Transfermarkt.
Steps:
1. **Configure Chrome Options**: Set any preferences such as headless mode if needed.
2. **Specify the WebDriver Path**: Provide the path to the Chrome WebDriver executable.
3. **Initialize the WebDriver**: Launch Chrome with the configured options.

In [6]:
# 2. Setting up the WebDriver

# Configuring Chrome options (e.g., headless mode can be set here if needed)
chrome_options = webdriver.chrome.options.Options()

# Path to the Chrome WebDriver executable
chrome_driver = "C:\\Users\\Lisandra\\Documents\\webdriver\\chromedriver-win64\\chromedriver.exe"

# Initializing the WebDriver service with the executable path
service_to_pass = Service(executable_path=chrome_driver)

# Launching the WebDriver with the service and options
wd = webdriver.Chrome(service=service_to_pass, options=chrome_options)


## 3. Defining Functions Used in the Notebook

This section defines utility functions that are used to scrape player data and process the information. These functions are modular, making it easier to manage different components of the scraping and analysis workflow.

### Function: `scrape_page()`
- **Purpose**: Extract player information (name and player profile link) from a page containing a list of the most valuable players.
- **Args**: 
  - `html` (str): The raw HTML source of the web page to be scraped.
- **Returns**: 
  - None (Modifies the global `players_data` list by appending player details).

### Function: `scrape_teammates()`
- **Purpose**: Extract information about a player's most frequently played teammates, including their names and profile links, from a web page.
- **Args**:
  - `html` (str): The HTML source of the web page that contains teammate information.
- **Returns**:
  - `teamates_vet` (DataFrame): A pandas DataFrame containing the extracted teammate names and links.

### Function: `change_page()`
- **Purpose**: Extract the URL for the next page in a paginated list of players or teammates.
- **Args**:
  - `html` (str): The HTML source of the current page.
- **Returns**:
  - `link` (str): The URL of the next page to be scraped.

### Function: `name_to_url()`
- **Purpose**: Converts a player's name into a formatted URL that can be used to access their profile on Transfermarkt.
- **Args**:
  - `player_name` (str): The name of the player to be converted into a URL.
- **Returns**:
  - `url` (str): The formatted Transfermarkt URL for the player's profile.

### Function: `extract_player_name()`
- **Purpose**: Extracts a player's name from a Transfermarkt profile URL and converts it into a readable format.
- **Args**:
  - `url` (str): The URL from which the player's name will be extracted.
- **Returns**:
  - `player_name` (str): The formatted player name, with hyphens replaced by spaces and proper capitalization.
  - If no match is found, the original URL is returned.

### Function: `teamates_page()`
- **Purpose**: Modifies a player's profile URL to navigate to the page listing the player's most frequently played teammates.
- **Args**:
  - `df_row` (Series): A row from the DataFrame that contains player data, including their profile URL.
- **Returns**:
  - `url` (str): The modified URL pointing to the player's teammates page.

### Function: `check_players_in_graph()`
- **Purpose**: Checks if two players are present as nodes in the graph.
- **Args**:
  - `graph` (NetworkX Graph): The graph where player nodes are stored.
  - `player1` (str): The name of the first player to check.
  - `player2` (str): The name of the second player to check.
- **Returns**:
  - `result` (bool): `True` if both players are found in the graph, otherwise `False`.
- **Additional Behavior**:
  - If only one player is found, a message is printed indicating which player is missing.
  - If neither player is found, a message is printed indicating that both are missing.

### Function: `add_to_graph()`
- **Purpose**: Adds a player and their teammates to a NetworkX graph (`G`), creating nodes for each player and adding edges to represent their connections.
- **Args**:
  - `player` (dict): A dictionary containing information about the player to be added as a node.
  - `teammates` (DataFrame): A DataFrame containing teammate data to be added as nodes and connected to the player.
- **Returns**:
  - None (modifies the global graph `G` in place).
- **Behavior**:
  - Adds a node for the player if it doesn’t already exist.
  - Adds nodes for each teammate.
  - Adds an edge between the player and each teammate, ensuring no duplicate edges.

In [7]:
# 3.1 Scrape Player Information from a Web Page

def scrape_page(html):
    """
    Extracts player information from a web page of the most valuable players. 
    The information of each player is added to the players_data list.

    Args:
        html (str): The page source that will be scraped.

    Returns:
        None: Appends player information to the global players_data list.
    """
    soup = BeautifulSoup(html, 'html.parser')
    
    # Find all player rows in the table (rows with 'odd' and 'even' class)
    player_rows = soup.find_all('tr', class_=['odd', 'even'])
    
    # Extract player information from each row
    for player in player_rows:
        name = player.find_all('td')[1].find('img').get('alt')  # Player's name
        player_link = 'https://www.transfermarkt.com' + player.find('td', class_='hauptlink').find('a')['href']  # Player page link
        
        # Add player data to the global list (uncomment if needed: rank, age, club, market value)
        players_data.append({
            'Name': name,
            'Player Page': player_link,
        })


In [8]:
# 3.2 Scrape Teammate Information from a Player's Page

def scrape_teammates(html):
    """
    Extracts players information from a web page of a player's most played teammates. 
    The information of each player is added to the players_data list.

    Args:
        html (str): The page source that will be scraped.

    Returns:
        teamates_vet (DataFrame): A DataFrame of all the teammates on the page.
    """
    soup = BeautifulSoup(html, 'html.parser')
    
    # Find the table containing teammate data
    teammate_table = soup.find('table', {'class': 'items'})
    teamates_vet = []  # List to store teammate data
    count_runs = 0  # Counter to handle iterations over the rows

    for tm in teammate_table.find_all('tr')[1:]:  # Skip the header row
        if count_runs % 3 == 0:  # Process every third row
            try:
                # Extract teammate name and profile link
                teammate_name = tm.find('img').get('alt')
                player_link = 'https://www.transfermarkt.com' + tm.find('td', class_='hauptlink').find('a')['href']
                
                # Add teammate data to global players_data and local teamates_vet
                players_data.append({
                    'Name': teammate_name,
                    'Player Page': player_link,
                })
                teamates_vet.append({
                    'Name': teammate_name,
                    'Player Page': player_link,
                })
            except:
                # Handle any errors in data extraction gracefully
                print("There was a problem gathering data from the row below")
                print(tm)
        
        count_runs += 1  # Increment the counter

    # Return the collected teammate data as a DataFrame
    return pd.DataFrame(teamates_vet)


In [9]:
# 3.3 Extract the Link to the Next Page in Pagination

def change_page(html):
    """
    Extracts the URL of the next page from a paginated web page.

    Args:
        html (str): The HTML source of the current page.

    Returns:
        link (str): The URL of the next page to be scraped.
    """
    soup = BeautifulSoup(html, 'html.parser')

    # Find the 'Next Page' button and extract the link
    next_button = soup.find('li', class_='tm-pagination__list-item tm-pagination__list-item--icon-next-page')
    link = 'https://www.transfermarkt.com' + next_button.find('a')['href']
    
    return link

In [10]:
# 3.4 Convert a Player's Name into a Transfermarkt Profile URL

def name_to_url(player_name):
    """
    Converts a player's name into a URL for their Transfermarkt profile.
    
    Args:
        player_name (str): The name of the player to be converted into a URL.

    Returns:
        url (str): The formatted Transfermarkt URL for the player's profile.
    """
    # Convert the name to lowercase and replace spaces with hyphens
    formatted_name = player_name.lower().replace(" ", "-")
    
    # Construct the URL
    url = f"https://www.transfermarkt.com/{formatted_name}/profil/spieler/"
    
    return url

In [11]:
# 3.5 Extract Player Name from a Transfermarkt Profile URL

def extract_player_name(url):
    """
    Extracts the player's name from a Transfermarkt profile URL and formats it.
    
    Args:
        url (str): The URL from which the player's name will be extracted.

    Returns:
        player_name (str): The player's name with hyphens replaced by spaces and proper capitalization.
        If no match is found, the original URL is returned.
    """
    # Use regex to find the player name part of the URL
    match = re.search(r'transfermarkt\.com/([^/]+)/', url)
    
    if match:
        # Extract the name and replace hyphens with spaces
        player_name = match.group(1).replace("-", " ")
        
        # Capitalize the first letter of each word (e.g., 'john-doe' becomes 'John Doe')
        player_name = player_name.title()
        return player_name
    
    # Return the original URL if no match is found
    return url


In [12]:
# 3.6 Convert Player Profile URL to Teammates Page URL

def teamates_page(df_row):
    """
    Modifies a player's profile URL to navigate to the page listing their most frequently played teammates.

    Args:
        df_row (Series): A row from the DataFrame containing the player's profile URL.

    Returns:
        url (str): The modified URL pointing to the player's teammates page.
    """
    # Replace "profil" with "gemeinsameSpiele" in the player's profile URL to get the teammates page URL
    return df_row['Player Page'].replace("profil", "gemeinsameSpiele")

In [13]:
# 3.7 Check if Two Players Exist in the Graph

def check_players_in_graph(graph, player1, player2):
    """
    Checks if two players (player1 and player2) are present as nodes in the graph.

    Args:
        graph (NetworkX Graph): The graph where player nodes are stored.
        player1 (str): The name of the first player to check.
        player2 (str): The name of the second player to check.

    Returns:
        bool: True if both players are found in the graph, otherwise False.
        
    Additional Behavior:
        - Prints a message if only one player is found or if both are missing.
    """
    # Check if player1 and player2 URLs are present as nodes in the graph
    player1_exists = any(name_to_url(player1) in node for node in graph.nodes)
    player2_exists = any(name_to_url(player2) in node for node in graph.nodes)
    
    # Return the result and print relevant messages
    if player1_exists and player2_exists:
        return True
    elif player1_exists:
        print(f"Player '{player1}' exists, but player '{player2}' does not exist in the graph.")
        return False
    elif player2_exists:
        print(f"Player '{player2}' exists, but player '{player1}' does not exist in the graph.")
        return False
    print(f"Neither player '{player1}' nor player '{player2}' exists in the graph.")
    return False


In [14]:
# 3.8 Add a Player and Their Teammates to the Graph

def add_to_graph(player, teammates):
    """
    Adds a player and their teammates to a NetworkX graph (G), creating nodes and edges.

    Args:
        player (dict): A dictionary containing player data to be added as a node.
        teammates (DataFrame): A DataFrame containing teammate data to be added as nodes and connected to the player.

    Returns:
        None: Modifies the global graph G in place.
    """
    # Add the player as a node if not already in the graph
    if player['Player Page'] not in G:
        G.add_node(player['Player Page'], data=player)

    # Iterate through the teammates DataFrame
    for index, row in teammates.iterrows():
        # Add the teammate as a node if not already in the graph
        if row['Player Page'] not in G:
            G.add_node(row['Player Page'], data=row)
        
        # Add an edge between the player and teammate if no existing edge is found
        if not G.has_edge(player['Player Page'], row['Player Page']) and not G.has_edge(row['Player Page'], player['Player Page']):
            G.add_edge(player['Player Page'], row['Player Page'])

## 4. Accessing and Scraping the Most Valuable Players Webpage
This section collects data from the Transfermarkt webpage, which lists the most valuable players. The data includes player names and profile URLs. We use a loop to paginate through the pages and gather information on 250 players or stop if no more players are available.
Futhermore, a treatment is applied to add a column corresponding to the teammates page for each player.

In [15]:
# Define the URL for the most valuable players on Transfermarkt
url = 'https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?land_id=0&ausrichtung=alle&spielerposition_id=alle&altersklasse=alle&jahrgang=0&kontinent_id=0&plus=1'
# Open the webpage using the Selenium webdriver
wd.get(url)

In [16]:
# Initialize an empty list to store player data
players_data = []
# Pagination: Continue scraping until you gather 250 players or run out of pages
players_count = 0

while players_count < NUM_PLAYERS:
    time.sleep(2)  # Wait for the page to fully load
    
    # Scrape data from the current page
    try:
        scrape_page(wd.page_source)  # Scrape player data from the current page
        players_count = len(players_data)  # Update the player count
        
        # Move to the next page using the 'change_page' function
        url = change_page(wd.page_source)
        wd.get(url)
    except:
        print('There was a problem accessing the data or no more players were found')
        break

In [17]:
# Convert the players data list into a DataFrame for future analysis
df_players = pd.DataFrame.from_dict(players_data[:NUM_PLAYERS])
df_players

Unnamed: 0,Name,Player Page
0,Jude Bellingham,https://www.transfermarkt.com/jude-bellingham/...
1,Erling Haaland,https://www.transfermarkt.com/erling-haaland/p...
2,Vinicius Junior,https://www.transfermarkt.com/vinicius-junior/...
3,Kylian Mbappé,https://www.transfermarkt.com/kylian-mbappe/pr...
4,Phil Foden,https://www.transfermarkt.com/phil-foden/profi...
...,...,...
70,Dani Olmo,https://www.transfermarkt.com/dani-olmo/profil...
71,Ousmane Dembélé,https://www.transfermarkt.com/ousmane-dembele/...
72,Marcus Rashford,https://www.transfermarkt.com/marcus-rashford/...
73,Jack Grealish,https://www.transfermarkt.com/jack-grealish/pr...


In [18]:
# Add a new column to the DataFrame with URLs to each player's teammates page
df_players['Teammates Page'] = df_players.apply(teamates_page, axis=1)

## 5. Accessing Teammate Data and Building a Graph
In this section, we access each player's "Teammates Page" to scrape data on their most frequently played teammates. For each player, we scrape up to 25 teammates and build a graph where players and their teammates are represented as nodes, and their relationships are edges. The graph helps us analyze the relationships between players based on their shared game time.

We use pagination to ensure we gather enough teammates and then add both players and teammates to the graph.

In [19]:
# Initialize an empty graph to store players and their relationships
G = nx.Graph()

# Iterate through each player in the DataFrame
for index, row in df_players.iterrows():
    url = row['Teammates Page']
    wd.get(url)  # Open the player's teammates page
    row = row.drop(labels='Teammates Page')  # Drop the teammates page URL from the current row for cleaner data
    
    # Initialize variables to store teammate data
    teammate_count = 0
    teammates = pd.DataFrame(columns=['Name', 'Player Page'])  # Empty DataFrame to store teammates
    
    # Loop to gather up to 25 teammates from each player's page
    while teammate_count < NUM_TEAMMATES:
        # Scrape teammate data from the current page
        page_teammates = scrape_teammates(wd.page_source)
        
        # Concatenate the current page's teammates to the overall list
        teammates = pd.concat([teammates, page_teammates[:NUM_TEAMMATES]])
        teammate_count = len(teammates)
        
        # Try to navigate to the next page of teammates
        try:
            url = change_page(wd.page_source)
            wd.get(url)
        except:
            print('There was a problem accessing the data or no more players were found')
            break
    
    # Print the scraped teammates and player data for debugging
    print(teammates)
    print(row)
    
    # Add the player and their teammates to the graph
    add_to_graph(row, teammates)

                Name                                        Player Page
0       Mats Hummels  https://www.transfermarkt.com/mats-hummels/pro...
1         Marco Reus  https://www.transfermarkt.com/marco-reus/profi...
2      Julian Brandt  https://www.transfermarkt.com/julian-brandt/pr...
3  Raphaël Guerreiro  https://www.transfermarkt.com/raphael-guerreir...
4           Emre Can  https://www.transfermarkt.com/emre-can/profil/...
Name                                             Jude Bellingham
Player Page    https://www.transfermarkt.com/jude-bellingham/...
Name: 0, dtype: object
             Name                                        Player Page
0   Manuel Akanji  https://www.transfermarkt.com/manuel-akanji/pr...
1           Rodri  https://www.transfermarkt.com/rodri/profil/spi...
2  Bernardo Silva  https://www.transfermarkt.com/bernardo-silva/p...
3         Ederson  https://www.transfermarkt.com/ederson/profil/s...
4      Phil Foden  https://www.transfermarkt.com/phil-foden/profi...
N

In [20]:
# Remove duplicates from the players data
df_players = pd.DataFrame.from_dict(players_data)
df_players = df_players.drop_duplicates()

# Regenerate the teammates page column after adding teammates
df_players['Teammates Page'] = df_players.apply(teamates_page, axis=1)

# Save the updated player data with teammates pages to a CSV file
df_players.to_csv('out.csv', index=False)


## 6. Plotting the Graph Built
In this section, we visualize the graph created in the previous step. The graph displays players as nodes and their connections (teammates) as edges. The size and color of the nodes represent the number of connections (teammates) a player has, giving insight into which players are central to the network. 

We use the NetworkX `spring_layout()` to position the nodes and Plotly to create an interactive plot where users can hover over nodes to see player names and their connection count.

In [21]:
# Position nodes in the graph using a spring layout
pos = nx.spring_layout(G)

# Extract edge positions
edge_x = []
edge_y = []
for edge in G.edges():
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    edge_x.append(x0)
    edge_x.append(x1)
    edge_x.append(None)  # For separating edges in the plot
    edge_y.append(y0)
    edge_y.append(y1)
    edge_y.append(None)

# Create the edge trace (lines)
edge_trace = go.Scatter(
    x=edge_x, y=edge_y,
    line=dict(width=1, color='black'),
    hoverinfo='none',
    mode='lines'
)

# Extract node positions
node_x = []
node_y = []
for node in G.nodes():
    x, y = pos[node]
    node_x.append(x)
    node_y.append(y)

# Node adjacency and hover text
node_adjacencies = []
node_text = []
for node, adjacencies in G.adjacency():
    num_connections = len(adjacencies)  # Number of connections for the node
    node_adjacencies.append(num_connections)
    node_text.append(f"Node {extract_player_name(node)} has {num_connections} connections")

# Create the node trace (points)
node_trace = go.Scatter(
    x=node_x, y=node_y,
    mode='markers',
    hoverinfo='text',
    marker=dict(
        showscale=True,
        colorscale='viridis',
        size=10,
        color=node_adjacencies,
        colorbar=dict(
            thickness=15,
            title='Node Connections',
            xanchor='left',
            titleside='right'
        ),
    )
)

# Assign node hover text
node_trace.text = node_text

# Create the plot
fig = go.Figure(data=[edge_trace, node_trace],
                layout=go.Layout(
                    title='Interactive Network Graph',
                    titlefont_size=16,
                    showlegend=False,
                    hovermode='closest',
                    margin=dict(b=0, l=0, r=0, t=40),
                    annotations=[dict(
                        text="Network visualization using NetworkX and Plotly",
                        showarrow=False,
                        xref="paper", yref="paper",
                        x=0.005, y=-0.002
                    )],
                    xaxis=dict(showgrid=False, zeroline=False),
                    yaxis=dict(showgrid=False, zeroline=False)
                )
)

# Display the graph
fig.show()

## 7. Interactive Section to Check for Shortest Path Between Players
In this section, we allow the user to interactively check for the shortest path between two players in the network. The user inputs the names of the two players, and the notebook calculates and displays the shortest path, if one exists, between them in the graph. The path is represented as a sequence of player names, showing how the two players are connected through the network.


In [22]:
# Interactive input to find the shortest path between two players
player1 = input("Enter the first player's name: ")
player2 = input("Enter the second player's name: ")

# Check if both players are in the graph
if check_players_in_graph(G, player1, player2):
    try:
        # Find the shortest path between the two players
        path = nx.shortest_path(G, 
                                source=[node for node in G.nodes if name_to_url(player1) in node][0], 
                                target=[node for node in G.nodes if name_to_url(player2) in node][0])
        
        # Print the path and its length
        print(f"The path has a {len(path)} length and it is formed by the following players:")
        for node in path:
            print(extract_player_name(node), end=" - ")
    except nx.NetworkXNoPath:
        print(f'There is no known path between {player1} and {player2}')


There is no known path between Jude Bellingham and Bernardo Silva


# 8. Ending the Selenium Process

In [23]:
# Close the browser
wd.quit()