# Load and Process Player Data

We will now load and clean the player dataset, with the aim of joining this with the match dataset.

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import warnings

# Ignore PerformanceWarning and UserWarning
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=pd.errors.PerformanceWarning)

### Load Datasets

In [None]:
# Load datasets
player_df = pd.read_csv('player_stats\all_player_atts.csv')
match_df = pd.read_csv('datasets\match_df.csv')

We can see below that the player dataset contains various stats including wins, losses etc. in various tournaments, on various surfaces etc. for every year for each player. The dataset also contains similar information regarding in-game stats such as total aces, returns etc. for every year for each player.

In [189]:
player_df.head()

Unnamed: 0,name,wins_2012,losses_2012,win_pct_2012,hard_wins_2012,hard_losses_2012,hard_pct_2012,clay_wins_2012,clay_losses_2012,clay_pct_2012,...,vsTop10_pct_2025,vsTop20_wins_2025,vsTop20_losses_2025,vsTop20_pct_2025,vsTop50_wins_2025,vsTop50_losses_2025,vsTop50_pct_2025,vsTop100_wins_2025,vsTop100_losses_2025,vsTop100_pct_2025
0,Andy Roddick,23.0,16.0,59.0,14.0,7.0,66.7,0.0,4.0,0.0,...,,,,,,,,,,
1,Roger Federer,71.0,12.0,85.5,30.0,4.0,88.2,15.0,3.0,83.3,...,,,,,,,,,,
2,Juan Carlos Ferrero,5.0,12.0,29.4,0.0,2.0,0.0,5.0,8.0,38.5,...,,,,,,,,,,
3,Andre Agassi,,,,,,,,,,...,,,,,,,,,,
4,Guillermo Coria,,,,,,,,,,...,,,,,,,,,,


### Function to filter columns by desired years

Although our webscraping should have only scraped data for the years for which we have player data, we will build a function to filter this data by only the years we want just in case any columns containing data from outside the desired range was included:

In [None]:
def remove_year_columns(df, current_year=2025, earliest_year=2003):
    """
    Removes columns that end with the current year or any year before the specified earliest year.

    Parameters:
    - df (pd.DataFrame): The DataFrame from which columns will be removed.
    - current_year (int): The latest year to remove (default is 2025).
    - earliest_year (int): The earliest year to keep (default is 2001).

    Returns:
    - df (pd.DataFrame): The DataFrame with specified columns removed.

    Notes:
    - Removes columns that end with `_current_year`.
    - Removes columns that end with `_year` where year is less than `earliest_year`.
    """

    # Identify columns to remove based on the provided years
    columns_to_remove = [
        col for col in df.columns 
        if col.endswith(f"_{current_year}") or any(col.endswith(f"_{year}") for year in range(2000, earliest_year + 1))
    ]

    # Drop the identified columns
    df = df.drop(columns=columns_to_remove, errors='ignore')

    return df


In [191]:
# Run the function
player_df = remove_year_columns(player_df, current_year=2025, earliest_year=2003)

### Function to shorten player names

We will need to match up the names from the match_df and the player_df. Since we have the full names in the player_df and shortened names in the match_df, we will need to create a function which reformats a given name from "Firstname Lastname" to "Lastname F." (where F is the first initial of the first name), returning "Invalid Name" for inputs that are not two-part strings:

In [192]:
def format_name_col(name):
    """
    Formats a full name into the format 'Lastname F.' where F is the initial of the first name.

    This function expects the input to be a string consisting of exactly two parts: 
    a first name and a last name, separated by a space. If the input does not meet 
    this criterion, it returns 'Invalid Name'.

    Parameters:
        name (str): A full name string in the format 'First Last'.

    Returns:
        str: A formatted string in the form 'Lastname F.' or 'Invalid Name' 
             if the input is not valid.
    """
    # Check if the provided 'name' is a string.
    if isinstance(name, str):
        # Split the name into parts
        split_name = name.split()
        # Check if the name has exactly two parts (assuming first and last name).
        if len(split_name) == 2:
            # Assign the first part of the name to 'first_name'.
            first_name = split_name[0]
            # Assign the second part of the name to 'last_name'.
            last_name = split_name[1]
            # Return the formatted name as "Lastname F." (F is the first letter of the first name).
            return f"{last_name} {first_name[0]}."
        else:
            # Return "Invalid Name" if the name does not have exactly two parts.
            return "Invalid Name"
    else:
        # Return "Invalid Name" if the provided 'name' is not a string.
        return "Invalid Name"


### Function to shorten all player names

We can then create a function that drops rows with no name value and applies the above function to all remaining rows:

In [193]:
def format_names(df):
    """
    Takes a DataFrame and returns a copy with an additional 'formatted_name' column.

    This function performs the following steps:
    - Creates a copy of the input DataFrame.
    - Drops any rows where the 'name' column is missing.
    - Applies the 'format_name_col' function to the 'name' column to generate a new 
      'formatted_name' column, which contains names in the format 'Lastname F.'.

    Parameters:
        df (pandas.DataFrame): A DataFrame containing a 'name' column.

    Returns:
        pandas.DataFrame: A modified copy of the original DataFrame with a new 
        'formatted_name' column and rows with missing names removed.
    """
    # Create a copy
    df_new = df.copy()

    # Drop rows with no name
    df_new.dropna(subset=['name'], inplace=True) 

    # Apply the function to create a formatted_name column
    df_new['formatted_name'] = df_new['name'].apply(format_name_col)

    return df_new


We can then run the function and check it is working as intended:

In [194]:
player_df = format_names(player_df)
player_df[['name', 'formatted_name']].head(2)

Unnamed: 0,name,formatted_name
0,Andy Roddick,Roddick A.
1,Roger Federer,Federer R.


### Function to analyse unmatched names between the player and match datasets

We can then create a function to compare the names we have in the player and match datasets and to check how many remaining names we have that were not matched:

In [None]:
def analyse_name_lengths(player_df, match_df):
    """
    Analyses and compares player names from two DataFrames, identifying matches and mismatches
    between formatted names and those used in match records.

    This function performs the following:
    - Separates valid and invalid formatted names from 'player_df'.
    - Extracts unique names from both 'player' and 'opponent' columns in 'match_df'.
    - Identifies which long names from 'player_df' correspond to formatted names present in 'match_df'.
    - Determines which names in 'match_df' do not match any formatted names.
    - Prints the count of names in each category for analysis.

    Parameters:
        player_df (pandas.DataFrame): A DataFrame containing columns 'name' and 'formatted_name'.
        match_df (pandas.DataFrame): A DataFrame containing 'player' and 'opponent' columns
                                     with names in formatted form.

    Returns:
        tuple: A tuple containing the following five lists:
            - changed_names (list): Formatted names successfully created from 'player_df'.
            - remaining_names (list): Original names from 'player_df' with invalid formatting.
            - all_shortened_names (list): All unique names from 'player' and 'opponent' columns in 'match_df'.
            - remaining_short_names (list): Names in 'match_df' not matched to any formatted name.
            - matched_long_names (list): Original names from 'player_df' whose formatted versions matched
                                         entries in 'match_df'.
    """
    # Extract valid formatted names and their corresponding long names from 'player_df'.
    valid_names_df = player_df[player_df['formatted_name'] != 'Invalid Name']
    
    # Extract original names where formatted names are invalid.
    remaining_names = list(player_df[player_df['formatted_name'] == 'Invalid Name']['name'])

    # Combine and deduplicate names from 'player' and 'opponent' columns in 'match_df'.
    all_shortened_names = list(set(match_df['player'].unique()).union(set(match_df['opponent'].unique())))

    # Find long names whose formatted name is present in all_shortened_names.
    matched_long_names = list(valid_names_df[valid_names_df['formatted_name'].isin(all_shortened_names)]['name'])

    # Filter out names in 'all_shortened_names' that are not in the list of formatted names.
    changed_names = list(valid_names_df['formatted_name'])
    remaining_short_names = [name for name in all_shortened_names if name not in changed_names]

    # Print the lengths of each list.
    print(f"Number of changed names from player_df: {len(changed_names)}")
    print(f"Number of unchanged names from player_df: {len(remaining_names)}")
    print(f"Number of all unique names in match_df: {len(all_shortened_names)}")
    print(f"Number of remaining unmatched names in match_df: {len(remaining_short_names)}")
    print(f"Number of matched long names: {len(matched_long_names)}")

    # Return the five lists.
    return changed_names, remaining_names, all_shortened_names, remaining_short_names, matched_long_names

In [196]:
changed_names, remaining_names, all_shortened_names, remaining_short_names, matched_long_names = analyse_name_lengths(player_df,match_df)

Number of changed names from player_df: 1062
Number of unchanged names from player_df: 171
Number of all unique names in match_df: 1206
Number of remaining unmatched names in match_df: 200
Number of matched long names: 1020


We can see above that we stil have 200 out of the 1206 names in the match dataset which still haven't been matched to a shortened name in the player dataset.

### Function to shorten additional names

After some external analysis on the remaining unmatched names, we can create another function to change names with special case such as containing specific structures such as containing occurences of 'Van' or 'De' etc. 

In [197]:
# Function to update player names in a DataFrame
def update_player_names(df, remaining_names, remaining_short_names):
    # Creating a copy of the input DataFrame to avoid modifying the original
    df_new = df.copy()

    # Iterating over each name in the remaining_names list
    for name in remaining_names:
        # Iterating over each short name in the remaining_short_names list
        for short_name in remaining_short_names:
            # Skip specific short names that are predefined
            if short_name in ['Munoz-De La Nava D.', 'Garcia-Lopez G.']:
                continue

            # Check if short name starts with specific prefixes
            if short_name.split()[0] in ['De', 'Del', 'O', 'Van', 'Al', 'Lopez']:
                # Set searchable name to the second-to-last word of short name
                searchable_name = short_name.split()[-2]
                # Check if searchable name is part of name
                if searchable_name in name:
                    # Iterate over rows in df_new to update formatted_name
                    for i, row in df_new.iterrows():
                        if (row['name'] == name) & (row['formatted_name'] == 'Invalid Name'):
                            df_new.loc[i, 'formatted_name'] = short_name 

            # Handling hyphenated first names
            elif '-' in short_name.split()[0]:
                # Set searchable name to the part after the hyphen
                searchable_name = short_name.split()[0].split('-')[1]
                if searchable_name in name:
                    for i, row in df_new.iterrows():
                        if (row['name'] == name) & (row['formatted_name'] == 'Invalid Name'):
                            df_new.loc[i, 'formatted_name'] = short_name
                else:
                    # If above condition fails, set to part before the hyphen
                    searchable_name = short_name.split()[0].split('-')[0]
                    if searchable_name in name:
                        for i, row in df_new.iterrows():
                            if (row['name'] == name) & (row['formatted_name'] == 'Invalid Name'):
                                df_new.loc[i, 'formatted_name'] = short_name
            
            # Default case for setting searchable name
            else:
                searchable_name = short_name.split()[0]
                if searchable_name in name:       
                    for i, row in df_new.iterrows():
                        if (row['name'] == name) & (row['formatted_name'] == 'Invalid Name'):
                            df_new.loc[i, 'formatted_name'] = short_name
                else:
                    # Checking the second word of the short name
                    searchable_name = short_name.split()[1]
                    if searchable_name in name:
                        # Skip specific names
                        if searchable_name in ['Da', 'De']:
                            pass
                        else:
                            for i, row in df_new.iterrows():
                                if (row['name'] == name) & (row['formatted_name'] == 'Invalid Name'):
                                    df_new.loc[i, 'formatted_name'] = short_name
                    else:
                        # Handling exceptions for the third word
                        try:
                            searchable_name = short_name.split()[2]
                            if searchable_name in name:
                                if searchable_name == 'F':
                                    pass
                                else:
                                    for i, row in df_new.iterrows():
                                        if (row['name'] == name) & (row['formatted_name'] == 'Invalid Name'):
                                            df_new.loc[i, 'formatted_name'] = short_name
                        except:
                            continue

    # Return the updated df
    return df_new

In [198]:
# Run the function
player_df = update_player_names(player_df, remaining_names, remaining_short_names)

We can run the analyse_name_lengths function again to see how many unmatched names we have remaining after running the above function:

In [199]:
changed_names, remaining_names, all_shortened_names, remaining_short_names, matched_long_names = analyse_name_lengths(player_df,match_df)

Number of changed names from player_df: 1233
Number of unchanged names from player_df: 0
Number of all unique names in match_df: 1206
Number of remaining unmatched names in match_df: 123
Number of matched long names: 1191


This has helped us reduce the number of unmatched names in the match dataset to 123 out of 1206 names, but some other edge cases still remain to be matched.

### Function to manually change remaining names

After some further external analysis, we can see some names that have been missed or incorrectly shortened so we can alter the previous function we created previously to add more manual name changes:

In [200]:
def manually_set_names(df, old_name_col, new_name_col):
    """
    Manually updates player names in a DataFrame to a standardised format.

    Args:
        df (pd.DataFrame): The input DataFrame containing player names.
        old_name_col (str): The column name containing the original player names.
        new_name_col (str): The column name where the formatted player names will be stored.

    Returns:
        pd.DataFrame: A new DataFrame with updated player names in the specified column.
    """  
    # Create a copy of the DataFrame to avoid modifying the original directly
    df_new = df.copy()

    # Manually set formatted names for specific players
    df_new.loc[df_new[old_name_col] == 'Alvaro Lopez San Martin', new_name_col] = 'Lopez San Martin A.'
    df_new.loc[df_new[old_name_col] == 'Juan Ignacio Londero', new_name_col] = 'Londero J.I.'
    df_new.loc[df_new[old_name_col] == 'Tomas Martin Etcheverry', new_name_col] = 'Etcheverry T.'
    df_new.loc[df_new[old_name_col] == 'Daniel Dutra Da Silva', new_name_col] = 'Dutra Da Silva D.'
    df_new.loc[df_new[old_name_col] == 'Juan Alejandro Hernandez', new_name_col] = 'Hernandez A.'
    df_new.loc[df_new[old_name_col] == 'Mario Gonzalez Fernandez', new_name_col] = 'Gonzalez M.'
    df_new.loc[df_new[old_name_col] == 'Felix Auger Aliassime', new_name_col] = 'Auger-Aliassime F.'
    df_new.loc[df_new[old_name_col] == 'Daniel Elahi Galan', new_name_col] = 'Galan D.E.'
    df_new.loc[df_new[old_name_col] == 'Gilles Arnaud Bailly', new_name_col] = 'Bailly G.'
    df_new.loc[df_new[old_name_col] == 'Nicolas Moreno De Alboran', new_name_col] = 'Moreno De Alboran N.'
    df_new.loc[df_new[old_name_col] == 'Kenny de Schepper', new_name_col] = 'De Schepper K.'
    df_new.loc[df_new[old_name_col] == 'Rodrigo Pacheco Mendez', new_name_col] = 'Pacheco Mendez R.'
    df_new.loc[df_new[old_name_col] == 'Pedro Martinez Portero', new_name_col] = 'Martinez P.'
    df_new.loc[df_new[old_name_col] == 'Herbert P-H.', new_name_col] = 'Herbert P.'
    df_new.loc[df_new[old_name_col] == 'Herbert P.H.', new_name_col] = 'Herbert P.'
    df_new.loc[df_new[old_name_col] == 'Herbert P-H', new_name_col] = 'Herbert P.'
    df_new.loc[df_new[old_name_col] == 'Herbert P.H', new_name_col] = 'Herbert P.'
    df_new.loc[df_new[old_name_col] == 'Tung-Lin Wu', new_name_col] = 'Wu T.L.'
    df_new.loc[df_new[old_name_col] == 'Zhizhen Zhang', new_name_col] = 'Zhang Zh.'
    df_new.loc[df_new[old_name_col] == 'Alejandro Moro Canas', new_name_col] = 'Moro Canas A.'
    df_new.loc[df_new[old_name_col] == 'Victor Estrella', new_name_col] = 'Estrella Burgos V.'
    df_new.loc[df_new[old_name_col] == 'Jeff Wolf', new_name_col] = 'Wolf J.J.'
    df_new.loc[df_new[old_name_col] == 'Gian Marco Moroni', new_name_col] = 'Moroni G.'
    df_new.loc[df_new[old_name_col] == 'Moroni G.M.', new_name_col] = 'Moroni G.'
    df_new.loc[df_new[old_name_col] == 'Pablo Carreno-Busta', new_name_col] = 'Carreno Busta P.'
    df_new.loc[df_new[old_name_col] == 'Ramkumar Ramanathan', new_name_col] = 'Ram R.'
    df_new.loc[df_new[old_name_col] == 'Silva F.', new_name_col] = 'Ferreira Silva F.'
    df_new.loc[df_new[old_name_col] == 'Marc-Andrea Huesler', new_name_col] = 'Huesler M.A.'
    df_new.loc[df_new[old_name_col] == 'Thai Kwiatkowski', new_name_col] = 'Kwiatkowski T.S.'
    df_new.loc[df_new[old_name_col] == 'Thiago Moura Monteiro', new_name_col] = 'Monteiro T.'
    df_new.loc[df_new[old_name_col] == 'Galan D.', new_name_col] = 'Galan D.E.'
    df_new.loc[df_new[old_name_col] == 'Paul-Henri Mathieu', new_name_col] = 'Mathieu P.H.'
    df_new.loc[df_new[old_name_col] == 'Andrey Kuznetsov', new_name_col] = 'Kuznetsov An.'
    df_new.loc[df_new[old_name_col] == 'Alex Kuznetsov', new_name_col] = 'Kuznetsov Al.'
    df_new.loc[df_new[old_name_col] == 'Ramkumar Ramanathan', new_name_col] = 'Ramanathan R.'
    df_new.loc[df_new[old_name_col] == 'Daniel Munoz-De La Nava', new_name_col] = 'Munoz-De La Nava D.'
    df_new.loc[df_new[old_name_col] == 'Frederico Ferreira Silva', new_name_col] = 'Ferreira Silva F.'
    df_new.loc[df_new[old_name_col] == 'Soon-Woo Kwon', new_name_col] = 'Kwon S.W.'
    df_new.loc[df_new[old_name_col] == 'Carreno-Busta P.', new_name_col] = 'Carreno Busta P.'
    df_new.loc[df_new[old_name_col] == 'Tomas Barrios Vera', new_name_col] = 'Barrios Vera M.T.'
    df_new.loc[df_new[old_name_col] == 'Roberto Bautista Agut', new_name_col] = 'Bautista R.'
    df_new.loc[df_new[old_name_col] == 'Bautista Agut R.', new_name_col] = 'Bautista R.'
    df_new.loc[df_new[old_name_col] == 'Jc Aragone', new_name_col] = 'Aragone J.C.'
    df_new.loc[df_new[old_name_col] == 'Aragone Jc', new_name_col] = 'Aragone J.C.'
    df_new.loc[df_new[old_name_col] == 'Aragone J.', new_name_col] = 'Aragone J.C.'
    df_new.loc[df_new[old_name_col] == 'Silva F.F.', new_name_col] = 'Ferreira Silva F.'
    df_new.loc[df_new[old_name_col] == 'Felipe Meligeni Alves', new_name_col] = 'Meligeni Rodrigues F'
    df_new.loc[df_new[old_name_col] == 'Meligeni Alves F.', new_name_col] = 'Meligeni Rodrigues F'
    df_new.loc[df_new[old_name_col] == 'John-Patrick Smith', new_name_col] = 'Smith J.P.'
    df_new.loc[df_new[old_name_col] == 'Jan-Lennard Struff', new_name_col] = 'Struff J.L.'
    df_new.loc[df_new[old_name_col] == 'Struff J-L.', new_name_col] = 'Struff J.L.'
    df_new.loc[df_new[old_name_col] == 'Cedrik-Marcel Stebe', new_name_col] = 'Stebe C.M.'
    df_new.loc[df_new[old_name_col] == 'Ze Zhang', new_name_col] = 'Zhang Ze.'
    df_new.loc[df_new[old_name_col] == 'Zhang Ze', new_name_col] = 'Zhang Ze.'
    df_new.loc[df_new[old_name_col] == 'Philipp Kohlschreiber', new_name_col] = 'Kohlschreiber P.'
    df_new.loc[df_new[old_name_col] == 'Kohlschreiber P..', new_name_col] = 'Kohlschreiber P.'
    df_new.loc[df_new[old_name_col] == 'Juan Pablo Varillas', new_name_col] = 'Varillas J.P.'
    df_new.loc[df_new[old_name_col] == 'Varillas J. P.', new_name_col] = 'Varillas J.P.'
    df_new.loc[df_new[old_name_col] == 'Jo-Wilfried Tsonga', new_name_col] = 'Tsonga J.W.'
    df_new.loc[df_new[old_name_col] == 'N. Vijay Sundar Prashanth', new_name_col] = 'Prashanth V.'
    df_new.loc[df_new[old_name_col] == "Christopher O'Connell", new_name_col] = 'O Connell C.'
    df_new.loc[df_new[old_name_col] == "O'Connell C.", new_name_col] = 'O Connell C.'
    df_new.loc[df_new[old_name_col] == 'Luca Van Assche', new_name_col] = 'Van Assche L.'
    df_new.loc[df_new[old_name_col] == 'James McCabe', new_name_col] = 'Mccabe J.'
    df_new.loc[df_new[old_name_col] == 'Rogerio Dutra Da Silva', new_name_col] = 'Dutra Silva R.'
    df_new.loc[df_new[old_name_col] == 'Dutra Da Silva R.', new_name_col] = 'Dutra Silva R.'
    df_new.loc[df_new[old_name_col] == 'Suk-Young Jeong', new_name_col] = 'Jeong S.Y.'
    df_new.loc[df_new[old_name_col] == 'Juan Martin Del Potro', new_name_col] = 'Del Potro J.M.'
    df_new.loc[df_new[old_name_col] == 'Del Potro J. M.', new_name_col] = 'Del Potro J.M.'
    df_new.loc[df_new[old_name_col] == 'Del Potro J.', new_name_col] = 'Del Potro J.M.'
    df_new.loc[df_new[old_name_col] == 'Mario Vilella Martinez', new_name_col] = 'Vilella Martinez M.'
    df_new.loc[df_new[old_name_col] == 'Gerardo Lopez Villasenor', new_name_col] = 'Lopez Villasenor G.'
    df_new.loc[df_new[old_name_col] == 'Dolgopolov O.', new_name_col] = 'Dolgopolov A.'
    df_new.loc[df_new[old_name_col] == 'Alexandr Dolgopolov', new_name_col] = 'Dolgopolov A.'
    df_new.loc[df_new[old_name_col] == 'David Vega Hernandez', new_name_col] = 'Vega Hernandez D.'
    df_new.loc[df_new[old_name_col] == 'Albert Ramos-Vinolas', new_name_col] = 'Ramos A.'
    df_new.loc[df_new[old_name_col] == 'Ramos-Vinolas A.', new_name_col] = 'Ramos A.'
    df_new.loc[df_new[old_name_col] == 'Chun Hsin Tseng', new_name_col] = 'Tseng C.H.'
    df_new.loc[df_new[old_name_col] == 'Tseng C. H.', new_name_col] = 'Tseng C.H.'
    df_new.loc[df_new[old_name_col] == 'Mukund Sasikumar', new_name_col] = 'Mukund S.'
    df_new.loc[df_new[old_name_col] == 'Andres Artunedo Martinavarr', new_name_col] = 'Artunedo Martinavarro A.'
    df_new.loc[df_new[old_name_col] == 'Ricardo Ojeda Lara', new_name_col] = 'Ojeda Lara R.'
    df_new.loc[df_new[old_name_col] == 'Jean-Christophe Faurel', new_name_col] = 'Faurel J.C.'
    df_new.loc[df_new[old_name_col] == 'Roman Andres Burruchaga', new_name_col] = 'Burruchaga R.'
    df_new.loc[df_new[old_name_col] == 'Juan-Martin Aranguren', new_name_col] = 'Aranguren J.M.'
    df_new.loc[df_new[old_name_col] == 'Federico Delbonis', new_name_col] = 'Del Bonis F.'
    df_new.loc[df_new[old_name_col] == 'Delbonis F.', new_name_col] = 'Del Bonis F.'
    df_new.loc[df_new[old_name_col] == 'Juan Carlos Ferrero', new_name_col] = 'Ferrero J.C.'
    df_new.loc[df_new[old_name_col] == 'Matteo Viola', new_name_col] = 'Viola Mat.'
    df_new.loc[df_new[old_name_col] == 'Viola Mat.', new_name_col] = 'Viola M.'
    df_new.loc[df_new[old_name_col] == 'Abdullah Maqdes', new_name_col] = 'Abdulla M.'
    df_new.loc[df_new[old_name_col] == 'Jean-Rene Lisnard', new_name_col] = 'Lisnard J.'
    df_new.loc[df_new[old_name_col] == 'Lisnard J.R.', new_name_col] = 'Lisnard J.'
    df_new.loc[df_new[old_name_col] == 'Jesse Levine', new_name_col] = 'Levine J.'
    df_new.loc[df_new[old_name_col] == 'Levine I.', new_name_col] = 'Levine J.'
    df_new.loc[df_new[old_name_col] == 'Miguel Gallardo-Valles', new_name_col] = 'Gallardo M.'
    df_new.loc[df_new[old_name_col] == 'Gallardo Valles M.', new_name_col] = 'Gallardo M.'
    df_new.loc[df_new[old_name_col] == 'Daniel Gimeno-Traver', new_name_col] = 'Gimeno-Traver D.'
    df_new.loc[df_new[old_name_col] == 'Gimeno D.', new_name_col] = 'Gimeno-Traver D.'
    df_new.loc[df_new[old_name_col] == 'Jose-Antonio Sanchez-De Luna', new_name_col] = 'Sanchez De Luna J.'
    df_new.loc[df_new[old_name_col] == 'Aisam Qureshi', new_name_col] = 'Qureshi A.'
    df_new.loc[df_new[old_name_col] == 'Qureshi A.U.H.', new_name_col] = 'Qureshi A.'
    df_new.loc[df_new[old_name_col] == 'John-Paul Fruttero', new_name_col] = 'Fruttero J.P.'
    df_new.loc[df_new[old_name_col] == 'Francisco Fogues-Domenech', new_name_col] = 'Fogues F.'
    df_new.loc[df_new[old_name_col] == 'Shao-Xuan Zeng', new_name_col] = 'Zeng S.X.'
    df_new.loc[df_new[old_name_col] == 'Mounir El Aarej', new_name_col] = 'El Aarej M.'
    df_new.loc[df_new[old_name_col] == 'Yu Jr. Wang', new_name_col] = 'Wang Y.Jr.'
    df_new.loc[df_new[old_name_col] == 'Arnaud Di Pasquale', new_name_col] = 'Di Pasquale A.'
    df_new.loc[df_new[old_name_col] == 'Ben-Qiang Zhu', new_name_col] = 'Zhu B.Q.'
    df_new.loc[df_new[old_name_col] == 'Guillem Burniol-Teixido', new_name_col] = 'Burniol G.'
    df_new.loc[df_new[old_name_col] == 'Salvador Navarro-Gutierrez', new_name_col] = 'Navarro S.'
    df_new.loc[df_new[old_name_col] == 'Victor Estrella', new_name_col] = 'Estrella V.'
    df_new.loc[df_new[old_name_col] == 'Jason Murray Kubler', new_name_col] = 'Kubler J.'
    df_new.loc[df_new[old_name_col] == 'Marc-Kevin Goellner', new_name_col] = 'Burruchaga R.'
    df_new.loc[df_new[old_name_col] == 'Alex Jr. Bogomolov', new_name_col] = 'Bogomolov A.'
    df_new.loc[df_new[old_name_col] == 'Jean-Claude Scherrer', new_name_col] = 'Scherrer J.C.'
    df_new.loc[df_new[old_name_col] == 'Michael McClune', new_name_col] = 'Mcclune M.'
    df_new.loc[df_new[old_name_col] == 'Alex De Minaur', new_name_col] = 'De Minaur A.'
    df_new.loc[df_new[old_name_col] == 'Cristobal Saavedra-Corvalan', new_name_col] = 'Saavedra Corvalan C.'
    df_new.loc[df_new[old_name_col] == 'Albert Montanes', new_name_col] = 'Albert M.'
    df_new.loc[df_new[old_name_col] == 'Montanes A.', new_name_col] = 'Albert M.'
    df_new.loc[df_new[old_name_col] == 'Frank Condor-Fernandez', new_name_col] = 'Condor F.'
    df_new.loc[df_new[old_name_col] == 'Juan-Antonio Marin', new_name_col] = 'Marin J.A'
    df_new.loc[df_new[old_name_col] == 'Genaro Alberto Olivieri', new_name_col] = 'Olivieri G.'
    df_new.loc[df_new[old_name_col] == 'Bartolome Salva-Vidal', new_name_col] = 'Salva B.'
    df_new.loc[df_new[old_name_col] == 'Juan-Pablo Guzman', new_name_col] = 'Guzman J.P.'
    df_new.loc[df_new[old_name_col] == 'Guzman J.', new_name_col] = 'Guzman J.P.'
    df_new.loc[df_new[old_name_col] == 'Marc Fornell-Mestres', new_name_col] = 'Fornell M.'
    df_new.loc[df_new[old_name_col] == 'Matheus Pucinelli De Almeida', new_name_col] = 'Pucinelli De Almeida M.'
    df_new.loc[df_new[old_name_col] == 'Ji Sung Nam', new_name_col] = 'Nam J.S.'
    df_new.loc[df_new[old_name_col] == 'Nikolas Sanchez-Izquierdo', new_name_col] = 'Sanchez Izquierdo N.'
    df_new.loc[df_new[old_name_col] == 'Izak van der Merwe', new_name_col] = 'Van Der Merwe I.'
    df_new.loc[df_new[old_name_col] == 'Van D. Merwe I.', new_name_col] = 'Van Der Merwe I.'
    df_new.loc[df_new[old_name_col] == 'Mohammad Ghareeb', new_name_col] = 'Al Ghareeb M.'
    df_new.loc[df_new[old_name_col] == 'Mao-Xin Gong', new_name_col] = 'Gong M.X.'
    df_new.loc[df_new[old_name_col] == 'Juan-Pablo Brzezicki', new_name_col] = 'Brzezicki J.P.'
    df_new.loc[df_new[old_name_col] == 'Oscar Serrano-Gamez', new_name_col] = 'Serrano O.'
    df_new.loc[df_new[old_name_col] == 'Jan-Michael Gambill', new_name_col] = 'Gambill J.M.'
    df_new.loc[df_new[old_name_col] == 'Gambill J. M.', new_name_col] = 'Gambill J.M.'
    df_new.loc[df_new[old_name_col] == 'Woong-Sun Jun', new_name_col] = 'Jun W.S.'
    df_new.loc[df_new[old_name_col] == 'Mathieu Perchicot', new_name_col] = 'Mathieu P.'
    df_new.loc[df_new[old_name_col] == 'Gabriel Trujillo-Soler', new_name_col] = 'Trujillo G.'
    df_new.loc[df_new[old_name_col] == 'Reda El Amrani', new_name_col] = 'El Amrani R.'
    df_new.loc[df_new[old_name_col] == 'Jan-Frode Andersen', new_name_col] = 'Andersen J.F.'
    df_new.loc[df_new[old_name_col] == 'Mauricio Perez Mota', new_name_col] = 'Mota B.'
    df_new.loc[df_new[old_name_col] == 'Alex Jr. Bogomolov', new_name_col] = 'Bogomolov Jr.A.'
    df_new.loc[df_new[old_name_col] == 'Juan Ignacio Chela', new_name_col] = 'Chela J.I.'
    df_new.loc[df_new[old_name_col] == 'Chela J.', new_name_col] = 'Chela J.I.'
    df_new.loc[df_new[old_name_col] == 'Pierre-Ludovic Duclos', new_name_col] = 'Duclos P.L.'
    df_new.loc[df_new[old_name_col] == 'Petru-Alexandru Luncanu', new_name_col] = 'Luncanu P.A.'
    df_new.loc[df_new[old_name_col] == 'Hyung-Taik Lee', new_name_col] = 'Lee H.T.'
    df_new.loc[df_new[old_name_col] == 'Tsung-Hua Yang', new_name_col] = 'Yang T.H.'
    df_new.loc[df_new[old_name_col] == 'Alex Jr. Bogomolov', new_name_col] = 'Bogomolov Jr. A.'
    df_new.loc[df_new[old_name_col] == 'Juan-Sebastian Cabal', new_name_col] = 'Cabal J.S.'
    df_new.loc[df_new[old_name_col] == 'Facundo Diaz Acosta', new_name_col] = 'Diaz Acosta F.'
    df_new.loc[df_new[old_name_col] == 'Ryler Deheart', new_name_col] = 'De Heart R.'
    df_new.loc[df_new[old_name_col] == 'Younes El Aynaoui', new_name_col] = 'El Aynaoui Y.'
    df_new.loc[df_new[old_name_col] == 'Alejandro Davidovich Fokina', new_name_col] = 'Davidovich Fokina A.'
    df_new.loc[df_new[old_name_col] == 'Aleksandr Nedovyesov', new_name_col] = 'Nedovyesov O.'
    df_new.loc[df_new[old_name_col] == 'Hyun-Woo Nam', new_name_col] = 'Nam H.W.'
    
    # Return the modified DataFrame
    return df_new

In [201]:
# Run the function
player_df = manually_set_names(player_df, 'name', 'formatted_name')

Running the below function again allows us to see how many remaining unmatched names we have after running the above function:

In [202]:
changed_names, remaining_names, all_shortened_names, remaining_short_names, matched_long_names = analyse_name_lengths(player_df,match_df)

Number of changed names from player_df: 1233
Number of unchanged names from player_df: 0
Number of all unique names in match_df: 1206
Number of remaining unmatched names in match_df: 65
Number of matched long names: 1232


This leaves us with 65 unmatched names, which from external anaylsis were unable to be obviously matches with any name in the player dataset. However, since we have matched the vast majority of names, we shouldn't be exlcuding too many matches from our dataset.

We can now save this cleaned player dataset for future reference:

In [None]:
# Save to file
player_df.to_csv('datasets\player_df.csv', index=False)