In on of my previous [scripts][1] I have explored whether surface expertise exists in Men's Tennis.

Let's see how the different surfaces affect the performance of different players.

1. Can we again find players who perform better on one kind of surface of the other?
2. Can we detect style differences impose (e.g, number of aces, number of first serves in, etc.)
3. among the experts - the players who have a clear surface preference, can we detect characteristics such as height or age difference compared to other players?



  [1]: https://www.kaggle.com/drgilermo/d/jordangoblet/atp-tour-20002016/competitiveness-and-expertise-in-tennis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
plt.style.use('fivethirtyeight')

## Read the Data:

In [None]:
path = "../input/"
os.chdir(path)
filenames = os.listdir(path)
df = pd.DataFrame()
for filename in sorted(filenames):
    try:
        read_filename = '../input/' + filename
        temp = pd.read_csv(read_filename,encoding='utf8')
        frame = [df,temp]
        df = pd.concat(frame)
    except UnicodeDecodeError:
        pass

## Let's explore some serving differences

It turns out that serving (and returning) is different in different surfaces.

Is it easier to score an ace on some surfaces?

Is it more difficult to break the serve in others?

let's see:

In [None]:
df['Aces'] = df.l_ace + df.w_ace

plt.bar(1,np.mean(df.Aces[df.surface == 'Hard']))
plt.bar(2,np.mean(df.Aces[df.surface == 'Grass']), color = 'g')
plt.bar(3,np.mean(df.Aces[df.surface == 'Clay']), color ='r')
plt.ylabel('Aces per Match')
plt.xticks([1,2,3], ['Hard','Grass','Clay'])
plt.title('More Aces on Grass')

When playing on grass, the ball doesn't bounce from the ground as much as it does on clay. therefore, returning a good serve would be difficult than on clay. This is not surprising. The effect is significant: an average match on grass would yield almost twice as much aces than when playing on a clay surface.

In [None]:
df['loser_1st_rate'] = np.true_divide(df['l_1stIn'], df['l_svpt'])
df['winner_1st_rate'] = np.true_divide(df['w_1stIn'], df['w_svpt'])
df['first_serve_rate'] = (df['loser_1st_rate'] + df['winner_1st_rate'])/2

plt.bar(1,100*np.mean(df.first_serve_rate[df.surface == 'Hard']))
plt.bar(2,100*np.mean(df.first_serve_rate[df.surface == 'Grass']), color = 'g')
plt.bar(3,100*np.mean(df.first_serve_rate[df.surface == 'Clay']), color ='r')
plt.ylabel('First Serve In [%]')
plt.ylim([50,70])
plt.xticks([1,2,3], ['Hard','Grass','Clay'])
plt.title('% of 1st serves in')

plt.figure()
df['loser_1st_taken'] =  np.true_divide(df['l_1stWon'], df['l_1stIn'])
df['winner_1st_taken'] =  np.true_divide(df['w_1stWon'], df['w_1stIn'])
df['first_taken'] = (df['loser_1st_taken'] + df['winner_1st_taken'])/2

plt.bar(1,100*np.mean(df.first_taken[df.surface == 'Hard']))
plt.bar(2,100*np.mean(df.first_taken[df.surface == 'Grass']), color = 'g')
plt.bar(3,100*np.mean(df.first_taken[df.surface == 'Clay']), color ='r')
plt.ylabel('First Serve point taken')
plt.title('% of first serve points taken')
plt.xticks([1,2,3], ['Hard','Grass','Clay'])
plt.ylim([50,70])

## Easier to break on Clay

Not only aces are easier to produce on grass, but taking the game in general is easier for the serving player on grass - more than 65% of the in serves are taken by the player who served. On clay this number drops to about 61%.

We can assume that this effect doesn't only come from aces, but from good serves that would force errors upon the receiver. 



In [None]:
df['aces_per_serve_w'] = np.true_divide(df.w_ace,df.w_svpt)
df['aces_per_serve_l'] = np.true_divide(df.l_ace,df.l_svpt)
df['aces_per_serve'] = (df['aces_per_serve_w'] + df['aces_per_serve_l'])/2

plt.bar(1,np.mean(df['aces_per_serve'][df.surface == 'Hard']))
plt.bar(2,np.mean(df['aces_per_serve'][df.surface == 'Grass']), color = 'g')
plt.bar(3,np.mean(df['aces_per_serve'][df.surface == 'Clay']), color = 'r')
plt.ylabel('Aces Per Serve')
plt.title('Aces Per Serve')
plt.xticks([1,2,3], ['Hard','Grass','Clay'])

## Most serves are not aces
but almost 5% of them are, on grass. whereas less than 3% are on clay.

while the serving advantages probably lingers during the whole rally, the 2% difference in aces per serve between clay and grass can explains about half of difference in serve points taken.

## Now let's build a players Data Frame:

In [None]:
winners = list(np.unique(df.winner_name))
losers = list(np.unique(df.loser_name))

all_players = winners + losers
players = np.unique(all_players)

players_df = pd.DataFrame()
players_df['Name'] = players
players_df['Wins'] = players_df.Name.apply(lambda x: len(df[df.winner_name == x]))
players_df['Losses'] = players_df.Name.apply(lambda x: len(df[df.loser_name == x]))
players_df['PCT'] = np.true_divide(players_df.Wins,players_df.Wins + players_df.Losses)
players_df['Games'] = players_df.Wins + players_df.Losses

surfaces = ['Hard','Grass','Clay','Carpet']
for surface in surfaces:
    players_df[surface + '_wins'] = players_df.Name.apply(lambda x: len(df[(df.winner_name == x) & (df.surface == surface)]))
    players_df[surface + '_losses'] = players_df.Name.apply(lambda x: len(df[(df.loser_name == x) & (df.surface == surface)]))
    players_df[surface + 'PCT'] = np.true_divide(players_df[surface + '_wins'],players_df[surface + '_losses'] + players_df[surface + '_wins'])
    
serious_players = players_df[players_df.Games>40]
serious_players['Height'] = serious_players.Name.apply(lambda x: list(df.winner_ht[df.winner_name == x])[0])
serious_players['Best_Rank'] = serious_players.Name.apply(lambda x: min(df.winner_rank[df.winner_name == x]))
serious_players['Win_Aces'] = serious_players.Name.apply(lambda x: np.mean(df.w_ace[df.winner_name == x]))
serious_players['Lose_Aces'] = serious_players.Name.apply(lambda x: np.mean(df.l_ace[df.loser_name == x]))
serious_players['Aces'] = (serious_players['Win_Aces']*serious_players['Wins'] + serious_players['Lose_Aces']*serious_players['Losses'])/serious_players['Games']


## Players Clusters
We can now use K-means clustering based on the winning percentage in each surface for each player, and see of the cluster centers tell us something interesting. We would also look at the average height of the players in each cluster.

I chose 6 clusters, as a lower number of clusters will mostly divide the data set into groups based mostly on the general level (winning percentage).

In [None]:
serious_players = serious_players[np.isnan(serious_players.GrassPCT) == 0]
serious_players = serious_players[np.isnan(serious_players.ClayPCT) == 0]
kmeans_df = serious_players[['HardPCT','GrassPCT','ClayPCT']][np.isnan(serious_players.GrassPCT) == 0]

kmeans = KMeans(n_clusters = 6, random_state = 0).fit(kmeans_df)
kmeans.cluster_centers_

serious_players['label'] = kmeans.labels_
print(['Hard','Grass','Clay'])
for i,label in enumerate(kmeans.cluster_centers_):
    print(label)
    print(np.mean(serious_players.Height[serious_players.label == i]))




## Let's plot the Clusters

In [None]:
for i,cluster in enumerate(kmeans.cluster_centers_):
    plt.bar(i+1-0.25,100*cluster[0], width = 0.25, color = 'b')
    plt.bar(i+1,100*cluster[1],width = 0.25, color = 'g')
    plt.bar(i+1+0.25,100*cluster[2],width = 0.25, color = 'r')

plt.legend(['Hard','Grass','Clay'], loc = 2, fontsize = 10)
plt.ylabel('Winning Percentage')
plt.xlabel('Cluster')
plt.title('6 Clusters of players')

pca = PCA(2)
pca.fit(kmeans_df)
pca_df = pca.transform(kmeans_df)
pca_df = pd.DataFrame(pca_df)
pca_df['label'] = kmeans.labels_

plt.figure()
legend = []

for i,label in enumerate(kmeans.cluster_centers_):
    plt.plot(pca_df[pca_df.columns[0]][pca_df.label == i],pca_df[pca_df.columns[1]][pca_df.label == i],'o')
    legend.append('Cluster ' + str(i+1))
    
plt.legend(legend)
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.title('PCA 2D view')

The clusters or sorted almost horizontally based on the general winning percentage.

As winning percentage drops, more interesting pattern are revealed. The good players are more well rounded (and taller), whereas the worse players are divided into subgroups - where some of then are extremely under performing on grass, on other under-perform on clay. This is quite similar to what I've seen with the Men's ATP tennis data set.