# ICC Cricket World Cup 2019 - Batting Impact
I have recently been thinking about cricket metrics, and the best way that we can measure how impactful a batsman is. This comes from recently seeing a fantastic thread on Twitter from [Dan Weston](https://twitter.com/SAAdvantage) about the PSL, as well as the recent discussions about where Moeen Ali should bat for England given his power hitting against spin, and where most of the spin overs are in a T20 match.  

I thought that I would apply these ideas and metrics to the top 50 run scorers from last year's ICC Cricket World Cup.  

I got these summary statistics from [ESPNCricInfo](http://stats.espncricinfo.com/ci/engine/records/batting/most_runs_career.html?id=12357;type=tournament). I have written a web scraper that pulls the data from that page and saves it as a csv so it is easier for me to import into Kaggle.  

All the details about the web scraper can be found in my [Cricketer Analysis repository](https://github.com/willcanniford/cricketer-analysis) on Github. 

In [None]:
# Imports for loading and cleaning data
import pandas as pd
import re
import numpy as np

# Imports for visualisations and displaying tables
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from IPython.display import HTML, display

# Imports for batsman grouping and classification
from sklearn.cluster import KMeans
from sklearn import preprocessing 
from sklearn.metrics import silhouette_score, silhouette_samples
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from sklearn.manifold import TSNE

## Loading and preparing the data

In [None]:
world_cup_batting_raw = pd.read_csv("../input/world_cup_2019_batting_raw.csv")
df = world_cup_batting_raw.loc[:, ['Player','Runs', 'BF','SR','4s','6s']].copy()
df.head(3)

### Clean `Player` column
In the raw form, straight from the website, the `Player` column contains the country in brackets after the player's name. I'd like to separate that out into its own column (`Country`) and remove that from `Player`. I'm going to write 2 simple regex functions that we can `apply` to the `pd.Series` to achieve this. 

In [None]:
def extract_country(player_string):
    regex = re.compile(r'.* \(([A-Z]*)\)')
    return(regex.search(player_string).group(1))

def clean_player_name(player_string):
    regex = re.compile(r'([a-zA-Z \-]*)\s\([A-Z]*\)')
    return(regex.search(player_string).group(1))

In [None]:
df['Country'] = df.Player.apply(extract_country) # Create separate `Country` column
df['Player'] = df.Player.apply(clean_player_name) # Clean and replace `Player`
df.head(3) # Inspect new format 

### Derive boundary-related metrics
The data as it currently stands just has the basic summary statistics for those individuals so I am going to derive some new ones that are based around the boundary hitting (using 4s and 6s), and work out the strike rate of the batsman if you remove the boundary balls from their stats.  

We will use some of these metrics later to classify the batsman into groups using `sklearn`. 

In [None]:
df['BoundaryRuns'] = df['4s'] * 4 + df['6s'] * 6
df['NonBoundaryRuns'] = df['Runs'] - df['BoundaryRuns']
df['TotalBoundaries'] = df['4s'] + df['6s']
df['NonBoundaryBalls'] = df['BF'] - df['TotalBoundaries']
df['RunsFromBoundary %'] = round(df['BoundaryRuns'] / df['Runs'] * 100, 2)
df['Boundary %'] = round(df['TotalBoundaries'] / df['BF'] * 100, 2)
df['NonBoundaryStrikeRate'] = round(df['NonBoundaryRuns'] / df['NonBoundaryBalls'] * 100, 2)
df['Boundary6 %'] = round(df['6s'] / (df['6s'] + df['4s']) * 100, 2)
df.head(3) # Inspect new format 

## Visualising Batsman
Comparing the percentage of the balls that a batsman faced that were hit for boundaries, and the strike rate of the batsman in balls that weren't boundaries can give us a slight indication of how they approach the game.  

For example, Chris Gayle is known for his power hitting, and also his lack of running between the wickets. It therefore makes sense that his `NonBoundaryStrikeRate` is lowest of any batsman in this dataset.  

In this particular visualisation, you want to be aiming for the top right, as this indicates a lot of boundaries hit, as well as the ability to score off the balls that you don't connect with. Unfortunately this doesn't have context to the runs, as late over hitting yields both more aggressive batting, but also spread fields that allow for easy singles if you don't fully connect with the ball. 

In [None]:
fig = px.scatter(df, 
                 x='Boundary %', 
                 y='NonBoundaryStrikeRate', 
                 color='Country', 
                 hover_name='Player', 
                 size='Runs')

fig.update_layout(
    height=500,
    title_text='ICC Cricket World Cup 2019 - Boundary Impact'
)
fig.show()

Looking at both `Boundary %` and the `Boundary6 %` you can see the prolific 6 hitters, and the risk takers. Those that are scoring boundaries frequently, and big hits when they do go for it.  

Chris Gayle rises here, as his strong arms allow for more than average 6 hitting in the top half of the graph. Morgan sits at the very top of the graph, and his big hitting over the leg side is well known in international cricket; he was at it again in [South Africa](https://www.espncricinfo.com/series/19286/scorecard/1185315/south-africa-vs-england-3rd-t20i-england-in-sa-2019-20).

In [None]:
fig = px.scatter(df, 
                 x='Boundary %', 
                 y='Boundary6 %', 
                 color='Country', 
                 hover_name='Player', 
                 size='Runs')

fig.update_layout(
    height=500,
    title_text='ICC Cricket World Cup 2019 - 6 Hitting Impact'
)
fig.show()

## Scaling and classifying batsman data
Firstly, I need to decide which metrics that we have defined are going to help us to group the batsman into their groups. I don't think any of the summary statistics are appropriate here, as we are trying to define the style in which the batsman performs, rather than the amount of runs that they happened to score in this tournament.  

I have decided that I will work with just 3 groups in this example, which might not be the optimal grouping but I am working with only 50 players. 

In [None]:
grouping_columns = ['SR', 'RunsFromBoundary %', 'Boundary %', 'NonBoundaryStrikeRate', 'Boundary6 %']
df_chosen = df.loc[:,grouping_columns]

- `SR` - What is the general pace of the batsman's scoring? 
- `RunsFromBoundary %` - How reliant is the player on hitting boundaries for their runs? 
- `Boundary %` - How frequently do they find the rope? 
- `NonBoundaryStrikeRate` - Do they have other options and still able to score when not hitting boundaries?
- `Boundary6 %` - Do they take risks to gain maximum runs for a delivery? 

In [None]:
df_scaled = pd.DataFrame(preprocessing.StandardScaler().fit_transform(df_chosen))
df_scaled.columns = grouping_columns
df_scaled.head(3)

In [None]:
np.random.seed(1)

# Instantiate a model with 3 centers
kmeans = KMeans(3)

# Then fit the model to your data using the fit method
model = kmeans.fit(df_scaled)

# Finally predict the labels on the same data to show the category that point belongs to
labels = model.predict(df_scaled)

In [None]:
model

In [None]:
labels_group = pd.Series(labels, dtype="category").map({0:'A', 1:'B',2:'C'})
df['Batting Classification'] = labels_group

In [None]:
fig = px.scatter(df, 
                 x='Boundary %', 
                 y='NonBoundaryStrikeRate', 
                 color='Batting Classification', 
                 hover_name='Player', 
                 size='Runs')

fig.update_layout(
    height=500,
    title_text='ICC Cricket World Cup 2019 - Batting Classifications'
)
fig.show()

In [None]:
df_pair = df_chosen.copy()
df_pair['Batting Classification'] = labels_group
sns.pairplot(df_pair, hue='Batting Classification')
plt.show()

The fairly simply batting classification seems to have grouped them fairly sensibly (when looking at the visuals) when we use those 5 metrics but there are some plots where all the groups are muddied in the water, the `Boundary6 %` doesn't seem to separate out the batsman as well as the `Boundary %` for example. 

**Group A**  
DA Warner, Shakib Al Hasan, JE Root, Babar Azam, BA Stokes, V Kohli, F du Plessis, SPD Smith, Mushfiqur Rahim, UT Khawaja, JC Buttler, van der Dussen, MDKJ Perera, MS Dhoni, Mohammad Hafeez, HH Pandya, Mahmudullah, C de Grandhomme

**Group B**  
RG Sharma, JM Bairstow, AJ Finch, JJ Roy, AT Carey, EJG Morgan, N Pooran, Q de Kock, SO Hetmyer, CH Gayle, Zadran, WIA Fernando, Fakhar Zaman, MJ Guptill, Liton Das, GJ Maxwell, JO Holder, Soumya Sarkar

**Group C**  
KS Williamson, KL Rahul, LRPL Taylor, Imam-ul-Haq, SD Hope, Rahmat Shah, AD Mathews, Tamim Iqbal, JDS Neesham, FDM Karunaratne, HM Amla, Hashmatullah Shahidi, Gulbadin Naib

### Attempting to find a better `K` value
I'm not convinced of the value of `K` being 3. Let's use the Elbow Method and the Silhouette method to try and determine what the optimal value of `K` is.  

In [None]:
scores = []
clusters = [x for x in range(2,10)]
for i in clusters:
    kmeans = KMeans(i)
    model = kmeans.fit(df_scaled)
    scores.append(np.abs(model.score(df_scaled)))
    
plt.plot(clusters, scores)  
plt.show()

In [None]:
sil = []

# dissimilarity would not be defined for a single cluster, thus, minimum number of clusters should be 2
for k in clusters:
  kmeans = KMeans(n_clusters = k).fit(df_scaled)
  labels = kmeans.labels_
  sil.append(silhouette_score(df_scaled, labels, metric = 'euclidean'))

plt.plot(clusters, sil)
plt.show()

The elbow method hasn't told us what the optimal level is, and the Silhouette method is just showing us that maybe more clusters would be better. I'm going to try with 8, but given the small sample size that we are clustering here, this may be too granular. 

In [None]:
kmeans = KMeans(8)
model = kmeans.fit(df_scaled)
labels = model.predict(df_scaled)
df_8_clusters = df.copy()
labels_group = pd.Series(labels, dtype="category").map({0:'A',1:'B',2:'C',3:'D',4:'E',5:'F',6:'G',7:'H'})
df_8_clusters['Batting Classification'] = labels_group
df_8_clusters['Batting Labels'] = labels
fig = px.scatter(df_8_clusters, 
                 x='Boundary %', 
                 y='NonBoundaryStrikeRate', 
                 color='Batting Classification', 
                 hover_name='Player', 
                 size='Runs')

fig.update_layout(
    height=500,
    title_text='ICC Cricket World Cup 2019 - Batting Classifications'
)
fig.show()

In [None]:
df_pair = df_chosen.copy()
df_pair['Batting Classification'] = labels_group
sns.pairplot(df_pair, hue='Batting Classification')
plt.show()

I suppose the most notable difference here is that this has completely isolated Maxwell.

- - -

I think when you get the point where you have that overlap with so many of the metrics that we are looking at then we have reached a point where K means clustering isn't necessarily the ideal method here. It really only benefits groups that are circular ultimately, and that doesn't immediately appear to be the case here. It is probably also worth considering removing some variables that don't seem to add any particular variation to the scores, as they are being judged equally thanks to our scaling. Taking a quick look at the variables (pre-scaling obviously) we can see that some variables aren't as varied as each other, and above is showing that we might be better performing a PCA analysis to group together variables that are correlated to get better clusters overall. 

In [None]:
df[grouping_columns].std() / df[grouping_columns].mean()

- - -
### Outliers within clusters

We can explore this further by taking a look at the indiviual silhouette scores for each of the players and seeing where the overall values lies even though the average is good with the higher number of clusters. Using those scores we can take a further look at the players that aren't clearly in one cluster or the other. 

In [None]:
silhouette_scores = silhouette_samples(df_scaled, labels)
df.iloc[np.where(silhouette_scores < 0)]

Visualising these players might show us a bit more clearly why they could belong to multiple clusters under the current circumstances.

In [None]:
plt.scatter(df_8_clusters.loc[:, 'Boundary %'], 
            df_8_clusters.loc[:, 'NonBoundaryStrikeRate'], 
            c=df_8_clusters.loc[:, 'Batting Labels'])
cluster_outliers = np.where(silhouette_scores < 0)
cluster_outliers_index = df_8_clusters.index[cluster_outliers]
plt.scatter(df_8_clusters.loc[cluster_outliers_index, 'Boundary %'], 
            df_8_clusters.loc[cluster_outliers_index, 'NonBoundaryStrikeRate'], 
            c='black', 
            s=100, alpha=0.25)
plt.show()

While difficult to tell from a single graphic why these are excluded, we at least get a quick look at how those batsman could belong to more than a single cluster that we have identified. Another layer deeper into this investigation, we can take a look at the distances to the cluster centroids for these players and see if they are close to numerous other clusters. 

In [None]:
distances_to_clusters = model.transform(df_scaled)
pd.DataFrame(distances_to_clusters).join(
    pd.Series(silhouette_scores, name='SilhouetteScore')).join(
    pd.Series(labels, name='Group')).sort_values('SilhouetteScore').head(6)

There is confusion for these values about which group they should be in. While the distances have assigned them to a particular group, the silhouette score indicates that they may be in the wrong group. It might be that they are more alike a different cluster that they are near and there was an issue with the initial random selection of the centroids at the beginning of the calculations. It may be assigned to cluster A, but might be nearer to all other points in cluster B, hence a negative silhouette score; an individual might be nearer to the centroid of its own cluster but nearer, on average, to the members of another cluster. 

- - - 
### Hierarchical Clustering using `scipy`

Trying hierarchical clustering using `scipy.cluster.hierarchy` can yield us differnt results, ones that aren't defined using the centroids and a predetermined number of clusters. This has its benefits and we can see that we have a 4 cluster result, with Maxwell as an outlier cluster and then the main body of the results split into 3 categories; with the first cluster you can see that it is close to being 2 clusters based on the dendrogram.  

In [None]:
mergings = linkage(df_scaled, method='complete')

fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (9, 9))

dendrogram(mergings,
           labels=np.array(df['Player']),
           leaf_font_size=9,
           orientation='right',
           ax=ax)

plt.show()

Using different methods of clustering here can yield vastly different results. You can read about the different types of hierarchical clustering on the [scipy documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html?highlight=linkage#scipy.cluster.hierarchy.linkage) and this [good article by John Clements](https://towardsdatascience.com/introduction-hierarchical-clustering-d3066c6b560e). I will illustrate the differences below. 

In [None]:
methods = ['complete', 'single', 'average', 'centroid']
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(11, 11))

for i, method in enumerate(methods):
    ax = axes.flatten()[i]
    mergings = linkage(df_scaled, method=method)
    dendrogram(mergings,
           labels=np.array(df['Player']),
           leaf_font_size=7,
           orientation='right',
           ax=ax)
    ax.set_title(method)
    
fig.tight_layout()
plt.show()

On the face of the above, you might say that Maxwell is such an outlier that the algorithms are just creating two clusters because he is so far away from them (Gayle does sneak in on the single method). Looking closer, however you can see that the order of the names and how they have been joined for the visual is indicative of how the underlying function is working in each case. Each player starts as an individual cluster and then, using the function, it joins the two nearest clusters together into a single cluster. 

Let's have a look and see what the results of these dendrograms look like when we remove Maxwell... 

In [None]:
# Remove Maxwell 
indexes = df[df['Player'] != 'GJ Maxwell'].index
df_scaled_without_maxwell = df_scaled.iloc[indexes, :]
names = df[df['Player'] != 'GJ Maxwell'].Player

# Run the visualisation again
methods = ['complete', 'single', 'average', 'centroid']
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(11, 11))

for i, method in enumerate(methods):
    ax = axes.flatten()[i]
    mergings = linkage(df_scaled_without_maxwell, method=method)
    dendrogram(mergings,
           labels=np.array(names),
           leaf_font_size=7,
           orientation='right',
           ax=ax)
    ax.set_title(method)
    
fig.tight_layout()
plt.show()

We get much more varied results here and some interesting results. 'centroid' and 'average' are very similar with smaller clusters around the base of the visuals. 'single' is the same but with a shift of the smaller group. Whereas, 'complete' has remained essentially the same as it has made it groups with Maxwell being too far away to realistically be part of any of the existing groups. 

We can generate our final resting clusters here by taking a clustering and picking a height that we can 'cut off' the dendrogram off at. In this instance, the height that you choose to cut the dendrogram off at to generate the clusters specifies the max distance between merging clusters; it says that the hierarchical clustering should stop merging clusters when all the clusters are at least this value apart. 

In [None]:
# Use fcluster to generate a set of cluster labels using a set of heights
# NB: include Maxwell again for this
mergings = linkage(df_scaled, method='complete')

fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (6, 6))

dendrogram(mergings,
           labels=np.array(df['Player']),
           leaf_font_size=7,
           orientation='right',
           ax=ax)

plt.show()

### Defining a height for generating cluster labels using `fcluster`

You can imagine that choosing a point to 'cut' the dendrogram is important here. The lower the value then the more clusters that we are going to have as you have a lower tolerance for the clusters to have larger distance metrics between them. If you're happy for the tolerance to be higher then the clusters will be larger. 

Just eye-balling the dendrogram, I'm going to pick 4 values to show the impact that picking those height metrics can have. I have then printed the dendrogram again, but with the added height thresholds for ease. 

In [None]:
# Pick a range of heights 
heights = [2.75, 4.1, 5, 8]

fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (6, 6))

dendrogram(mergings,
           labels=np.array(df['Player']),
           leaf_font_size=7,
           orientation='right',
           ax=ax)

columns = ['SR', 'RunsFromBoundary %', 'Boundary %', 'NonBoundaryStrikeRate', 'Boundary6 %', 'label']

for height in heights: 
    ax.axvline(height, c='orange')
    labels = fcluster(mergings, height, criterion='distance')
    display(HTML(f'<hr><h3>Height: {height} - {max(labels)} clusters</h3>'))
    df['label'] = labels
    display(HTML(df.loc[:, columns].groupby('label').mean().to_html()))
    
plt.show()

Even with a lot if clusters at a cut-off height of 2.75, we can start to see the separation in the average stats of those with that group. As we start to allow more distance within the clusters by raising the height, we can see the clusters start to become more defined towards a 'play style'. It is only at the very end that we see Maxwell join into a cluster to create two more generic clusters, that represent those that more runs from boundaries and hit more boundaries and those that have a higher strike rate off balls that aren't boundaries, i.e. strike rotational players. 

In [None]:
df_pair = df_chosen.copy()
df_pair['Cluster Label'] = fcluster(mergings, 5, criterion='distance')
sns.pairplot(df_pair, hue='Cluster Label')
plt.show()

### TSNE
We can use `TSNE` to approximate the distance of our scaled data from a higher dimension into a 2D format. I've done this below using some basic code just to illustrate the process and what the end result is. Note that I've hidden the x and y labels as they don't mean anything in this instance and add confusion to the plot in my opinion.

In [None]:
tsne_model = TSNE(learning_rate=100)
tsne_features = tsne_model.fit_transform(df_scaled)
x = tsne_features[:, 0]
y = tsne_features[:, 1]
labels = fcluster(mergings, 5, criterion="distance")
plt.scatter(x, y, c=labels)
plt.tick_params(axis="x", which="both", bottom=False, top=False, labelbottom=False)
plt.tick_params(axis="y", which="both", right=False, left=False, labelleft=False)
plt.show()

- - -

I want to continue to develop this clustering and add additional methods, visualisations and try my best to explain the concepts of each method.  

If you've got any **suggestions** then please let me know, or any ideas or tips about how I can improve this kernel then I would love to hear them.  

If you've enjoyed reading this then please consider **upvoting** this kernel! 