# Introduction
In this notebook, I perform exploratory data analysis on top chess players, using Plotly, Seaborn, and Matplotlib for visualization.

### About the Data: 
The International Chess Federation, [FIDE](https://en.wikipedia.org/wiki/FIDE), acts as the governing body for international chess. They assign [Elo](https://en.wikipedia.org/wiki/Elo_rating_system) ratings for three time controls (of increasing speed): Standard, Rapid, and Blitz. They also award various titles, based on both rating and performance. The dataset contains information on <b>active players whose Standard FIDE rating was at least 2500</b>, as of September 2020. The 2500 minimum was chosen to match the peak [rating requirement](https://en.wikipedia.org/wiki/Grandmaster_%28chess%29#Current_regulations) for the Grandmaster (GM) title. 

## Importing Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go

We load our CSV file and take a cursory glance at our dataset:

In [None]:
top = pd.read_csv("../input/top-chess-players/topchesslist.csv")
top.shape

In [None]:
top.info()

## Data Cleaning
For our analysis, it will be easier to work with age than birth year.
Thus, we add an age column (under the assumption everyone has already had their birthday this year):

In [None]:
top['Age'] = 2020 - top['Birth Year']

### Missing Values
Let us look at the percentages of missing values:

In [None]:
(top.isnull().sum()/len(top))*100

We see there are two untitled players, and some missing Rapid and Blitz ratings.
First, let's address the missing titles: we know that all players are above 2500 FIDE, which makes them eligible for the title of [FIDE Master (FM)](https://en.wikipedia.org/wiki/FIDE_titles#FIDE_Master_%28FM%29).

In order to detect missing (title) values, we define a null string helper function: 

In [None]:
def isNaN(some_string):
    return some_string != some_string

Now, we use the helper function to assign the FM title to these players:

In [None]:
for i in range(len(top['Title'])):
    if isNaN(top['Title'][i]) == True:
        top.at[i,'Title'] = 'FM'

We confirm that there are no missing titles:

In [None]:
top['Title'].isnull().sum()

As for the missing Rapid and Blitz ratings: we note that they comprise roughly 5% of the data. So, we will simply omit them from our analysis.

## Data Analysis and Visualization
Let's take a first look at the data:

In [None]:
top.head()

In [None]:
top.describe()

### Country Distribution
We turn our attention to the country distribution:

In [None]:
country_count = top['Country'].value_counts().head(16)
ax1 = plt.axes()
ax1 = sns.barplot(country_count.index,country_count.values,alpha=0.8)
plt.title('Country Distribution',fontsize=14)
plt.xlabel('Country',fontsize=12)
plt.ylabel('Count',fontsize=12)
ax1.yaxis.set_major_locator(plt.MultipleLocator(10.0))
sns.set(font_scale=0.8)
plt.show()
print(str(round((country_count.sum()/len(top))*100,1)) + 
      "% of the players come from " +str(len(country_count)) + " countries")

We see that Russia has (by far) the most top players, and that roughly two thirds of the players come from just 16 countries. 

This might lead us to ask: which countries have the strongest players? For simplicity, we will only look at these 16 countries. Let's see a plot of average rating by country:

In [None]:
CL = country_count.index.tolist()
TCL = [[top['Country'][i],top['Standard'][i]] for i in range(len(top)) if top['Country'][i] in CL]
c_df = pd.DataFrame(TCL, columns=['Country','Standard'])
AvgByCountry = [(c_df.loc[(c_df['Country'] == country),['Standard']].mean()) for country in CL]
ax2 = plt.axes()
ax2 = sns.barplot(CL,AvgByCountry,alpha=0.8)
#Adding in a horizontal line representing the original mean
ax2.axhline(top['Standard'].mean(),color='red')
plt.title('Average Standard Rating by Country',fontsize=14)
plt.xlabel('Country',fontsize=12)
plt.ylabel('Average Standard Rating',fontsize=12)
ax2.set_ylim(2500,2700)
sns.set(font_scale=0.8)
plt.show()

We note that Armenia, China, USA, Azerbaijan, and Ukraine round out the top 5 countries by rating. We also note that only half of these well-represented countries have average ratings above the original mean. This might indicate that a country's depth is not strongly correlated with its strength.

### Titles
Next, let's look at title counts:

In [None]:
title_count = top['Title'].value_counts()
title_count

We see that GMs comprise the vast majority of our dataset. This is unsurprising, because Grandmaster is the [highest title](https://en.wikipedia.org/wiki/Grandmaster_%28chess%29). We also note that women may hold both [women's titles](https://en.wikipedia.org/wiki/FIDE_titles#Women's_titles) and open titles, which explains the dual titles of "GM WGM" and "IM WGM". 

Let's use a pie chart to visualize the title composition of our dataset:

In [None]:
labels = title_count.index.tolist()
values = top['Title'].value_counts()
fig3 = go.Figure(data=[go.Pie(labels=labels,values=values)])
fig3.update_layout(title_text='Title Distribution')
fig3.show()

Let's make a barplot of average rating by title category:

In [None]:
ax4 = plt.axes()
Titles = top['Title'].unique()
AvgByTitle = [top.loc[(top['Title'] == title),['Standard']].mean() for title in Titles]
ax4 = sns.barplot(Titles,AvgByTitle,alpha=0.8)
plt.title('Average Standard Rating by Title',fontsize=14)
plt.xlabel('Title',fontsize=12)
plt.ylabel('Average Standard Rating',fontsize=12)
ax4.set_ylim(2400,2600)
plt.show()

The result is entirely in line with the relative ranking of [open](https://en.wikipedia.org/wiki/FIDE_titles#Open_titles) and women's titles.

### Age Distribution
Now, let's visualize the age distribution using a histogram and a boxplot:

In [None]:
fig5 = go.Figure(px.histogram(top,x='Age',title='Age Histogram'))
fig5.show()

In [None]:
fig6 = go.Figure(px.box(top,y='Age',title='Age Distribution',points='all'))
fig6.show()

From the histogram, we see that the bulk of the players are between 20 and 40. We also note that the age distribution looks slightly skewed right.

From the boxplot, we note that median age (33) is slightly lower than mean age (34.4). This could be explained by the fact that 
some older players can maintain 2500+ Standard ratings, but very few children can. 

### Rating Distributions
Next, let's put all the rating distributions into one figure:

In [None]:
fig7 = go.Figure()
for col in [top['Standard'],top['Rapid'],top['Blitz']]:
    fig7.add_trace(go.Box(y=col,name=col.name))
fig7.update_layout(title="Standard, Rapid, and Blitz Ratings"
                   ,legend_title_text="Category")
fig7.show()

We note that Rapid and Blitz have higher standard deviations than Standard.
This could be partially explained by the fact that many players compete infrequently in tournaments featuring Rapid and Blitz.
We also see that Rapid and Blitz have high and low outliers, while Standard only has high
outliers.
This is unsurprising, because all players must have a minimum Standard rating of 2500.

### Interactions Between Rating and Age
Next, let's look at ratings versus age with Plotly scatterplots.

In [None]:
fig8 = go.Figure(px.scatter(top,x=top['Age'],y=top['Standard']))
fig8.update_layout(title="Standard Rating versus Age")
fig8.show()
fig9 = go.Figure(px.scatter(top,x=top['Age'],y=top['Rapid']))
fig9.update_layout(title = "Rapid Rating versus Age")
fig9.show()
fig10 = go.Figure(px.scatter(top,x=top['Age'],y=top['Blitz']))
fig10.update_layout(title="Blitz Rating versus Age")
fig10.show()

From these scatterplots alone, we might suspect that ratings drop off after a certain age, perhaps in the 30s.
To explore this claim, let's look at average ratings per age group (while ignoring null Rapid/Blitz ratings).

In [None]:
ax11 = plt.axes()
Ages = np.sort(top['Age'].unique())
AvgStd = [float(top.loc[(top['Age'] == age),['Standard']].mean()) for age in Ages]
ax11.scatter(Ages,AvgStd,s=5)
plt.xlabel('Age',fontsize=12)
plt.ylabel('Average Standard Rating of Age Group',fontsize=12)
plt.title('Average Standard Rating by Age',fontsize=14)
std_quad_model = np.poly1d(np.polyfit(Ages,AvgStd,2))
line = np.linspace(top['Age'].min(),top['Age'].max())
ax11.plot(line,std_quad_model(line),color='Red')
plt.show()

ax12 = plt.axes()
RpdAges = [top['Age'][i] for i in range(len(top)) if isNaN(top['Rapid'][i]) == False]
RpdAges = np.sort(pd.Series(RpdAges).unique())
AvgRpd = [float(top.loc[(top['Age'] == age),['Rapid']].mean()) for age in RpdAges]
ax12.scatter(RpdAges,AvgRpd,s=5)
plt.xlabel('Age',fontsize=12)
plt.ylabel('Average Rapid Rating of Age Group',fontsize=12)
plt.title('Average Rapid Rating by Age',fontsize=14)
rpd_quad_model = np.poly1d(np.polyfit(RpdAges,AvgRpd,2))
ax12.plot(line,rpd_quad_model(line),color = 'Green')
plt.show()

ax13 = plt.axes()
BlzAges = [top['Age'][i] for i in range(len(top)) if isNaN(top['Blitz'][i]) == False]
BlzAges = np.sort(pd.Series(BlzAges).unique())
AvgBlz = [float(top.loc[(top['Age'] == age),['Blitz']].mean()) for age in BlzAges]
ax13.scatter(BlzAges,AvgBlz,s=5)
plt.xlabel('Age',fontsize=12)
plt.ylabel('Average Blitz Rating of Age Group',fontsize=12)
plt.title('Average Blitz Rating by Age',fontsize=14)
blz_quad_model = np.poly1d(np.polyfit(BlzAges,AvgBlz,2))
ax13.plot(line,blz_quad_model(line),color = 'Orange')
plt.show()

Surprisingly, it seems that while average Standard ratings peak in the early 30s, average Rapid and Blitz ratings peak after 40. However, there are several potential biases which cast doubt on this conclusion.

Firstly, younger players might not have played enough Rapid/Blitz games for their ratings to catch up to their true level. We will touch on this point later. 

Secondly, older players might be active in Standard, but inactive in Rapid/Blitz, so that their ratings in these categories don't accurately reflect their current abilities. 

Thirdly, it might be that the older players disproportionately comprise those with missing Rapid/Blitz ratings (and that their true ratings would be below average). This possibility would skew the per-age rating averages for older players, especially considering how few older players there are in the dataset. One might ask if there is evidence to support this third point. To that end, we observe the following statistics:

In [None]:
NullRpdAges = pd.Series([top['Age'][i] for i in range(len(top)) if isNaN(top['Rapid'][i]) == True])
print("The age statistics for players missing Rapid ratings are: ")
NullRpdAges.describe()

In [None]:
NullBlzAges = pd.Series([top['Age'][i] for i in range(len(top)) if isNaN(top['Blitz'][i]) == True])
print("The age statistics for players missing Blitz ratings are: ")
NullBlzAges.describe()

In both cases, the mean and median ages are significantly higher than that of the whole dataset. Thus, we can safely say that players missing Rapid/Blitz Ratings are older. However, the Standard ratings of these players are only slightly lower than that of the whole dataset: 

In [None]:
NullRpdRtgs = pd.Series([top['Standard'][i] for i in range(len(top)) if isNaN(top['Rapid'][i]) == True])
print("The Standard rating statistics for players missing Rapid ratings are: ")
pd.Series(NullRpdRtgs).describe()

In [None]:
NullBlzRtgs = pd.Series([top['Standard'][i] for i in range(len(top)) if isNaN(top['Blitz'][i]) == True])
print("The Standard rating statistics for players missing Blitz ratings are: ")
pd.Series(NullBlzRtgs).describe()

Therefore, it is hard to predict whether these players would have lower Rapid/Blitz ratings than the rest of the dataset.

We now look at correlations among Standard, Rapid, and Blitz Ratings, as well as age, with a heatmap:

In [None]:
corr_columns = top.drop(['Name','Title','Country','Birth Year'],axis=1)
ax14 = plt.axes()
mask = np.triu(corr_columns.corr(),1)
ax14 = sns.heatmap(corr_columns.corr(),mask=mask,annot=True,cmap='Blues')
plt.title('Rating and Age Correlations')
plt.show()

Unsurprisingly, all rating categories are strongly correlated.
We might wish to visualize the players' ratings as a 3d-scatterplot, with color representing age: 

In [None]:
fig15 = px.scatter_3d(top,x=top['Blitz'],y=top['Rapid'],z=top['Standard']
                    ,color = 'Age'
                    ,color_continuous_scale=["lightblue","darkblue"]
                    ,hover_data=(top['Name'],top['Blitz'],top['Rapid'],top['Standard']))
fig15.update_layout(title="Players' Ratings",title_x = 0.45,title_y=0.95,margin=dict(l=0, r=0, b=0, t=0))
fig15.show()

At first glance, we see that Caruana stands out for his high Standard rating, Nakamura for his high Blitz and Rapid ratings, and Carlsen for all three ratings.  

We also observe that young players such as Kelires, Gukesh, and Praggnanandhaa stand out for the disparities between their Standard and Rapid/Blitz ratings. These disparities are a result of the players having played very few Rapid/Blitz games, so that their ratings have not caught up with their true abilities. Such misleading ratings highlight an important source of bias in the data, but one that is slightly difficult to deal with. One potential solution would be to ignore the rating of any player who has played less than 30 games in Rapid or Blitz. Note that the choice of 30 is [not arbitrary](https://en.wikipedia.org/wiki/Elo_rating_system#Most_accurate_K-factor).

## Closing Remarks


### Credit Where Credit is Due
First, I must credit posts by [Arindam Baruah](https://www.kaggle.com/arindambaruah/who-is-dominating-women-s-chess) and [Subhanjan Das](https://www.kaggle.com/subhanjandas/chess-queens-eda-using-plotly) involving the [Top Women Chess Players](https://www.kaggle.com/vikasojha98/top-women-chess-players/notebooks) dataset. Their approaches with various Plotly tools were invaluable to me.

### Room for Improvement
This analysis could be improved in many ways, including by:
- Further exploring the relationship between country and rating(s)
- Updating the dataset to include rating history, games played per rating category, gender, and more

### <b> I hope you found this notebook interesting or informative! <b>