<h1><center>Women Chess Players Analysis</center></h1>
<center><img src=https://images.chesscomfiles.com/uploads/v1/news/570892.1736de5a.668x375o.0c825a5ff3e9@2x.jpeg></center>

In this kernel we are going to analyze women chess players into 3 parts.In the first part we take a look at our dataset and we will familiar with dataset(e. g. how many instances and columns we have , what are the column names ,...).In the second part we are going to analyze features of our dataset with different aspects(e. g. best chess players , top federations , ...).And in the last part we visualize the dataset with multiple graphs.

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0' role="tab" aria-controls="home"><center>Table Of Contents</center></h3>

* [1. Dataset Overview](#1)
* [2. Feature Engineering And Analysis](#2)
* [3. Data Visualization](#3)

<a id='1'></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0' role="tab" aria-controls="home"><center>Dataset Overview</center></h3>

**First of all we have to import all essential libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
import plotly.graph_objs as go
import cufflinks as cf
from plotly.offline import download_plotlyjs , init_notebook_mode
init_notebook_mode(connected = True)
cf.go_offline()

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

import os
print(os.listdir("../input"))

**Importing CSV file**

In [None]:
df = pd.read_csv('../input/top-women-chess-players/top_women_chess_players_aug_2020.csv')
df.head()

In [None]:
print('This dataset has {} instances.'.format(df.shape[0]))
print('This dataset has {} columns.'.format(df.shape[1]))

**Taking 5 samples of dataset**

In [None]:
df.sample(5)

**Lets take a look at columns of dataset**

In [None]:
for i , column in enumerate(df.columns):
    print('{}.Columns is {}'.format(i + 1 , column))

**General information of dataset**

In [None]:
df.info()

In [None]:
df.describe()

**Number of unique values for each categorical column**

In [None]:
df.select_dtypes('object').nunique()

**Lets take a look at NA values percentage for each column**

In [None]:
no_of_rows = df.shape[0]
percentage_of_missing_data = df.isnull().sum()/no_of_rows
percentage_of_missing_data

* As we can see **more than 50%** values of some columns are missing values such as **Title** , **Rapid_rating** and **Blitz_rating**

**Lets change NA values in Title column into 'No Award'**

In [None]:
df['Title'].fillna('No Award' , inplace = True)

**Lets take a look at chess titles**

In [None]:
df['Title'].unique()

* GM : **Grandmaster**
* IM : **International Master**
* WGM : **Woman Grandmaster**
* FM : **FIDE Master**
* WFM : **Woman FIDE Master**
* WIM : **Woman International Master**
* CM : **Candidate Master**
* WCM : **Woman Candidate Master**

<a id='2'></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0' role="tab" aria-controls="home"><center>Feature Engineering And Analysis</center></h3>

**Lets create and add 'Age' column to our dataset**

In [None]:
df['Age'] = 2020 - df['Year_of_birth']
df['Age'] = df['Age'].fillna('N/A')
df.head()

**Then taking destribution of players age in tournament**

In [None]:
fig = px.histogram(df,
                   x = 'Age',
                   title = 'Players Age distribution',
                   nbins = 20,
                   marginal = 'rug',
                   color_discrete_sequence = ['indianred'])

fig.update_layout(hoverlabel = dict(
            bgcolor = "white",
            font_size = 15,
            font_family = 'Times New Roman'
    )
)

fig.update_traces(
                  marker_line_color = 'rgb(0 , 0 , 0)',
                  marker_line_width = 2,
                 )

fig.show()

* As we can see that players with ages between **25-35** have the most popularity(with around **3000** players)

**Top 10 Players with best standard rating**

In [None]:
best_standard_ratings = df.sort_values(by = 'Standard_Rating' , ascending = False)
best_standard_ratings.head(10)

**Top 10 Players with best rapid rating**

In [None]:
best_rapid_ratings = df.sort_values(by = 'Rapid_rating' , ascending = False)
best_rapid_ratings.head(10)

**Top 10 Players with best blitz rating**

In [None]:
best_blitz_ratings = df.sort_values(by = 'Blitz_rating' , ascending = False)
best_blitz_ratings.head(10)

* As we can see from tables above **Judit Polgar** from **Hungary** has the best performance in all three fields

**Top 10 federations with best players in ratings**

**First of all lets create a dataset with two columns :**
1. Federation
2. Number of Players of each federation that participated in

First step, we take the value counts of **Federation** column(because each row represents a player in tournament).After that we take a Series of value counts.If we want to convert a Series to DataFarme and represent Federation as a column instead of index,we have to use reset_index method.reset_index returns a DataFrame with some column names that are not suitable for our DataFrame and we have to change them manually and then we print the DataFrame.

In [None]:
federation_player_count = df['Federation'].value_counts().reset_index()
federation_player_count.columns = ['Federation' , 'Count']
federation_player_count.head(10)

**Then we define another dataset that describes median ratings of any federation**

In [None]:
federation_median_ratings = df[['Federation' , 'Standard_Rating' ,  'Rapid_rating' , 'Blitz_rating']]\
.groupby(['Federation']).agg('median')
federation_median_ratings = federation_median_ratings.sort_values(by = 'Standard_Rating' , ascending = False)
federation_median_ratings.head()

**And then we merge this two datasets , sorting it and then return 10 first rows**

In [None]:
federation_info = federation_player_count.merge(federation_median_ratings , left_on = 'Federation' , right_on = 'Federation')
federation_info = federation_info.sort_values(by = ['Standard_Rating' , 'Rapid_rating' , 'Blitz_rating'] , ascending = False)
federation_info.head(10)

* As we can see **QATAR** federation has the best ratings overall,but only 1 player of this federation participated in the tournament and we cant **generalize** well that this federation has the best players with best performances or maybe we can interpret this instance as an **outlier**.There are another federations that we cant generalize very well such as:BER(Bermuda?) , PAR(?) , ...

* But countries like **China** with **198 players** with the median standard rating **2112.5** is acceptable and we have enough **confidence**.Another countries that we can generalize well with their average scores such as:GEO(Georgia) , AZE(Azerbaijan) and KAZ(Kazakhstan)

**For avoiding outliers and exceptions we print top 10 countries with at least 10 participants**

In [None]:
federation_info[federation_info['Count'] > 10].head(10)

**Lets plot the pair plot of rating for a brief overview**

<a id='3'></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:black; border:0' role="tab" aria-controls="home"><center>Data Visualization</center></h3>

In [None]:
print('Mean Standard Score : {:.2f}'.format(df['Standard_Rating'].mean()))
print('Mean Rapid Score Score : {:.2f}'.format(df['Rapid_rating'].mean()))
print('Mean Blitz Score : {:.2f}'.format(df['Blitz_rating'].mean()))

In [None]:
g = sns.PairGrid(df[['Standard_Rating' , 'Rapid_rating' , 'Blitz_rating']])
g.map_upper(plt.scatter)
g.map_diag(plt.hist)
g.map_lower(plt.scatter)
plt.show()

* As we can see from diagonal part, more than **2600** players scored **lower 2000** which is **lower than mean** and less than **1800** players scored **higher 2500**.From rapid and blitz rating histograms we can see that most players scored around **2000**.From scatter plots we can not interprete and guess their correlation at the first glance but in the next step we are going to analyze their correlation by heatmap which is way easier to interpret...

**First lets get correlations of rating columns**

In [None]:
df_corr = df[['Standard_Rating' , 'Rapid_rating' , 'Blitz_rating']].corr()
df_corr

**And plot the heatmap of it**

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(df_corr, annot=True, linewidths = 5 , cmap="YlGnBu" , ax=ax)
plt.show()

* As we can see from heatmap above, **Rapid_rating** and **Blitz_rating** have the most **correlation** so their **relationship is very strong**.It means that if blitz rating of a player increases, the probability of rapid rating increment of that player is very high and vice versa.we can expand this concept for other columns but we have to pay attention that the correlation and relationship of other columns is **lower than** Rapid_rating - Blitz_rating relationship

**First of all lets group our data by Title column and then take median ratings of each group**

In [None]:
median_ratings = df[['Title' , 'Standard_Rating' , 'Rapid_rating' , 'Blitz_rating']].groupby(['Title']).agg('median')
median_ratings

**And taking the title distribution among players**

In [None]:
title_distribution = df['Title'].value_counts()
title_distribution = title_distribution.reset_index()
title_distribution.columns = ['Title' , 'Count']
title_distribution

**At last , merging the two above datasets by 'Title' column**

In [None]:
title_df = title_distribution.merge(median_ratings , left_on = 'Title' , right_on = 'Title')
title_df

**Lets plot pie chart of title distribuion among players**

In [None]:
fig = px.pie(title_df ,
             names = 'Title' ,
             values = 'Count' ,
             title = 'Titles Distribution Among Players' ,
             hover_data = ["Standard_Rating" , "Rapid_rating" , "Blitz_rating"],
             height = 600 , width = 850
            )

fig.update_traces(textposition = 'inside', textinfo = 'percent+label',
                 textfont_size=20 , marker=dict(line=dict(color='#000000', width=1)))

fig.show()

* As we can see from chart above, more than **60% of players** have not given any awards

**Then Lets plot the bar chart of number of players from each federation that participated in this tournament**

In [None]:
federation_info = federation_info.sort_values(by = ['Count'] , ascending = False)
fig = px.bar(federation_info[:30] ,
             x = 'Federation' ,
             y = 'Count' ,
             title = 'Top 30 Countries With Most Participated Players' ,
             hover_data = ["Standard_Rating" , "Rapid_rating" , "Blitz_rating"],
             hover_name = 'Federation',
             height = 600 , width = 850
            )


fig.show()

* As we can see, many players of this tournament are from **Russia** which have a significant difference with the second stage of our bar chart which is **Germany**

**Lets analyse the players standard scores by each title**

In [None]:
fig = px.violin(df[df['Title'] != 'WH'],
                x = 'Title',
                y = 'Standard_Rating',
                title = 'Standard Rating Distribution By Each Title',
                box = True,
                points = False,
                labels = {'x' : 'Titles' , 'y' : 'Standard Rating'})
# fig.update_traces(meanline_visible=True) 
fig.show()

In [None]:
fig = px.scatter(df[df['Title'] != 'WH'],
                 x = 'Title',
                 y = 'Rapid_rating',
                 title = 'Rapid Rating Distribution By Each Title',
                 color_discrete_sequence=["green"]
                 )

fig.update_traces(marker = dict(size = 10,
                                line = dict(width=1)),
                                selector = dict(mode = 'markers')
                 )

fig.show()

In [None]:
fig = px.scatter(df[df['Title'] != 'WH'],
                 x = 'Title',
                 y = 'Blitz_rating',
                 title = 'Blitz Rating Distribution By Each Title',
                 color_discrete_sequence=["goldenrod"]
                 )

fig.update_traces(marker = dict(size = 10,
                                line = dict(width=1,
                                color = 'red')),
                 )


fig.show()

* As we can see,in general players with **Grand Master** trophy have the best ratings than any other players

**Lets compare the median rantigs of each title**

In [None]:
title_df = title_df.sort_values(by = ['Standard_Rating'])[::-1]
title_df

**Lets plot the bar chart of all three median ratings for each title**

In [None]:
fig = px.bar(title_df ,
             x = 'Title' ,
             y = ['Standard_Rating' , 'Rapid_rating' , 'Blitz_rating'] ,
             title = 'Median Ratings of Each Title' , 
             opacity = .8
            )

fig.update_traces(
                  marker_line_color = 'rgb(0 , 0 , 0)',
                  marker_line_width = 2,
                 )

fig.update_layout(legend_title_text='Ratings')


                  
fig.update_layout(barmode='group' , bargroupgap=0.1)
fig.show()

* The bar chart above shows that in general a player with **Candidate Master** or **Women Candidate Master** award approximately scored the same ratings as a player who has not given any award

**Analyzing ratings density**

In [None]:
fig = go.Figure()

fig = go.Figure()
fig.add_trace(go.Histogram2dContour(
        x = df['Standard_Rating'],
        y = df['Rapid_rating'],
        colorscale = 'Blues',
        colorbar = dict(title = 'Count'),
        hovertemplate = 'Standard_Rating: %{x} <br>Rapid_rating %{y}',
        reversescale = True,
        xaxis = 'x',
        yaxis = 'y',
    ))

fig.add_trace(
    go.Scatter(
        x = df['Standard_Rating'],
        y = df['Rapid_rating'],
        mode = 'markers',
        marker = dict(color = 'Red'),
        opacity = .4,
        xaxis = 'x',
        yaxis = 'y',
        hovertemplate = 'Standard_Rating: %{x} <br>Rapid_rating %{y}',
        name = 'Individual Player',
        showlegend = True
    ))

fig.update_layout(hoverlabel = dict(
            bgcolor = "white",
            font_size = 15,
            font_family = 'Times New Roman'
    )
)

fig.update_layout(
    xaxis_title_text='Standard Rating',
    yaxis_title_text='Rapid Rating',
    title = 'Ratings Density'
)

fig.update_layout(legend = dict(
    orientation = 'h',
    yanchor = 'bottom',
    y = 1.02,
    xanchor = 'right',
    x = 1,
))

fig.show()

* As we can see above, Standard Rating with ranges between **~(1800 - 1900)** and Rapid Rating with ranges between **~(1800 - 2000)** have the highest density.This is because many players(As the red data points represent)scored rating between this two ranges.

**Lets plot correlations between 'Standard_Rating' and 'Rapid_rating' columns by scatter and density plots**

In [None]:
fig = px.scatter(df,
                 x = 'Standard_Rating', 
                 y = 'Rapid_rating',
                 title = 'Ratings Correlation',
                 color = 'Title',
                 symbol = 'Title',
                 hover_name = 'Title',
                 labels = {'Rapid_rating' : 'Rapid Rating' , 'Standard_Rating' : 'Standard Rating'},
                )
fig.update_traces(marker = dict(size=8,
                                line=dict(width=1,
                                color='DarkSlateGrey')),
                                selector=dict(mode='markers'))

fig.update_layout(hoverlabel = dict(
            font_size = 15,
            font_family = 'Times New Roman'),
            legend=dict(
            bordercolor = "Black",
            borderwidth = 1
            ))

fig.show()

* As we can see above, players with **Grand Master** trophy have the highest Rapid and Standard scores among players(we have  seen this analysis in previous violin plot) which is pretty normal,because **Grand Master title is the most valuable troghy that a chess player can get**.After players with Grand Master title we can see that players with **International Master** title have the best scores.At last we can see that players without any awards have the weaker performance among players which is normal.

**At last lets plot scatters of standard and rapid ratings by each title discretely**

In [None]:
fig = px.scatter(df[df['Title'] != 'WH'],
                 x = 'Standard_Rating', 
                 y = 'Rapid_rating',
                 title = 'Ratings Distribution For Each Title',
                 color = 'Title',
                 symbol = 'Title',
                 hover_name = 'Title',
                 facet_col = "Title",
                 facet_col_wrap = 4, 
                 labels = {'Rapid_rating' : 'Rapid Rating' , 'Standard_Rating' : 'Standard Rating'},
                 height=600, width=1000,
                 )

fig.update_layout(hoverlabel = dict(
            font_size = 15,
            font_family = 'Times New Roman'),
            legend=dict(
            bordercolor = "Black",
            borderwidth = 1
            ))

fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1])) #Removing 'Title=' form every facet plots title
fig.update_yaxes(showticklabels=True)

fig.show()