# Women Chess Players 2020 [EDA] 

## Introduction 

### About the Game -
Chess is a two-player strategy board game played on a checkered board with 64 squares arranged in an 8×8 square grid. 

### Governing Body - 
The International Chess Federation (FIDE) governs international chess competition. FIDE used Elo rating system for calculating the relative skill levels of players.

### Dataset Details - 
The dataset contains details of Top women chess players in the world sorted by their Standard FIDE rating (highest to lowest above 1800 Elo) as updated in August 2020. The data includes all active and inactive players which can be identified by the Inactive_flag column.

Note: All ratings are updated as published by FIDE in August 2020.

## 1. Importing Libraries and Reading the dataset

In [124]:
import warnings

warnings.filterwarnings('ignore')

In [125]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from plotly.offline import init_notebook_mode, iplot
import plotly.figure_factory as ff
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
import plotly.graph_objs as go
import plotly
from plotly import tools
import plotly.express as px
from scipy.stats import boxcox
init_notebook_mode(connected=True)
pd.set_option('display.max_columns', 100)

In [126]:
#Reading the dataset 
chess = pd.read_csv('datasets_816231_1397485_top_women_chess_players_aug_2020.csv')
chess

Unnamed: 0,Fide id,Name,Federation,Gender,Year_of_birth,Title,Standard_Rating,Rapid_rating,Blitz_rating,Inactive_flag
0,700070,"Polgar, Judit",HUN,F,1976.0,GM,2675,2646.0,2736.0,wi
1,8602980,"Hou, Yifan",CHN,F,1994.0,GM,2658,2621.0,2601.0,
2,5008123,"Koneru, Humpy",IND,F,1987.0,GM,2586,2483.0,2483.0,
3,4147103,"Goryachkina, Aleksandra",RUS,F,1998.0,GM,2582,2502.0,2441.0,
4,700088,"Polgar, Susan",HUN,F,1969.0,GM,2577,,,wi
...,...,...,...,...,...,...,...,...,...,...
8548,3302288,"Reinkens, Natalia",BOL,F,,,1801,,,wi
8549,343960,"Saffova, Michaela",CZE,F,1994.0,,1801,1791.0,1765.0,
8550,5038294,"Shetye, Siddhali",IND,F,1992.0,,1801,1884.0,1824.0,wi
8551,2072491,"Trakru, Priya",USA,F,2001.0,WFM,1801,,,wi


## 2. Inspecting the data

In [127]:
chess.shape

(8553, 10)

In [128]:
chess.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8553 entries, 0 to 8552
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Fide id          8553 non-null   int64  
 1   Name             8553 non-null   object 
 2   Federation       8553 non-null   object 
 3   Gender           8553 non-null   object 
 4   Year_of_birth    8261 non-null   float64
 5   Title            3118 non-null   object 
 6   Standard_Rating  8553 non-null   int64  
 7   Rapid_rating     3608 non-null   float64
 8   Blitz_rating     3472 non-null   float64
 9   Inactive_flag    5852 non-null   object 
dtypes: float64(3), int64(2), object(5)
memory usage: 668.3+ KB


In [129]:
chess.describe()

Unnamed: 0,Fide id,Year_of_birth,Standard_Rating,Rapid_rating,Blitz_rating
count,8553.0,8261.0,8553.0,3608.0,3472.0
mean,8829011.0,1985.291732,2005.102303,1931.680155,1925.155242
std,9226777.0,14.055386,137.146646,191.449272,188.556849
min,100145.0,1920.0,1801.0,1224.0,1159.0
25%,2119447.0,1978.0,1891.0,1811.0,1805.0
50%,4500539.0,1988.0,1998.0,1921.0,1919.0
75%,13605530.0,1995.0,2090.0,2051.0,2044.0
max,73601140.0,2010.0,2675.0,2646.0,2736.0


### 2.1. Finding columns with high null values

In [130]:
null_perc = chess.isnull().sum()/len(chess)*100
null_perc.sort_values(ascending = False).head(10)

Title              63.544955
Blitz_rating       59.406056
Rapid_rating       57.815971
Inactive_flag      31.579563
Year_of_birth       3.414007
Standard_Rating     0.000000
Gender              0.000000
Federation          0.000000
Name                0.000000
Fide id             0.000000
dtype: float64

We observe that 4 columns have more than 30% values missing, namely 'Title', 'Blitz_rating', 'Rapid_rating' and 'Inactive_flag'

### 2.2. Handling Missing Values

#### 2.2.1. Handling Missing Values of column 'Title'

In [131]:
chess.Title.value_counts(dropna=False)

NaN    5435
WFM    1545
WIM     809
WGM     316
WCM     247
IM      119
GM       37
FM       36
CM        8
WH        1
Name: Title, dtype: int64

This column has 5425 null values and should be dropped but this column is essential for analysis hence we will fill the null vaules with 'Title not known'.

In [132]:
df.Title = chess.Title.fillna('Title not known')
df.Title.value_counts()

Title not known    5435
WFM                1545
WIM                 809
WGM                 316
WCM                 247
IM                  119
GM                   37
FM                   36
CM                    8
WH                    1
Name: Title, dtype: int64

#### 2.2.2. Handling Missing Values of column 'Inactive_Flag'

In [133]:
chess.Inactive_flag.value_counts(dropna=False)

wi     5852
NaN    2701
Name: Inactive_flag, dtype: int64

In this column the value 'wi' stands for = Woman Inactive. Hence the null values are for the players who are active in the game, so we will rephrase the varibles with 'Active' and 'Inactive'.

In [134]:
chess.Inactive_flag = chess.Inactive_flag.fillna('Active')
chess.Inactive_flag = chess.Inactive_flag.astype(str).replace('wi','Inactive')
chess.Inactive_flag.value_counts(dropna=False)

Inactive    5852
Active      2701
Name: Inactive_flag, dtype: int64

### 2.3 Finding Outliers

#### 2.3.1. Distribution of Age

In [135]:
chess['Current_Year'] = 2020
chess['Age'] = chess['Current_Year'] - chess['Year_of_birth']
chess.head()

Unnamed: 0,Fide id,Name,Federation,Gender,Year_of_birth,Title,Standard_Rating,Rapid_rating,Blitz_rating,Inactive_flag,Current_Year,Age
0,700070,"Polgar, Judit",HUN,F,1976.0,GM,2675,2646.0,2736.0,Inactive,2020,44.0
1,8602980,"Hou, Yifan",CHN,F,1994.0,GM,2658,2621.0,2601.0,Active,2020,26.0
2,5008123,"Koneru, Humpy",IND,F,1987.0,GM,2586,2483.0,2483.0,Active,2020,33.0
3,4147103,"Goryachkina, Aleksandra",RUS,F,1998.0,GM,2582,2502.0,2441.0,Active,2020,22.0
4,700088,"Polgar, Susan",HUN,F,1969.0,GM,2577,,,Inactive,2020,51.0


In [163]:
fig = px.box(chess, y="Age",title='Distribution of Age' )
fig.show()

As we can observe there are a lot of outliers in the age column, but this is expected as this dataset contains the data of top inactive female players also.

#### 2.3.2. Distribution of Standard Rating 

In [247]:
fig = px.box(chess, y="Standard_Rating",title='Distribution of Standard Rating' )
fig.show()

Standard Rating also has huge amount of outliers.

#### 2.3.3 Distribution of Rapid Rating 

In [246]:
fig = px.box(chess, y="Rapid_rating",title='Distribution of Rapid Rating' )
fig.show()

Rapid Ratings have outliers both below the first quartile and above the fourth quartile

#### 2.3.4 Distribution of Blitz Rating 

In [248]:
fig = px.box(chess, y="Blitz_rating",title='Distribution of Blitz Rating' )
fig.show()

Rapid Ratings have outliers both below the first quartile and above the fourth quartile.

## 3. Analysis

### 3.1. Univariate Analysis

#### 3.1.1. Top 30 Countries by Number of Top Ranked Female Chess Players

In [136]:
temp = chess["Federation"].value_counts().head(30)
temp.iplot(kind='bar', xTitle = 'Country', yTitle = "Count", title = 'Top 30 Countries by Number of Top Ranked Female Chess Players', color = '#4CB391')

We can observe that Russia has the highest number of Top Ranked Female Chess Players followed by Germany, Poland, Ukraine and India. 


India has 251 Top Ranked Female Chess players!!

#### 3.1.2  Status of Top Ranked Female Chess Players

In [137]:
im = chess["Inactive_flag"].value_counts()
df = pd.DataFrame({'labels': im.index,'values': im.values})
df.iplot(kind='pie',labels='labels',values='values', title='Status of Top Ranked Female Chess Players', hole = 0.5, colors=['#FF414D','#9B116F'])

We observe that 68.4% of the Top Ranked Female Chess Players are Inactive.

#### 3.1.3 Count of Title of Top Ranked Female Chess Players

In [138]:
temp = chess["Title"].value_counts().head(30)
temp.iplot(kind='bar', xTitle = 'Country', yTitle = "Count", title = 'Count of Title of Top Ranked Female Chess Players', color = '#FF8E15')

Most of the Top Female Chess Players are Woman FIDE master (WFM) whereas only 1 Player has the title WH provided by FIDE.

### 3.2. Bivariate Analysis

#### 3.2.1 Year of Birth Distribution with active status of Top Female Chess Players 

In [139]:
fig = px.histogram(chess, x = 'Year_of_birth', color="Inactive_flag", title = 'Year of Birth Distribution with active status of Top Female Chess Players ')
fig.show()



Most of the top female chess players of the world are born in 1989.

#### 3.2.2. Age Distribution with active status of Top Female Chess Players

In [140]:
fig = px.histogram(chess, x = 'Age', color="Inactive_flag", title = 'Age Distribution with active status of Top Female Chess Players ')
fig.show()

Majority of the Top female chess players are of 31 years of age.

#### 3.2.3 Countries based on Female Grand Master count

In [255]:
gm = chess[chess['Title']=='GM']
temp = gm.groupby('Federation')['Name'].count().sort_values(ascending=False).head(10)
temp.iplot(kind='bar', xTitle = 'Country', yTitle = "Count", title = 'Countries with most number of female Grand Masters', color = '#FD7055')

China has the most number of Female Grand Masters with a count of 7 followed by Russia.

India has two female Grand Masters.

#### 3.2.4. Top 10 Female Chess Players based on Standard Rating

In [158]:
temp = chess.sort_values(by = 'Standard_Rating', ascending = False).head(10)
fig = px.funnel(temp, y = 'Name', x = 'Standard_Rating', title = 'Top 10 Female Chess Players based on Standard Rating', color = 'Inactive_flag')
fig.show()

Judit Polgar is the top female chess player with a standard rating of 2675 with inactive status. Yifan Hou is the top active female chess player with a standard rating of 2658.

#### 3.2.5. Top 10 Female Chess Players based on Rapid Rating

In [157]:
temp = chess.sort_values(by = 'Rapid_rating', ascending = False).head(10)
fig = px.funnel(temp, y = 'Name', x = 'Rapid_rating', title = 'Top 10 Female Chess Players based on Rapid Rating', color = 'Inactive_flag')
fig.show()

Judit Polgar is the top female chess player with a rapid rating of 2646 with inactive status. Yifan Hou is the top active female chess player with a rapid rating of 2621.

#### 3.2.6. Top 10 Female Chess Players based on Blitz Rating

In [156]:
temp = chess.sort_values(by = 'Blitz_rating', ascending = False).head(10)
fig = px.funnel(temp, y = 'Name', x = 'Blitz_rating', title = 'Top 10 Female Chess Players based on Blitz Rating', color = 'Inactive_flag' )
fig.show()

Judit Polgar is the top female chess player with a blitz rating of 2736 with inactive status. Kateryna Lagno is the top active female chess player with a blitz rating of 2608.

#### 3.2.7. Standard Rating vs Rapid Rating 

In [177]:
fig = px.density_heatmap(chess, x="Standard_Rating", y="Rapid_rating", marginal_x="histogram", marginal_y="histogram")
fig.show()

#### 3.2.8. Standard Rating vs Blitz Rating

In [175]:
fig = px.density_contour(chess, x="Standard_Rating", y="Blitz_rating", color="Inactive_flag", marginal_x="histogram", marginal_y="histogram")
fig.show()

#### 3.2.9. Rapid Rating vs Blitz Rating

In [202]:
fig = px.density_heatmap(chess, x="Rapid_rating", y="Blitz_rating", color_continuous_scale="tropic")
fig.show()


## 4. Map Visuals

### 4.1. Map of Top Ranked Female Chess Players by Country

In [245]:
df_fed = pd.DataFrame(chess.groupby('Federation').size()).reset_index()
df_fed.rename(columns = {0:'Number of Chess Players'}, inplace=True)
fig = px.choropleth(df_fed, locations="Federation",
                    color="Number of Chess Players",
                    hover_name="Federation", 
                    color_continuous_scale="tealgrn",
                    title = 'Top Ranked Female Chess Players by Country')
fig.show()

### 4.2. Map of Female Grand Masters by Country

In [236]:
gm_fed = pd.DataFrame(gm.groupby('Federation').size()).reset_index()
gm_fed.rename(columns = {0:'Number of Grandmasters'}, inplace=True)

In [243]:
fig = px.choropleth(gm_fed, locations="Federation",
                    color="Number of Grandmasters", 
                    hover_name="Federation", 
                    color_continuous_scale='portland',
                    title = 'Female Grandmasters by Country')
fig.show()

## 5. Conclusions
- Judit Polgar is the top female chess player topping all three of the ratings, i.e. Standard Rating, Rapid Rating and Blitz Rating.
- Russia is the leader when it comes to top female chess players although China tops the list with the highest number of Grand Masters with a count of 7.
- When it comes to female Grand Masters, Georgia ranks 3rd with 5 Grand Masters.
- There are only 37 female Grand Masters in the world. 
- India has 2 female Grand Masters.
