# Chess Openings

**Author: Aramis Farias**  
**Date: 28/04/2021**  
**E-mail:Fariasaramis@gmail.com**

This Notebook along with the Tableau visualizations and the Slide presentation are part of the Capstone project of the Google Professional Certificate in Data Analytics.

<a id="section-one"></a>
## ASK

A company has a mobile app for playing and learning chess. The company is now revisiting the lessons for chess openings. The convencional way of teaching opening is based on frequency of the openings, but they want to see if other approach can get their customer to improve faster their rating. 

The key factors to analyse are the **win-lose ratio of the particular opening** and **the ELO rating difference between the players**.

The audience are the development and marketing teams, that will implement and advertise the new lessons.

### Deliverables

* A clear statement of the business task you have selected to investigate
* A description of all data sources used
* A summary of your analysis
* Supporting visualizations and key findings
* Your top high-level insights based on the analysis 
* A Slide presentation of your findings for presentation

<a id="section-two"></a>
## PREPARE

We are going to use the following restrictions:
* Only players with a ELO rating of at least 2000 ( Low rated players would input too much "noise", and be less related to the actual opening)
* Openings will be adressed by the ECO code (Encyclopaedia of Chess Openings code)

Good matches of datasets found on kaggle:
* Chess Game Dataset (Lichess) - https://www.kaggle.com/datasnaek/chess
* Chess_dataset (100,000 games) Lichess - https://www.kaggle.com/sahit2509/chess-dataset-100000-games-lichess
* Chess Games - https://www.kaggle.com/arevel/chess-games

<a id="section-three"></a>
## PROCESS

We are going to import and process the datasets in the pandas library in python.
To do so we are going to run 2 rounds of processing.
The first one will be responsable for processing the datasets individually, looking for anomalies, missing data, cutting out unrelated columns, etc.
The second round will prepare the 3 different datasets to be united in a single dataset. Column names will be changed, terminology will be unified.

In [None]:
import pandas as pd

In [None]:
df_src1=pd.read_csv("../input/chess/games.csv")
df_src2=pd.read_csv("../input/chess-games/chess_games.csv")
df_src3=pd.read_csv("../input/chess-dataset-100000-games-lichess/sep.csv")

### 1ºRound - First dataset

In [None]:
df_src1.head(10)

In [None]:
df_src1.columns

In [None]:
df_src1_filter = df_src1[['victory_status',
       'winner', 'increment_code', 'white_rating',
       'black_rating', 'opening_eco', 'opening_name']]

In [None]:
df_src1_filter.head(10)

In [None]:
df_src1_filter = df_src1_filter[df_src1_filter['white_rating']>=2000]

In [None]:
df_src1_filter = df_src1_filter[df_src1_filter['black_rating']>=2000]

In [None]:
df_src1_filter.info()

### 1ºRound - Second dataset

In [None]:
df_src2.head(10)

In [None]:
df_src2.columns

In [None]:
df_src2_filter = df_src2[['Result', 'UTCDate', 'WhiteElo',
       'BlackElo', 'ECO', 'Opening',
       'TimeControl', 'Termination']]

In [None]:
df_src2_filter['ECO'].value_counts(normalize=True).head(10)

In [None]:
df_src2_filter = df_src2_filter[df_src2_filter['WhiteElo']>=2000]
df_src2_filter = df_src2_filter[df_src2_filter['BlackElo']>=2000]

In [None]:
df_src2_filter.info()

### 1ºRound - Third dataset

In [None]:
df_src3.head(10)

In [None]:
df_src3.columns

In [None]:
df_src3_filter = df_src3[['Event', 'Date', 'Result',
       'BlackElo', 'ECO', 'Opening', 'Termination', 'TimeControl', 'WhiteElo']]

In [None]:
df_src3_filter = df_src3_filter[df_src3_filter['WhiteElo']>=2000]
df_src3_filter = df_src3_filter[df_src3_filter['BlackElo']>=2000]

In [None]:
df_src3_filter.info()

### 1ºRound - Conclusion and next steps

We were able to gather more than 700 thousand games that both contain the players with more than 2000 rating and also contain the ECO code of the opening.

Now we are going to apply some common convention to the dataset with the goal of merging them in a single dataset.
The conventions being:  
  
  **COLUMNS**
* Format - this column will tell us which time format the game was about - Classical (time>30min), Rapid (30min>= time>= 10min), Blitz (10min> time>= 3min), Bullet* (3min> time)
* Result - 3 possible results: 'black', 'white' or 'draw'
* BlackElo - Elo rating for black
* WhiteElo - Elo rating for White
* Termination - How the game ended - 'normal' for checkmate or draw, 'time forfeit', 'resignation', 'other'
* ECO - the ECO code
* Opening - the name of the opening


### 2ºRound - First dataset

First we are going to apply the name convention to the columns

In [None]:
df_src1_filter.head()

In [None]:
df_src1_filter.rename(columns = {'increment_code': 'Format', 'winner': 'Result','black_rating':'BlackElo',
                                 'white_rating':'WhiteElo','victory_status':'Termination','opening_eco':'ECO',
                                 'opening_name':'Opening'}, inplace = True )

In [None]:
df_src1_filter.head()

Now we are going to visit all columns to unify theirs results according to the conventions stated prior.

#### Format

In [None]:
df_src1_filter['Format'].value_counts()

In [None]:
df_src1_filter['Format'] = df_src1_filter['Format'].str.extract(r'(^.+)\+.+')
df_src1_filter['Format'].value_counts(dropna=False)

In [None]:
df_src1_filter['Format2']=df_src1_filter['Format'].astype(int)
df_src1_filter['Format3']= 'string'

df_src1_filter.loc[df_src1_filter.Format2 > 30, "Format3"] = "Classical"
df_src1_filter.loc[(df_src1_filter.Format2 <= 30) & (df_src1_filter.Format2 >= 10), "Format3"] = "Rapid"
df_src1_filter.loc[(df_src1_filter.Format2 < 10) & (df_src1_filter.Format2 >= 3), "Format3"] = "Blitz"
df_src1_filter.loc[df_src1_filter.Format2 < 3, "Format3"] = "Bullet"

In [None]:
df_src1_filter['Format']=df_src1_filter['Format3']
df_src1_filter.drop(['Format2', 'Format3'], axis=1,inplace = True)

#### Result

In [None]:
df_src1_filter['Result'].value_counts()

#### BlackElo

In [None]:
df_src1_filter['BlackElo'].value_counts()

#### WhiteElo

In [None]:
df_src1_filter['WhiteElo'].value_counts()

#### Termination

In [None]:
df_src1_filter['Termination'].value_counts()

In [None]:
df_src1_filter.loc[df_src1_filter.Termination == 'resign', "Termination"] = "resignation"
df_src1_filter.loc[df_src1_filter.Termination == 'mate', "Termination"] = "normal"
df_src1_filter.loc[df_src1_filter.Termination == 'draw', "Termination"] = "normal"
df_src1_filter.loc[df_src1_filter.Termination == 'outoftime', "Termination"] = "time forfeit"

In [None]:
df_src1_filter['Termination'].value_counts()

#### ECO

In [None]:
df_src1_filter['ECO'].value_counts()

#### Opening

In [None]:
df_src1_filter['Opening'].value_counts()

In [None]:
df_src1_proc = df_src1_filter

### 2ºRound - Second dataset

We are going to follow the same routine of the first dataset

In [None]:
df_src2_filter.head()

In [None]:
df_src2_filter.rename(columns = {'TimeControl': 'Format', 'UTCDate':'Date',
                                 'opening_eco':'ECO'}, inplace = True )

#### Format  
Atenttion - this time format is expressed in seconds

In [None]:
df_src2_filter['Format'].value_counts().head(20)

In [None]:
df_src2_filter['Format_dummy'] = df_src2_filter['Format'].str.extract(r'(^.+)\+.+')
df_src2_filter['Format_dummy'].value_counts(dropna=False)

In [None]:
df_src2_filter['Format_dummy'] = df_src2_filter['Format_dummy'].astype(float)/60
df_src2_filter['Format_dummy'].value_counts(dropna=False)

In [None]:
df_src2_filter.info()

In [None]:
df_src2_filter['Format2']=df_src2_filter['Format_dummy']
df_src2_filter['Format3']= 'unknown'

df_src2_filter.loc[df_src2_filter.Format2 > 30, "Format3"] = "Classical"
df_src2_filter.loc[(df_src2_filter.Format2 <= 30) & (df_src2_filter.Format2 >= 10), "Format3"] = "Rapid"
df_src2_filter.loc[(df_src2_filter.Format2 < 10) & (df_src2_filter.Format2 >= 3), "Format3"] = "Blitz"
df_src2_filter.loc[df_src2_filter.Format2 < 3, "Format3"] = "Bullet"

In [None]:
df_src2_filter['Format']=df_src2_filter['Format3']
df_src2_filter.drop(['Format2', 'Format3','Format_dummy'], axis=1,inplace = True)

In [None]:
df_src2_filter['Format'].value_counts()

#### Result

In [None]:
df_src2_filter['Result'].value_counts().head(10)

Here we observe something unexpected, '*' is no possible answer to a chess game outcome. Let's take a better look at those observations.

In [None]:
df_src2_filter[df_src2_filter['Result']=='*']

We can only conclude that the match either was not terminated or the result is unknown.  
We are going to supress those lines.

In [None]:
df_src2_filter = df_src2_filter.drop(df_src2_filter[df_src2_filter['Result']=='*'].index)

In [None]:
df_src2_filter.loc[df_src2_filter.Result == '1-0', "Result"] = "white"
df_src2_filter.loc[df_src2_filter.Result == '0-1', "Result"] = "black"
df_src2_filter.loc[df_src2_filter.Result == '1/2-1/2', "Result"] = "draw"

In [None]:
df_src2_filter['Result'].value_counts()

#### BlackElo

In [None]:
df_src2_filter['BlackElo'].value_counts()

#### WhiteElo

In [None]:
df_src2_filter['WhiteElo'].value_counts()

#### Termination

In [None]:
df_src2_filter['Termination'].value_counts()

In [None]:
df_src2_filter.loc[df_src2_filter.Termination == 'Abandoned', "Termination"] = "resignation"
df_src2_filter.loc[df_src2_filter.Termination == 'Normal', "Termination"] = "normal"
df_src2_filter.loc[df_src2_filter.Termination == 'Rules infraction', "Termination"] = "other"
df_src2_filter.loc[df_src2_filter.Termination == 'Time forfeit', "Termination"] = "time forfeit"

In [None]:
df_src2_filter['Termination'].value_counts()

#### ECO

In [None]:
df_src2_filter['ECO'].value_counts()

#### Opening

In [None]:
df_src2_filter['Opening'].value_counts()

In [None]:
df_src2_proc = df_src2_filter

### 2ºRound - Third dataset

In [None]:
df_src3_filter.head()

In [None]:
df_src3_filter.rename(columns = {'TimeControl': 'Format'}, inplace = True )

#### Format

In [None]:
df_src3_filter['Format'].value_counts()

In [None]:
df_src3_filter['Format_dummy'] = df_src3_filter['Format'].str.extract(r'(^.+)\+.+')
df_src3_filter['Format_dummy'].value_counts(dropna=False)

Again the time is being recorded in seconds, so we are going to convert it to minutes

In [None]:
df_src3_filter['Format_dummy'] = df_src3_filter['Format_dummy'].astype(float)/60
df_src3_filter['Format_dummy'].value_counts(dropna=False)

In [None]:
df_src3_filter['Format2']=df_src3_filter['Format_dummy']
df_src3_filter['Format3']= 'unknown'

df_src3_filter.loc[df_src3_filter.Format2 > 30, "Format3"] = "Classical"
df_src3_filter.loc[(df_src3_filter.Format2 <= 30) & (df_src3_filter.Format2 >= 10), "Format3"] = "Rapid"
df_src3_filter.loc[(df_src3_filter.Format2 < 10) & (df_src3_filter.Format2 >= 3), "Format3"] = "Blitz"
df_src3_filter.loc[df_src3_filter.Format2 < 3, "Format3"] = "Bullet"

In [None]:
df_src3_filter['Format']=df_src3_filter['Format3']
df_src3_filter.drop(['Format2', 'Format3','Format_dummy'], axis=1,inplace = True)

In [None]:
df_src3_filter['Format'].value_counts()

#### Result

In [None]:
df_src3_filter['Result'].value_counts(dropna=False)

In [None]:
df_src3_filter.loc[df_src3_filter.Result == '1-0', "Result"] = "white"
df_src3_filter.loc[df_src3_filter.Result == '0-1', "Result"] = "black"
df_src3_filter.loc[df_src3_filter.Result == '1/2-1/2', "Result"] = "draw"

In [None]:
df_src3_filter['Result'].value_counts(dropna=False)

#### BlackElo

In [None]:
df_src3_filter['BlackElo'].value_counts(dropna=False)

#### WhiteElo

In [None]:
df_src3_filter['WhiteElo'].value_counts(dropna=False)

#### Termination

In [None]:
df_src3_filter['Termination'].value_counts(dropna=False)

In [None]:
df_src3_filter.loc[df_src3_filter.Termination == 'Abandoned', "Termination"] = "resignation"
df_src3_filter.loc[df_src3_filter.Termination == 'Normal', "Termination"] = "normal"
df_src3_filter.loc[df_src3_filter.Termination == 'Rules infraction', "Termination"] = "other"
df_src3_filter.loc[df_src3_filter.Termination == 'Time forfeit', "Termination"] = "time forfeit"

In [None]:
df_src3_filter['Termination'].value_counts(dropna=False)

#### ECO

In [None]:
df_src3_filter['ECO'].value_counts(dropna=False)

#### Opening 

In [None]:
df_src3_filter['Opening'].value_counts(dropna=False)

In [None]:
df_src3_filter['Opening'].isnull().sum()

In [None]:
df_src3_proc = df_src3_filter

### 2ºRound - Concatenation

In [None]:
df_src1_proc

In [None]:
df = pd.concat((df_src1_proc, df_src2_proc, df_src3_proc), axis = 0, ignore_index=True) 

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.to_csv('chess_open_concat.csv')

<a id="section-four"></a>
## ANALYSE

With the appropriate data in hand, the Tableau Public software was used to further the analysis.
The visualizations can be found in this link: https://public.tableau.com/profile/aramis.de.albuquerque.farias#!/vizhome/EmpiricalAdvantagesinChessOpenings/Story1

First, the difference between ELO of the players should be considered. It was decided that an ELO difference of no more than 50 points should be included, seeking to minimize the effect of the difference in abilities between players.
Second, It was also noted that there are only 342 games in classical Format and 968 games in unknown format. It was decided to exclude both of them.
Lastly, only openings that were played at least 100 times in the dataset were included, in a way to avoid too much noise caused by random chance.


Starting with more than 6 milion games, we got down to 265 thousand games meeting all the criteria, less than 5% of the total amount.

#### Outcome rates for each opening


In [None]:
from IPython.display import Image
Image("../input/empirical-advantages-chess-openings-images/images/All games_tableau.png")

Here we can have a view of all the openings and their respective outcomes rates. Clearly there are some openings that tend to either the white or the black players. But it is also interesting to take note that even with an appearing advantage for one side, **the differences aren't overwhelming**, which offers room for competition regardless the opening played.

#### White advantage - Format

In [None]:
Image("../input/empirical-advantages-chess-openings-images/images/White advantage_tableau.png")

As the games get separated by format, we can see that **some openings got more representation in some formats** rather than others.  
The following openings offered the best rates for white's victory:  

**Blitz (10 min > time >= 3 min)**:
* C22 - 57,1% - Center Game
* C61 - 56,5% - Ruy Lopez, Bird's Defense
* E32 - 56,2% - Nimzo-Indian, Classical 

**Bullet (time < 3 min)**:
* A30 - 59,9% - English, Symmetrical
* C27 - 56,2% - Vienna Game
* D13 - 56,2% - Queen's Gambit Declined Slav, Exchange Variation

**Rapid (30 min >= time >= 10 min)**:
* C30 - 61,1% - King's Gambit Declined
* C53 - 57,4% - Giuoco Piano
* C21 - 57,0% - Center Game

*note: some openings share the same name but have different positions in game.*

Having the most favorable openings and according to this data, we can expect a winning chance of around 56,5% for white. This advantage can be translated in a **63 points advantage** in ELO rating. Considering we applied a maximum of 50 points difference in ELO rating, the advantage can still be translated in at least 13 ELO points in favor of white, or **at least half a pawn** in game material sense.

#### Black advantage - Format

In [None]:
Image("../input/empirical-advantages-chess-openings-images/images/Black advantage_tableau.png")

Similarly to white best winning rates, the black best winning rates also feature **some openings got more representation in some formats**.

**Blitz (10 min > time >= 3 min)**:
* C51 - 55,9% - Evans Gambit
* A60 - 55,1% - Benoni Defense
* E65 - 54,7% - King's Indian, Fianchetto, Yugoslav

**Bullet (time < 3 min)**:
* C46 - 56,5% - Three Knights
* A53 - 55,8% - Old Indian
* A51 - 55,4% - Budapest Gambit

**Rapid (30 min >= time >= 10 min)**:
* B03 - 57,3% - Alekhine's Defense
* C20 - 54,9% - King's Pawn Game
* C50 - 51,8% - Giuoco Piano

*note: some openings share the same name but have different positions in game.*

Having the most favorable openings and according to this data, we can expect a winning chance of around 55,2% for black. This advantage can be translated in a 54 points advantage in ELO rating. Considering we applied a maximum of 50 points difference in ELO rating, the advantage can be translated in at least 4 ELO points in favor of black. 
This advantage clearly speaks in favor of black, but is **argueably non significant** from a statistical point of view. 

In order to better access the advantage that the most favorable openings can offer black, a narrower range should be applied between the ELO ratings of the players.

#### Termination as a strategy

In [None]:
Image("../input/empirical-advantages-chess-openings-images/images/Termination_tableau.png")

Appling similar conditions, we can observe that a considerable amount of games in particular openings ended up in time forfeit, favoring either black or white with some consistency.

This could be explained by the fact that positions in chess can be harder to manouver for one side, while easier to the other, which leads to a time advantage on the clock. This can be used as an strategy specially in faster formats, namely Bullet and Blitz, formats that happen to be the majority of the games included in this analysis.

In this sense, the following openings can be used to exploit termintion by time forfeit:

For White:
* D13 - 62,5% - Queen's Gambit Declined Slav, Exchange Variation
* D32 - 59,4% - Queen's Gambit Declined, Tarrasch
* B80 - 56,9% - Sicilian, Scheveningen

For Black:
* A42 - 57,3% - Modern Defense, Averbakh System
* B09 - 56,3% - Pirc, Austrian Attack
* B43 - 55,1% - Sicilian, Kan, 5.Nc3

Having those openings and seeking to achieve a time forfeit victory, can have an average expected winning rate of 59,6% for White and 56,2% for Black. Whick can be translated in 86 and 61 point in ELO rating, respectively.
This, according to the data, surprisinly represent **a better strategy than playing openings for a regular checkmate**. Again it should be taken in account that this would be mainly useful in formats with less time.

<a id="section-five"></a>
## SHARE

Slide Presentation : https://docs.google.com/presentation/d/1DGU_tYx9Ui06yoQ37I0uuHPrBQFaTfnSYGO8z_enfzg

### Sources

Chess opening names: https://www.chessgames.com/  
Elo Win probability Calculator: https://wismuth.com/elo/calculator.html