<h1> Using Data to Understand Chess (with expert games) </h1>

In this notebook, I utilize data taken from <a href="https://www.kaggle.com/ironicninja/1-million-games-from-chessgames">this dataset</a> to better understand the game of chess.

<h2> Table of Contents </h2>
<ol style="font-size: 16px">
    <li><a href="https://www.kaggle.com/ironicninja/using-data-to-understand-chess-with-expert-games#Section-1:-Setup-the-Packages,-Variables,-and-Data">Setup the Packages, Variables, and Data</a></li>
    <li><a href="https://www.kaggle.com/ironicninja/using-data-to-understand-chess-with-expert-games#Section-2:-Quick-Exploratory-Data-Analysis">Quick Exploratory Data Analysis</a></li>
    <li><a href="https://www.kaggle.com/ironicninja/using-data-to-understand-chess-with-expert-games#Section-3:-Data-Preprocessing">Data Preprocessing</a></li>
    <li><a href="https://www.kaggle.com/ironicninja/using-data-to-understand-chess-with-expert-games#Section-4:-Adding-Features">Adding Features</a></li>
    <li><a href="https://www.kaggle.com/ironicninja/using-data-to-understand-chess-with-expert-games/data#Section-5:-Start-with-the-Setting">Start with the Setting</a></li>
    <li><a href="https://www.kaggle.com/ironicninja/using-data-to-understand-chess-with-expert-games/data#Section-6:-What-Players-Do-We-Have-in-the-Dataset?">What Players Do We Have in the Dataset</a></li>
    <li><a href="https://www.kaggle.com/ironicninja/using-data-to-understand-chess-with-expert-games/data#Section-7.-Opening-Analysis">Opening Analysis</a></li>
        <li><a href="https://www.kaggle.com/ironicninja/using-data-to-understand-chess-with-expert-games/data#Section-8.-The-End?">The End?</a></li>
</ol>

# Section 1: Setup the Packages, Variables, and Data

<h2> Packages </h2>

In [None]:
!pip install bar_chart_race

In [None]:
#-----General------#
import numpy as np
import pandas as pd
import os
import sys

#-----Plotting-----#
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as py
py.init_notebook_mode(connected=True)
import seaborn as sns
import bar_chart_race as bcr
from pandas_profiling import ProfileReport

#-----Utility-----#
import math
import itertools
import warnings
warnings.filterwarnings("ignore")
import re
import gc

<h2> Global Variables </h2>

In [None]:
LOOK_AT = 10
BCR_DISPLAY = True

<h2> Import Data </h2>

In [None]:
%%time

games_list = []
for dirname, _, filenames in os.walk("../input/1-million-games-from-chessgames/games"):
    for filename in filenames:
        try:
            games_list.append(pd.read_csv(os.path.join(dirname, filename)).drop("Unnamed: 0", axis=1))
        except Exception as e:
            print(filename, e)

In [None]:
df = pd.concat(games_list).reset_index(drop=True)
df

The following line of code collects all of the garbage. We will do this after each section to ensure the rest of the program has enough RAM to proceed.

In [None]:
gc.collect()

# Section 2: Quick Exploratory Data Analysis

For an extremely quick glance at the data, I love using ```ProfileReport```. This is an interactive automatic EDA library which provides an overview of the dataset, as well as its variables, interactions, correlations, and missing values. Feel free to take a look at the report shown below.

In [None]:
report = ProfileReport(df)
report

From this report, we can see that there are some missing cells which we need to account for in the next section, Data Preprocessing.

In [None]:
gc.collect()

# Section 3: Data Preprocessing

Before I continue with my analysis, I need to ensure that the data is clean and consistent. Doing so will minimize the number of bugs I encounter later in the notebook, which will in turn save me a tremendous amount of time debugging.

<h2> 1. Drop any row with missing values. </h2>

I do this by only keeping the rows where the DataFrame has more than one null value.

In [None]:
df = df.loc[np.count_nonzero(df.isnull(), axis=1) == 0]

<h2> 2. Drop any row which has a move count of zero. </h2>

This is pretty simple; only include indices where the Move Count isn't zero.

In [None]:
df = df.loc[df['Move Count'] != 0]

<h2> 3. Fix PGN spacing. </h2>

I want the PGNs to have consistent spacing so the same function can be used to analyze the notation. In the original data, some games don't have a space after each move number. I change that through a regular expression statement. <a href="https://stackoverflow.com/a/29507362"> Link to original Stack Overflow post detailing the syntax used here. </a>

In [None]:
df['PGN'] = df['PGN'].map(lambda s : re.sub(r'\.(?! )', '. ', re.sub(r' +', ' ', str(s))))
df.head()['PGN']

In [None]:
%%time

include_list = []
for i in df.index:
    if i%100000 == 0:
        print(i)
    tmp_ser = df.loc[i]
    try:
        tmp_x, tmp_y = float(tmp_ser['White Elo']), float(tmp_ser['Black Elo'])
        include_list.append(i)
    except:
        pass

In [None]:
df = df.loc[include_list]

<h2> 4. Convert All Datatypes. </h2>

As best practice, I convert the object data to string data. I do this after dropping null values since convert a ```NaN``` to a string results in ```<NA>```, which is not considered as a null value when we use ```isnull()```.

In [None]:
for column in df.columns:
    try:
        df[column] = pd.to_numeric(df[column])
    except:
        df[column] = df[column].astype("string")
        
df.dtypes

In [None]:
gc.collect()

# Section 4: Adding Features

I could only extract a select few features from https://www.chessgames.com/. There are many other features that would be great to have, so let's add them here!

<h2> What features do we have now? </h2>

In [None]:
original_features = df.columns
print(original_features, "\n")
print(f"Number of Features: {len(df.columns)}")

<h2> 1. Adding the updated date + year. </h2>

I would like to add ```year``` as a feature to this dataset so, in the future, I can use it to extract time-series data using Pandas built-in functions like ```.agg()``` and ```.groupby()```. 

To do this, first, I convert all of the ```?``` in the original dates to ```0```. Then, I extract the year of each game by considering the fact all years in this dataset are 4 characters long (e.g. 1962, 2018).

In [None]:
df['Updated Date'] = df['Date'].str.replace('?', '0', regex=False)
df['Year'] = df['Updated Date'].str[:4].astype(int)
df.head()

<h2> 1.5. Remove Games from the Year 1620. </h2>

There is a large gap between the earliest year recorded with a chess game, 1620, and the next year recorded with a chess game, 1834. Therefore, it seems reasonable to remove all games played in the year 1620 in this analysis.

In [None]:
unique_years = df['Year'].unique()
print(f"First year recorded: {unique_years[0]}, Second year recorded: {unique_years[1]}, Difference: {unique_years[1] - unique_years[0]} years.")

df = df.loc[df['Year'] != 1620]

<h2> 2. Adding opening names + opening moves. </h2>

I would like to convert the openings from ECO into more common opening names like the Queen's Gambit or King's Indian Attack so that those are not familiar with ECO (I certainly am not) can understand the data. I also add the opening moves of each opening, which could potentially be useful.

I scraped the ```ECO -> Name -> Move``` relationship from https://www.chessgames.com/chessecohelp.html using a Python script. All that is left after that is to use the ```.map``` Series method, which is an efficient, vectorized implementation of appending using a dictionary.

In [None]:
openings = pd.read_csv("../input/1-million-games-from-chessgames/openings.csv").drop("Unnamed: 0", axis=1)
openings

In [None]:
eco_df = openings.set_index("ECO")
df['Opening Names'] = df['ECO'].map(eco_df['Opening Names'])
df['Opening Moves'] = df['ECO'].map(eco_df['Moves'])
df.head()

<h2> 3. Adding number of captures, promotions, checks, and checkmates. </h2>

The nice thing about doing this is that the characters ```x```, ```=```, ```+```, and ```#``` appear in the PGN only to symbolize captures, promotions, and checks, and checkmates, respectively. Thereofre, I will use the Series ```.str``` method for a quick vectorized counting.

In [None]:
df['Captures'] = df['PGN'].str.count("x")
df['Promotions'] = df['PGN'].str.count("=")
df['Checks'] = df['PGN'].str.count("\+")
df['Checkmate'] = df['PGN'].str.count("#").astype(bool)
df.head()

<h2> 4. Adding First Moves + Removing Weird Games. </h2>

You'll see what I mean by weird games.

In [None]:
def extract_move(move_num, color="White"):
    """Function which extracts specific move. Not very efficient function but is good enough to get the job done right now."""
    
    if color.lower() != "white" and color.lower() != "black":
        raise Exception("Pass in a correct color (White/Black).")
        
    split_index = 3*move_num - 1 if color.lower() == "white" else 3*move_num
    move = [x[split_index-1]  if len(x) >= split_index else "No Move" for x in df['PGN'].str.split(' ', n=split_index).tolist()]
    return move

In [None]:
first_move = extract_move(1)
df['First Move'] = first_move
df.head()

In [None]:
unique_first_moves = df['First Move'].unique()
unique_first_moves

What on earth are moves like d4# and Kb7? I have no idea, so let's just remove that data.

In [None]:
LETTERS = "abcdefgh"
legal_first_moves = [f"{letter}3" for letter in LETTERS] + [f"{letter}4" for letter in LETTERS] + ["Na3", "Nc3", "Nf3", "Nh3"]
legal_first_moves

In [None]:
for move in unique_first_moves:
    if move not in legal_first_moves:
        df = df.loc[df['First Move'] != move]

df.head()

# Section 4.5: Re-run the EDA With Our New Features

In [None]:
report = ProfileReport(df.drop(original_features, axis=1))
report

There are some games that don't have a recorded ECO, which means that those games also don't have an opening name or opening moves. Because there are so few games that have unrecorded ECO though, I opt not to fill in the opening names manually.

The last thing we need to do before working with the data is removing extraneous years with a low number of chess games (e.g. those that are played on "Year 0"). I'll drop any year with less than 50 chess games played.

In [None]:
year_ser = df.groupby('Year').size()
for year in year_ser.keys():
    if year_ser.loc[year] <= 50:
        df = df.loc[df['Year'] != year]
        
df.head()

In [None]:
gc.collect()

# Section 5: Start with the Setting

<h2> 1. Years with most chess played. </h2>

Games played:

In [None]:
fig = px.line(df.groupby('Year').size())
fig.update_layout(title={'text': f"Number of Chess Games Played Each Year", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, showlegend=False, yaxis_title="Count")
fig.show()

Events hosted:

In [None]:
all_years = np.sort(df['Year'].unique())
year_event_df = df.groupby(['Year', 'Event']).count()

events_list = []
for year in all_years:
    events_list.append(len(year_event_df.loc[year]))
    
year_event_df = pd.Series(data=events_list, index=all_years)    
fig = px.line(year_event_df)
fig.update_layout(title={'text': f"Number of Events Hosted Each Year", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, showlegend=False, yaxis_title="Count")
fig.show()

<h2> 2. The most popular tournaments every year. </h2>

There are some big name tournaments with some big name players that are held every year; Linares, Wijk Aan Zee (Tata Steel), London, etc. Let's take a look at them! Before we do though, let's drop the most popular events in this dataset which aren't actual tournaments.

In [None]:
# Note, I added this drop condition halfway through the data gathering process; they are most likely unnecessary given the current state of the data
dropped_tours = ['corr', 'Match', 'Simul', 'Consultation game', '?', 'Unknown', 'Blindfold simul, 10b', 'Casual game', 'Blindfold simul, 8b']

<h3> Most Legendary Tournaments </h3>

These are the tournaments that have the most games played in them (and therefore the highest chance for hosting some historic games).

In [None]:
tour_df = pd.DataFrame(df.groupby('Event').size().drop(dropped_tours).sort_values(ascending=False), columns=["Games Played"])
fig = px.bar(tour_df[:LOOK_AT], y='Games Played', color='Games Played')
fig.update_layout(title={'text': f"Top {LOOK_AT} Tournaments in Terms of Games Played", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}})
fig.show()

<h3> Most Longstanding Tournaments </h3>

These tournaments have been hosted for so many years it's hard to count (ok, not really, but they are very historic tournaments).

In [None]:
long_tour_df = df.groupby('Event').nunique().drop(dropped_tours).sort_values("Year", ascending=False).rename(columns={"Year": "Years Played"})
fig = px.bar(long_tour_df[:LOOK_AT], y='Years Played', color='Years Played')
fig.update_layout(title={'text': f"Top {LOOK_AT} Tournaments in Terms of Unique Years Hosted", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}})
fig.show()

In [None]:
gc.collect()

# Section 6: What Players Do We Have in the Dataset?

The direction of a chess game is largely dependent on the players that are playing. Thus, it's important to know the distribution of players in the dataset.

<h2> 1. Visualization of number of chess games played by each player. </h2>

I will be going more in-depth with many of the visualizations, distributions, and analyses done here (which are similar to the initial EDA done in Section 2).

In [None]:
white_p_df = df[['White Player', 'Year', 'Move Count', 'Result']].rename(columns={"White Player": "Name"})
black_p_df = df[['Black Player', 'Year', 'Move Count', 'Result']].rename(columns={"Black Player": "Name"})
all_players_df = pd.concat((white_p_df, black_p_df)).reset_index(drop=True)

Distribution:

In [None]:
bin_width = 50
player_games = all_players_df.groupby('Name').size().to_numpy()

nbins = 2*math.ceil((player_games.max() - player_games.min()) / (bin_width))

fig = px.histogram(player_games, nbins=nbins)
fig.update_layout(title={'text': f"Distribution of Games the {len(player_games)} Unique Players Play", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, xaxis_title="Number of Games", yaxis_title="Count (log scale)", showlegend=False)
fig.update_yaxes(type="log")
fig.show()

By most games played:

In [None]:
player_games_df = pd.DataFrame(all_players_df.groupby('Name').size().sort_values(ascending=False), columns=["Count"])
fig = px.bar(player_games_df[:LOOK_AT], y="Count", color="Count")
fig.update_layout(title={'text': f"Top {LOOK_AT} Players With Most Chess Games Played", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}})
fig.show()

In [None]:
all_players_df['Games Played'] = all_players_df['Name'].map(all_players_df.groupby('Name').size())
all_players_df

<h2> 2. Visualization of number of unique years played by each player. </h2>

Note that in this analysis we drop ```NN``` (No Name), which is not a real player.

In [None]:
player_year_df = all_players_df.groupby(['Name', 'Year']).size()
player_name_list = all_players_df['Name'].unique()
years_list = []
for name in player_name_list:
    years_list.append(len(player_year_df[name]))
    
years_played_df = pd.DataFrame(years_list, index=player_name_list, columns=["Count"]).sort_values("Count", ascending=False).drop("NN")
years_played_df

In [None]:
fig = px.histogram(years_played_df)
fig.update_layout(title={'text': f"Distribution of Number of Years Each Player Plays", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, showlegend=False)
fig.show()

In [None]:
fig = px.bar(years_played_df[:LOOK_AT], y="Count", color="Count")
fig.update_layout(title={'text': f"Top {LOOK_AT} Players With Most Years Played", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, xaxis_title="")
fig.show()

Vasily Symslov, an elite Russian grandmaster, holds the title for the player with the most active years, with a game recorded from him for 63 unique years (out of the 89 years he lived!) Some other top grandmasters are not far behind him.

<h2> 3. Which players have the most wins/win the most? </h2>

<h3> By total wins:

With the White pieces:

In [None]:
white_df = pd.DataFrame(df.loc[df['Result'] == 'White Wins'].groupby('White Player').size().sort_values(ascending=False), columns=['Count'])
fig = px.bar(white_df[:LOOK_AT], y="Count", color="Count")
fig.update_layout(title={'text': f"Top {LOOK_AT} Players With the Most Wins with White", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, xaxis_title="")
fig.show()

With the Black pieces:

In [None]:
black_df = pd.DataFrame(df.loc[df['Result'] == 'Black Wins'].groupby('Black Player').size().sort_values(ascending=False), columns=['Count'])
fig = px.bar(black_df[:LOOK_AT], y="Count", color="Count")
fig.update_layout(title={'text': f"Top {LOOK_AT} Players With the Most Wins with Black", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, xaxis_title="")
fig.show()

Draw:

In [None]:
draw_df = pd.DataFrame(all_players_df.loc[all_players_df['Result'] == 'Draw'].groupby('Name').size().sort_values(ascending=False), columns=['Count'])
fig = px.bar(draw_df[:LOOK_AT], y="Count", color="Count")
fig.update_layout(title={'text': f"Top {LOOK_AT} Players With the Most Draws", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, xaxis_title="")
fig.show()

In [None]:
results_comb_df = pd.concat([white_df, black_df, draw_df], axis=1)
results_comb_df.columns = ['White', 'Black', 'Draw']
results_comb_df

<h3> By percentage: </h3>

Condition: ```More than 100 games won with one color.```

In [None]:
MIN_VALUE = 100
PERC_SHOWN = 50

In [None]:
white_perc_df = (white_df.loc[white_df['Count'] >= MIN_VALUE]['Count'])/(df.groupby('White Player').size())
white_perc_df = pd.DataFrame(100*white_perc_df.dropna().sort_values(ascending=False).round(5), columns=['Percentage'])
white_perc_df['Count'] = df.groupby('White Player').size()
tmp_white_df = white_perc_df.loc[white_perc_df['Percentage'] >= PERC_SHOWN]
fig = px.bar(tmp_white_df, y="Percentage", color="Percentage", hover_data=["Count"])
fig.update_layout(title={'text': f"The {len(tmp_white_df)} Players with More Than a {PERC_SHOWN}% White Win Rate", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, xaxis_title="")
fig.show()

In [None]:
black_perc_df = (black_df.loc[black_df['Count'] >= MIN_VALUE]['Count'])/(df.groupby('Black Player').size())
black_perc_df = pd.DataFrame(100*black_perc_df.dropna().sort_values(ascending=False).round(5), columns=['Percentage'])
black_perc_df['Count'] = df.groupby('Black Player').size()
tmp_black_df = black_perc_df.loc[black_perc_df['Percentage'] >= PERC_SHOWN]
fig = px.bar(tmp_black_df, y="Percentage", color="Percentage", hover_data=["Count"])
fig.update_layout(title={'text': f"The {len(tmp_black_df)} Players with More Than a {PERC_SHOWN}% Black Win Rate", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, xaxis_title="")
fig.show()

Chess players clearly win more when they are white compared to when they are black!

<h2> 4. Who is the "best" player? </h2>

Who is the best player, not according to ELO?

Condition: ```Minimum played games is 500.```

Raw score is calculated by wins minus losses.

In [None]:
MIN_GAMES_PLAYED = 500

In [None]:
all_colors_df = pd.concat([white_df, black_df, draw_df], axis=1)
all_colors_df.columns = ["White Win", "Black Win", "Draw"]
all_colors_df['White Loss'] = df.loc[df['Result'] == 'Black Wins'].groupby('White Player').size()
all_colors_df['Black Loss'] = df.loc[df['Result'] == 'White Wins'].groupby('Black Player').size()
all_colors_df['Total Games'] = all_colors_df.sum(axis=1)
all_colors_df.fillna(0, inplace=True)
all_colors_df = all_colors_df.loc[all_colors_df['Total Games'] >= MIN_GAMES_PLAYED]
all_colors_df

In [None]:
all_colors_df['Score'] = all_colors_df['White Win'] + all_colors_df['Black Win'] - all_colors_df['White Loss'] - all_colors_df['Black Loss']
all_colors_df.sort_values("Score", ascending=False, inplace=True)
fig = px.bar(all_colors_df[:LOOK_AT], y="Score", color="Score", hover_data=["White Win", "Black Win", "White Loss", "Black Loss", "Total Games"])
fig.update_layout(title={'text': f"Top {LOOK_AT} Players by Raw Score", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, xaxis_title="")
fig.show()

Proportion score is calculated by dividing a player's raw score by the total number of games they played.

In [None]:
all_colors_df['Prop'] = all_colors_df['Score']/all_colors_df['Total Games']
all_colors_df.sort_values("Prop", ascending=False, inplace=True)
fig = px.bar(all_colors_df[:LOOK_AT], y="Prop", color="Prop", hover_data=["Score"])
fig.update_layout(title={'text': f"Top {LOOK_AT} Players by Proportion Score", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, xaxis_title="")
fig.show()

All of these players are elite and masters at the art of winning.

<h2> 5. Adding ELO to the equation. </h2>

We don't add ELO as a feature in Section 3 since ELO is more relevant when analyzing the dataset on a player-to-player basis. Sporadic changes in ELO can occur for each chess games, which is hard to account for with a sparse dataset of chess ratings.

In [None]:
ratings_list = []
for dirname, _, filenames in os.walk("../input/1-million-games-from-chessgames/ratings"):
    for filename in filenames:
        try:
            tmp_df = pd.read_csv(os.path.join(dirname, filename))
            tmp_df = tmp_df.set_index("Unnamed: 0")
            ratings_list.append(tmp_df)
        except Exception as e:
            print(filename, e)

Using the dataset of ratings (from 1843 to 2005):

In [None]:
%%time

rating_df = pd.concat(ratings_list, axis=1).sort_index()
rating_df.index.name = "Date"
rating_df['Year'] = rating_df.index.str.slice(0, 4)
rating_df.index = rating_df.index.astype("datetime64[ns]")
#rating_df = rating_df.interpolate('linear', axis=0, limit=1)
rating_df = rating_df.groupby('Year').mean()
rating_df

In [None]:
rank_players_df = pd.DataFrame(rating_df.columns.values[np.argsort(-rating_df.values, axis=1)[:, :LOOK_AT]], 
                  index=rating_df.index)
rank_players_df

In [None]:
for j in rank_players_df.columns:
    tmp_rating_list = []
    tmp_df = rank_players_df.loc[:, j]
    for i in rank_players_df.index:
        tmp_rating_list.append(rating_df.loc[i, tmp_df.loc[i]])
        
    rank_players_df[f"Rating {j}"] = tmp_rating_list
    
rank_players_df

Using ratings from the chessgames dataset:

Condition: ```Minimum games played that year is 10.```

In [None]:
MIN_GAMES = 10

In [None]:
df['White Elo'] = df['White Elo'].replace(-1, np.nan)
df['Black Elo'] = df['Black Elo'].replace(-1, np.nan)
df.head()

In [None]:
white_rating_df = df.groupby(['Year', 'White Player']).mean()['White Elo']
black_rating_df = df.groupby(['Year', 'Black Player']).mean()['Black Elo']
games_played_by_white = df.groupby(['Year', 'White Player']).size()
games_played_by_black = df.groupby(['Year', 'Black Player']).size()

other_ratings_list = []
for year in df['Year'].unique():
    if year >= 2006:
        white_ser = white_rating_df.loc[year]
        black_ser = black_rating_df.loc[year]
        comb_ser = (white_ser + black_ser)/2
        comb_ser.rename(year, inplace=True)
        other_ratings_list.append(comb_ser.loc[games_played_by_white.loc[year] + games_played_by_black.loc[year] >= MIN_GAMES])

In [None]:
other_ratings_df = pd.concat(other_ratings_list, axis=1).T.sort_index()
other_ratings_df.index.rename("Year", inplace=True)
other_ratings_df.drop(["Liren Ding", "Vachier-Lagrave, Maxime"], axis=1, inplace=True)
other_ratings_df = other_ratings_df.loc[:, ~other_ratings_df.columns.str.contains("Computer")]
other_rankings_df = pd.DataFrame(other_ratings_df.columns.values[np.argsort(-other_ratings_df.values, axis=1)[:, :LOOK_AT]], 
                  index=other_ratings_df.index)
other_rankings_df

In [None]:
for j in range(len(other_rankings_df.iloc[0])):
    tmp_list = []
    for i in other_rankings_df.index:
        tmp_list.append(other_ratings_df.loc[i].loc[other_rankings_df.loc[i][j]])
        
    other_rankings_df[f'Rating {j}'] = tmp_list
    
other_rankings_df

In [None]:
all_rank_players_df = rank_players_df.append(other_rankings_df)
all_rank_players_df

In [None]:
fig = go.Figure()
for i in range(LOOK_AT):
    poss = all_rank_players_df[f'Rating {i}'] != 0
    fig.add_trace(go.Scatter(x=all_rank_players_df.index[poss], y=all_rank_players_df[f"Rating {i}"][poss], text=all_rank_players_df[i][poss], name=f"#{i+1}"))

    
fig.update_layout(hovermode='x unified', title={'text': f"Top {LOOK_AT} Players Each Year by Rating", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, xaxis_title="")
fig.show()

In [None]:
%time highest_rating_df = rating_df.apply(pd.Series.nlargest, axis=1, n=1)
fig = px.line(highest_rating_df)
fig.update_layout(hovermode='x unified', title={'text': f"Players with the Highest Rating up until 2006", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, xaxis_title="")
fig.show()

**Possibility**: Use ARIMA/time-series model to predict rating distributions of players.

In [None]:
comb_rating_list = []

for i in range(len(rating_df)):
    comb_rating_list.append(rating_df.iloc[i][rank_players_df.iloc[i][:LOOK_AT]])
    
comb_rating_df = pd.concat(comb_rating_list, axis=1).T.append(other_ratings_df).fillna(0)
comb_rating_df.index = comb_rating_df.index.astype("string").astype("datetime64[ns]")
comb_rating_df

In [None]:
if BCR_DISPLAY:
    bcr_elo = bcr.bar_chart_race(
        df=comb_rating_df, 
        filename="ratings.mp4",
        n_bars=5,
        interpolate_period=True,
        steps_per_period=12, 
        title="Top 10 Chess Players by Rating",
        period_fmt='%b %-d, %Y'
    )

In [None]:
gc.collect()

# Section 7. Opening Analysis

<h2> 1. First Chess Move Over Time. </h2>

In [None]:
first_moves_list = []
moves_names_list = []
first_move_df = df.groupby(['First Move', 'Year']).size()
for move in legal_first_moves:
    first_moves_list.append(first_move_df[move])
    moves_names_list.append(move)
    
moves_df = pd.concat(first_moves_list, axis=1)
moves_df.columns = moves_names_list

reindex_columns = moves_df.sum().sort_values(ascending=False).keys().tolist()
moves_df = moves_df.reindex(reindex_columns, axis=1)
moves_df

Distribution of first moves:

In [None]:
fig = px.histogram(x=moves_df.columns, y=moves_df.sum())
fig.update_layout(title={'text': f"Total Times First Moves Have Been Played", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}})
fig.show()

Total number over time:

In [None]:
fig = px.line(moves_df)
fig.update_layout(title={'text': f"History of Chess' First Move", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, yaxis_title="Number of Games")
fig.show()

First moves as a proportion to other first moves that year:

In [None]:
proportion_df = moves_df.divide(moves_df.sum(axis=1), axis=0)
proportion_df.fillna(0, inplace=True)
fig = px.line(proportion_df)
fig.update_layout(title={'text': f"Proportion of First Moves Per Year", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, yaxis_title="Proportion of Games")
fig.show()

In [None]:
moves_df.fillna(0, inplace=True)
moves_cum = moves_df.cumsum()[1:] # Removes year=1620 since it's kinda glitchy
moves_cum.index = moves_cum.index.astype("string").astype("datetime64[ns]")
moves_cum

In [None]:
if BCR_DISPLAY:
    bcr_moves = bcr.bar_chart_race(
        df=moves_cum, 
        filename="moves.mp4",
        interpolate_period=True,
        steps_per_period=12, 
        title="First Move Bar Chart Race",
        period_fmt='%b %-d, %Y'
    )

<h2> 2. Openings Over Time. </h2>

In [None]:
opening_df = pd.DataFrame(df.groupby('Opening Names').size().sort_values(ascending=False), columns=["Count"]).reset_index()
fig = px.bar(opening_df[:LOOK_AT], x="Opening Names", y="Count", color="Count")
fig.update_layout(title={'text': "Total Openings Played", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}})
fig.show()

In [None]:
yearly_opening_df = df.groupby(['Opening Names', 'Year']).size()
unique_openings = df['Opening Names'].unique()
openings_list = []
used_openings = []
for opening in unique_openings:
    try:
        openings_list.append(yearly_opening_df[opening])
        used_openings.append(opening)
    except:
        pass
    
total_opening_df = pd.concat(openings_list, axis=1)
total_opening_df.columns = used_openings
total_opening_df.fillna(0, inplace=True)
total_opening_df

In [None]:
opening_cum = total_opening_df.cumsum()[1:]
opening_cum.index = opening_cum.index.astype("string").astype("datetime64[ns]")
opening_cum

In [None]:
if BCR_DISPLAY:
    bcr_opening = bcr.bar_chart_race(
        df=opening_cum, 
        filename="opening.mp4",
        n_bars=10,
        interpolate_period=True,
        steps_per_period=12, 
        title="Top 10 Openings Bar Chart Race",
        period_fmt='%b %-d, %Y'
    )

<h2> 3. Best Openings by Win Rate. </h2>

Condition: ```Opening must be played at least 100 times.```

In [None]:
MIN_OPENINGS_PLAYED = 100

In [None]:
winrate_opening_df = df.groupby(['Result', 'Opening Names']).size()
opening_total_games = df.groupby(['Opening Names']).size()
winrate_opening_df

Most white wins:

In [None]:
white_opening_df = pd.DataFrame((winrate_opening_df.loc['White Wins']/opening_total_games*100).loc[opening_total_games.values >= MIN_OPENINGS_PLAYED], 
                                columns=["White Percentage Won"]).round(3)
white_opening_df['Total Games'] = opening_total_games
white_opening_df.sort_values("White Percentage Won", ascending=False, inplace=True)

fig = px.bar(white_opening_df[:LOOK_AT], y="White Percentage Won", color="White Percentage Won", hover_data=["Total Games"])
fig.update_layout(title={'text': f"Top {LOOK_AT} Openings by White Winning Percentage", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, xaxis_title="")
fig.show()

Most black wins:

In [None]:
black_opening_df = pd.DataFrame((winrate_opening_df.loc['Black Wins']/opening_total_games*100).loc[opening_total_games.values >= MIN_OPENINGS_PLAYED], 
                                columns=["Black Percentage Won"]).round(3)
black_opening_df['Total Games'] = opening_total_games
black_opening_df.sort_values("Black Percentage Won", ascending=False, inplace=True)

fig = px.bar(black_opening_df[:LOOK_AT], y="Black Percentage Won", color="Black Percentage Won", hover_data=["Total Games"])
fig.update_layout(title={'text': f"Top {LOOK_AT} Openings by Black Winning Percentage", 'x': 0.5,
                         'xanchor': 'center', 'font': {'size': 20}}, xaxis_title="")
fig.show()

In [None]:
gc.collect()

# Exporting the new DataFrame

Feel free to download and use this Dataframe instead of the raw data!

In [None]:
df.to_csv("million_chessgames.csv")

# Section 8. The End?

If you've read down this far in the notebook, thank you so much. This notebook took quite a long time to make, and to be honest, I'm releasing with far less than I planned to have. I'll link the document I used to outline this notebook <a href="https://docs.google.com/document/d/1nMDrfFCiMltSBMS7EiwOq715AFer8ywRrXJxwnmDSeU/edit">here</a>... feel free to use my ideas to further explore the data!

I'm also going to plug <a href="https://github.com/IronicNinja/chessgames_scraping">my github repo</a> here which I used for scraping the data for this notebook.

Finally, if you liked my work, please leave a like, comment, and maybe drop a follow! I'd really appreciate it :)