# Chess Games Exploration
## by Yasser Ali

## Preliminary Wrangling

> This is an exploratory anlaysis of a dataset containing informations about **over 20,000 games** collected from a selection of users on the site **[Lichess](https://www.lichess.org)**.

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from matplotlib import rcParams

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
rcParams['figure.figsize'] = 10,6
base_color = sb.color_palette()[0]

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# loading data into a pandas dataframe
games = pd.read_csv("/kaggle/input/chess/games.csv")

In [None]:
# overview of data
print(games.shape)
print(games.dtypes)
print(games.head(10))

In [None]:
# splitting increment_code into base_time and increment
games['base_time']= games['increment_code'].str.split('+',expand=True)[0].astype(int)
games['increment']= games['increment_code'].str.split('+',expand=True)[1].astype(int)
# removing unused columns in this analysis
games.drop(columns= ['id','created_at','last_move_at','increment_code','white_id','black_id','moves','opening_name','opening_ply'], inplace=True)

In [None]:
# overview of dataset (cont.)
games.info()

In [None]:
# overview of dataset (cont.)
games.describe()

According to [chess.com](https://support.chess.com/article/330-why-are-there-different-ratings-in-live-chess#:~:text=Live%20Chess%20has%20three%20different,games%2010%20minutes%20and%20longer.), live Chess has three different ratings based on different time controls: 
- **Bullet rating** - For games under 3 minutes. 
- **Blitz rating** - For games over 3 minutes but under 10 minutes. 
- **Rapid rating** - For games 10 minutes and longer.

The system assumes an **average game-length of 40 moves** to estimate the total length of time available to each player; the result is what determines which rating is affected by a given time control

So, we can define a column **`game_rating`** that contains the ratings based on **`base_time`** (minutes) &  **`increment`** (seconds) with estimated game-length of 40 moves.

$$ Estimated Game Time = BaseTime + \frac{Increment* 40}{60}$$

In [None]:
# the equation for calculating estimated game time (rounding to nearest 0.5)
games['estimated_time'] = round((games.base_time + games.increment*40/60) * 2)/2

# initiate the game_rating column with the values of estimated time
games['game_rating'] = games['estimated_time']

# replacing estimated game time with associated game rating
games['game_rating'] = games['game_rating'].apply(lambda x : 'bullet' if x<3 
                                                  else ('blitz' if x<10 else 'rapid'))

In [None]:
# converting game_rating data type to ordered categorical data type 
ratings = ['bullet', 'blitz', 'rapid']
ratings_cat = pd.api.types.CategoricalDtype(categories= ratings, ordered=True)
games['game_rating'] = games['game_rating'].astype(ratings_cat)

Now, we have columns for estimated game time and game rating. We don't need time control columns anymore.

In [None]:
# dropping columns
games.drop(columns=['base_time', 'increment'], inplace = True)

In [None]:
# final overview of dataset
print(games.shape)
print(games.dtypes)
games.head(15)

### What is the structure of your dataset?

This is a part of a dataset that can be found [here](https://www.kaggle.com/datasnaek/chess) and it contains information about the most recent 20,000 games (at its time) played by the top 100 teams on [Lichess](https://www.lichess.org).

**rated** (T/F): If true it means that the game is rated and players' ratings increase or decrease depending on the final result.<br>
**turns**: Number of total turns until the game ends.<br>
**victory_status**: It is the game result; out of time, draw, resign and mate.<br>
**winner**: The winner of the game; player as black or as white.<br>
**white_rating**, **black_rating**: Players' ratings on Lichess.<br>
**opening_eco**: Standardised code for any given opening play, list [here](https://www.365chess.com/eco.php) and it is the opening made by the winner or player as white if the game ends with a draw.<br>
**estimated_time**: Estimated game time.<br>
**game_rating**: Based on time controls, there are three possible ratings; bullet, blitz, rapid.<br>

### What is/are the main feature(s) of interest in your dataset?

I'm interested in what makes a player, as black or white, **win** the game and whether player as white always have the advantage or not. Also, it is interesting to know what are the attributes associated with each **final game result**.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I expect that factors like; rating difference between players, opening play, and whether the player plays as white or black, could have the most effect on the final result. It is also possible that different game ratings/types have significant effect as the players adaptation to each type could be different.

## Univariate Exploration

**1. Victory Status** <br>

 
  Let's start with one of the variables of most interest.

In [None]:
# distribution of game status in a bar chart
status_order = games['victory_status'].value_counts().index
sb.countplot(data=games, x='victory_status', color=base_color, order=status_order);
plt.xlabel('Victory Status');
print(games['victory_status'].value_counts())

Resigning the game occurs the most, mate is the second most and much less occurances of out of time  and draw games.<br>

It is not unusual that most of the games end with one of the players resigning because those games are played by the top 100 teams in Lichess, so they figure out when they lose before it actually happens by calculating next possible moves.<br>

> Since I'm interested in winnings and the factors affecting it ,whether it is a win through mate or the other player resigns, I will drop the rows of the two other results; draw and out of time.<br>

In [None]:
games = games.loc[(games['victory_status'].isin(['resign','mate']))]  # keep rows of winnings only
games.reset_index(drop=True, inplace=True)   # reseting index to fill gaps in the index

In [None]:
# quick overview of dataset
print(games.shape)
print(games.describe())

**2. Rated?**

In [None]:
# since there are only two possible values of 'rated' column, I'll invesigate them in a pie chart
rated_counts = games['rated'].apply(lambda x: 'Rated' if x==True else 'Unrated').value_counts()

def pie_values(pct):
    absolute = int(round(pct/100*games.shape[0],0))
    return "{}\n\n{:0.0f}%".format(absolute,pct)
def pie_abs(pct):
    absolute = int(round(pct/100*games.shape[0],0))
    return "{}".format(absolute)

plt.pie(data=games, x= rated_counts, labels= rated_counts.index,
        startangle=90, counterclock=False,autopct=pie_values, textprops=dict(fontsize=15),
        colors=['darkseagreen','coral'])
plt.axis('square');

Most of the games in the dataset are rated meaning that players ratings get affected by the result of the game which makes them competative games. In the other hand, unrated games are much less as they are played for practice or for fun.<br>

A lot of factors affecting game result are to be investigated, so it might be a good idea to consider only rated games as they are meant to be competative and each player is motivated to win.

In [None]:
games = games[games['rated']==True] # keeping rated games only and dropping unrated games

games.drop(columns='rated', axis=1, inplace= True) # dropping 'rated' column as it contains a single value

games.reset_index(drop=True, inplace=True)   # reseting index to fill gaps in the index

In [None]:
# overview of dataset
print('Number of rows = {:,.0f} rows'.format(games.shape[0]))
games.head()

Now we have 14,115 rated games with a winner in each.

**3. Winner**<br>

In [None]:
# plotting bar chart for the distribution of the winners; white and black
sb.countplot(data=games, x='winner', palette=['silver','black'], order=['white','black']);
games.winner.value_counts()

Player as white has slightly more wins than player as black, but this doesn't conclude that white has an advantage over black.

**4. Game Rating/Category**

In [None]:
# At most game_ratings column contains only 3 ratings; bullet, blitz, rapid. So, we can plot it as a pie chart.

# getting count for each rating and account for categories with zero count if any.
games['game_rating'] = games['game_rating'].cat.remove_unused_categories()
game_rating_count = games['game_rating'].value_counts() 

# plotting pie chart
plt.pie(data=games, x= game_rating_count, labels= game_rating_count.index,
        startangle=90, counterclock=False,autopct=pie_abs, textprops=dict(fontsize=15),
        colors=['darkseagreen','orange'])
plt.axis('square');

It looks like there are no bullet rating games in this dataset. Most of games are of rapid ratings which are greater than or equal 10 minutes.

**5. Opening ECO**

In [None]:
games['opening_eco'].nunique() # number of unique opening ECOs

There are 341 unique openings which makes them hard to investigate.<br>

According to [365Chess](https://www.365chess.com/eco.php), those openings are divided in five volumes labeled from "A" through "E". So, I will use those five volumes instead of the whole 353 openings.

In [None]:
# keeping volume labels and remove the rest of the codes
games['opening_eco']= games['opening_eco'].apply(lambda x: x[0]) 

# converting opening_eco data type to ordered categorical data type (alphabetically)
openings = ['A', 'B', 'C', 'D', 'E']
openings_cat = pd.api.types.CategoricalDtype(categories= openings, ordered=True)
games['opening_eco'] = games['opening_eco'].astype(openings_cat)


# plot
sb.countplot(data=games, y= 'opening_eco', color= base_color);
plt.ylabel('Opening ECO');
plt.xlabel('Count');

Openings in volume C are the most frequent followed by B and A. Volume D occurs much less and volume E openings are rarely used.

**6. Player Rating** *(White & Black)*

In [None]:
games[['white_rating','black_rating']].describe()

In [None]:
# plotting white rating side by side with black rating

# constructing bins
step = 40
bins_rating = np.arange(780, 2700+step, step)

# plot white rating
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
sb.distplot(games['white_rating'],kde=False, bins= bins_rating);
plt.xlabel('White Rating')
plt.ylabel('Count')
plt.xlim(750,2700) # based on data min and max
plt.ylim(0,1050) 

# plot black rating
plt.subplot(1,2,2)
sb.distplot(games['black_rating'],kde=False, bins=bins_rating);
plt.xlabel('Black Rating')
plt.xlim(750,2700) # based on data min and max
plt.ylim(0,1050); # to make y-axes synchronized

The two distributions are similar; they both have a slight right skew due to small number of GM (Grand Master) players who have the highest ratings.<br>

However, I'm more interested in the difference of rating between the two players. Let's see the difference distribution.

**7. Rating Difference** *(White to Black)*

In [None]:
# adding new column for difference in rating between white and black
games['rating_diff_wb'] = games['white_rating'] - games['black_rating'] 
games['rating_diff_wb'].describe()

In [None]:
# bins
bins_diff = np.arange(-1605, 1499+50, 50)

# plot
sb.distplot(games['rating_diff_wb'],kde=False, bins= bins_diff);
plt.xlabel('Rating Difference (W-B)');
plt.ylabel('Count');

Difference in rating forms a long tailed distribution at both ends with a peak at ~=0 of approximately 2250 games that both players ratings are so close.<br>

Points at far ends could be considered outliers. Despite knowing that it is possible that in certain games there are high difference in ratings between the two players, those points shall be investigated further.

In [None]:
# far points at the left end
left_end = -1000 # eyeballed
print('Number of far points at the left end= {}'.format(games[games['rating_diff_wb']<=left_end].shape[0]))
games[games['rating_diff_wb']<=left_end] 

In [None]:
# far points at the right end
right_end= 1000 # eyeballed
print('Number of far points at the left end= {}'.format(games[games['rating_diff_wb']>=right_end].shape[0]))
games[games['rating_diff_wb']>=right_end] 

There is nothing odd about those points, the ones with the higher ratings win in all of these games which makes sense, except the fact that they cause long tails in both sides of the difference distribution, so I decided to keep them. 

**8. Number of Turns**

Number of turns may not be related to the variable of interest, winner,

In [None]:
games['turns'].describe()

In [None]:
# bin edges
step= 6
bins_turns = np.arange(games['turns'].min(), games['turns'].max()+step, step)

# plot
sb.distplot(games['turns'], kde=False, bins=bins_turns);
plt.xlabel('Turns')
plt.ylabel('Count');

The distribution of turns is right skewed with a long tail that suggests further investigation using log scale

In [None]:
np.log10(games['turns'].describe())

In [None]:
# right skewed distribution with long tail suggests plotting the distribution with a log scale 
step = 0.05
bins_log_turns = 10**np.arange(0,np.log10(games['turns'].max())+step, step)
sb.distplot(games['turns'], kde=False, bins= bins_log_turns);
plt.xscale('log')
ticks = [1, 2, 5, 10, 20, 50 ,100, 150, 250]
plt.xticks(ticks, ticks)
plt.xlabel('Turns')
plt.ylabel('Count');

The log scale exposes the obvious outliers at the left end which doesn't make sense in practice as it is hard for a game to be won by less than 4 turns. I looked it up myself in the games history at Lichess, and here is a rated rapid [game](https://lichess.org/V7a3QIoB/black#4) which is barely won in 4 turns.

In [None]:
# investigating far points at the left

print(games[(games['turns']<4)].victory_status.value_counts()) # count number of points
games[(games['turns']<4)] # rows with less than 4 turns

There are 170 games in which one of the players resigned, and it is nearly impossible for the game to be finished in just 3 turns. It should be noted that it is an online game and there are many reasons for a player to resign a game and those reasons may mostly be unrelated to the game itself specially with such small number of turns. So those games are to be dropped from the dataset.

In [None]:
# drop rows with number of turns < 4
games = games[(games['turns']>=4)]

games.reset_index(drop=True, inplace=True)   # reseting index to fill gaps in the index

Final distribution of number of turns:

In [None]:
# bins
step = 0.06
bins_log_turns = 10**np.arange(0.47,np.log10(games['turns'].max())+0.005, step)

# plot
sb.distplot(games['turns'], kde=False, bins= bins_log_turns);
plt.xscale('log')
ticks = [5, 10, 20, 50 ,100, 150, 200]
plt.xticks(ticks, ticks)
plt.xlabel('Turns')
plt.ylabel('Count');

Still left skewed distribution in log scale but with more resonable data points at both ends.

In [None]:
print('Number of rows = {:,.0f} rows'.format(games.shape[0]))

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

* **The game  could end with 4 possible results and 2 of them are dropped because I'm more interested in the games that are won by checkmate or when one of players resign**. [Draw](https://www.chess.com/terms/draw-chess) is obviously not a win, and out of time result could end with a draw as the rules of the game states that if one player is out of time and the other player has insufficient material to checkmate the game ends with a draw.

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

* There are **rated and unrated games**, but I decided to investigate those rated games as they are more competative than unrated ones, and they form the majority of data. Then, this column becomes useless with one value so it is dropped.<br>


* **Openings** in chess games are divided in **5 volumes** that are to be used instead of what is inside each voulume because it would be difficult to analyze all of them.<br>


* **Players Ratings** on their own are not relevant but the difference between them is to be important in predicting the winner, or that is what I expect. So, **I added a column for the difference in ratings between white and black respectively**. Despite the long tailed distribution at both ends, which indicates high difference in rating, I decided not to remove those games as they are possible and there is nothing odd about their data.<br>


* **Number of turns** may not be the best predictor of the game winner, but exploring it helped me find interesting points that needed further investigation. At first, the distribution was right skewed, so I applied **log scale** to take a good look at the right tail. Instead, the distribution shifted to the right and becomes left skewed. That is when I found **outliers at the left tail**; games with 1,2, and 3 turns only are odd in practice. After exploring those games, it is found that all of them ended with one of the players resigning which is not normal in a chess game after only 3 turns and it may be due to reasons unrelated to the game, so I decided to **drop those points**.<br>

## Bivariate Exploration

First, let's start with one of the variables of interest: **winner**. I want to look at its relationship with game rating.

In [None]:
# plot
g =sb.FacetGrid(data=games, col='game_rating', height=6, aspect=1.1)
g.map(sb.countplot, 'winner', order=['white', 'black'], palette=['silver','black']);

White overcomes black in both ratings, but with much less advantage at blitz rating games.

So, is the player with higher rating has more chances to win? To answer that, let's look at the relationship between rating difference and the winner.

In [None]:
# plot
sb.pointplot(data=games, x='winner', y= 'rating_diff_wb', order=['white', 'black'],palette=['silver','black'], linestyles="");
plt.ylabel('Average Rating Difference (White-Black)');
plt.xlabel('Winner');

As expected, the winner, player as white or as black, has an average difference in rating in his favor by about 80.

Are there certain opening groups associated with the winning player? Let's see.

In [None]:
sb.countplot(data=games, hue='winner', x='opening_eco', hue_order=['white','black'],palette=['silver','black']);

Openings of group C are used significantly more by the player as white to win the game compared to player as black while all other groups has very small difference in favor of player as white except for group E in which there is approximately no difference.

Do players as white need less number of turns to win a game?

In [None]:
# plotting winner againist average number of turns
sb.barplot(data=games, x='winner', y='turns', order=['white','black'], palette=['silver','black']);

On average, players as white need less number of turns to win the game. It might be a slight difference, but in chess game I'd argue that it is may be significant.

How the winner win the game?

In [None]:
# plotting the distribution of winners for each victory statusz
sb.countplot(data=games, hue='winner', x='victory_status', hue_order=['white','black'],palette=['silver','black']);

Most of the games are won when one of the players resigns,white or black. It could be because the games are played by high rated players who predicted the next moves and found out that they are losing before it happens.

So, let's move on to another variable of interest: **victory status**.

First, I need to take the absolute of rating difference because I want to look at its relationship with victory status regardless of the player ,black or white, who has the higher rating.

In [None]:
# adding new column for absolute difference in rating 
games['abs_rating_diff_wb'] = abs(games['white_rating'] - games['black_rating'])

In [None]:
# plotting (absolute rating difference) against (victory status)
sb.barplot(data= games, x='victory_status', y= 'abs_rating_diff_wb');
plt.ylabel('Average Absolute Rating Difference (White-Black)')
plt.xlabel('Victory Status');

Absolute rating difference, on average, is higher at games that ended by checkmate compared to games that ended by one of the players resigning.

Looking at the relationship between victory status and the other variables: 

In [None]:
# plotting victory status distribution for each game rating
sb.countplot(data=games, hue='victory_status', x='game_rating');
plt.ylabel('Count')
plt.xlabel('Game Rating')
plt.legend(title='Victory Status');

In [None]:
# plotting victory status distribution for each opening
sb.countplot(data=games, hue='victory_status', x='opening_eco');
plt.ylabel('Count')
plt.xlabel('Opening ECO')
plt.legend(title='Victory Status');

In [None]:
# plotting victory status distribution for each opening
sb.violinplot(data=games, x='victory_status', y='turns', inner='quartile');
plt.ylabel('Number of Turns')
plt.xlabel('Victory Status');

- Both game ratings have similar trends for victory status.
- Haven't learnt much from the plot with groups of openings as each of these groups show the same trend of victory status.
- Winning by checkmates takes, on average, more number of turns than that of games won by resigning; it can be seen from the higher median and the thicker distribution at higher number of turns.

The above bivariate exploration is about the variables of interest; winner and victory status. Now, let's take a look at the other relationships that number of turns and rating difference have with the other variables. Starting with the relationship between them.

In [None]:
# scatter plot between absolute rating difference and number of turns using a sample from the data
sb.regplot(data=games.sample(2000, replace=False), y='turns', x='abs_rating_diff_wb',
           fit_reg=True,truncate=False, scatter_kws={'alpha':0.4});
plt.xlabel('Absolute Rating Difference');
plt.ylabel('Number of Turns');

Number of turns has very weak correlation with the absolute rating difference. However, number of turns tends to be higher when the two players' ratings are so close (approximately zero difference between them).

In [None]:
# plotting number of turns against game rating and openings
g = sb.PairGrid(data = games, y_vars=['turns'],
            x_vars=['game_rating','opening_eco'], height=6, aspect=1.1,diag_sharey=True )
g.map(sb.boxenplot);

In [None]:
# plotting absolute rating difference against game rating and openings
g = sb.PairGrid(data = games, y_vars=['abs_rating_diff_wb'],
            x_vars=['game_rating','opening_eco'], height=6, aspect=1.1,diag_sharey=True)
g.map(sb.boxenplot);

There are barely noticeable difference in values for each of these variables distribution against number of turns or rating difference. Number of turns tend to be higher in rapid rated games compared to blitz rating ones but blitz rating games are won more when difference in rating is higher. The openings with the highest number of turns are these of group E. 

There is only one relationship left; game rating versus opening groups.

In [None]:
#plot
sb.countplot(data=games, x='opening_eco', hue='game_rating', palette='dark');
plt.ylabel('Count')
plt.xlabel('Opening ECO')
plt.legend(title='Game Rating');

There is a small number of blitz games compared to rapid games that is why rapid rating dominates at each opening group.

After looking at each possible combination of variables, I want to look at the percentages of games won, instead of counts, for the variables of interest; winner and victory status.

I will get the percentage of games won by black and by white in each game rating then compare them.

In [None]:
def winner_percentages(var):
        ''' A function to plot the percentage of games won for each winner, white and black,
            and its variation with a variable
            var: (string) the variable you want to plot the variation with.
        '''
        white_counts = games.query('winner == "white"').groupby(var)[var].count()
        black_counts = games.query('winner == "black"').groupby(var)[var].count()
        game_var_counts = games[var].value_counts()
        
        white_var_pct= (white_counts/game_var_counts)*100
        black_var_pct= (black_counts/game_var_counts)*100
        plt.figure(figsize=(12,7))
        plt.bar(data= games, x=white_var_pct.index, height=white_var_pct,
                width=-0.4, align='edge', color='silver',label='white');

        plt.bar(data= games, x=black_var_pct.index, height=black_var_pct, 
                width=0.4, align='edge', color='black', label='black');

        ticks= [0,10,20,30,40,50]
        labels = ['{}%'.format(t) for t in ticks]
        plt.yticks(ticks, labels);
        plt.ylabel('% of Games Won')
        plt.legend();

winner_percentages('game_rating')
winner_percentages('opening_eco')
winner_percentages('victory_status')

- Player as white has more advantage over player as black in blitz rating games.
- Highest winning percentage for white is at opening group C while, surprisingly, black has slight advantage when using opening group E.
- White has slightly higher win percentage over black by checkmate compared to winning by player resigning.

Doing the same with victory status variable:

In [None]:
def victory_percentages(var):
        ''' A function to plot the percentage of games won for each victory status, resign and mate,
            and its variation with a variable
            var: (string) the variable you want to plot the variation with.
        '''
        mate_counts = games.query('victory_status == "mate"').groupby(var)[var].count()
        resign_counts = games.query('victory_status == "resign"').groupby(var)[var].count()
        game_var_counts = games[var].value_counts()
        
        mate_var_pct= (mate_counts/game_var_counts)*100
        resign_var_pct= (resign_counts/game_var_counts)*100
        plt.figure(figsize=(12,7))
        
        plt.bar(data= games, x=mate_var_pct.index, height=mate_var_pct,
                width=-0.4, align='edge', color=sb.color_palette('colorblind')[3],label='mate');

        plt.bar(data= games, x=resign_var_pct.index, height=resign_var_pct, 
                width=0.4, align='edge', label='resign');

        ticks= [0,10,20,30,40,50, 60, 70]
        labels = ['{}%'.format(t) for t in ticks]
        plt.yticks(ticks, labels);
        plt.ylabel('% of Games Won')
        plt.legend();

victory_percentages('game_rating')
victory_percentages('opening_eco')

- The trend is the same for both rapid rating and blitz rating games with resigning percentages are higher compared to checkmates in both.
- Resigning percentage in group E is the highest, but other than that all openings are having similar distribution.

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?<br>

There are a lot of factors that can possibly affect the features of interest and I will take the most interesting relationships one by one so it doesn't become so overwhelming.<br>

* **Difference in rating** between players is the most obvious factor in **winning the game**; The players who win have higher rating, on average, than the opponent by around 80 rating points. Average absolute rating difference is higher at games that ends with checkmate compared to the games that ends with resigning; but it doesn't say if the one with higher rating wins or not which will be investigated later.<br>


* Player as black is taking slightly more **number of turns**, on average, to win the game compared to player as white, and number of turns that needed to checkmate are,on average, higher significantly than that of resigning.<br>


* White is **winning** with higher percentage whether by **checkmate or resigning** compared to black, but the difference is slightly higher at games that ends with checkmate. <br>


* Victory status has low variation with variables like **game rating** and **opening ECO**, but as for winning, white is more dominant at blitz rating games than rapid rating ones and it also has higher percentage at all openings except for openings group E in which black has barely higher winning percentage with it. <br>


### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

* The two numerical variables; number of turns and difference in rating have a week negative correlation; the higher the difference in rating the fewer number of turns are needed to win the game.<br>


## Multivariate Exploration

From what I have learned in the previous exploration, the variations of the variables of interest, winner and victory status, with rating difference and number of turns are significant, so I will start plotting the variables of interest one by one with these two variables.

In [None]:
# making a sample of data to use for scatter plots
games_sample= games.sample(1000, replace=False, random_state=100)

In [None]:
# scatter plot of rating difference versus number of turns for each winner
winner_markers= [['white', 'o', 'darkgray'],
                ['black', '^', 'black']]

plt.figure(figsize=(16,9))
for winner, marker, color in winner_markers:
    sb.regplot(data = games_sample.loc[games_sample['winner']==winner], x = 'rating_diff_wb', y = 'turns',
               x_jitter=5,fit_reg = False, marker = marker, scatter_kws={'color':color});

plt.legend(title='Winner',labels=['White','Black']);
plt.axvspan(0, 1100, facecolor='g', alpha=0.05)
plt.axvline(0, color='k')
plt.xlim(-1100,1100)
plt.xlabel('Rating Difference (White-Black)')
plt.ylabel('Number of Turns');

This plot can clearly show what I learned earlier from the data and some interesting relationships.
* Most of the games are played between players of the **same or close rating** and lasted until about **40 to 70 turns**.
* Players, as white or as black, are **winning** more when their **ratings are higher** than their opponents compared to when they have less rating than their opponents.
* Games that are played with **high difference in rating** in the winner's favor tend to end in **less number of turns** than those with lower difference in rating.
* There are more players as **black winning** the game with **high number of turns** specially when the rating difference is small.

In [None]:
# scatter plot of rating difference versus number of turns for each victory status
victory_markers= [['resign', 'x'],
                ['mate', 's']]

plt.figure(figsize=(15,8))
for victory, marker in victory_markers:
    ax=sb.regplot(data = games_sample.loc[games_sample['victory_status']==victory], x = 'rating_diff_wb', y = 'turns',
               fit_reg = False, marker = marker, scatter_kws={'alpha':0.7},);

plt.legend(title='Victory Status', labels=['resign','mate']);
plt.xlim(ax.get_xlim())
plt.axvspan(0,ax.get_xlim()[1], facecolor='g', alpha=0.05)
plt.axvline(0, color='k')
plt.xlabel('Rating Difference (White-Black)')
plt.ylabel('Number of Turns');

It is hard to interpret any findings in this visual as the distributions of games won by checkmate or resigning are similar.

Let's plot bar charts with the means of rating difference and number of turns that may be clearer.

In [None]:
# plotting victory status against average rating difference for each winner
g = sb.FacetGrid(data=games, col='winner', col_order=['white','black'], height=6, aspect=1.1)
g.map(sb.barplot,'victory_status' , 'rating_diff_wb',order=['mate','resign'] 
      , palette=[sb.color_palette('colorblind')[3],sb.color_palette()[0]]);
plt.ylim(-110,110)
g.axes[0,0].set_xlabel('Victory Status');
g.axes[0,0].set_ylabel('Average Rating Difference (White-Black)');
g.axes[0,1].set_xlabel('Victory Status');

In [None]:
# plotting victory status against average number of turns for each winner
g = sb.FacetGrid(data=games, col='winner', col_order=['white','black'], height=6, aspect=1.1)
g.map(sb.barplot,'victory_status' , 'turns',order=['resign','mate'] 
      , palette=[sb.color_palette()[0],sb.color_palette('colorblind')[3]]);
g.axes[0,0].set_xlabel('Victory Status');
g.axes[0,0].set_ylabel('Average Number of Turns');
g.axes[0,1].set_xlabel('Victory Status');

Games won by checkmate need, on average, higher number of turns and higher rating difference between players in favor of the winner compared to games won by resigning regardless of the winner, white or black.

I want to include another variable which is game rating, but first let's see its variation with rating difference and number of turns.

In [None]:
# scatter plot of rating difference versus number of turns for each game rating
gameRating_markers= [['blitz', 'x', sb.color_palette('dark')[0]],
                ['rapid', 's', sb.color_palette('dark')[3]]]

plt.figure(figsize=(15,8))
for gameRating, marker, color in gameRating_markers:
    ax=sb.regplot(data = games_sample.loc[games_sample['game_rating']==gameRating], x = 'rating_diff_wb', y = 'turns',
               fit_reg = False, marker = marker, scatter_kws={'alpha':0.7,'color': color});

plt.legend(title='Game Rating', labels=['blitz','rapid']);
plt.xlim(ax.get_xlim())
plt.axvspan(0,ax.get_xlim()[1], facecolor='g', alpha=0.035)
plt.axvline(0, color='k')
plt.xlabel('Rating Difference (White-Black)')
plt.ylabel('Number of Turns');

Blitz rating games are won at high rating difference with less number of turns compared to the majority of rapid rating games.

In [None]:
# plotting game rating against rating difference and against number of turns for each winner
plt.figure(figsize=(15,8))

# first plot
# to make the comparison clearer between white an black I will take the absolute of mean rating difference

plt.subplot(1,2,1) 
white_data=games.query('winner=="white"').groupby('game_rating')['rating_diff_wb'].mean().abs()
sb.pointplot(x=white_data.index, y=white_data.values, hue=['white']*len(white_data.values),
                 palette=['silver']*len(white_data.values))

black_data=games.query('winner=="black"').groupby('game_rating')['rating_diff_wb'].mean().abs()
sb.pointplot(x=black_data.index, y=black_data.values, hue=['black']*len(black_data.values),
             palette=['black']*len(black_data.values))
plt.xlabel('Game Rating')
plt.ylabel('Average Rating Difference in Favor of the Winner')
plt.ylim(0,110);
plt.legend(title='Winner');



# second plot

plt.subplot(1,2,2)
sb.pointplot(data=games, x='game_rating', y='turns', hue='winner', order=['blitz','rapid'], hue_order=['white','black'],
            palette=['silver', 'black'], ci=False)
plt.ylim(0,70);
plt.xlabel('Game Rating')
plt.ylabel('Average Number of Turns');

These plots prove what I learned in the previous scatter plot with some additional findings;
* blitz rating games won by more skilled players as white compared to players as black,
* players as black tend to take more number of turns to win a rapid rating game.

Let's add victory status variable to each of the previous two plots.

In [None]:
# plotting game rating against rating difference for each winner  and each victory status

# first plot
plt.figure(figsize=(16,9))
plt.subplot(1,2,1)
white_data=games.query('winner=="white" & victory_status=="resign"').groupby('game_rating')['rating_diff_wb'].mean().abs()
ax2=sb.pointplot(x=white_data.index, y=white_data.values, hue=['white']*len(white_data.values),
                 palette=['silver']*len(white_data.values))

black_data=games.query('winner=="black" & victory_status=="resign"').groupby('game_rating')['rating_diff_wb'].mean().abs()
ax1= sb.pointplot(x=black_data.index, y=black_data.values, hue=['black']*len(black_data.values), 
                  palette=['black']*len(black_data.values))
plt.ylim(0,140);
plt.xlabel('Game Rating')
plt.ylabel('Average Rating Difference in Favor of the Winner')
plt.title('Victory Status: Resign');
plt.legend(title='Winner');


# second plot
plt.subplot(1,2,2)
white_data=games.query('winner=="white" & victory_status=="mate"').groupby('game_rating')['rating_diff_wb'].mean().abs()
ax2=sb.pointplot(x=white_data.index, y=white_data.values, hue=['white']*len(white_data.values),
                 palette=['silver']*len(white_data.values))

black_data=games.query('winner=="black" & victory_status=="mate"').groupby('game_rating')['rating_diff_wb'].mean().abs()
ax1= sb.pointplot(x=black_data.index, y=black_data.values, hue=['black']*len(black_data.values),
                  palette=['black']*len(black_data.values))
plt.ylim(0,140);
plt.xlabel('Game Rating')
plt.title('Victory Status: Mate');
plt.legend(title='Winner');

Games that ends with **resigning** have similar trends for black and white. However, Blitz rating games that ended with **checkmate** are won by player as black at lower rating difference in his favor and in rapid rating games with higher rating difference compared to player as white.

In [None]:
# plotting game rating against number of turns for each winner  and each victory status

g = sb.FacetGrid(data=games, col='victory_status',col_order=['resign', 'mate'], height=6, aspect=1.1)
g.map(sb.pointplot,'game_rating' , 'turns', 'winner',order=['blitz','rapid'] , hue_order=['white','black']
      , palette=['silver', 'black'], ci=False);

g.axes[0,0].set_ylabel('Average Number of Turns');
g.axes[0,0].set_xlabel('Game Rating');
g.axes[0,1].set_xlabel('Game Rating');
g.axes[0,0].set_title('Victory Status: Resign');
g.axes[0,1].set_title('Victory Status: Mate');
plt.legend(title= 'Winner',title_fontsize=13, fontsize=12 );
plt.ylim(0,80);

The trends are similar for both white and black for each victory status. And as I found earlier checkmates are at higher number of turns compared to resigns.

Moving on to openings to see if there are certain openings that gurantee win for white or black.

I decided to look at the winner's rating instead of rating difference this time because I want to know what openings at each rating level that the players use to win.

In [None]:
games[['white_rating','black_rating']].describe()

In [None]:
# plotting the opening groups against the winner rating for white and black
plt.figure(figsize=(15,8))
white_data=games.query('winner=="white"').groupby('opening_eco')['white_rating'].mean()
sb.pointplot(x=white_data.index, y=white_data.values, hue=['white']*len(white_data.values),
                 palette=['silver']*len(white_data.values))

black_data=games.query('winner=="black"').groupby('opening_eco')['black_rating'].mean()
sb.pointplot(x=black_data.index, y=black_data.values, hue=['black']*len(black_data.values),
             palette=['black']*len(black_data.values))
plt.ylabel("Average Winner's Rating")
plt.legend(title='Winner');
plt.ylim(np.min([games['white_rating'].min(),games['black_rating'].min()]),
         np.max([games['white_rating'].max(),games['black_rating'].max()]));


It looks that openings of group E are used more by the highest rated players while the lowest rated player prefer openings of group C.

In [None]:
# plotting the opening groups against the winner rating for white and black and for each victory status

plt.figure(figsize=(16,9))
plt.subplot(1,2,1)
white_data=games.query('winner=="white" & victory_status=="resign"').groupby('opening_eco')['white_rating'].mean().abs()
ax2=sb.pointplot(x=white_data.index, y=white_data.values, hue=['white']*len(white_data.values),
                 palette=['silver']*len(white_data.values))

black_data=games.query('winner=="black" & victory_status=="resign"').groupby('opening_eco')['black_rating'].mean().abs()
ax1= sb.pointplot(x=black_data.index, y=black_data.values, hue=['black']*len(black_data.values), 
                  palette=['black']*len(black_data.values))
plt.xlabel('Opening ECO')
plt.ylabel("Average Winner's Rating")
plt.title('Victory Status: Resign');
plt.legend(title='Winner');
plt.ylim(np.min([games['white_rating'].min(),games['black_rating'].min()]),
         np.max([games['white_rating'].max(),games['black_rating'].max()]));


plt.subplot(1,2,2)
white_data=games.query('winner=="white" & victory_status=="mate"').groupby('opening_eco')['white_rating'].mean().abs()
ax2=sb.pointplot(x=white_data.index, y=white_data.values, hue=['white']*len(white_data.values),
                 palette=['silver']*len(white_data.values))

black_data=games.query('winner=="black" & victory_status=="mate"').groupby('opening_eco')['black_rating'].mean().abs()
ax1= sb.pointplot(x=black_data.index, y=black_data.values, hue=['black']*len(black_data.values),
                  palette=['black']*len(black_data.values))
plt.xlabel('Opening ECO')
plt.title('Victory Status: Mate');
plt.legend(title='Winner');
plt.ylim(np.min([games['white_rating'].min(),games['black_rating'].min()]),
         np.max([games['white_rating'].max(),games['black_rating'].max()]));

The same trends as earlier but with more highly rated players as black winning by chekmate using group B and E compared to other groups.

In [None]:
# plotting the opening groups against the number of turns for each winner
plt.figure(figsize=(15,8))
plt.subplot(1,2,1)
sb.pointplot(data=games, x='opening_eco', y='turns', hue='winner', hue_order=['white','black'],
            palette=['silver', 'black'], ci=False)
plt.xlabel('Opening Eco')
plt.ylabel('Average Number of Turns');
plt.ylim(0,80);

# plotting the opening groups against the number of turns for each victory status
plt.subplot(1,2,2)
sb.pointplot(data=games, x='opening_eco', y='turns', hue='victory_status', hue_order=['resign','mate'],
            ci=False)
plt.xlabel('Opening Eco')
plt.ylabel('');
plt.ylim(0,80);

It takes black more number of turns, on average to win in each opening group and resigning from a game happens after less number of turns compared to checkmate in all opening groups.

In [None]:
# plotting the opening groups against the number of turns for each winner and for each victory status
g = sb.FacetGrid(data=games, col='victory_status',col_order=['resign', 'mate'], height=6, aspect=1.1)
g.map(sb.pointplot,'opening_eco' , 'turns', 'winner',order=['A','B','C','D','E'] , hue_order=['white','black']
      , palette=['silver', 'black'], ci=False);

g.axes[0,0].set_ylabel('Average Number of Turns');
g.axes[0,0].set_xlabel('Opening ECO');
g.axes[0,1].set_xlabel('Opening ECO');
g.axes[0,0].set_title('Victory Status: Resign');
g.axes[0,1].set_title('Victory Status: Mate');
plt.legend(title= 'Winner',title_fontsize=13, fontsize=12 );
plt.ylim(0,90);

Games which are won by resigning takes the highest number of turns at group E for black and white, however; the highst number of turns it takes to win by checkmate is at group B for black and at group E for white.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I investigated more the two factors, number of turns and players ratings, that showed the strongest variations with the winner and victory status and used them one by one with the other variables; game rating and opening groups. This multivariate investigation yielded some findings that were subtle before such as:


* Blitz rating games are won more by white when he is more skilled than black with lower number of turns compared to rapid games. <br>

* Despite the fact that openings of group E are rarely used by the winner, they are popular among players of higher ratings.<br>

* For all opening groups, it takes higher number of turns to beat an opponent by checkmate than it takes for a win by resigning.<br>

### Were there any interesting or surprising interactions between features?

It is interesting that white doesn't have more advantage over black or even different strategy despite that the player as white is the one who opens the game and player as black has to react to that opening. However, they both have  similar variations with all other variables with slight advantage for white who has wins at lower average number of turns and less average rating difference with the opponent.