In [28]:
import pandas as pd
import numpy as np
import altair as alt
alt.renderers.enable('default', max_rows=)

RendererRegistry.enable('default')

In [29]:
df = pd.read_csv("games.csv")

In [30]:
df.head()

Unnamed: 0,id,rated,created_at,last_move_at,turns,victory_status,winner,increment_code,white_id,white_rating,black_id,black_rating,moves,opening_eco,opening_name,opening_ply
0,TZJHLljE,False,1504210000000.0,1504210000000.0,13,outoftime,white,15+2,bourgris,1500,a-00,1191,d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5...,D10,Slav Defense: Exchange Variation,5
1,l1NXvwaE,True,1504130000000.0,1504130000000.0,16,resign,black,5+10,a-00,1322,skinnerua,1261,d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6...,B00,Nimzowitsch Defense: Kennedy Variation,4
2,mIICvQHh,True,1504130000000.0,1504130000000.0,61,mate,white,5+10,ischia,1496,a-00,1500,e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc...,C20,King's Pawn Game: Leonardis Variation,3
3,kWKvrqYL,True,1504110000000.0,1504110000000.0,61,mate,white,20+0,daniamurashov,1439,adivanov2009,1454,d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O...,D02,Queen's Pawn Game: Zukertort Variation,3
4,9tXo1AUZ,True,1504030000000.0,1504030000000.0,95,mate,white,30+3,nik221107,1523,adivanov2009,1469,e4 e5 Nf3 d6 d4 Nc6 d5 Nb4 a3 Na6 Nc3 Be7 b4 N...,C41,Philidor Defense,5


### Dataset Information:

The dataset contains information on 20,058 chess games and their participating players. Based on the [dataset description](https://www.kaggle.com/datasnaek/chess), the data was collected from users' game histories on the [API](https://lichess.org/api) of Lichess, the chess site on which the games included were played. The API was accessed and the dataset put together by Kaggle user Mitchell J. I was interested in finding out which of the recorded elements of the games may change how they proceed or their outcomes. I thought that various trends may exist in terms of the players' statistics and the types of games they play. From my limited understanding of chess, I do not think that any important information was missing aside from a way to see how long the games went. Although the dataset included `created_at` and `last_move_at` times, these values did not have enough significant figures for me to determine their difference, so I could not find the exact time taken in a game. While `increment_code`, the total amount of time the players had, may generally be related to the amount of time taken, many games could have been much shorter or longer than this time, depending on when win conditions were met. Through analysis, I was hoping to find what aspects of the games were correlated and how those aspects could help me predict other elements, such as the victor of the game. 

### Quantitative Variables:

For quantitative variables, I wanted to focus on the attributes `increment_code` and `turns`. Because I also wanted to have more information on the players in the games, I decided to include `black_rating` and `white_rating` as well. For each variable, I examined distribution information such as the minimums, maximums, quartiles, means, and standard deviations and used histograms to provide a visual. After this exploration, I found the correlation between all of the variables.

#### Increment Code:

I first looked at the `increment code` attribute, which showed the total time given for a game and the time provided after each move. It is written in the format M+S, where M gives the total minutes on the clock for a game, and S gives the additional seconds allotted to the player on their turn. Since these codes were strings with values in multiple units, I wanted to convert each code into a number with a single unit, which I decided would be seconds in avoiding decimals. The increment time would be added to this total time once for each turn. This number would not represent the total time for the game clock in most cases, given that the players may have stopped taking turns before their clock ran out, but I figured that it would give me a better idea of how much time each game had than looking at the first half of the increment code alone. I used a function to accomplish this and made the modified codes into their own column, `seconds`.

In [31]:
pd.unique(df["increment_code"])

array(['15+2', '5+10', '20+0', '30+3', '10+0', '15+30', '15+0', '20+60',
       '5+40', '8+0', '15+15', '15+16', '11+0', '30+60', '15+5', '10+10',
       '15+3', '18+0', '30+30', '20+25', '5+5', '30+0', '40+40', '16+0',
       '6+10', '8+10', '8+12', '7+12', '15+10', '20+5', '20+10', '45+0',
       '5+8', '30+2', '25+0', '10+2', '7+2', '25+25', '7+8', '6+6',
       '10+4', '12+0', '15+1', '2+10', '9+0', '10+9', '6+7', '30+10',
       '30+15', '10+3', '45+30', '9+2', '7+3', '6+5', '25+5', '5+7',
       '8+2', '20+8', '30+5', '25+15', '15+7', '8+8', '5+9', '16+7',
       '10+7', '7+5', '40+20', '10+5', '11+5', '10+15', '180+0', '17+5',
       '8+5', '3+8', '60+120', '90+120', '15+8', '9+1', '9+3', '2+15',
       '10+1', '5+12', '60+10', '5+60', '8+15', '30+25', '12+1', '14+0',
       '60+20', '10+8', '20+3', '14+12', '7+7', '60+0', '45+45', '4+15',
       '5+6', '180+180', '8+3', '4+6', '10+25', '3+10', '7+4', '6+3',
       '5+15', '45+15', '15+20', '6+8', '6+9', '14+8', '6+4', '15+4',
 

In [32]:
def find_time(increment_code, turns): 
    '''
    Takes a chess increment code and the number of turns and returns the total number of seconds the code represents based on the turns taken.
    '''
    finding_increment = False # True when the second half of the code (increment) is being iterated through
    increment_lst = []
    clock_lst = []
    str1 = ""
    for i in increment_code: # Creates lists for the integers before the "+" (clock) and after the "+" (increment)
        if finding_increment:
            increment_lst.append(i)
        elif i == '+':
            finding_increment = True 
        else:
            clock_lst.append(i)
    
    increment = int(str1.join(increment_lst)) # Makes the list from the increment side of the code into a single integer
    clock = int(str1.join(clock_lst)) # Makes the list from the clock side of the code in a single integer
    return clock*60 + (increment*turns) # Converts the clock time to seconds and adds the increment time for each turn

In [33]:
time_lst = []
for i in range(0,df.shape[0]): # Finds the seconds version of the increment code for each row, using the increment_code and turns columns, and adds them to a list
    time_lst.append(find_time(df.iloc[i,7],df.iloc[i,4]))

df["seconds"] = time_lst

In [34]:
df["seconds"].describe()

count    20058.000000
mean      1118.072241
std       1525.794926
min         32.000000
25%        600.000000
50%        645.000000
75%       1185.000000
max      35460.000000
Name: seconds, dtype: float64

The mean is much higher than the median, which makes sense considering how much more common smaller values are and how the maximum is much more distant from common data points than the minimum. The difference between Q1 and Q2 is much smaller than that between Q2 and Q3, demonstrating that values may have been concentrated on the lower end of the data and giving the distribution a right skew. Many games had around 600 seconds on the clock, or 10 minutes.

In [35]:
alt.Chart(df).mark_bar().encode(
    x = alt.X("seconds:Q", bin = alt.BinParams(step = 500)),
    y = "count()"
)

TypeError: spec_to_html() got an unexpected keyword argument 'max_rows'

alt.Chart(...)

This histogram shows the extent to which smaller values are far more common. The range from 500 to 1000 especially had a staggering amount of cases, representing over half of the total data points. The distance of outliers from common values also becomes evident.

In [36]:
df[df["seconds"] > 6000].shape

(266, 17)

In [37]:
alt.Chart(df).mark_bar(clip=True).encode(
    x = alt.X("seconds:Q", bin = alt.BinParams(step = 250), scale = alt.Scale(domain = [0, 6000])),
    y = "count()"
)

TypeError: spec_to_html() got an unexpected keyword argument 'max_rows'

alt.Chart(...)

Due to the large size of the x axis of the first histogram, I decided to make a second one that would provide a better representation of the more common values. I found that the proportion of the data greater than 6000 seconds was very small, around 1% of the total data, so I ommited these points. The range between 500 and 750 seconds represented around half of the data, and a large portion of the data was between 250 and 2000 seconds. 

#### Turns:

The `turns` attribute shows the number of turns taken in each game.

In [38]:
df["turns"].describe()

count    20058.000000
mean        60.465999
std         33.570585
min          1.000000
25%         37.000000
50%         55.000000
75%         79.000000
max        349.000000
Name: turns, dtype: float64

The mean is slightly higher than the median, so the data is skewed right. The difference between Q1 and Q2 is slightly smaller than that between Q2 and Q3, so the data is slightly more concentrated on that side. The minimum is much closer to common values than the maximum is, supporting the right skew.

In [39]:
alt.Chart(df).mark_bar().encode(
    x = alt.X("turns:Q", bin = alt.BinParams(step = 10)),
    y = "count()"
)

TypeError: spec_to_html() got an unexpected keyword argument 'max_rows'

alt.Chart(...)

The mode is between 50 and 60, which is also where the median is. The distribution is fairly symmetrical, but it has a clear tail towards the right and high outliers.

#### Black and White Ratings:

The attributes `black_rating` and `white_rating` show the rating for the player of either color. I thought that considering the difference between or the average of the variables may provide more useful information than the variables individually, so I made new columns for `rating_diff` and `rating_avg`. I also added `rating_diff_abs`, the absolute value of the difference, to show a difference where the color of the players did not matter.

In [40]:
df["white_rating"].describe()

count    20058.000000
mean      1596.631868
std        291.253376
min        784.000000
25%       1398.000000
50%       1567.000000
75%       1793.000000
max       2700.000000
Name: white_rating, dtype: float64

In [41]:
df["black_rating"].describe()

count    20058.000000
mean      1588.831987
std        291.036126
min        789.000000
25%       1391.000000
50%       1562.000000
75%       1784.000000
max       2723.000000
Name: black_rating, dtype: float64

The black and white ratings have almost indentical distributions, which makes sense since either set of ratings would have been taken from the same pool of players. 

In [42]:
rating_avg = []
rating_diff = []
rating_diff_abs = []
for i in range(0,df.shape[0]): # Iterates through the rows and makes a lists containing the average of the two ratings, their difference, and the absolute value of their difference
    rating_avg.append(sum([df.iloc[i,9], df.iloc[i,11]])/2)
    diff = df.iloc[i,9] - df.iloc[i,11]
    rating_diff.append(diff) 
    rating_diff_abs.append(abs(diff))
    
df["rating_avg"] = rating_avg
df["rating_diff"] = rating_diff 
df["rating_diff_abs"] = rating_diff_abs

In [43]:
df["rating_avg"].describe()

count    20058.000000
mean      1592.731927
std        263.173825
min        816.500000
25%       1406.500000
50%       1568.750000
75%       1771.000000
max       2475.500000
Name: rating_avg, dtype: float64

The mean and median are very close, but the median is slightly smaller. The data seems to be mostly symmetrical and slightly more concentrated on the lower end.

In [44]:
df["rating_diff"].describe()

count    20058.000000
mean         7.799880
std        249.036667
min      -1605.000000
25%       -108.000000
50%          3.000000
75%        122.000000
max       1499.000000
Name: rating_diff, dtype: float64

The mean and median are close to zero, which reflects the similarity of the black and white distributions. From the way the column was constructed, positive values indicate a greater white rating and negative ones a greater black rating, and it makes sense that the matching system would choose a certain difference in favor of black and that difference in favor of white a similar amount of times. The distribution is mostly symmetrical as well.

In [45]:
df["rating_diff_abs"].describe()

count    20058.000000
mean       173.091435
std        179.214854
min          0.000000
25%         45.000000
50%        115.000000
75%        241.000000
max       1605.000000
Name: rating_diff_abs, dtype: float64

The mean is higher than the median, and the data is skewed right. The difference between Q1 and Q2 is smaller than that between Q2 and Q3. If the matching system for players worked well, it makes sense that players of similar skill levels and thus ratings would be set to play one another. 

In [46]:
df[["seconds", "turns","rating_avg","rating_diff","rating_diff_abs"]].corr()

Unnamed: 0,seconds,turns,rating_avg,rating_diff,rating_diff_abs
seconds,1.0,0.047946,-0.041242,0.001271,0.053972
turns,0.047946,1.0,0.160526,-0.035781,-0.126531
rating_avg,-0.041242,0.160526,1.0,0.000965,0.054417
rating_diff,0.001271,-0.035781,0.000965,1.0,0.056133
rating_diff_abs,0.053972,-0.126531,0.054417,0.056133,1.0


Every quantitative variable I analyzed had close to no correlation with the other variables. `rating_avg` and `turns` and `rating_avg_abs` and `turns` had significantly more correlation than the other variables, but their values were not high on their own. It could make sense that higher level players would want to play more involved games and would make fewer fatal mistakes, leading to a greater number of turns. On the other hand, players with a greater difference in skill would have a game end in fewer turns, as the more skilled player would more easily outsmart the less skilled one. In terms of the other variables, time may not influence any other elements because any level of player could take any number of turns depending on how careful they wanted to be. Even in shorter matches, the lower amount of time may lead players to make moves much more quickly. The rating average and difference are naturally not correlated because the same difference could occur between a wide range of values.

### Qualitative Variables:

For this section, I focused on the attributes `victory_status` and `winner`. I found the counts across the categories of each variable and created a side-by-side bar plot to compare them.

#### Victory Status

`victory_status` gives the way in which the winning player won. The possible categories include out of time, where the loser's clock ran out, resign, where the loser resigned, mate, where the winner checkmated their opponent, and draw. 

In [47]:
pd.unique(df["victory_status"])

array(['outoftime', 'resign', 'mate', 'draw'], dtype=object)

In [48]:
alt.Chart(df).mark_bar().encode(
    x = alt.X("victory_status:N"),
    y = "count()"
)

TypeError: spec_to_html() got an unexpected keyword argument 'max_rows'

alt.Chart(...)

Games ending in a resign were much more common than others, and checkmates were fairly common as well. Given the online format of the games, it makes sense that players would commonly resign upon realizing they were in a losing or draw position rather than playing out the match. Connection issues or players accidentally selecting the wrong format and leaving could be common as well. 

#### Winner

`winner` shows which color won or if the players drew.

In [49]:
alt.Chart(df).mark_bar().encode(
    x = alt.X("winner:N"),
    y = "count()"
)

TypeError: spec_to_html() got an unexpected keyword argument 'max_rows'

alt.Chart(...)

White and black had similar amounts of victories, but white still had a noticable amount more than black, which could indicate that white may be a slightly advantageous position. Having a victor in general was far more common than drawing.

In [50]:
alt.Chart(df).mark_bar().encode(
    x = alt.X("victory_status:N"),
    y = "count()"
).facet( 
    column = "winner:N" # A chart showing victory_status is created individually for each winner
)

TypeError: spec_to_html() got an unexpected keyword argument 'max_rows'

alt.FacetChart(...)

In any case where one side beats the other, the distribution of mates, out of times, and resigns is very similar. In a majority of cases where draw occurred, the result was a pure draw state rather than an out of time. When out of time occurs, it is probably rare that the player whose opponent ran out of time will not have enough pieces to checkmate their opponent.

### Visualization:

The below visualization contains the attributes `rating_diff`, `winner`, and `victory_status`. The marks are the box plots, and the channels include x axis position within each chart, x axis position between the 4 charts, the y axis position, the color, the box size and position of the median line, the length of the whiskers, and the position of outlier points. The visualization shows how the difference in the ratings of the players relates to which player won and how that relationship changes according to how they won. In general, black won more often when the difference was negative, meaning the black player had a higher rating, and white won more often when the difference was positive, meaning the white player had a higher rating. In a draw, the median of the difference was almost zero. When a mate occurred, the boxes were more separated, showing the rating difference favored the winner to a greater extent. Resigns had slightly less separation, and the out of time result had a clear overlap. It makes sense that an out of time result may cause even a better player to lose, if they were spending too much time strategizing, for instance, but checkmating the other player would require more skill, which corresponds to a higher rating. For the draw, there is less variation in the box plot for drawing out of time than for having a pure draw result. This may be because a player who was significantly better than another would not usually be in a position where they would run out of time at an advantage or run out of pieces before a worse opponent, so players who drew out of time would tend to have closer ratings. I decided to use box plots over regular bars because I thought they provided more useful information. While using mean bars could make the visualization easier to look at, I thought that they would oversimplify the data. Using a box plot displays the distribution for the rating difference, and it is often useful to see where the data varies more. With more complicated marks, I used a side-by-side format over attempting to stack or layer the bar plots in some form, which could be difficult to read. I chose an interaction that highlights the box plots in different colors for the winner selected in the legend. I hoped that this could help the reader compare the distributions for a single winner across different types of victories.

In [51]:
hover = alt.selection_multi(on="mouseover", nearest=True, empty="none", fields=["winner"]) # Selects the field corresponding to the object the user mouses over 
win_color = alt.condition(hover, alt.Color("winner:N", legend=None), alt.value("lightgray")) # The field's color changes to light gray when not selected

graph = alt.Chart(df).mark_boxplot().encode(
    x=alt.X("winner:N", title="Winner"),
    y=alt.Y("rating_diff:Q", title="Rating Difference"),
    color=win_color
).interactive().facet( 
    column=alt.Column("victory_status:N", title = "Type of Victory"), # A chart showing victor is created for each type of victory
    title = "Rating Difference vs Winner and Type of Victory"
)
legend = alt.Chart(df).mark_rect().encode(
    y = alt.Y("winner:N", axis=alt.Axis(orient="right", title="Winner")),
    color=win_color
).add_selection(hover) # The selection will occur for the fields the user mouses over in the legend


graph|legend

TypeError: spec_to_html() got an unexpected keyword argument 'max_rows'

alt.HConcatChart(...)