# Game length patterns
## Question: How long do chess games typically last, and what factors influence game duration?
## Tasks
- Calculating average game length by victory type
- Finding the shortest and longest games
- Identifying if certain openings lead to shorter/longer games
- Create a "game length profile" for different victory types

In [8]:
# === INITIALIZE ===
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("data/games.csv")

In [9]:
# == [1] Average game length by victory type

# Get the statistical values of turns grouped by victory status
vdf = df.groupby(by="victory_status")["turns"].agg(["count", "mean", "std"])

# Adding the CV - coefficient of variation (std in relation to mean) (the data has to be approximately normally distributed)
vdf['coef_variation'] = (vdf['std'] / vdf['mean']).round(3)

print(vdf)

                count       mean        std  coef_variation
victory_status                                             
draw              906  83.781457  45.318111           0.541
mate             6325  65.415020  33.245468           0.508
outoftime        1680  72.742857  39.104104           0.538
resign          11147  53.912533  29.665326           0.550


**Interpretation**
The data mostly follows intuition in my opinion:


**1.** "draw" games are the longest, stretched out games on average. There just are a lot of late game
situations that lead to a draw.

**2.** "mate" games are way shorter and the deviation is smaller, the "mate" games *feel*
   
way more determined than draw games by experience.

**3.** "outoftime" games follow the direction of draw games.
   
**4.** "resign" games are the shortest and from a players perspective holds the most deterministic situations.

You instantly know when there is no (little) chance and forfeit fast.


**5.** If you ever played online, it just makes sense that "resign" has the highest count.

**6.** The standard deviations make up pretty much the same percentage relative to the mean,

   in basically every outcome. There could be a discussion on why "mate" stands out with

   up to .42 difference in the *coefficient of variation*.

In [None]:
# == [2] Shortest and longest games
df['created_at'] = pd.to_datetime(df['created_at'])
df['last_move_at'] = pd.to_datetime(df['last_move_at'])

min_l_game = df.loc[df["turns"].idxmin()]
max_l_game = df.loc[df["turns"].idxmax()]
print(f"Shortest game:\n\n{min_l_game}\n\n")
print(f"Longest game:\n\n{max_l_game}")


### Interpretation: \[2\] Shortest and longest games

**Shortest game:**

So, one would think, that there is no information to gain from the shortest game, since

it would just be a game, which gets forfeited instantly.

Still there is some information to, at least meta information.


1. Despite there is just one move made (g3), lichchess still assigns a opening, which

gives information on how lichchess gains data. This information could under circumstances

be crucial in furter exploration and or modelling.

1. The fact, that there is no game with 0 turns in the whole dataset, could suggest,

that lichchess does

not persist games with 0 moves.


**Longest game:**

Now the longest game went on for an astonishing 349 moves, just so that black ran out

of time. It is to be expected, that the players have more or less the same ratings.



In [28]:
# === [3] Identifying if certain openings lead to shorter/longer games
# top_shortest = opening_stats.sort_values(by='mean', ascending=True).head()
top_shortest = opening_stats.sort_values(by='mean', ascending=True)[opening_stats["count"] > 100].head()
print(f"Top 5 openings with the shortest games on average:\n{top_shortest}\n\n")

top_longest = opening_stats.sort_values(by='mean', ascending=False)[opening_stats["count"] > 100].head()
print(f"Top 5 openings with the shortest games on average:\n{top_longest}")

Top 5 openings with the shortest games on average:
             count       mean        std        cv
opening_eco                                       
C23            155  49.529032  33.145941  0.669222
C57            121  49.694215  36.070034  0.725840
C40            446  51.782511  31.769603  0.613520
B00            611  54.783961  35.200118  0.642526
C20            675  55.088889  36.655438  0.665387


Top 5 openings with the shortest games on average:
             count       mean        std        cv
opening_eco                                       
B90            101  68.653465  30.595240  0.445647
B40            129  68.348837  32.339665  0.473156
B50            226  67.101770  34.457230  0.513507
B06            176  66.602273  35.490535  0.532873
C62            137  66.467153  32.200213  0.484453


  top_shortest = opening_stats.sort_values(by='mean', ascending=True)[opening_stats["count"] > 100].head()
  top_longest = opening_stats.sort_values(by='mean', ascending=False)[opening_stats["count"] > 100].head()


**Interpretation:**
I think there is not much information to gain from this data.

1. In the top 5, no matter if shortest or longest, there is not much variation
   ingame lengths. Also the openings are pretty stable in terms of deviation (and relative to chess)

2. There is a factor which could corrupt the data. Since some openings are rarely played, their stats
   could come from one or few players, that apply their individual style on to the stats, and the
   opening has not much to do with the i.e. mean.
   If this would be a business project, something like this should definitely be checked out.
   For now I'll leave it out.

3. If the treshhold for relevant game counts is increased from 50 to i.e. 100, the data shows
   a significant rise in the coefficient of variance. Which means, that the games vary much more
   in game lenghts. I interpret this as a result of that the more popular openings get played in much
   more diverse "tiers". They get more tried out and tested, they also get more practiced and
   trained. As to more rare openings, that might be just played in very small "pockets". For example
   just the highest rated 5% of players tends to play opening xy.