# Game length patterns
## Question: How long do chess games typically last, and what factors influence game duration?
## Tasks
- Calculate average game length by victory type
- Find the shortest and longest games
- Identify if certain openings lead to shorter/longer games
- Create a "game length profile" for different victory types

In [8]:
# === INITIALIZE ===
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("data/games.csv")

In [9]:
# == [1] Average game length by victory type

# Get the statistical values of turns grouped by victory status
vdf = df.groupby(by="victory_status")["turns"].agg(["count", "mean", "std"])

# Adding the CV - coefficient of variation (std in relation to mean) (the data has to be approximately normally distributed)
vdf['coef_variation'] = (vdf['std'] / vdf['mean']).round(3)

print(vdf)

                count       mean        std  coef_variation
victory_status                                             
draw              906  83.781457  45.318111           0.541
mate             6325  65.415020  33.245468           0.508
outoftime        1680  72.742857  39.104104           0.538
resign          11147  53.912533  29.665326           0.550


**Interpretation**
The data mostly follows intuition in my opinion:
1. "draw" games are the longest, stretched out games on average. There just are a lot of late game
situations that lead to a draw.
2. "mate" games are way shorter and the deviation is smaller, the "mate" games *feel*
way more determined than draw games by experience.
3. "outoftime" games follow the direction of draw games.
4. "resign" games are the shortest and from a players perspective holds the most deterministic situations.
You instantly know when there is no (little) chance and forfeit fast.
5. If you ever played online, it just makes sense that "resign" has the highest count.
6. The standard deviations make up pretty much the same percentage relative to the mean,
   in basically every outcome. There could be a discussion on why "mate" stands out with
   up to .42 difference in the *coefficient of variation*.

In [12]:
# == [2] Shortest and longest games
min_l_game = df.loc[df["turns"].idxmin()]
max_l_game = df.loc[df["turns"].idxmax()]
print(f"Shortest game:\n\n{min_l_game}\n\n")
print(f"Longest game:\n\n{max_l_game}")



Shortest game:

id                         3K5kYPO8
rated                          True
created_at          1491530000000.0
last_move_at        1491530000000.0
turns                             1
victory_status               resign
winner                        black
increment_code                 10+0
white_id               serik-astana
white_rating                   1464
black_id                 brorael357
black_rating                   1355
moves                            g3
opening_eco                     A00
opening_name      Hungarian Opening
opening_ply                       1
Name: 1946, dtype: object


Longest game:

id                                                         pN0ioHNr
rated                                                          True
created_at                                          1503084425823.0
last_move_at                                        1503085571843.0
turns                                                           349
victory_status           

### Interpretation: \[2\] Shortest and longest games

**Shortest game:**

So, one would think, that there is no information to gain from the shortest game, since

it would just be a game, which gets forfeited instantly.

Still there is some information to, at least meta information.


1. Despite there is just one move made (g3), lichchess still assigns a opening, which

gives information on how lichchess gains data. This information could under circumstances

be crucial in furter exploration and or modelling.

1. The fact, that there is no game with 0 turns in the whole dataset, could suggest,

that lichchess does

not persist games with 0 moves.


**Longest game:**

Now the longest game went on for an astonishing 349 moves, just so that black ran out

of time. It is to be expected, that the players have more or less the same ratings.



In [None]:

fastest_opening = opening_stats.loc[opening_stats["mean"].idxmin()]

print(fastest_opening)
# Top 5 openings with the shortest games on average
# top_shortest = opening_stats.sort_values(by='mean', ascending=True).head()
# print(f"Top 5 openings with the shortest games on average:\n{top_shortest}")

# Games with the smallest deviation in game length
# opening_stats["cv"] = opening_stats["std"] / opening_stats["mean"]

# print(opening_stats.head())

# Z-Score: How many standard deviations is the shortest opening game length mean, away from the overall mean?

# opening with the shortest smallest game length mean
# shortest_opening = opening_stats[opening_stats["mean"].idxmin()]