## Evaluation different statistical Questions

1. How diverse are the player ratings (elo)
2. Do game lengths predict outcome? (Using CV)
3. Which opening lead to the most predictable outcomes?
4. Which openings produce unusually short or longer games compared to typical games? (Z-Scores)

In [None]:
# === INITIALIZE ===
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("data/games.csv")

### Question #1: How diverse are the player ratings?

In [None]:

white_rating_stats = {
    "mean": df["white_rating"].mean(),
    "std_dev": df["white_rating"].std()
}

black_rating_stats = {
    "mean": df["black_rating"].mean(),
    "st_dev": df["black_rating"].std()
}
print(f"Stats White: {white_rating_stats}")
print(f"Stats White: {black_rating_stats}")

In [None]:
# How many players are within +- 1 std dev around the mean?

# around_mean = df[df["white_rating" ] > white_rating_stats["mean"] - white_rating_stats["std_dev"] and df["white_rating"] < white_rating_stats["mean"] + white_rating_stats["std_dev"] ]
around_mean = df[df["white_rating" ] > white_rating_stats["mean"] - white_rating_stats["std_dev"]]
around_mean = df[df["white_rating" ] < (white_rating_stats["mean"] + white_rating_stats["std_dev"])]

In [None]:

plt.figure(figsize=(12, 4))

# Plot 1: Distribution with standard deviation markers
plt.subplot(1, 2, 1)
plt.hist(df['white_rating'], bins=50, edgecolor='black')
# plotting the standard deviation +- 1xStandarddeviation
plt.axvline(white_rating_stats['mean'], color='red', linestyle='--', label=f'Mean: {white_rating_stats["mean"]:.0f}')
plt.axvline(white_rating_stats['mean'] - white_rating_stats['std_dev'], color='orange', linestyle='--', label='±1 Std Dev')
plt.axvline(white_rating_stats['mean'] + white_rating_stats['std_dev'], color='orange', linestyle='--')
plt.xlabel('Player Rating')
plt.ylabel('Number of Players')
plt.title('Player Rating Distribution')
plt.legend()

## Interpretation: "Player Rating Distribution"

The visualization shows, that most of the players are within the range of +- 1 standas deviation from the mean.

Next up, I could check out the same for Black, but I think that there is not much
information to gain from, since the means are very close.

I could also get the statistics for all the players combined (White and Black) but since
the sample (White) for now is a good enough representation of the population,
I'll stick with that.

## Question #2: What is the relationship between victory type and game length?


In [None]:
# Get the statistical values of turns grouped by victory status
vdf = df.groupby(by="victory_status")["turns"].agg(["count", "mean", "std"])

# Adding the CV - coefficient of variation (std in relation to mean) (the data is approximately normally distributed)
vdf['coef_variation'] = (vdf['std'] / vdf['mean']).round(3)

print(vdf)
for victory_type in vdf.index:
    cv = vdf.loc[victory_type, 'coef_variation']
    print(f"  {victory_type:12}: {cv:.3f} (Mean: {vdf.loc[victory_type, 'mean']:.1f} turns)")

### Interpretation: Coefficient of Variation
Unfortunately, there is no clear pattern that says much about the game length in turns
relative to the win condition.

**-> Chess games a pretty variable in terms of the amount of turns. No matter how the
game is won.**