# League of Legends Early-Game EDA

This notebook documents the exploratory data analysis (EDA) required for the CMSE 492 final project. It mirrors the expectations outlined in the course requirements document by profiling the dataset, establishing a reproducible train/test split, and surfacing insights that inform feature engineering and model selection.

## 1. Environment Setup
We import the scientific Python stack specified in `requirements.txt`, configure Matplotlib for headless execution, and set up Seaborn styling for consistent visuals.

In [None]:
from __future__ import annotations

import json
from pathlib import Path

import matplotlib
matplotlib.use('Agg')  # Keep plots deterministic across local/remote runs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split

sns.set_theme(style='whitegrid')

## 2. Load Raw Data
The requirements emphasize documenting data provenance and structure. We load the Kaggle CSV that captures Diamond-tier matches capped at the 10-minute mark.

In [None]:
RAW_PATH = Path('data/raw/high_diamond_ranked_10min.csv')
if not RAW_PATH.exists():
    raise FileNotFoundError(f'Dataset missing at {RAW_PATH}')

df = pd.read_csv(RAW_PATH)
df.head()

## 3. Train/Test Split Rationale
The course requires that data splitting happens prior to deeper EDA to avoid leakage. We reserve 20% of matches for held-out evaluation, stratifying on the binary outcome so downstream models inherit balanced folds.

In [None]:
TARGET = 'blueWins'
TEST_SIZE = 0.2
RANDOM_STATE = 42

train_df, test_df = train_test_split(
    df,
    test_size=TEST_SIZE,
    stratify=df[TARGET],
    random_state=RANDOM_STATE,
)
train_df.shape, test_df.shape

## 4. Dataset Profile
We summarise row/feature counts, class balance, and missingness. These checkpoints address the requirements' call for describing columns, sample counts, and data quality concerns.

In [None]:
profile = {
    'train_rows': int(train_df.shape[0]),
    'test_rows': int(test_df.shape[0]),
    'n_features': int(train_df.shape[1] - 1),
    'class_balance_train': train_df[TARGET].value_counts(normalize=True).round(4).to_dict(),
    'missing_rates': train_df.isnull().mean().round(4).to_dict(),
}
profile

### Descriptive Statistics
The full numeric summary helps detect anomalies (e.g., impossible ranges) and supports later feature scaling choices for models like logistic regression and neural networks.

In [None]:
numeric_summary = train_df.select_dtypes(include=[np.number]).describe().T
numeric_summary.head()

## 5. Target Correlations
Understanding which metrics align with match outcomes guides baseline model selection and feature importance expectations. We inspect absolute correlations with `blueWins`.

In [None]:
corr_with_target = (
    train_df.select_dtypes(include=[np.number]).corr()[TARGET]
    .drop(TARGET)
    .sort_values(key=lambda s: s.abs(), ascending=False)
)
corr_with_target.head(10)

## 6. Visual Diagnostics
Plots expose distributional differences and objective control gaps between winning and losing teams. These visuals support the report's narrative sections (Data Description, Results) and motivate feature engineering.

In [None]:
figures_dir = Path('figures/eda')
figures_dir.mkdir(parents=True, exist_ok=True)

# Class balance
fig, ax = plt.subplots(figsize=(6, 4))
class_counts = train_df[TARGET].value_counts().sort_index()
ax.bar(['Red Wins (0)', 'Blue Wins (1)'], class_counts.values, color=['#d95f02', '#1b9e77'])
ax.set_ylabel('Match Count')
ax.set_title('Train Split Class Balance')
for label, count in zip(['Red Wins (0)', 'Blue Wins (1)'], class_counts.values):
    ax.annotate(f'{count}', (label, count), ha='center', va='bottom', fontsize=9)
fig.tight_layout()
fig.savefig(figures_dir / 'class_balance.png', dpi=300)
plt.close(fig)
class_counts

In [None]:
# Feature distributions
key_features = [
    'blueGoldDiff',
    'blueExperienceDiff',
    'blueKills',
    'blueDeaths',
    'blueEliteMonsters',
    'blueDragons',
    'blueTowersDestroyed',
]
fig, axes = plt.subplots(len(key_features), 1, figsize=(7, 2.6 * len(key_features)))
for feature, ax in zip(key_features, axes):
    sns.histplot(
        train_df,
        x=feature,
        hue=TARGET,
        element='step',
        stat='density',
        common_norm=False,
        palette=['#d95f02', '#1b9e77'],
        ax=ax,
    )
    ax.set_title(f'Distribution of {feature} by Match Outcome')
fig.tight_layout()
fig.savefig(figures_dir / 'feature_distributions.png', dpi=300)
plt.close(fig)

In [None]:
# Objective control comparison
objective_features = ['blueDragons', 'blueHeralds', 'blueEliteMonsters', 'blueTowersDestroyed']
obj_stats = (
    train_df.groupby(TARGET)[objective_features]
    .mean()
    .rename(index={0: 'Red Victory', 1: 'Blue Victory'})
    .T
)
fig, ax = plt.subplots(figsize=(7.5, 4.2))
obj_stats.plot(kind='bar', ax=ax, color=['#d95f02', '#1b9e77'])
ax.set_ylabel('Average Count (First 10 Minutes)')
ax.set_title('Objective Control by Winning Team')
ax.legend(title='Outcome')
fig.tight_layout()
fig.savefig(figures_dir / 'objective_control.png', dpi=300)
plt.close(fig)
obj_stats

In [None]:
# Correlation heatmap
selected = list(corr_with_target.head(12).index) + [TARGET]
corr_matrix = train_df[selected].corr()
fig, ax = plt.subplots(figsize=(9, 7))
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt='.2f',
    cmap='coolwarm',
    center=0,
    cbar_kws={'shrink': 0.7},
    ax=ax,
)
ax.set_title('Correlation Heatmap: Top Outcome-Linked Features')
fig.tight_layout()
fig.savefig(figures_dir / 'top_feature_correlation_heatmap.png', dpi=300)
plt.close(fig)

## 7. Outlier Assessment
We estimate interquartile ranges (IQR) for critical advantage metrics. Low outlier rates suggest we can feed raw values into tree ensembles, while scaling-aware models can handle them after standardisation.

In [None]:
outlier_features = ['blueGoldDiff', 'blueExperienceDiff', 'blueKills']
outlier_summary = {}
for feat in outlier_features:
    series = train_df[feat]
    q1, q3 = series.quantile([0.25, 0.75])
    iqr = float(q3 - q1)
    lower = float(q1 - 1.5 * iqr)
    upper = float(q3 + 1.5 * iqr)
    outliers = series[(series < lower) | (series > upper)]
    outlier_summary[feat] = {
        'iqr': iqr,
        'lower_bound': lower,
        'upper_bound': upper,
        'outlier_fraction': float(len(outliers) / len(series)),
    }
outlier_summary

## 8. Persist EDA Outputs
Storing summaries enables downstream scripts and the final report to reuse statistics without recomputation, satisfying the reproducibility requirement.

In [None]:
processed_dir = Path('data/processed')
processed_dir.mkdir(parents=True, exist_ok=True)

(processed_dir / 'eda_summary.json').write_text(json.dumps(profile, indent=2))
numeric_summary.round(3).to_csv(processed_dir / 'numeric_feature_summary.csv')
split_meta = {
    'random_state': RANDOM_STATE,
    'test_size': TEST_SIZE,
    'stratified': True,
    'train_count': int(train_df.shape[0]),
    'test_count': int(test_df.shape[0]),
}
(processed_dir / 'split_metadata.json').write_text(json.dumps(split_meta, indent=2))
(processed_dir / 'outlier_summary.json').write_text(json.dumps(outlier_summary, indent=2))

path_checks = sorted(p.name for p in processed_dir.glob('*'))
path_checks

## 9. Key Takeaways for Modeling
- **Balanced outcome**: Blue vs. red wins remain near 50/50, so class reweighting is optional.
- **Gold/experience gaps dominate**: Confirms that economic tempo metrics should appear in every baseline model.
- **Objective control matters**: Dragons and Heralds show separation that tree models can exploit; these features may benefit from engineered differences.
- **Minimal outliers**: Logistic regression and neural networks can rely on standard scaling without aggressive trimming.

These observations map directly to the proposal/report sections on Data Description, Preprocessing, Modeling Strategy, and Interpretation.