# League of Legends Early-Game EDA

This notebook delivers the exploratory analysis and baseline modeling required by the CMSE 492 Project Setup and Proposal assignment. The workflow establishes a clean train/test split, profiles the dataset, surfaces outcome-driven patterns, and records a simple baseline—all prerequisites for the upcoming project proposal and milestone planning.

## 1. Environment Setup

The assignment specifies using the scientific Python stack listed in `requirements.txt`. We load it here and configure plotting for consistent styling, even in headless environments.

In [None]:
from __future__ import annotations

import json
from pathlib import Path

import matplotlib
matplotlib.use('Agg')  # Keep plots deterministic across local/remote runs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

sns.set_theme(style='whitegrid')

## 2. Data Source and Loading

Per the requirements, we document provenance before analysis. The dataset comes from Kaggle's *League of Legends Diamond Ranked Games (10 min)* collection, which aggregates high-level ranked matches with team statistics captured through the first ten minutes.

In [None]:
RAW_PATH = Path('data/raw/high_diamond_ranked_10min.csv')
if not RAW_PATH.exists():
    raise FileNotFoundError(f'Dataset missing at {RAW_PATH}')

df = pd.read_csv(RAW_PATH)
df.head()

We capture fundamental metadata—table shape and column data types—so the proposal can state the dataset size and feature mix explicitly.

In [None]:
df_shape = df.shape
df_dtypes = df.dtypes.to_frame('dtype')
df_shape, df_dtypes.head()

The assignment also requests saving a sample to `data/processed/` for quick inspection by reviewers or teammates.

In [None]:
processed_dir = Path('data/processed')
processed_dir.mkdir(parents=True, exist_ok=True)
sample_path = processed_dir / 'sample_matches.csv'
df.sample(n=200, random_state=42).to_csv(sample_path, index=False)
sample_path

## 3. Train/Test Split

We follow the requirement to split data prior to deeper EDA, reserving 20% of matches for a held-out set and stratifying on the binary target `blueWins` to preserve class balance.

In [None]:
TARGET = 'blueWins'
TEST_SIZE = 0.2
RANDOM_STATE = 42

train_df, test_df = train_test_split(
    df,
    test_size=TEST_SIZE,
    stratify=df[TARGET],
    random_state=RANDOM_STATE,
)
train_df.shape, test_df.shape

## 4. Dataset Profile

We record the core profiling statistics—row counts, feature counts, class balance, and missingness—for reuse in the proposal's Data Description section.

In [None]:
profile = {
    'train_rows': int(train_df.shape[0]),
    'test_rows': int(test_df.shape[0]),
    'n_features': int(train_df.shape[1] - 1),
    'class_balance_train': train_df[TARGET].value_counts(normalize=True).round(4).to_dict(),
    'missing_rates': train_df.isnull().mean().round(4).to_dict(),
}
profile

### Descriptive Statistics

Numerical summaries help flag anomalous ranges and guide scaling choices for baseline models.

In [None]:
numeric_summary = train_df.select_dtypes(include=[np.number]).describe().T
numeric_summary.head()

### Missingness Visualization

Even though this dataset is known to be complete, we include a visualization to explicitly document the absence of missing values, fulfilling the EDA checklist.

In [None]:
figures_dir = Path('figures/eda')
figures_dir.mkdir(parents=True, exist_ok=True)

missing_rates = pd.Series(profile['missing_rates']).sort_values(ascending=False)
fig, ax = plt.subplots(figsize=(8, 4))
missing_rates.head(15).plot(kind='bar', ax=ax, color='#7570b3')
ax.set_ylabel('Fraction Missing')
ax.set_title('Top Feature Missingness (Train Split)')
fig.tight_layout()
fig.savefig(figures_dir / 'missingness.png', dpi=300)
plt.close(fig)
missing_rates.head()

## 5. Target Correlations

Correlations with `blueWins` inform which signals (e.g., gold and experience advantages) should be prioritized in baseline models and hypothesis testing.

In [None]:
corr_with_target = (
    train_df.select_dtypes(include=[np.number]).corr()[TARGET]
    .drop(TARGET)
    .sort_values(key=lambda s: s.abs(), ascending=False)
)
corr_with_target.head(10)

## 6. Visual Diagnostics

These figures will feed directly into the proposal's Data Description and Motivation sections, highlighting class balance, feature distributions, objective control, and feature interplay.

In [None]:
# Class balance
fig, ax = plt.subplots(figsize=(6, 4))
class_counts = train_df[TARGET].value_counts().sort_index()
ax.bar(['Red Wins (0)', 'Blue Wins (1)'], class_counts.values, color=['#d95f02', '#1b9e77'])
ax.set_ylabel('Match Count')
ax.set_title('Train Split Class Balance')
for label, count in zip(['Red Wins (0)', 'Blue Wins (1)'], class_counts.values):
    ax.annotate(f'{count}', (label, count), ha='center', va='bottom', fontsize=9)
fig.tight_layout()
fig.savefig(figures_dir / 'class_balance.png', dpi=300)
plt.close(fig)
class_counts

In [None]:
# Feature distributions
key_features = [
    'blueGoldDiff',
    'blueExperienceDiff',
    'blueKills',
    'blueDeaths',
    'blueEliteMonsters',
    'blueDragons',
    'blueTowersDestroyed',
]
fig, axes = plt.subplots(len(key_features), 1, figsize=(7, 2.6 * len(key_features)))
for feature, ax in zip(key_features, axes):
    sns.histplot(
        train_df,
        x=feature,
        hue=TARGET,
        element='step',
        stat='density',
        common_norm=False,
        palette=['#d95f02', '#1b9e77'],
        ax=ax,
    )
    ax.set_title(f'Distribution of {feature} by Match Outcome')
fig.tight_layout()
fig.savefig(figures_dir / 'feature_distributions.png', dpi=300)
plt.close(fig)

In [None]:
# Objective control comparison
objective_features = ['blueDragons', 'blueHeralds', 'blueEliteMonsters', 'blueTowersDestroyed']
obj_stats = (
    train_df.groupby(TARGET)[objective_features]
    .mean()
    .rename(index={0: 'Red Victory', 1: 'Blue Victory'})
    .T
)
fig, ax = plt.subplots(figsize=(7.5, 4.2))
obj_stats.plot(kind='bar', ax=ax, color=['#d95f02', '#1b9e77'])
ax.set_ylabel('Average Count (First 10 Minutes)')
ax.set_title('Objective Control by Winning Team')
ax.legend(title='Outcome')
fig.tight_layout()
fig.savefig(figures_dir / 'objective_control.png', dpi=300)
plt.close(fig)
obj_stats

In [None]:
# Correlation heatmap
selected = list(corr_with_target.head(12).index) + [TARGET]
corr_matrix = train_df[selected].corr()
fig, ax = plt.subplots(figsize=(9, 7))
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt='.2f',
    cmap='coolwarm',
    center=0,
    cbar_kws={'shrink': 0.7},
    ax=ax,
)
ax.set_title('Correlation Heatmap: Top Outcome-Linked Features')
fig.tight_layout()
fig.savefig(figures_dir / 'top_feature_correlation_heatmap.png', dpi=300)
plt.close(fig)

## 7. Outlier Assessment

We evaluate interquartile ranges for the primary advantage metrics. Low outlier fractions reinforce that models can work with raw counts (after optional scaling).

In [None]:
outlier_features = ['blueGoldDiff', 'blueExperienceDiff', 'blueKills']
outlier_summary = {}
for feat in outlier_features:
    series = train_df[feat]
    q1, q3 = series.quantile([0.25, 0.75])
    iqr = float(q3 - q1)
    lower = float(q1 - 1.5 * iqr)
    upper = float(q3 + 1.5 * iqr)
    outliers = series[(series < lower) | (series > upper)]
    outlier_summary[feat] = {
        'iqr': iqr,
        'lower_bound': lower,
        'upper_bound': upper,
        'outlier_fraction': float(len(outliers) / len(series)),
    }
outlier_summary

## 8. Baseline Models

The homework requires a simple baseline model. We compare a majority-class `DummyClassifier` with a regularized logistic regression pipeline to establish reference metrics (accuracy, F1, ROC-AUC). `gameId` is excluded because it is an identifier.

In [None]:
feature_cols = [col for col in train_df.columns if col not in {TARGET, 'gameId'}]
X_train = train_df[feature_cols]
y_train = train_df[TARGET]
X_test = test_df[feature_cols]
y_test = test_df[TARGET]

baseline_results = {}

dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
y_pred_dummy = dummy.predict(X_test)
baseline_results['dummy_majority'] = {
    'accuracy': float(accuracy_score(y_test, y_pred_dummy)),
    'f1': float(f1_score(y_test, y_pred_dummy, zero_division=0)),
}

log_reg = Pipeline(
    steps=[
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(max_iter=1000, solver='lbfgs'))
    ]
)
log_reg.fit(X_train, y_train)
y_pred_lr = log_reg.predict(X_test)
y_proba_lr = log_reg.predict_proba(X_test)[:, 1]
baseline_results['logistic_regression'] = {
    'accuracy': float(accuracy_score(y_test, y_pred_lr)),
    'f1': float(f1_score(y_test, y_pred_lr)),
    'roc_auc': float(roc_auc_score(y_test, y_proba_lr)),
}

baseline_results

## 9. Persist Outputs

To keep the repo reproducible, we persist summaries, split metadata, outlier stats, baseline metrics, and the sample dataset under `data/processed/`. Charts are already saved under `figures/eda/`.

In [None]:
(processed_dir / 'eda_summary.json').write_text(json.dumps(profile, indent=2))
numeric_summary.round(3).to_csv(processed_dir / 'numeric_feature_summary.csv')
split_meta = {
    'random_state': RANDOM_STATE,
    'test_size': TEST_SIZE,
    'stratified': True,
    'train_count': int(train_df.shape[0]),
    'test_count': int(test_df.shape[0]),
}
(processed_dir / 'split_metadata.json').write_text(json.dumps(split_meta, indent=2))
(processed_dir / 'outlier_summary.json').write_text(json.dumps(outlier_summary, indent=2))
(processed_dir / 'baseline_metrics.json').write_text(json.dumps(baseline_results, indent=2))
sorted(p.name for p in processed_dir.glob('*'))

## 10. Key Takeaways for Modeling

- Blue vs. red wins remain perfectly balanced, so class reweighting is optional.
- Gold and experience differentials dominate the signal (|corr| ≈ 0.5), guiding feature importance   expectations for tree ensembles and neural nets.
- Objective control (dragons, heralds, towers) differentiates outcomes, suggesting engineered difference   features could boost models.
- Logistic regression already beats the majority baseline, validating that even linear models capture   early-game signals—useful when benchmarking more complex approaches.