# League of Legends Early-Game EDA

This notebook delivers the exploratory analysis and baseline modeling required by the CMSE 492 Project Setup and Proposal assignment. The workflow establishes a clean train/test split, profiles the dataset, surfaces outcome-driven patterns, and records a simple baseline—all prerequisites for the upcoming project proposal and milestone planning.

## 1. Environment Setup

The assignment specifies using the scientific Python stack listed in `requirements.txt`. We load it here and configure plotting for consistent styling, even in headless environments.

In [1]:
from __future__ import annotations

import json
from pathlib import Path

import matplotlib
matplotlib.use('Agg')  # Keep plots deterministic across local/remote runs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

sns.set_theme(style='whitegrid')

## 2. Data Source and Loading

Per the requirements, we document provenance before analysis. The dataset comes from Kaggle's *League of Legends Diamond Ranked Games (10 min)* collection, which aggregates high-level ranked matches with team statistics captured through the first ten minutes.

In [11]:
RAW_PATH = Path('/Users/liamsandy/ML_Project/data/raw/high_diamond_ranked_10min.csv')
if not RAW_PATH.exists():
    raise FileNotFoundError(f'Dataset missing at {RAW_PATH}')

df = pd.read_csv(RAW_PATH)
df.head()

Unnamed: 0,gameId,blueWins,blueWardsPlaced,blueWardsDestroyed,blueFirstBlood,blueKills,blueDeaths,blueAssists,blueEliteMonsters,blueDragons,...,redTowersDestroyed,redTotalGold,redAvgLevel,redTotalExperience,redTotalMinionsKilled,redTotalJungleMinionsKilled,redGoldDiff,redExperienceDiff,redCSPerMin,redGoldPerMin
0,4519157822,0,28,2,1,9,6,11,0,0,...,0,16567,6.8,17047,197,55,-643,8,19.7,1656.7
1,4523371949,0,12,1,0,5,5,5,0,0,...,1,17620,6.8,17438,240,52,2908,1173,24.0,1762.0
2,4521474530,0,15,0,0,7,11,4,1,1,...,0,17285,6.8,17254,203,28,1172,1033,20.3,1728.5
3,4524384067,0,43,1,0,4,5,5,1,0,...,0,16478,7.0,17961,235,47,1321,7,23.5,1647.8
4,4436033771,0,75,4,0,6,6,6,0,0,...,0,17404,7.0,18313,225,67,1004,-230,22.5,1740.4


We capture fundamental metadata—table shape and column data types—so the proposal can state the dataset size and feature mix explicitly.

In [13]:
df_shape = df.shape
df_dtypes = df.dtypes.to_frame('dtype')
df_shape, df_dtypes.head()

((9879, 40),
                     dtype
 gameId              int64
 blueWins            int64
 blueWardsPlaced     int64
 blueWardsDestroyed  int64
 blueFirstBlood      int64)

The assignment also requests saving a sample to `data/processed/` for quick inspection by reviewers or teammates.

In [15]:
processed_dir = Path('data/processed')
processed_dir.mkdir(parents=True, exist_ok=True)
sample_path = processed_dir / 'sample_matches.csv'
df.sample(n=200, random_state=42).to_csv(sample_path, index=False)
sample_path

PosixPath('data/processed/sample_matches.csv')

## 3. Train/Test Split

We follow the requirement to split data prior to deeper EDA, reserving 20% of matches for a held-out set and stratifying on the binary target `blueWins` to preserve class balance.

In [17]:
TARGET = 'blueWins'
TEST_SIZE = 0.2
RANDOM_STATE = 42

train_df, test_df = train_test_split(
    df,
    test_size=TEST_SIZE,
    stratify=df[TARGET],
    random_state=RANDOM_STATE,
)
train_df.shape, test_df.shape

((7903, 40), (1976, 40))

## 4. Dataset Profile

We record the core profiling statistics—row counts, feature counts, class balance, and missingness—for reuse in the proposal's Data Description section.

In [19]:
profile = {
    'train_rows': int(train_df.shape[0]),
    'test_rows': int(test_df.shape[0]),
    'n_features': int(train_df.shape[1] - 1),
    'class_balance_train': train_df[TARGET].value_counts(normalize=True).round(4).to_dict(),
    'missing_rates': train_df.isnull().mean().round(4).to_dict(),
}
profile

{'train_rows': 7903,
 'test_rows': 1976,
 'n_features': 39,
 'class_balance_train': {0: 0.5009, 1: 0.4991},
 'missing_rates': {'gameId': 0.0,
  'blueWins': 0.0,
  'blueWardsPlaced': 0.0,
  'blueWardsDestroyed': 0.0,
  'blueFirstBlood': 0.0,
  'blueKills': 0.0,
  'blueDeaths': 0.0,
  'blueAssists': 0.0,
  'blueEliteMonsters': 0.0,
  'blueDragons': 0.0,
  'blueHeralds': 0.0,
  'blueTowersDestroyed': 0.0,
  'blueTotalGold': 0.0,
  'blueAvgLevel': 0.0,
  'blueTotalExperience': 0.0,
  'blueTotalMinionsKilled': 0.0,
  'blueTotalJungleMinionsKilled': 0.0,
  'blueGoldDiff': 0.0,
  'blueExperienceDiff': 0.0,
  'blueCSPerMin': 0.0,
  'blueGoldPerMin': 0.0,
  'redWardsPlaced': 0.0,
  'redWardsDestroyed': 0.0,
  'redFirstBlood': 0.0,
  'redKills': 0.0,
  'redDeaths': 0.0,
  'redAssists': 0.0,
  'redEliteMonsters': 0.0,
  'redDragons': 0.0,
  'redHeralds': 0.0,
  'redTowersDestroyed': 0.0,
  'redTotalGold': 0.0,
  'redAvgLevel': 0.0,
  'redTotalExperience': 0.0,
  'redTotalMinionsKilled': 0.0,
  'r

### Descriptive Statistics

Numerical summaries help flag anomalous ranges and guide scaling choices for baseline models.

In [21]:
numeric_summary = train_df.select_dtypes(include=[np.number]).describe().T
numeric_summary.head()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
gameId,7903.0,4500003000.0,27365950.0,4295358000.0,4482322000.0,4510743000.0,4521693000.0,4527991000.0
blueWins,7903.0,0.499051,0.5000307,0.0,0.0,0.0,1.0,1.0
blueWardsPlaced,7903.0,22.39011,18.17242,5.0,14.0,16.0,20.0,250.0
blueWardsDestroyed,7903.0,2.849424,2.181258,0.0,1.0,3.0,4.0,25.0
blueFirstBlood,7903.0,0.504745,0.5000091,0.0,0.0,1.0,1.0,1.0


### Missingness Visualization

Even though this dataset is known to be complete, we include a visualization to explicitly document the absence of missing values, fulfilling the EDA checklist.

In [23]:
figures_dir = Path('figures/eda')
figures_dir.mkdir(parents=True, exist_ok=True)

missing_rates = pd.Series(profile['missing_rates']).sort_values(ascending=False)
fig, ax = plt.subplots(figsize=(8, 4))
missing_rates.head(15).plot(kind='bar', ax=ax, color='#7570b3')
ax.set_ylabel('Fraction Missing')
ax.set_title('Top Feature Missingness (Train Split)')
fig.tight_layout()
fig.savefig(figures_dir / 'missingness.png', dpi=300)
plt.close(fig)
missing_rates.head()

gameId               0.0
blueWins             0.0
redWardsDestroyed    0.0
redFirstBlood        0.0
redKills             0.0
dtype: float64

## 5. Target Correlations

Correlations with `blueWins` inform which signals (e.g., gold and experience advantages) should be prioritized in baseline models and hypothesis testing.

In [25]:
corr_with_target = (
    train_df.select_dtypes(include=[np.number]).corr()[TARGET]
    .drop(TARGET)
    .sort_values(key=lambda s: s.abs(), ascending=False)
)
corr_with_target.head(10)

redGoldDiff           -0.512365
blueGoldDiff           0.512365
redExperienceDiff     -0.489471
blueExperienceDiff     0.489471
blueTotalGold          0.416703
blueGoldPerMin         0.416703
redGoldPerMin         -0.410694
redTotalGold          -0.410694
redTotalExperience    -0.391754
blueTotalExperience    0.391026
Name: blueWins, dtype: float64

## 6. Visual Diagnostics

These figures will feed directly into the proposal's Data Description and Motivation sections, highlighting class balance, feature distributions, objective control, and feature interplay.

In [27]:
# Class balance
fig, ax = plt.subplots(figsize=(6, 4))
class_counts = train_df[TARGET].value_counts().sort_index()
ax.bar(['Red Wins (0)', 'Blue Wins (1)'], class_counts.values, color=['#d95f02', '#1b9e77'])
ax.set_ylabel('Match Count')
ax.set_title('Train Split Class Balance')
for label, count in zip(['Red Wins (0)', 'Blue Wins (1)'], class_counts.values):
    ax.annotate(f'{count}', (label, count), ha='center', va='bottom', fontsize=9)
fig.tight_layout()
fig.savefig(figures_dir / 'class_balance.png', dpi=300)
plt.close(fig)
class_counts

blueWins
0    3959
1    3944
Name: count, dtype: int64

In [31]:
# Feature distributions
key_features = [
    'blueGoldDiff',
    'blueExperienceDiff',
    'blueKills',
    'blueDeaths',
    'blueEliteMonsters',
    'blueDragons',
    'blueTowersDestroyed',
]
fig, axes = plt.subplots(len(key_features), 1, figsize=(7, 2.6 * len(key_features)))
for feature, ax in zip(key_features, axes):
    sns.histplot(
        train_df,
        x=feature,
        hue=TARGET,
        element='step',
        stat='density',
        common_norm=False,
        palette=['#d95f02', '#1b9e77'],
        ax=ax,
    )
    ax.set_title(f'Distribution of {feature} by Match Outcome')
fig.tight_layout()
fig.savefig(figures_dir / 'feature_distributions.png', dpi=300)
plt.close(fig)

In [32]:
# Objective control comparison
objective_features = ['blueDragons', 'blueHeralds', 'blueEliteMonsters', 'blueTowersDestroyed']
obj_stats = (
    train_df.groupby(TARGET)[objective_features]
    .mean()
    .rename(index={0: 'Red Victory', 1: 'Blue Victory'})
    .T
)
fig, ax = plt.subplots(figsize=(7.5, 4.2))
obj_stats.plot(kind='bar', ax=ax, color=['#d95f02', '#1b9e77'])
ax.set_ylabel('Average Count (First 10 Minutes)')
ax.set_title('Objective Control by Winning Team')
ax.legend(title='Outcome')
fig.tight_layout()
fig.savefig(figures_dir / 'objective_control.png', dpi=300)
plt.close(fig)
obj_stats

blueWins,Red Victory,Blue Victory
blueDragons,0.262693,0.462475
blueHeralds,0.153322,0.221349
blueEliteMonsters,0.416014,0.683824
blueTowersDestroyed,0.021975,0.081389


In [35]:
# Correlation heatmap
selected = list(corr_with_target.head(12).index) + [TARGET]
corr_matrix = train_df[selected].corr()
fig, ax = plt.subplots(figsize=(9, 7))
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt='.2f',
    cmap='coolwarm',
    center=0,
    cbar_kws={'shrink': 0.7},
    ax=ax,
)
ax.set_title('Correlation Heatmap: Top Outcome-Linked Features')
fig.tight_layout()
fig.savefig(figures_dir / 'top_feature_correlation_heatmap.png', dpi=300)
plt.close(fig)

## 7. Outlier Assessment

We evaluate interquartile ranges for the primary advantage metrics. Low outlier fractions reinforce that models can work with raw counts (after optional scaling).

In [38]:
outlier_features = ['blueGoldDiff', 'blueExperienceDiff', 'blueKills']
outlier_summary = {}
for feat in outlier_features:
    series = train_df[feat]
    q1, q3 = series.quantile([0.25, 0.75])
    iqr = float(q3 - q1)
    lower = float(q1 - 1.5 * iqr)
    upper = float(q3 + 1.5 * iqr)
    outliers = series[(series < lower) | (series > upper)]
    outlier_summary[feat] = {
        'iqr': iqr,
        'lower_bound': lower,
        'upper_bound': upper,
        'outlier_fraction': float(len(outliers) / len(series)),
    }
outlier_summary

{'blueGoldDiff': {'iqr': 3143.0,
  'lower_bound': -6294.0,
  'upper_bound': 6278.0,
  'outlier_fraction': 0.012779956978362646},
 'blueExperienceDiff': {'iqr': 2516.0,
  'lower_bound': -5085.0,
  'upper_bound': 4979.0,
  'outlier_fraction': 0.011388080475768695},
 'blueKills': {'iqr': 4.0,
  'lower_bound': -2.0,
  'upper_bound': 14.0,
  'outlier_fraction': 0.00835125901556371}}

## 8. Baseline Models

The homework requires a simple baseline model. We compare a majority-class `DummyClassifier` with a regularized logistic regression pipeline to establish reference metrics (accuracy, F1, ROC-AUC). `gameId` is excluded because it is an identifier.

In [40]:
feature_cols = [col for col in train_df.columns if col not in {TARGET, 'gameId'}]
X_train = train_df[feature_cols]
y_train = train_df[TARGET]
X_test = test_df[feature_cols]
y_test = test_df[TARGET]

baseline_results = {}

dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
y_pred_dummy = dummy.predict(X_test)
baseline_results['dummy_majority'] = {
    'accuracy': float(accuracy_score(y_test, y_pred_dummy)),
    'f1': float(f1_score(y_test, y_pred_dummy, zero_division=0)),
}

log_reg = Pipeline(
    steps=[
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(max_iter=1000, solver='lbfgs'))
    ]
)
log_reg.fit(X_train, y_train)
y_pred_lr = log_reg.predict(X_test)
y_proba_lr = log_reg.predict_proba(X_test)[:, 1]
baseline_results['logistic_regression'] = {
    'accuracy': float(accuracy_score(y_test, y_pred_lr)),
    'f1': float(f1_score(y_test, y_pred_lr)),
    'roc_auc': float(roc_auc_score(y_test, y_proba_lr)),
}

baseline_results

{'dummy_majority': {'accuracy': 0.5010121457489879, 'f1': 0.0},
 'logistic_regression': {'accuracy': 0.7160931174089069,
  'f1': 0.7176648213387016,
  'roc_auc': 0.8057624930850082}}

## 9. Persist Outputs

To keep the repo reproducible, we persist summaries, split metadata, outlier stats, baseline metrics, and the sample dataset under `data/processed/`. Charts are already saved under `figures/eda/`.

In [42]:
(processed_dir / 'eda_summary.json').write_text(json.dumps(profile, indent=2))
numeric_summary.round(3).to_csv(processed_dir / 'numeric_feature_summary.csv')
split_meta = {
    'random_state': RANDOM_STATE,
    'test_size': TEST_SIZE,
    'stratified': True,
    'train_count': int(train_df.shape[0]),
    'test_count': int(test_df.shape[0]),
}
(processed_dir / 'split_metadata.json').write_text(json.dumps(split_meta, indent=2))
(processed_dir / 'outlier_summary.json').write_text(json.dumps(outlier_summary, indent=2))
(processed_dir / 'baseline_metrics.json').write_text(json.dumps(baseline_results, indent=2))
sorted(p.name for p in processed_dir.glob('*'))

['baseline_metrics.json',
 'eda_summary.json',
 'numeric_feature_summary.csv',
 'outlier_summary.json',
 'sample_matches.csv',
 'split_metadata.json']

## 10. Key Takeaways for Modeling

- Blue vs. red wins remain perfectly balanced, so class reweighting is optional.
- Gold and experience differentials dominate the signal (|corr| ≈ 0.5), guiding feature importance   expectations for tree ensembles and neural nets.
- Objective control (dragons, heralds, towers) differentiates outcomes, suggesting engineered difference   features could boost models.
- Logistic regression already beats the majority baseline, validating that even linear models capture   early-game signals—useful when benchmarking more complex approaches.