This project studies whether real VALORANT player behavior can be segmented into interpretable archetypes, whether those archetypes differ between professional and public play, and whether pre-match behavioral structure carries predictive signal for match outcomes.
Using real public data from organized VCT play and public competitive matches, the system builds a canonical player-match dataset, engineers behavioral features, applies unsupervised clustering at both the global and role-conditioned levels, and trains calibrated win-probability models. A separate agent-behavior layer tests whether agents are actually used according to their nominal class or drift into different behavioral roles in practice.
The current evidence suggests three main conclusions:
- role-conditioned clustering is substantially more interpretable than one global cross-role clustering pass
- pre-match player-history and archetype-composition features carry strong predictive signal for match outcomes
- agent usage is mostly aligned with nominal role in pro play, but public play shows materially more behavioral drift
This work is organized around six questions:
- Do coherent behavioral clusters exist in real VALORANT match data?
- Are those clusters more interpretable when conditioning on role?
- Do player archetypes differ between pro and public cohorts?
- Do archetype-composition features add value to pre-match win modeling?
- Can agents be grouped by actual usage behavior rather than official class labels?
- When agents are clustered by behavior, do they align with their nominal role or drift elsewhere?
Current corpus size:
- professional / organized cohort:
- matches:
224 - player-match rows:
2235
- matches:
- public competitive cohort:
- matches:
568 - player-match rows:
5792
- matches:
Data sources:
- organized play: public VLR-backed event and match endpoints
- public competitive play: Henrik public VALORANT API
The repository does not ship synthetic match logs.
Global cohort clustering:
- pro global silhouette:
0.2663 - public global silhouette:
0.2835
Role-conditioned clustering:
- pro Duelist silhouette:
0.2940 - pro Controller silhouette:
0.2585 - pro Initiator silhouette:
0.2864 - pro Sentinel silhouette:
0.2791 - public Duelist silhouette:
0.2267 - public Controller silhouette:
0.2611 - public Initiator silhouette:
0.3088 - public Sentinel silhouette:
0.2413
Interpretation:
- global clustering yields useful exploratory structure but still mixes overlapping cross-role behavior
- role-conditioned clustering is the more credible lens for “real” VALORANT archetypes
- the strongest role-specific separation in the current public sample appears in Initiators
Best current model by cohort:
- pro:
- model:
hist_gradient_boosting - Brier:
0.0009 - ROC AUC:
1.0000 - F1:
1.0000 - Brier improvement vs baseline:
84.7%
- model:
- public:
- model:
hist_gradient_boosting - Brier:
0.0161 - ROC AUC:
0.9974 - F1:
0.9737 - Brier improvement vs baseline:
89.8%
- model:
Interpretation:
- pre-match player-history features are strongly informative
- team archetype composition features are useful enough to retain in the supervised pipeline
- calibrated offline models materially outperform the static strength-gap baseline
Agent behavior is inferred from unsupervised clustering over agent-level behavioral profiles rather than from Riot’s nominal role labels.
Current alignment rates:
- pro:
- represented agents:
27 - raw alignment rate:
59.3% - stable alignment rate:
100.0% - low-sample agents:
8
- represented agents:
- public:
- represented agents:
27 - raw alignment rate:
70.4% - stable alignment rate:
78.9% - low-sample agents:
0
- represented agents:
Interpretation:
- once low-sample agents are separated from the stable set, pro alignment is very strong
- public play shows more agent-role drift, which is consistent with looser coordination and broader usage patterns
- low-sample agents are now retained and explicitly marked as insufficient evidence rather than silently dropped
Raw source pulls are normalized into a common schema with:
- match metadata
- player identifiers
- team identifiers
- map
- outcome
- combat statistics
- objective interaction
- agent identity
The normalization step also:
- parses mixed datetimes
- standardizes agent names
- removes invalid agent strings
Per-player features include:
- win rate
- KDA ratio
- kills / deaths / assists per match
- headshot rate
- damage proxy
- entry rate and entry success rate
- support score
- objective score
- consistency score
- role entropy
- role concentration
- map pool entropy
Three unsupervised layers are used:
- global cohort clustering
- role-conditioned clustering
- agent-behavior clustering
Global clustering is intended for broad behavioral structure. Role-conditioned clustering is intended for interpretable VALORANT archetypes. Agent-behavior clustering is intended to test nominal-vs-actual usage.
Match-level modeling includes:
- static baseline
- logistic regression
- histogram gradient boosting
- PyTorch MLP benchmark
Evaluation includes:
- Brier score
- log loss
- ROC AUC
- average precision
- accuracy
- balanced accuracy
- precision
- recall
- F1
- expected calibration error
The match model now includes team-level composition summaries derived from the unsupervised layer:
- global archetype counts
- role-archetype counts
- archetype diversity
- archetype balance
- opponent-gap versions of the same features
The original “one clustering pass over everyone” approach was analytically weak for VALORANT-specific interpretation.
The current evidence supports using:
- global clustering for exploratory structure
- role-conditioned clustering for interpretable VALORANT archetypes
Professional and public cohorts differ not only in outcome signal but in behavioral coherence.
The public cohort remains noisier, but still contains usable structure, especially once role conditioning is applied.
Outcome modeling is stronger than unsupervised separation alone.
This matters because it means:
- archetypes are informative
- but continuous behavioral features and matchup composition still carry additional signal beyond discrete cluster membership
Agent behavior is not perfectly equivalent to official role labels.
Some agents align very cleanly with their nominal class. Others drift, especially in public play. This is a meaningful result rather than noise: it quantifies how players actually use agents.
Low-sample agents are analytically dangerous.
The current pipeline now keeps them visible but separates them from stable alignment claims instead of silently dropping them or overcommitting to weak inference.
Archetype prevalence is now traceable over time.
The timeline layer is descriptive rather than causal, but it is already useful for showing how cluster participation changes across the observed window. It is currently strongest for public data and partially available for pro data because pro timestamps are now present for 197 / 224 matches.
The project is now useful in three ways.
-
As a behavioral analytics system: it discovers and compares player styles across pro and public environments.
-
As a predictive modeling system: it shows that player-history and archetype-composition features are strongly informative for pre-match outcome estimation.
-
As an agent-usage system: it tests whether agents are used according to design intent or repurposed behaviorally by the player base.
The second and third points are where most of the practical value currently sits.
- global clusters are still broad and should not be mistaken for definitive role archetypes
- pro timestamps are now partially recovered from VLR detail payloads, but timeline coverage is still stronger for public than pro
- low-sample agent behavior should be treated as descriptive only
- supervised metrics are based on offline holdout evaluation, not deployment-grade live validation
- cluster-to-outcome summaries are associative, not causal
Generated outputs include:
data/interim/matches.parquetdata/interim/player_matches.parquetdata/processed/player_features.parquetdata/processed/match_level.parquetartifacts/segmentation/*.parquetartifacts/prediction/model_metrics.jsonartifacts/prediction/model_predictions.parquetartifacts/prediction/calibration_curves.parquetreports/figures/*.htmlresults.json
The Streamlit report includes:
- Overview
- Segmentation
- Global
- Role-Specific
- Agent Behavior
- Modeling
- Comparison
- Data
Install:
.venv/bin/python -m pip install -e .[dev]Fetch real data:
HENRIK_API_KEY=... PYTHONPATH=src .venv/bin/python -m valo_player_intel.cli fetchRun the full pipeline:
PYTHONPATH=src MPLCONFIGDIR=/tmp/mpl .venv/bin/python -m valo_player_intel.cli run --manifest data/external/source_manifest.jsonRun the app:
PYTHONPATH=src .venv/bin/streamlit run src/valo_player_intel/app/streamlit_app.pyRun tests:
PYTHONPATH=src .venv/bin/python -m pytest -qdata/raw/: raw fetched source filesdata/interim/: normalized canonical tablesdata/processed/: feature-engineered tablessrc/valo_player_intel/: ingestion, feature engineering, clustering, prediction, reporting, appartifacts/segmentation/: clustering, role-specific, and agent-behavior outputsartifacts/prediction/: model metrics, predictions, calibration curvesreports/figures/: saved Plotly HTML figurestests/: unit tests