![cover image](cover.jpg)

# Problem Framing

To develop a predictive model that forecasts the outcomes of sports events or gaming matches
based on historical data, player statistics, and other relevant factors. The model can be used to
inform betting strategies and improve user engagement on the platform.

- The project will be scoped to VALORANT, particularly its esports matches (VCT)
- Data is ranges from the years 2021 to 2024
- The predictive model will be primarily focused on predicting the outcomes (win/loss) of individual games (by map)
- Optionally, the model could also predict the scoreboard of the game (high-risk bets)
- To simulate a betting environment, the model will be used to predict the game outcomes from Champions 2024 matches

# Data Collection

The dataset used in this project was the [Valorant Champion Tour 2021-2024 Data](https://www.kaggle.com/datasets/ryanluong1/valorant-champion-tour-2021-2023-data/data), which was sourced from Kaggle and created by Ryan Luong. We chose this dataset because it included the most comprehensive statistics we found that are crucial for predictive analytics.

The dataset includes matches, agents, and player data from VCT 2021–2024. This was obtained via data scraping from [vlr.gg](https://www.vlr.gg/). Each year contains four folders: `agents`, `matches`, `player_stats`, and `ids`.

The `agents` folder contains agent pick rates, map pick rates, attacker and defender side win/loss percentage, team pick rates on an agent, and win/loss rate.

The `matches folder` contains team picks and bans, their economy on each round of a match, their economy stats on a match, players kills performance on other players, players kill stats, maps that were played on a match, the scores from the map, players overview stats, a player kills performance on players and their agent on a specific round, matches scores and their results, a list of abbreviated team names with their full names, the count of the method that occurred for a team for a match they played and its round number.

The `player_stats` folder only contains player stats.

The `ids` folder contains the ids for the teams, players, tournaments, stages, match types, matches, and games.

The `all_ids` folder contains all the IDs, and the abbreviated team name with their full name.

# Loading the Data

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# To avoid truncating the output
pd.set_option('display.max_rows', None)

In [3]:
# VCT 2021
matches_overview_2021 = pd.read_csv("vct_data/vct_2021/matches/overview.csv")
matches_maps_scores_2021 = pd.read_csv("vct_data/vct_2021/matches/maps_scores.csv")
matches_eco_rounds_2021 = pd.read_csv("vct_data/vct_2021/matches/eco_rounds.csv")
matches_kills_stats_2021 = pd.read_csv("vct_data/vct_2021/matches/kills_stats.csv")
matches_win_loss_methods_count_2021 = pd.read_csv("vct_data/vct_2021/matches/win_loss_methods_count.csv")

# VCT 2022
matches_overview_2022 = pd.read_csv("vct_data/vct_2022/matches/overview.csv")
matches_maps_scores_2022 = pd.read_csv("vct_data/vct_2022/matches/maps_scores.csv")
matches_eco_rounds_2022 = pd.read_csv("vct_data/vct_2022/matches/eco_rounds.csv")
matches_kills_stats_2022 = pd.read_csv("vct_data/vct_2022/matches/kills_stats.csv")
matches_win_loss_methods_count_2022 = pd.read_csv("vct_data/vct_2022/matches/win_loss_methods_count.csv")

# VCT 2023
matches_overview_2023 = pd.read_csv("vct_data/vct_2023/matches/overview.csv")
matches_maps_scores_2023 = pd.read_csv("vct_data/vct_2023/matches/maps_scores.csv")
matches_eco_rounds_2023 = pd.read_csv("vct_data/vct_2023/matches/eco_rounds.csv")
matches_kills_stats_2023 = pd.read_csv("vct_data/vct_2023/matches/kills_stats.csv")
matches_win_loss_methods_count_2023 = pd.read_csv("vct_data/vct_2023/matches/win_loss_methods_count.csv")

# VCT 2024
matches_overview_2024 = pd.read_csv("vct_data/vct_2024/matches/overview.csv")
matches_maps_scores_2024 = pd.read_csv("vct_data/vct_2024/matches/maps_scores.csv")
matches_eco_rounds_2024 = pd.read_csv("vct_data/vct_2024/matches/eco_rounds.csv")
matches_kills_stats_2024 = pd.read_csv("vct_data/vct_2024/matches/kills_stats.csv")
matches_win_loss_methods_count_2024 = pd.read_csv("vct_data/vct_2024/matches/win_loss_methods_count.csv")

# For mapping ids to names
all_ids_all_teams_ids = pd.read_csv("vct_data/all_ids/all_teams_ids.csv")
all_ids_all_matches_games_ids = pd.read_csv("vct_data/all_ids/all_matches_games_ids.csv")

  matches_overview_2021 = pd.read_csv("vct_data/vct_2021/matches/overview.csv")


# Data Cleaning and Preprocessing

- ~~Load data~~
- ~~For each year, figure out how to join the tables horizontally and which columns to keep~~
- ~~Concatenate 2021-2024 data~~
- Handle missing data (prioritize imputation)
- Encode categorical data to numerical (e.g., one-hot, ordinal)
- Feature engineering
- Normalization and standardization (if necessary)

Variable naming convention: `folder_filename_year`

## Merging and concatenating tables to form a single dataset

In [4]:
def merge_tables(
        matches_maps_scores, 
        matches_overview, 
        matches_eco_rounds, 
        matches_kills_stats,
        matches_win_loss_methods_count,
        all_ids_all_teams_ids,
        all_ids_all_matches_games_ids
    ):

    # Map ids to their corresponding column names
    vct_data = matches_maps_scores.merge(
        all_ids_all_matches_games_ids,
        on=["Tournament", "Stage", "Match Type", "Match Name", "Map"],
        how="left"
    )

    vct_data = vct_data.merge(all_ids_all_teams_ids, left_on="Team A", right_on="Team", how="left")
    vct_data = vct_data.merge(all_ids_all_teams_ids, left_on="Team B", right_on="Team", how="left", suffixes=("_TeamA", "_TeamB"))

    # Always drop these columns after merging for Team B to avoid duplicate col error
    vct_data = vct_data.drop(columns=["Team_TeamA", "Team_TeamB"])

    # Aggregate and merge overview stats
    matches_overview_both = matches_overview[matches_overview["Side"] == "both"]

    overview_agg = matches_overview_both.groupby(["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team"]).agg({
        "Rating": "mean",
        "Average Combat Score": "mean",
        "Kills": "mean",
        "Deaths": "mean",
        "Assists": "mean",
        "Kills - Deaths (KD)": "mean",
        "Kill, Assist, Trade, Survive %": lambda x: np.mean([(float(i.strip("%")) / 100) if isinstance(i, str) else i for i in x]),
        "Average Damage Per Round": "mean",
        "Headshot %": lambda x: np.mean([(float(i.strip("%")) / 100) if isinstance(i, str) else i for i in x]),
        "First Kills": "mean",
        "First Deaths": "mean",
        "Kills - Deaths (FKD)": "mean"
    }).reset_index()

    vct_data = vct_data.merge(
        overview_agg, 
        left_on=["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team A"],
        right_on=["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team"],
        how="left"
    )

    vct_data = vct_data.merge(
        overview_agg, 
        left_on=["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team B"],
        right_on=["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team"],
        how="left",
        suffixes=("_TeamA", "_TeamB")
    )

    vct_data = vct_data.drop(columns=["Team_TeamA", "Team_TeamB"])

    # Aggregate and merge for eco rounds
    eco_rounds_agg = matches_eco_rounds.groupby(["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team"]).agg({
        "Loadout Value": lambda x: np.mean([float(i.replace('k', '').replace(',', '')) * 1000 for i in x]),
        "Remaining Credits": lambda x: np.mean([float(i.replace('k', '').replace(',', '')) * 1000 for i in x]),
        "Type": lambda x: x.mode()[0]
    }).reset_index()

    vct_data = vct_data.merge(
        eco_rounds_agg,
        left_on=["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team A"],
        right_on=["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team"],
        how="left"
    )

    vct_data = vct_data.merge(
        eco_rounds_agg,
        left_on=["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team B"],
        right_on=["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team"],
        how="left",
        suffixes=("_TeamA", "_TeamB")
    )

    vct_data = vct_data.drop(columns=["Team_TeamA", "Team_TeamB"])

    # Aggregate and merge for kills stats
    kills_stats_agg = matches_kills_stats.groupby(["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team"]).agg({
        "2k": "sum",
        "3k": "sum",
        "4k": "sum",
        "5k": "sum",
        "1v1": "sum",
        "1v2": "sum",
        "1v3": "sum",
        "1v4": "sum",
        "1v5": "sum",
        "Econ": "mean",
        "Spike Plants": "sum",
        "Spike Defuses": "sum"
    }).reset_index()

    vct_data = vct_data.merge(
        kills_stats_agg,
        left_on=["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team A"],
        right_on=["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team"],
        how="left"
    )

    vct_data = vct_data.merge(
        kills_stats_agg,
        left_on=["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team B"],
        right_on=["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team"],
        how="left",
        suffixes=("_TeamA", "_TeamB")
    )

    vct_data = vct_data.drop(columns=["Team_TeamA", "Team_TeamB"])
    
    # Merge with win/loss methods count
    vct_data = vct_data.merge(
        matches_win_loss_methods_count,
        left_on=["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team A"],
        right_on=["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team"],
        how="left"
    )

    vct_data = vct_data.merge(
        matches_win_loss_methods_count,
        left_on=["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team B"],
        right_on=["Tournament", "Stage", "Match Type", "Match Name", "Map", "Team"],
        how="left",
        suffixes=("_TeamA", "_TeamB")
    )

    vct_data = vct_data.drop(columns=["Team_TeamA", "Team_TeamB"])

    return vct_data

In [5]:
vct_2021 = merge_tables(
    matches_maps_scores_2021, 
    matches_overview_2021, 
    matches_eco_rounds_2021, 
    matches_kills_stats_2021, 
    matches_win_loss_methods_count_2021,
    all_ids_all_teams_ids,
    all_ids_all_matches_games_ids
)

vct_2022 = merge_tables(
    matches_maps_scores_2022, 
    matches_overview_2022, 
    matches_eco_rounds_2022, 
    matches_kills_stats_2022, 
    matches_win_loss_methods_count_2022,
    all_ids_all_teams_ids,
    all_ids_all_matches_games_ids
)

vct_2023 = merge_tables(
    matches_maps_scores_2023, 
    matches_overview_2023, 
    matches_eco_rounds_2023, 
    matches_kills_stats_2023, 
    matches_win_loss_methods_count_2023,
    all_ids_all_teams_ids,
    all_ids_all_matches_games_ids
)

vct_2024 = merge_tables(
    matches_maps_scores_2024, 
    matches_overview_2024, 
    matches_eco_rounds_2024, 
    matches_kills_stats_2024, 
    matches_win_loss_methods_count_2024,
    all_ids_all_teams_ids,
    all_ids_all_matches_games_ids
)

Investigating each year's VCT data

In [6]:
vct_2021.head()

Unnamed: 0,Tournament,Stage,Match Type,Match Name,Map,Team A,Team A Score,Team A Attacker Score,Team A Defender Score,Team A Overtime Score,...,Detonation Denied_TeamA,Time Expiry (Failed to Plant)_TeamA,Elimination_TeamB,Detonated_TeamB,Defused_TeamB,Time Expiry (No Plant)_TeamB,Eliminated_TeamB,Defused Failed_TeamB,Detonation Denied_TeamB,Time Expiry (Failed to Plant)_TeamB
0,Valorant Champions 2021,Group Stage,Opening (D),Vision Strikers vs FULL SENSE,Haven,Vision Strikers,13,9,4,,...,1.0,0.0,4.0,0.0,1.0,0.0,9.0,2.0,2.0,0.0
1,Valorant Champions 2021,Group Stage,Opening (D),Vision Strikers vs FULL SENSE,Breeze,Vision Strikers,13,9,4,,...,2.0,0.0,3.0,0.0,2.0,0.0,9.0,3.0,1.0,0.0
2,Valorant Champions 2021,Group Stage,Opening (C),Team Vikings vs Crazy Raccoon,Icebox,Team Vikings,13,6,7,,...,5.0,1.0,2.0,1.0,5.0,1.0,7.0,2.0,3.0,1.0
3,Valorant Champions 2021,Group Stage,Opening (C),Team Vikings vs Crazy Raccoon,Haven,Team Vikings,13,6,7,,...,1.0,0.0,5.0,2.0,1.0,0.0,11.0,0.0,2.0,0.0
4,Valorant Champions 2021,Group Stage,Opening (D),FNATIC vs Cloud9,Icebox,FNATIC,13,7,6,,...,4.0,0.0,6.0,1.0,4.0,0.0,7.0,0.0,6.0,0.0


In [7]:
vct_2022.head()

Unnamed: 0,Tournament,Stage,Match Type,Match Name,Map,Team A,Team A Score,Team A Attacker Score,Team A Defender Score,Team A Overtime Score,...,Detonation Denied_TeamA,Time Expiry (Failed to Plant)_TeamA,Elimination_TeamB,Detonated_TeamB,Defused_TeamB,Time Expiry (No Plant)_TeamB,Eliminated_TeamB,Defused Failed_TeamB,Detonation Denied_TeamB,Time Expiry (Failed to Plant)_TeamB
0,Valorant Champions 2022,Group Stage,Opening (A),Paper Rex vs EDward Gaming,Pearl,Paper Rex,13,6,7,,...,3.0,0.0,7.0,1.0,3.0,0.0,11.0,0.0,1.0,1.0
1,Valorant Champions 2022,Group Stage,Opening (A),Paper Rex vs EDward Gaming,Icebox,Paper Rex,5,2,3,,...,0.0,0.0,11.0,2.0,0.0,0.0,5.0,0.0,0.0,0.0
2,Valorant Champions 2022,Group Stage,Opening (A),Paper Rex vs EDward Gaming,Haven,Paper Rex,13,7,6,,...,1.0,1.0,5.0,1.0,1.0,1.0,9.0,1.0,3.0,0.0
3,Valorant Champions 2022,Group Stage,Opening (A),Leviatán vs Team Liquid,Haven,Leviatán,13,8,5,,...,2.0,1.0,6.0,1.0,2.0,1.0,8.0,0.0,5.0,0.0
4,Valorant Champions 2022,Group Stage,Opening (A),Leviatán vs Team Liquid,Ascent,Leviatán,13,6,7,,...,2.0,1.0,6.0,1.0,2.0,1.0,7.0,4.0,2.0,0.0


In [8]:
vct_2023.head()

Unnamed: 0,Tournament,Stage,Match Type,Match Name,Map,Team A,Team A Score,Team A Attacker Score,Team A Defender Score,Team A Overtime Score,...,Detonation Denied_TeamA,Time Expiry (Failed to Plant)_TeamA,Elimination_TeamB,Detonated_TeamB,Defused_TeamB,Time Expiry (No Plant)_TeamB,Eliminated_TeamB,Defused Failed_TeamB,Detonation Denied_TeamB,Time Expiry (Failed to Plant)_TeamB
0,Valorant Champions 2023,Group Stage,Opening (D),Team Liquid vs Natus Vincere,Fracture,Team Liquid,11,6,5,,...,1,1,6,5,1,1,8,2,1,0
1,Valorant Champions 2023,Group Stage,Opening (D),Team Liquid vs Natus Vincere,Bind,Team Liquid,15,7,5,3.0,...,4,1,11,1,4,1,11,1,2,1
2,Valorant Champions 2023,Group Stage,Opening (D),DRX vs LOUD,Lotus,DRX,13,7,5,1.0,...,4,1,8,2,4,1,5,4,3,1
3,Valorant Champions 2023,Group Stage,Opening (D),DRX vs LOUD,Split,DRX,13,8,5,,...,2,0,2,2,2,0,8,2,1,2
4,Valorant Champions 2023,Group Stage,Opening (D),DRX vs LOUD,Ascent,DRX,13,8,5,,...,2,0,6,0,2,0,9,1,3,0


In [9]:
vct_2024.head()

Unnamed: 0,Tournament,Stage,Match Type,Match Name,Map,Team A,Team A Score,Team A Attacker Score,Team A Defender Score,Team A Overtime Score,...,Detonation Denied_TeamA,Time Expiry (Failed to Plant)_TeamA,Elimination_TeamB,Detonated_TeamB,Defused_TeamB,Time Expiry (No Plant)_TeamB,Eliminated_TeamB,Defused Failed_TeamB,Detonation Denied_TeamB,Time Expiry (Failed to Plant)_TeamB
0,Champions Tour 2024: Americas Stage 2,Regular Season,Week 1,MIBR vs Leviatán,Ascent,MIBR,9,6,3,,...,1,0,11,1,1,0,8,0,1,0
1,Champions Tour 2024: Americas Stage 2,Regular Season,Week 1,MIBR vs Leviatán,Icebox,MIBR,7,3,4,,...,4,0,8,1,4,0,4,2,1,0
2,Champions Tour 2024: Americas Stage 2,Regular Season,Week 1,Sentinels vs NRG Esports,Lotus,Sentinels,13,8,5,,...,3,0,3,2,3,0,10,1,2,0
3,Champions Tour 2024: Americas Stage 2,Regular Season,Week 1,Sentinels vs NRG Esports,Sunset,Sentinels,14,9,3,2.0,...,2,0,7,3,2,0,11,2,1,0
4,Champions Tour 2024: Americas Stage 2,Regular Season,Week 1,FURIA vs 100 Thieves,Icebox,FURIA,14,6,6,2.0,...,1,0,10,1,1,0,11,0,2,1


Concatenating yearly data into one

In [10]:
vct_2021_2024 = pd.concat([vct_2021, vct_2022, vct_2023, vct_2024])

In [11]:
vct_2021_2024.head()

Unnamed: 0,Tournament,Stage,Match Type,Match Name,Map,Team A,Team A Score,Team A Attacker Score,Team A Defender Score,Team A Overtime Score,...,Detonation Denied_TeamA,Time Expiry (Failed to Plant)_TeamA,Elimination_TeamB,Detonated_TeamB,Defused_TeamB,Time Expiry (No Plant)_TeamB,Eliminated_TeamB,Defused Failed_TeamB,Detonation Denied_TeamB,Time Expiry (Failed to Plant)_TeamB
0,Valorant Champions 2021,Group Stage,Opening (D),Vision Strikers vs FULL SENSE,Haven,Vision Strikers,13,9,4,,...,1.0,0.0,4.0,0.0,1.0,0.0,9.0,2.0,2.0,0.0
1,Valorant Champions 2021,Group Stage,Opening (D),Vision Strikers vs FULL SENSE,Breeze,Vision Strikers,13,9,4,,...,2.0,0.0,3.0,0.0,2.0,0.0,9.0,3.0,1.0,0.0
2,Valorant Champions 2021,Group Stage,Opening (C),Team Vikings vs Crazy Raccoon,Icebox,Team Vikings,13,6,7,,...,5.0,1.0,2.0,1.0,5.0,1.0,7.0,2.0,3.0,1.0
3,Valorant Champions 2021,Group Stage,Opening (C),Team Vikings vs Crazy Raccoon,Haven,Team Vikings,13,6,7,,...,1.0,0.0,5.0,2.0,1.0,0.0,11.0,0.0,2.0,0.0
4,Valorant Champions 2021,Group Stage,Opening (D),FNATIC vs Cloud9,Icebox,FNATIC,13,7,6,,...,4.0,0.0,6.0,1.0,4.0,0.0,7.0,0.0,6.0,0.0


In [12]:
vct_2021_2024.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27022 entries, 0 to 1165
Data columns (total 94 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Tournament                            27022 non-null  object 
 1   Stage                                 27022 non-null  object 
 2   Match Type                            27022 non-null  object 
 3   Match Name                            27022 non-null  object 
 4   Map                                   26979 non-null  object 
 5   Team A                                27022 non-null  object 
 6   Team A Score                          27022 non-null  int64  
 7   Team A Attacker Score                 27022 non-null  int64  
 8   Team A Defender Score                 27022 non-null  int64  
 9   Team A Overtime Score                 2407 non-null   float64
 10  Team B                                27022 non-null  object 
 11  Team B Score         

In [13]:
vct_2021_2024.describe()

Unnamed: 0,Team A Score,Team A Attacker Score,Team A Defender Score,Team A Overtime Score,Team B Score,Team B Attacker Score,Team B Defender Score,Team B Overtime Score,Tournament ID,Stage ID,...,Detonation Denied_TeamA,Time Expiry (Failed to Plant)_TeamA,Elimination_TeamB,Detonated_TeamB,Defused_TeamB,Time Expiry (No Plant)_TeamB,Eliminated_TeamB,Defused Failed_TeamB,Detonation Denied_TeamB,Time Expiry (Failed to Plant)_TeamB
count,27022.0,27022.0,27022.0,2407.0,27022.0,27022.0,27021.0,2407.0,26983.0,26983.0,...,26984.0,26984.0,26984.0,26984.0,26984.0,26984.0,26984.0,26984.0,26984.0,26984.0
mean,10.734587,6.417475,4.119865,2.045285,9.486011,3.76175,5.536435,1.94516,680.025498,1355.06556,...,1.904425,0.205752,6.686147,0.681626,1.904425,0.205752,7.70564,0.721687,2.05544,0.241699
std,3.646161,2.737558,2.306665,1.746397,4.151297,2.462209,2.729987,1.724688,390.273136,747.77649,...,1.55575,0.482341,3.399158,0.920029,1.55575,0.482341,3.240275,0.953654,1.57048,0.518553
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,277.0,554.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,8.0,4.0,2.0,1.0,6.0,2.0,3.0,0.0,375.0,779.0,...,1.0,0.0,4.0,0.0,1.0,0.0,5.0,0.0,1.0,0.0
50%,13.0,7.0,4.0,2.0,11.0,4.0,5.0,2.0,557.0,1125.0,...,2.0,0.0,7.0,0.0,2.0,0.0,8.0,0.0,2.0,0.0
75%,13.0,9.0,6.0,3.0,13.0,6.0,8.0,3.0,840.0,1670.5,...,3.0,0.0,9.0,1.0,3.0,0.0,10.0,1.0,3.0,0.0
max,23.0,12.0,12.0,11.0,24.0,12.0,12.0,12.0,2096.0,4034.0,...,10.0,6.0,19.0,7.0,10.0,6.0,20.0,7.0,9.0,5.0


## Handling missing values

In [14]:
missing_values = vct_2021_2024.isnull().sum()

missing_values[missing_values > 0]

Map                                        43
Team A Overtime Score                   24615
Team B Defender Score                       1
Team B Overtime Score                   24615
Duration                                  599
Tournament ID                              39
Stage ID                                   39
Match Type ID                              44
Match ID                                   39
Game ID                                    41
Year                                       39
Team ID_TeamA                               1
Rating_TeamA                             6503
Average Combat Score_TeamA                157
Kills_TeamA                               117
Deaths_TeamA                              117
Assists_TeamA                             117
Kills - Deaths (KD)_TeamA                 118
Kill, Assist, Trade, Survive %_TeamA     6493
Average Damage Per Round_TeamA            426
Headshot %_TeamA                          433
First Kills_TeamA                 

## Handling categorical data

In [15]:
categorical_columns = vct_2021_2024.select_dtypes(include=["object"]).columns
categorical_columns

Index(['Tournament', 'Stage', 'Match Type', 'Match Name', 'Map', 'Team A',
       'Team B', 'Duration', 'Type_TeamA', 'Type_TeamB'],
      dtype='object')

Check cardinality of categorical columns

In [16]:
for col in categorical_columns:
    print(f"{col}: {vct_2021_2024[col].nunique()}")

Tournament: 209
Stage: 86
Match Type: 253
Match Name: 10364
Map: 11
Team A: 2628
Team B: 3256
Duration: 3220
Type_TeamA: 3
Type_TeamB: 4


In [17]:
# one hot encoding for columns Map, Type_TeamA, and Type_TeamB
dummies = pd.get_dummies(vct_2021_2024[['Map','Type_TeamA','Type_TeamB']]).fillna(0).astype('int16')

vct_2021_2024 = pd.concat([vct_2021_2024, dummies], axis=1)
vct_2021_2024.drop(['Map','Type_TeamA','Type_TeamB'], axis=1, inplace=True)

pd.set_option('display.max_columns', None)
vct_2021_2024.head(5)

Unnamed: 0,Tournament,Stage,Match Type,Match Name,Team A,Team A Score,Team A Attacker Score,Team A Defender Score,Team A Overtime Score,Team B,Team B Score,Team B Attacker Score,Team B Defender Score,Team B Overtime Score,Duration,Tournament ID,Stage ID,Match Type ID,Match ID,Game ID,Year,Team ID_TeamA,Team ID_TeamB,Rating_TeamA,Average Combat Score_TeamA,Kills_TeamA,Deaths_TeamA,Assists_TeamA,Kills - Deaths (KD)_TeamA,"Kill, Assist, Trade, Survive %_TeamA",Average Damage Per Round_TeamA,Headshot %_TeamA,First Kills_TeamA,First Deaths_TeamA,Kills - Deaths (FKD)_TeamA,Rating_TeamB,Average Combat Score_TeamB,Kills_TeamB,Deaths_TeamB,Assists_TeamB,Kills - Deaths (KD)_TeamB,"Kill, Assist, Trade, Survive %_TeamB",Average Damage Per Round_TeamB,Headshot %_TeamB,First Kills_TeamB,First Deaths_TeamB,Kills - Deaths (FKD)_TeamB,Loadout Value_TeamA,Remaining Credits_TeamA,Loadout Value_TeamB,Remaining Credits_TeamB,2k_TeamA,3k_TeamA,4k_TeamA,5k_TeamA,1v1_TeamA,1v2_TeamA,1v3_TeamA,1v4_TeamA,1v5_TeamA,Econ_TeamA,Spike Plants_TeamA,Spike Defuses_TeamA,2k_TeamB,3k_TeamB,4k_TeamB,5k_TeamB,1v1_TeamB,1v2_TeamB,1v3_TeamB,1v4_TeamB,1v5_TeamB,Econ_TeamB,Spike Plants_TeamB,Spike Defuses_TeamB,Elimination_TeamA,Detonated_TeamA,Defused_TeamA,Time Expiry (No Plant)_TeamA,Eliminated_TeamA,Defused Failed_TeamA,Detonation Denied_TeamA,Time Expiry (Failed to Plant)_TeamA,Elimination_TeamB,Detonated_TeamB,Defused_TeamB,Time Expiry (No Plant)_TeamB,Eliminated_TeamB,Defused Failed_TeamB,Detonation Denied_TeamB,Time Expiry (Failed to Plant)_TeamB,Map_Abyss,Map_Ascent,Map_Bind,Map_Breeze,Map_Fracture,Map_Haven,Map_Icebox,Map_Lotus,Map_Pearl,Map_Split,Map_Sunset,Type_TeamA_Eco: 0-5k,Type_TeamA_Full buy: 20k+,Type_TeamA_Semi-buy: 10-20k,Type_TeamB_Eco: 0-5k,Type_TeamB_Full buy: 20k+,Type_TeamB_Semi-buy: 10-20k,Type_TeamB_Semi-eco: 5-10k
0,Valorant Champions 2021,Group Stage,Opening (D),Vision Strikers vs FULL SENSE,Vision Strikers,13,9,4,,FULL SENSE,5,2,3.0,,59:11,449.0,945.0,8272.0,51282.0,57948.0,2021.0,198.0,4050.0,1.234,220.2,15.0,10.8,5.4,4.2,0.834,155.8,0.344,2.0,1.6,0.4,0.786,176.8,10.8,15.0,5.0,-4.2,0.646,126.6,0.248,1.6,2.0,-0.4,18033.333333,8177.777778,15361.111111,4961.111111,12.0,4.0,1.0,0.0,0.0,2.0,1.0,0.0,0.0,60.8,10.0,2.0,12.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,53.4,4.0,1.0,9.0,2.0,2.0,0.0,4.0,0.0,1.0,0.0,4.0,0.0,1.0,0.0,9.0,2.0,2.0,0.0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0
1,Valorant Champions 2021,Group Stage,Opening (D),Vision Strikers vs FULL SENSE,Vision Strikers,13,9,4,,FULL SENSE,5,2,3.0,,44:30,449.0,945.0,8272.0,51282.0,57949.0,2021.0,198.0,4050.0,1.228,214.8,14.0,10.0,6.0,4.0,0.766,137.4,0.296,2.4,1.2,1.2,0.792,165.4,10.0,14.0,3.8,-4.0,0.68,116.6,0.348,1.2,2.4,-1.2,19711.111111,14011.111111,15150.0,4322.222222,12.0,5.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,61.8,11.0,1.0,7.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,48.2,3.0,2.0,9.0,3.0,1.0,0.0,3.0,0.0,2.0,0.0,3.0,0.0,2.0,0.0,9.0,3.0,1.0,0.0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0
2,Valorant Champions 2021,Group Stage,Opening (C),Team Vikings vs Crazy Raccoon,Team Vikings,13,6,7,,Crazy Raccoon,9,3,6.0,,59:48,449.0,945.0,8268.0,51278.0,57936.0,2021.0,420.0,277.0,1.154,233.2,18.6,15.0,6.2,3.6,0.81,152.6,0.19,2.6,1.8,0.8,0.836,196.6,15.0,18.6,7.6,-3.6,0.656,129.2,0.316,1.8,2.6,-0.8,19213.636364,8922.727273,16254.545455,4759.090909,19.0,5.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,57.4,11.0,3.0,13.0,3.0,1.0,0.0,3.0,1.0,0.0,0.0,0.0,49.8,6.0,5.0,7.0,2.0,3.0,1.0,2.0,1.0,5.0,1.0,2.0,1.0,5.0,1.0,7.0,2.0,3.0,1.0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0
3,Valorant Champions 2021,Group Stage,Opening (C),Team Vikings vs Crazy Raccoon,Team Vikings,13,6,7,,Crazy Raccoon,8,2,6.0,,52:48,449.0,945.0,8268.0,51278.0,57937.0,2021.0,420.0,277.0,1.232,213.0,16.8,11.8,6.6,5.0,0.8,145.8,0.31,2.8,1.4,1.4,0.816,172.0,11.8,16.8,7.4,-5.0,0.572,114.0,0.226,1.4,2.8,-1.4,19676.190476,11185.714286,16000.0,6566.666667,14.0,7.0,1.0,0.0,2.0,1.0,0.0,0.0,0.0,55.2,8.0,2.0,11.0,6.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,46.0,8.0,1.0,11.0,0.0,2.0,0.0,5.0,2.0,1.0,0.0,5.0,2.0,1.0,0.0,11.0,0.0,2.0,0.0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0
4,Valorant Champions 2021,Group Stage,Opening (D),FNATIC vs Cloud9,FNATIC,13,7,6,,Cloud9,11,6,5.0,,59:50,449.0,945.0,8272.0,51283.0,57951.0,2021.0,2593.0,188.0,1.194,222.0,18.8,14.6,6.2,4.2,0.794,146.6,0.182,3.2,1.6,1.6,0.828,181.8,14.6,18.8,6.6,-4.2,0.644,121.4,0.166,1.6,3.2,-1.6,18950.0,11825.0,17329.166667,4641.666667,23.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64.2,9.0,6.0,16.0,2.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,43.6,11.0,4.0,7.0,0.0,6.0,0.0,6.0,1.0,4.0,0.0,6.0,1.0,4.0,0.0,7.0,0.0,6.0,0.0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0


## Converting columns to appropriate their appropriate data type

In [None]:
convert_columns = ['Team A Overtime Score','Team B Defender Score','Team B Overtime Score','Tournament ID','Stage ID','Match Type ID',
                   'Match ID','Game ID','Team ID_TeamA','Team ID_TeamB','Year']

vct_2021_2024[convert_columns] = vct_2021_2024[convert_columns].astype('int32')


In [None]:
vct_2021_2024['Duration'] = pd.to_timedelta(vct_2021_2024['Duration'].apply(lambda x: '00:' + x if len(x.split(':')) == 2 else x)).dt.total_seconds() / 60
vct_2021_2024['Duration'].head(5)

In [None]:
vct_2021_2024.dtypes

In [None]:
vct_2021_2024.head(5)

## Feature Engineering

## Saving cleaned and preprocessed dataset

In [None]:
vct_2021_2024.to_csv("vct_data/vct_2021_2024.csv", index=False)

# Exploratory Data Analysis (EDA)

- Perform univariate and bivariate analysis
- Visualize trends and patterns (line, histogram, scatter, etc.)
- Analyze correlations between different features
- Document findings and form hypotheses about the factors influencing betting outcomes

# Model Selection and Training

- Will focus on predicting match outcomes, so models will be **Logistic Regression**, **Random Forest**, and **XGBoost**
- Perform hyperparamter tuning
- Consider ensemble methods or model stacking for improved accuracy

# Model Evaluation

- Evaluate model performance using the appropriate metrics
- Use cross-validation to assess the model's ability to generalize to new data
- Compare different models and select the one with the best trade-off between accuracy
and interpretability

# Model Validation

- Validate the model on unseen data, such as recent matches or a test dataset
- Analyze the model's predictions and compare them to actual outcomes
- Test the model's robustness by simulating different betting scenarios (e.g., low-risk vs.
high-risk bets)