Name of Members: 

- __LATOSA, JOSE ROMULO NORIEGA__
- __RAMIREZ, BENMAR SIM GREFALDA__
- __VICTORIA, ALFRED EUGENE TAGLE__

## I. Introduction

League of Legends (LOL), is a MOBA (Multiplayer Online Battle Arena) that's been around since 2009 is one of the most played games globally. 

The dataset we chose for this machine project, is the LOL Worlds 2021 Play-In Group Stats.

This notebook explores predictive modeling for team and individual performance metrics, aiming to predict win rates using various regression techniques such as Ridge Regression, Bagging Regressor, and Polynomial Regression. 

**Dataset Description:**
- **Features:** Individual metrics (e.g., KDA, Creep Score) and team metrics (e.g., Objectives Taken).
- **Target:** Win rate prediction.

**Methodology:**
1. **Data Preprocessing:** Address missing values and scale features.
2. **Modeling:** Compare Ridge Regression, Bagging Regressor, and Polynomial Regression.
3. **Evaluation:** Use Mean Squared Error (MSE) as the primary metric for comparison.

Our goal is to use the data from the LOL World's 2021 matches to predict the winrate of the teams and what role is the most important in terms of winning, based on the feature-engineered role performance metrics of the players.

We'll be utilizing machine learning and other techniques discussed in class to build up our predictive model.

## II. Description

The dataset has 220 instances (rows) and 20 features (columns). Each of the features are statistics of either the player or their team, aligned with their specific roles. As well as the gold earned, objectices taken and other important variables.

The data is the game stats for all matches in the League of Legends Worlds 2021 Play-in Groups.

The dataset stated it's collection process was done through lolesports.com, a website which shows all in-depth statistics available for each match, which allowed us to find correlations between in-game statistics and wins.

### List of Variables

| **Variable Name** | **Description**|
|--------------------------------------|----------------|
|**Team** | Acronym Code of Team.|
|**Player** | Nametag of Player.|
|**Opponent** | Acronym code of opposing team in match.|
|**Position**| Position played by player in a match, 5 unique variables.|
|**Champion**| Champion played by a player in a match.|
|**Kills**| Number of kills by player in a match.|
|**Deaths**| Number of deaths by a player in a match.|
|**Assists**| Number of assists by player in match.|
|**Creep Score**| Number of minions and monsters killed by player in a match.|
|**Gold Earned**| Gold earned by player in a match.|
|**Champion Damage** | Percentage of total damage done by team to other champions done by player.|
|**Kill Participation** | Percentage of team kills that player was part of.|
|**Wards Placed** | Number of wards placed by player in match.|
|**Wards Destroyed** | Number of wards killed by player in match.|
|**Wards Interactions** | Sum of wards placed and wards killed by player in match.|
|**Dragons For** | Number of dragons team killed in match.|
|**Dragons Against** | Number of dragons opposing team killed in match.|
|**Barons For** | Number of Barons killed in match.|
|**Barons Against** | Number of Barons opposing team killed in match.|
|**Win or Lose** | Win or Lose (W/L).|

## III. Modules

For this machine project, we will utilize the following Python libraries:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.inspection import permutation_importance
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline

## IV. Data Cleaning

We are about to undertake various preprocessing and data cleaning methods to improve the dataset's usability and suitability for our exploratory analysis and model training. 

This process ensures the dataset is ready for use by addressing missing values, duplicates, outliers, and other errors, thereby representing the data accurately.

In the code block below, we will we importing and reading the League of Legends 2021 World Play-in Matches csv file.

In [None]:
worlds = pd.read_csv("LOL2021WORLDS.csv")
worlds.head()

**Checking for Missing Values**

The code identifies and displays the columns in our worlds DataFrame that have missing values. It first counts the total number of missing values in each column and lists the columns with any missing data. 

This step is crucial to ensure we address any gaps in the data before proceeding with further processing.

In [None]:
# Check each column for missing values
missing_values = worlds.isnull().sum()

columns_with_missing_values = missing_values[missing_values > 0]
print(columns_with_missing_values)

**Checking for Duplicates**

This code identifies and removes duplicates rows in the worlds DataFrame, it finds all the duplicate rows and seperates them and prints out the rows. Duplicates are removed from the original dataframe and will only keep the first occurence.

In [None]:
# Find the duplicates
duplicate_uuids = worlds.duplicated(keep=False)

# Create a DataFrame for duplicates
duplicates = worlds[duplicate_uuids]

# Display the rows with duplicates
print("Duplicate Entries:\n", duplicates)

# Removing the duplicates if detected in DataFrame
if not duplicates.empty:
    worlds = worlds.drop_duplicates(keep='first')
    print(f"Duplicates removed. Dataset has {worlds.shape[0]} rows.")
else:
    print("No duplicates found.")


**Other Cleaning Processes**

The first processes replaces infinite values with NaN, which would stand for undefined data. While the second and third process would drop any columns and rows with more than 50% of its values missing.

In [None]:
# Replacing any erroneous infinite values
worlds = worlds.replace([np.inf, -np.inf], np.nan)

# Dropping columns with more than 50% missing values
worlds = worlds.dropna(thresh=len(worlds)*0.5, axis=1)

# Dropping rows with more than 50% missing values 
worlds = worlds.dropna(thresh=worlds.shape[1]*0.5, axis=0)

**Data Type Verification**

Ensuring the values in our dataset are of the correct data types is essential for effective data analysis and model training. Fortunately, our current dataset has appropriate data types for all variables and their respective values.

The importance of correct data types lies in achieving the following:

- **Data Integrity**: Ensures the consistency and accuracy of data.
- **Accuracy**: Provides precise and reliable data for analysis.
- **Efficiency**: Facilitates optimal processing and analysis speeds.
- **Compatibility**: Enhances compatibility with various data processing tools and methods.
- **Reliability**: Guarantees dependable results in analysis and modeling.

With these aspects in place, we can confidently proceed with our data analysis and model training.

In [None]:
# Printing all of the present data types
print("All data types present:\n", worlds.dtypes)

### Outlier Detection and Treatment

Outliers can significantly impact statistical measures and model performance. In the context of our dataset, outliers might represent unusual game conditions, errors in data collection, or truly exceptional events. 

Some examples of truly exceptional events could be C9 Sneaky dropping 50 kills in a game (insane btw) but this is highly unlikely for every other game and is considered an extreme outlier. 

In this dataset, varying values like kills, deaths, gold earned, and objectives taken are all subject to the team's playstyle and ability.

**Our Approach**

We conducted winsorization on the dataset to ensure its robustness against potential future outliers, particularly as the dataset expands. By capping the data at specified percentiles, we mitigate the influence of these extreme values.

Additionally, we presented visualizations of the data before and after outlier treatment to identify any existing outliers.

First, let's establish which columns we're going to check for outliers.

In [None]:
# Columns to check for outliers 
outlier_columns = ['Kills', 'Deaths', 'Assists', 'Creep Score', 
                   'Gold Earned', 'Champion Damage Share', 'Kill Participation', 
                   'Wards Placed', 'Wards Destroyed', 'Ward Interactions', 
                   'Dragons For', 'Dragons Against', 'Barons For', 'Barons Against']

After that, let's just make a copy of the worlds DataFrame to showcase the data before outlier treatment.

In [None]:
# Create a copy of the data before winsorization
worlds_before_outliers = worlds.copy()

We can then start the winsorization process by creating a function that caps values at specific percentiles, since the game has a lot of possibilities, and we can't possibly assume any values of the features based on domain knowledge (because you never know what'll happen in a league game) we can use extreme percentiles in our winsorization for a more forgiving threshold. This'll allow us to handle uncertainty in our data and keep data distortion at a minimal. We'll use a lower percentile of 1 and an upper percentile of 99.

In [None]:
# Winsorizes data by capping values at specified percentiles 
def winsorize(data, lower_percentile=1, upper_percentile=99): 
    lower_bound = np.percentile(data, lower_percentile) 
    upper_bound = np.percentile(data, upper_percentile) 
    return np.clip(data, lower_bound, upper_bound)

Now we apply winsorization using *winsorize* to the specified columns.

In [None]:
# Apply winsorization to the specified columns 
for column in outlier_columns: 
    worlds[column] = winsorize(worlds[column], lower_percentile=1, upper_percentile=99)

Next, is to truncate data the exceeds the upper bounds of our limits.

In [None]:
# Truncates data by removing values exceeding the upper bound
def truncate(data, upper_bound):
    return data[data <= upper_bound]

Let's visualize the distributions before and after winsorization, we'll use a few key features like, `Kills`, `Deaths`, `Assists`, and `Gold Earned`.

In [None]:
# List of features to visualize
features_to_visualize = ['Kills', 'Deaths', 'Assists', 'Gold Earned']

# Visualize before and after winsorization
for feature in features_to_visualize:
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Before Winsorization
    sns.boxplot(data=worlds_before_outliers, x=feature, ax=axes[0], color='skyblue')
    axes[0].set_title(f'{feature} - Before Winsorization')
    
    # After Winsorization
    sns.boxplot(data=worlds, x=feature, ax=axes[1], color='lightgreen')
    axes[1].set_title(f'{feature} - After Winsorization')
    
    plt.show()

for feature in features_to_visualize:
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Before Winsorization
    sns.histplot(data=worlds_before_outliers, x=feature, kde=True, ax=axes[0], color='skyblue')
    axes[0].set_title(f'{feature} - Before Winsorization')
    
    # After Winsorization
    sns.histplot(data=worlds, x=feature, kde=True, ax=axes[1], color='lightgreen')
    axes[1].set_title(f'{feature} - After Winsorization')
    
    plt.show()

Now that we've done outlier treatment, it's time to divide the rows by their games. 10 rows in the dataset is equivalent to 1 game, with 10 players and 2 teams. Dividing them will allow us to measure stats per game.

In [None]:
# Reset index to ensure it's sequential 
worlds.reset_index(drop=True, inplace=True)
# Assign GameID 
worlds['GameID'] = (worlds.index // 10) + 1
# Group by GameID and collect the unique teams involved in each game 
game_teams = worlds.groupby('GameID')['Team'].unique().reset_index()

Let's make a function to extract the two teams, so per GameID, we can see what team is fight which.

In [None]:
# Function to extract the two teams 
def extract_teams(team_list): 
    unique_teams = list(team_list) 
    if len(unique_teams) == 2: 
        return pd.Series({'Team1': unique_teams[0], 'Team2': unique_teams[1]}) 
    else: 
    # Handle cases where there are not exactly two teams 
        return pd.Series({'Team1': unique_teams[0], 'Team2': unique_teams[0]})

Now we can apply this into the 'Team' column and drop it to merge game_teams back into the original DataFrame.

In [None]:
# Apply the function to the 'Team' column 
game_teams[['Team1', 'Team2']] = game_teams['Team'].apply(extract_teams) 
# Drop the original 'Team' column as it's no longer needed 
game_teams.drop('Team', axis=1, inplace=True) 
# Merge the game_teams back into the original 'worlds' DataFrame 
worlds = worlds.merge(game_teams, on='GameID', how='left')

Let's observe the sample.

In [None]:
# Display the first 20 rows to verify 
print(worlds[['GameID', 'Team1', 'Team2']].head(20))

### Data Transformation and Feature Engineering

We aim to calculate the `winrate` of teams based on both individual and team metrics.

**Individual Metrics**

`Kills, Deaths, Assists (KDA)`: This is the ratio of kills/deaths/assists of the players, helps us measure the stats of the player.

`Kill Participation Difference (KPD)`: This is the calculation of the difference between a player's KP and the average KP for their role, which provides insight into their performance.

`Damage Share Difference (DSD)`: This is the calculation of the difference between a player's Damage Share and the average DS for their role. This will show how much of an impact a player has in terms of dealing damage in comparison to their teammates.

We'll be using this with the Creep Score and Gold Earned.

**Team-based Metrics**

`Team Objectives`: This will be the total count of epic monsters like Elemental Dragons or Baron Nashors taken by the team in a specific game. Both of these objectives can't be soloed (unless under very specific conditions at any point of the game, especially in the meta LOL 2021 was under). This is an emphasis of team play and will be essential in measuring team-play performance. We will be using Barons For and Dragons For.

`Team Vision Score`: This is a measure of vision play in a specific game, it will be using the variables, Wards Placed, Wards Destroyed, and Ward Interactions.

`Team Variability`: This is a measure of how evenly or unevenly the damage output is distributed between team members, this can help us understand whether the team's playstyle focuses on more on specific members (like the damage dealers, Bottom and Middle doing most of the work) or if they do a well-rounded utility comp that does well in teamfights.

These features will be used with Team Total Assists.

We can then calculate KDA by getting the Kills + Assists over Deaths, in the event of a zero (0) death game, it will be replaced with 1.

In [None]:
# Calculate KDA
worlds['KDA'] = (worlds['Kills'] + worlds['Assists']) / worlds['Deaths'].replace(0, 1)

We calculate the Role Averages here to get an idea of standard performance of players. If a player has a KP or DS lower than the standard it just means they are underperforming, and if it's above average they are succeeding.

In [None]:
# Calculate Role Averages
individual_metrics = ['Creep Score', 'KDA', 'Gold Earned', 'Kill Participation', 'Champion Damage Share']
role_averages = worlds.groupby('Position')[individual_metrics].mean().reset_index()

# Rename columns for merging
role_averages.rename(columns={
    'Creep Score': 'Avg CS',
    'KDA': 'Avg KDA',
    'Gold Earned': 'Avg Gold Earned',
    'Kill Participation': 'Avg KP',
    'Champion Damage Share': 'Avg Damage Share'
}, inplace=True)

# Merge role averages into worlds
worlds = pd.merge(worlds, role_averages, on='Position', how='left')

# Print Role Averages
print("Role Averages:")
print(role_averages)

We are also going to calculate in this code block, the KPD and DSD. For KPD, it'll be Kill Participation - Average KP. For the DSD it'll be the same process but it'll use Champion Damage Share - Avg Damage Share.

In [None]:
# Calculate KPD
worlds['KP Difference'] = worlds['Kill Participation'] - worlds['Avg KP'] 

# Calculate DSD
worlds['Damage Share Difference'] = worlds['Champion Damage Share'] - worlds['Avg Damage Share']

Now, we can calculate the team metrics, wherein the first is where we calculate the team_objectives, the second for team_vision and the last one is just a simple computation to calculate the total team assists.

In [None]:
# Calculate Team Objectives Taken 
team_objectives = worlds.groupby(['GameID', 'Team']).agg({ 
    'Dragons For': 'max', 
    'Barons For': 'max' 
    }).reset_index() 
team_objectives['Objectives Taken'] = team_objectives['Dragons For'] + team_objectives['Barons For'] 
team_objectives = team_objectives[['GameID', 'Team', 'Objectives Taken']] 

# Calculate Team Vision Score 
team_vision = worlds.groupby(['GameID', 'Team']).agg({ 
    'Wards Placed': 'sum', 
    'Wards Destroyed': 'sum', 
    'Ward Interactions': 'sum' }).reset_index() 
team_vision['Vision Score'] = team_vision['Wards Placed'] + team_vision['Wards Destroyed'] + team_vision['Ward Interactions'] 
team_vision = team_vision[['GameID', 'Team', 'Vision Score']] 

# Calculate Team Total Assists 
team_assists = worlds.groupby(['GameID', 'Team'])['Assists'].sum().reset_index() 
team_assists.rename(columns={'Assists': 'Team Assists'}, inplace=True)

# Calculate Team Variability (Damage Share STD)
team_variability = worlds.groupby(['GameID', 'Team'])['Champion Damage Share'].std().reset_index()
team_variability.rename(columns={'Champion Damage Share': 'Damage Share STD'}, inplace=True)

Since we've calculated our newly engineered features, it's time to merge them into the worlds DataFrame.

In [None]:
# Merge team objectives into worlds
worlds = pd.merge(worlds, team_objectives, on=['GameID', 'Team'], how='left')
worlds = pd.merge(worlds, team_vision, on=['GameID', 'Team'], how='left')
worlds = pd.merge(worlds, team_assists, on=['GameID', 'Team'], how='left')
worlds = pd.merge(worlds, team_variability, on=['GameID', 'Team'], how='left')

# Display the DataFrame after merging team objectives
print("After merging Team Objectives:")
print(worlds[['GameID', 'Player', 'Team', 'Objectives Taken', "Vision Score", "Team Assists", "Damage Share STD"]].head(10))


Now that we have established all of our newly-engineered features, we can now use a MinMaxScaler to scale all of our metrics and features. Scaling the features ensures that all of them can contribute equally to our machine learning models.

A MinMaxScaler can scale our features to a range between 0 and 1.

In [None]:
# List of features to scale
features_to_scale = [
    'Creep Score', 'KDA', 'Gold Earned', 'KP Difference', 
    'Damage Share Difference', 'Objectives Taken', 
    'Vision Score', 'Team Assists', 'Damage Share STD'
]

# Initialize the scaler
scaler = MinMaxScaler()

# Fit and transform the features
worlds[features_to_scale] = scaler.fit_transform(worlds[features_to_scale])

# Display the first few rows of the scaled features
print(worlds[features_to_scale].head())

As we've already finished outlier treatment and feature engineering, it's time to decide which features we should drop. It's important to identify any features we have that won't contribute significantly to our models.

In [None]:
# Calculate the correlation matrix
corr_matrix = worlds[features_to_scale].corr().abs()

# Find highly correlated features (correlation > 0.9)
upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Identify columns with correlation greater than 0.9
high_corr_pairs = [(column, row) for column in upper_triangle.columns for row in upper_triangle.index if upper_triangle.loc[row, column] > 0.9]

print("Highly correlated feature pairs (correlation > 0.9):")
for pair in high_corr_pairs:
    print(f"{pair[0]} and {pair[1]}: correlation = {corr_matrix.loc[pair[1], pair[0]]:.2f}")


We calculated the correlation matrix to use the worlds DataFrame to check if there are any features with high correlation. Strong correlations suggest redundancy in those features. 

## Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a critical step in any data analysis or machine learning project. It enables us to understand the data's characteristics, identify patterns and relationships, detect anomalies, inform feature engineering, and validate our assumptions. For instance, performing EDA on the 'worlds' dataset will provide insights into the individual and team performance metrics that influence the teams' win rates. It will also shed light on team dynamics based on damage share and role averages, helping us understand the criteria for underperforming and overperforming players. These insights will guide our choice of models and assist us in engineering features that enhance our model's predictive capabilities.

Role averages provide a benchmark for assessing individual player performances relative to their roles. They show the typical performance metrics for each position, helping identify overperforming and underperforming players.

### Numerical Data

In [None]:
# Display role averages 
print("Role Averages:") 
print (role_averages)

Key observations here are that ADCs have the highest average CS and Gold Earned, which is a direct reflection of their role as primary damage dealers who rely on farming. They also contribute to the team's damage output the most with strong kill participation rates as well.

The support position focuses on utility rather than farming and has the lowest creep score and gold earned. The jungle position has a somewhat moderate metric impacting the game through objectives.

In [None]:
# Extract unique combinations of GameID, Team, and scaled Damage Share STD
team_variability_scaled = worlds[['GameID', 'Team', 'Damage Share STD']].drop_duplicates()

# Display the scaled Damage Share STD per team per game
print("\nScaled Damage Share STD per Team per Game:")
print(team_variability_scaled)

The Damage Share STD measures the variability in damage contributions within a team. Higher values indicate greater reliance on specific players, while lower values suggest a more balanced distribution.

In [None]:
# Find the team with the highest scaled Damage Share STD
highest_variability = team_variability_scaled.loc[team_variability_scaled['Damage Share STD'].idxmax()]
print("\nTeam with Highest Scaled Damage Share STD:")
print(highest_variability)

# Find the team with the lowest scaled Damage Share STD
lowest_variability = team_variability_scaled.loc[team_variability_scaled['Damage Share STD'].idxmin()]
print("\nTeam with Lowest Scaled Damage Share STD:")
print(lowest_variability)

**High Damage Share STD**: Teams with high values rely heavily on specific players for damage, which can be effective but risky if those players are neutralized.

**Low Damage Share STD**: Teams with low values have a balanced approach, making them more resilient to opponents targeting specific players.

In [None]:
# Extract the first game
first_game_id = worlds['GameID'].min()
first_game = worlds[worlds['GameID'] == first_game_id]

# Display the first game's engineered features
engineered_features = ['KDA', 'KP Difference', 'Damage Share Difference', 'Team Assists', 'Vision Score', 'Damage Share STD']
print(f"\nFeatures of the First Game (Game ID {first_game_id}):")
print(first_game[engineered_features])


### Histograms

These charts show the distribution of four variables: KDA (Kill/Death/Assist), KP Difference (Kill Participation Difference), Damage Share Difference, and Damage Share STD.

The histograms represent the frequency of values in specific ranges, while the overlaid trend lines depict their probability density functions (PDFs).

In [None]:
# Engineered features for histograms
engineered_features = ['KDA', 'KP Difference', 'Damage Share Difference', 'Damage Share STD']

# Plot histograms for each engineered feature
for feature in engineered_features:
    plt.figure(figsize=(8, 4))
    sns.histplot(worlds[feature], kde=True)
    plt.title(f'Distribution of {feature}')
    plt.xlabel(feature)
    plt.ylabel('Frequency')
    plt.show()

In the KDA chart, the distribution is heavily right-skewed, with the majority of values concentrated around lower ranges, indicating most players have low KDA ratios. 

In contrast, the KP Difference, Damage Share Difference, and Damage Share STD distributions are more symmetrical, resembling a normal distribution centered around 0.5. This suggests a balanced spread of differences, with fewer extreme deviations. 


### Scatterplots

`X-axis (KP Difference)`: Represents the difference between a player's Kill Participation (KP) and the average KP for their role. Higher values indicate that the player is involved in more kills relative to their role average.

`Y-axis (Damage Share Difference)`: Represents the difference between a player's Champion Damage Share and the average Damage Share for their role. Higher values indicate that the player contributes more damage relative to their role average.

In [None]:
# Scatter plot of KP Difference vs. Damage Share Difference for the first game
plt.figure(figsize=(8, 6))
sns.scatterplot(x='KP Difference', y='Damage Share Difference',
                data=first_game, hue='Position', style='Team', s=100)
plt.title('KP Difference vs. Damage Share Difference (First Game)')
plt.xlabel('KP Difference')
plt.ylabel('Damage Share Difference')
plt.legend(title='Position', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

`Red Circle at the Top Left`: Indicates that the ADC from team UOL has a high Damage Share Difference but a lower KP Difference, suggesting this player deals a lot of damage compared to their role average but is less involved in kills.

`Blue Cross at the Bottom Right`: Indicates that the Top player from team GS has a high KP Difference but a lower Damage Share Difference, suggesting this player is highly involved in kills but contributes less damage relative to their role average.

## Model Training

Model training is a crucial step in the machine learning pipeline. It involves learning the underlying patterns in the training data so that we can make predictions on unseen data. The goal is to find a model that generalizes well, meaning it accurately predicts outcomes on new, unseen data based on the patterns it learned from the training data.

In this project, we trained several types of regression models on our 'worlds' dataset:

1. **Ridge Regression**: is a type of linear regression that includes a regularization term to prevent overfitting. The regularization term, also known as the L2 penalty, is added to the loss function, which helps constrain the model's coefficients.
2. **Regression Trees (Decision Trees)**: This is a type of model that breaks down our dataset into smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes, which provide a clear interpretation of why the model is making certain predictions. 
3. **Polynomial Regression**: This is a type of regression that models the relationship between the input variable (x) and the output variable (y) as an nth degree polynomial. Polynomial regression can model relationships between variables that aren't linear and can fit data with curves or slopes.

Let's start off by creating a binary win column just to calculate the raw winrate for each team. We'll use the groupby function to do this and merge back the winrate for each team. We can now drop the Win column and observe the stats and winrate of each of the members of that team.

In [None]:
# Create a binary win column for individual matches
worlds['Win'] = worlds['Result'].apply(lambda x: 1 if x == 'W' else 0)

# Calculate winrate for each team
team_winrate = worlds.groupby('Team')['Win'].mean().reset_index()
team_winrate.rename(columns={'Win': 'Winrate'}, inplace=True)

# Merge winrate back into the worlds DataFrame
worlds = worlds.merge(team_winrate, on='Team', how='left')

# Drop the binary win column as it's not needed for regression
worlds.drop(columns=['Win'], inplace=True)

# Display the first few rows with the new winrate column
print("Data with Winrate Column:")
print(worlds.head())

Then we aggregated individual metrics at the team level per game to summarize each team's performance using various statistical measures, ensuring a comprehensive dataset for analysis. 

It seemed best to preserve the original Damage Share STD values to maintain accurate variability information. 

In [None]:
# Aggregating individual metrics at the team level (per game)
team_aggregated = worlds.groupby(['GameID', 'Team']).agg({
    'KDA': 'mean',
    'KP Difference': 'mean',
    'Damage Share Difference': 'mean',
    'Creep Score': 'sum',
    'Gold Earned': 'sum',
    'Objectives Taken': 'mean',
    'Vision Score': 'mean',
    'Team Assists': 'mean',
    'Damage Share STD': 'first',
}).reset_index()

# Merge winrate to the aggregated DataFrame 
team_aggregated = team_aggregated.merge(team_winrate, on='Team', how='left')

Then we define the individual and team metrics and split the data into their own training and testing sets, one for each.

In [None]:
# Define individual and team metrics
individual_metrics = [
    'KDA', 'KP Difference', 'Damage Share Difference', 'Creep Score', 
    'Gold Earned'
]
team_metrics = [
    'Objectives Taken', 'Vision Score', 
    'Team Assists', 'Damage Share STD'
]

# Split the data into training and testing sets for individual metrics
X_individual = team_aggregated[individual_metrics]
X_train_ind, X_test_ind, y_train, y_test = train_test_split(X_individual, y, test_size=0.2, random_state=42)

# Split the data into training and testing sets for team metrics
X_team = team_aggregated[team_metrics]
X_train_team, X_test_team, y_train, y_test = train_test_split(X_team, y, test_size=0.2, random_state=42)

# Display the first few rows of the training data for both metrics
print("Training Data (Individual Metrics):")
print(X_train_ind.head())
print("\nTraining Data (Team Metrics):")
print(X_train_team.head())
print("\nTraining Target:")
print(y_train.head())

Now that we've got the data ready for training, we can now being training our machine learning models.

### Ridge Regression

Initiliaze and train the model, then make predictions, and print out the MSE.

In [None]:
# Individual Metrics
ridge_model_ind = Ridge(alpha=1.0)
ridge_model_ind.fit(X_train_ind, y_train)
ridge_predictions_ind = ridge_model_ind.predict(X_test_ind)
ridge_mse_ind = mean_squared_error(y_test, ridge_predictions_ind)

# Team Metrics
ridge_model_team = Ridge(alpha=1.0)
ridge_model_team.fit(X_train_team, y_train)
ridge_predictions_team = ridge_model_team.predict(X_test_team)
ridge_mse_team = mean_squared_error(y_test, ridge_predictions_team)

print(f"Ridge Regression MSE (Individual Metrics): {ridge_mse_ind:.4f}")
print(f"Ridge Regression MSE (Team Metrics): {ridge_mse_team:.4f}")

#### Error Analysis and Visualizations

Because the data was sifted through and pre-processed correctly the relationships in the dataset are well captured. Because of this we are given low MSE and RMSE values.

In [None]:
# Calculate RMSE for Ridge Regression
ridge_rmse_ind = np.sqrt(ridge_mse_ind)
ridge_rmse_team = np.sqrt(ridge_mse_team)
print(f"Ridge Regression RMSE (Individual Metrics): {ridge_rmse_ind:.4f}")
print(f"Ridge Regression RMSE (Team Metrics): {ridge_rmse_team:.4f}")

Plot residuals (difference between actual and predicted values) to identify patterns and challenging instances.

In [None]:
# Residuals for Ridge Regression
ridge_residuals_ind = y_test - ridge_predictions_ind
ridge_residuals_team = y_test - ridge_predictions_team

# Plot residuals for Ridge Regression (Individual Metrics)
plt.scatter(ridge_predictions_ind, ridge_residuals_ind)
plt.axhline(y=0, color='r', linestyle='-')
plt.title('Residual Plot for Ridge Regression (Individual Metrics)')
plt.xlabel('Predicted Winrate')
plt.ylabel('Residuals')
plt.show()

# Plot residuals for Ridge Regression (Team Metrics)
plt.scatter(ridge_predictions_team, ridge_residuals_team)
plt.axhline(y=0, color='r', linestyle='-')
plt.title('Residual Plot for Ridge Regression (Team Metrics)')
plt.xlabel('Predicted Winrate')
plt.ylabel('Residuals')
plt.show()

The table shows the actual win rates, the predicted win rates, and the residuals (errors) for data points in our regression model. Below we can see the most challenging instance, where the predicted values are far from the actual values. These are pain points that can possibly be improved through Hyperparameter tuning.

In [None]:
# Actual vs Predicted for Ridge Regression (Individual Metrics)
ridge_results_ind = pd.DataFrame({'Actual': y_test, 'Predicted': ridge_predictions_ind})
ridge_results_ind['Residual'] = np.abs(ridge_results_ind['Actual'] - ridge_results_ind['Predicted'])
ridge_difficult_instances_ind = ridge_results_ind.sort_values(by='Residual', ascending=False)
print("Ridge Regression (Individual Metrics) - Most Challenging Instances:")
print(ridge_difficult_instances_ind.head())

# Actual vs Predicted for Ridge Regression (Team Metrics)
ridge_results_team = pd.DataFrame({'Actual': y_test, 'Predicted': ridge_predictions_team})
ridge_results_team['Residual'] = np.abs(ridge_results_team['Actual'] - ridge_results_team['Predicted'])
ridge_difficult_instances_team = ridge_results_team.sort_values(by='Residual', ascending=False)
print("Ridge Regression (Team Metrics) - Most Challenging Instances:")
print(ridge_difficult_instances_team.head())

### Bagging

Initiliaze and train the model, then make predictions, and print out the MSE.

In [None]:
# Individual Metrics
tree_model_ind = DecisionTreeRegressor(random_state=42)
bagging_model_ind = BaggingRegressor(estimator=tree_model_ind, n_estimators=100, random_state=42)
bagging_model_ind.fit(X_train_ind, y_train)
bagging_predictions_ind = bagging_model_ind.predict(X_test_ind)
bagging_mse_ind = mean_squared_error(y_test, bagging_predictions_ind)

# Team Metrics
tree_model_team = DecisionTreeRegressor(random_state=42)
bagging_model_team = BaggingRegressor(estimator=tree_model_team, n_estimators=100, random_state=42)
bagging_model_team.fit(X_train_team, y_train)
bagging_predictions_team = bagging_model_team.predict(X_test_team)
bagging_mse_team = mean_squared_error(y_test, bagging_predictions_team)

print(f"Regression Trees (Bagging) MSE (Individual Metrics): {bagging_mse_ind:.4f}")
print(f"Regression Trees (Bagging) MSE (Team Metrics): {bagging_mse_team:.4f}")


#### Error Analysis

Again, we have lower RMSE and MSE values here, which suggests a slightly better fit to the data compared to Ridge Regression.

In [None]:
# Calculate RMSE for Bagging Regressor
bagging_rmse_ind = np.sqrt(bagging_mse_ind)
bagging_rmse_team = np.sqrt(bagging_mse_team)
print(f"Regression Trees (Bagging) RMSE (Individual Metrics): {bagging_rmse_ind:.4f}")
print(f"Regression Trees (Bagging) RMSE (Team Metrics): {bagging_rmse_team:.4f}")

Plot residuals here as well.

In [None]:

# Residuals for Bagging Regressor
bagging_residuals_ind = y_test - bagging_predictions_ind
bagging_residuals_team = y_test - bagging_predictions_team

# Plot residuals for Bagging Regressor (Individual Metrics)
plt.scatter(bagging_predictions_ind, bagging_residuals_ind)
plt.axhline(y=0, color='r', linestyle='-')
plt.title('Residual Plot for Bagging Regressor (Individual Metrics)')
plt.xlabel('Predicted Winrate')
plt.ylabel('Residuals')
plt.show()

# Plot residuals for Bagging Regressor (Team Metrics)
plt.scatter(bagging_predictions_team, bagging_residuals_team)
plt.axhline(y=0, color='r', linestyle='-')
plt.title('Residual Plot for Bagging Regressor (Team Metrics)')
plt.xlabel('Predicted Winrate')
plt.ylabel('Residuals')
plt.show()

Below are the most challenging instances when using the Bagging Regressor to compare the difference between the predictions using Individual Metrics and Team-based Metrics.

In [None]:
# Actual vs Predicted for Bagging Regressor (Individual Metrics)
bagging_results_ind = pd.DataFrame({'Actual': y_test, 'Predicted': bagging_predictions_ind})
bagging_results_ind['Residual'] = np.abs(bagging_results_ind['Actual'] - bagging_results_ind['Predicted'])
bagging_difficult_instances_ind = bagging_results_ind.sort_values(by='Residual', ascending=False)
print("Bagging Regressor (Individual Metrics) - Most Challenging Instances:")
print(bagging_difficult_instances_ind.head())

# Actual vs Predicted for Bagging Regressor (Team Metrics)
bagging_results_team = pd.DataFrame({'Actual': y_test, 'Predicted': bagging_predictions_team})
bagging_results_team['Residual'] = np.abs(bagging_results_team['Actual'] - bagging_results_team['Predicted'])
bagging_difficult_instances_team = bagging_results_team.sort_values(by='Residual', ascending=False)
print("Bagging Regressor (Team Metrics) - Most Challenging Instances:")
print(bagging_difficult_instances_team.head())

### Polynomial Regression

Generate polynomial features, then initiliaze and train the model, then make predictions, and print out the MSE.

In [None]:
# Individual Metrics
poly_ind = PolynomialFeatures(degree=2)
X_poly_train_ind = poly_ind.fit_transform(X_train_ind)
X_poly_test_ind = poly_ind.transform(X_test_ind)
poly_model_ind = LinearRegression()
poly_model_ind.fit(X_poly_train_ind, y_train)
poly_predictions_ind = poly_model_ind.predict(X_poly_test_ind)
poly_mse_ind = mean_squared_error(y_test, poly_predictions_ind)

# Team Metrics
poly_team = PolynomialFeatures(degree=2)
X_poly_train_team = poly_team.fit_transform(X_train_team)
X_poly_test_team = poly_team.transform(X_test_team)
poly_model_team = LinearRegression()
poly_model_team.fit(X_poly_train_team, y_train)
poly_predictions_team = poly_model_team.predict(X_poly_test_team)
poly_mse_team = mean_squared_error(y_test, poly_predictions_team)

print(f"Polynomial Regression MSE (Individual Metrics): {poly_mse_ind:.4f}")
print(f"Polynomial Regression MSE (Team Metrics): {poly_mse_team:.4f}")


#### Error Analysis

The low MSE values suggest the models are fitting the data reasonably well, except for polynomial regression, which seems to have a higher error.

In [None]:
# Calculate RMSE for Polynomial Regression
poly_rmse_ind = np.sqrt(poly_mse_ind)
poly_rmse_team = np.sqrt(poly_mse_team)
print(f"Polynomial Regression RMSE (Individual Metrics): {poly_rmse_ind:.4f}")
print(f"Polynomial Regression RMSE (Team Metrics): {poly_rmse_team:.4f}")

We can plot residuals here as well to identify the patterns or challenging instances.

In [None]:
# Residuals for Polynomial Regression
poly_residuals_ind = y_test - poly_predictions_ind
poly_residuals_team = y_test - poly_predictions_team

# Plot residuals for Polynomial Regression (Individual Metrics)
plt.scatter(poly_predictions_ind, poly_residuals_ind)
plt.axhline(y=0, color='r', linestyle='-')
plt.title('Residual Plot for Polynomial Regression (Individual Metrics)')
plt.xlabel('Predicted Winrate')
plt.ylabel('Residuals')
plt.show()

# Plot residuals for Polynomial Regression (Team Metrics)
plt.scatter(poly_predictions_team, poly_residuals_team)
plt.axhline(y=0, color='r', linestyle='-')
plt.title('Residual Plot for Polynomial Regression (Team Metrics)')
plt.xlabel('Predicted Winrate')
plt.ylabel('Residuals')
plt.show()

Residuals from the Polynomial Regression model.

In [None]:
# Actual vs Predicted for Polynomial Regression (Individual Metrics)
poly_results_ind = pd.DataFrame({'Actual': y_test, 'Predicted': poly_predictions_ind})
poly_results_ind['Residual'] = np.abs(poly_results_ind['Actual'] - poly_results_ind['Predicted'])
poly_difficult_instances_ind = poly_results_ind.sort_values(by='Residual', ascending=False)
print("Polynomial Regression (Individual Metrics) - Most Challenging Instances:")
print(poly_difficult_instances_ind.head())

# Actual vs Predicted for Polynomial Regression (Team Metrics)
poly_results_team = pd.DataFrame({'Actual': y_test, 'Predicted': poly_predictions_team})
poly_results_team['Residual'] = np.abs(poly_results_team['Actual'] - poly_results_team['Predicted'])
poly_difficult_instances_team = poly_results_team.sort_values(by='Residual', ascending=False)
print("Polynomial Regression (Team Metrics) - Most Challenging Instances:")
print(poly_difficult_instances_team.head())


## Improving Model Performance

Based on our findings from the error analysis, it is imperative that we start improving the models via Hyperparameter Tuning. The initial results were impressive as they were somewhat close to the actual values and had low MSEs (not you polynomial regression).

### Ridge Regression

For the `Ridge Regression`, we'll tune the alpha parameter, which controls the regularization strength. 

We did a range of (0.1, 1.0, 10.0, 100.0)

We used Grid Search to find the optimal `alpha` parameter for the Ridge Regression model by tuning it separately for individual and team metrics, aiming to minimize the mean squared error. After training the models with the best parameters, we evaluated their performance and printed the MSE for both sets of metrics.

In [None]:
# Define parameter grid for Ridge Regression
ridge_params = {'alpha': [0.1, 1.0, 10.0, 100.0]}

# Individual Metrics
ridge_grid_ind = GridSearchCV(Ridge(), ridge_params, cv=5, scoring='neg_mean_squared_error')
ridge_grid_ind.fit(X_train_ind, y_train)
best_ridge_ind = ridge_grid_ind.best_estimator_
ridge_predictions_ind_tuned = best_ridge_ind.predict(X_test_ind)
ridge_mse_ind_tuned = mean_squared_error(y_test, ridge_predictions_ind_tuned)

# Team Metrics
ridge_grid_team = GridSearchCV(Ridge(), ridge_params, cv=5, scoring='neg_mean_squared_error')
ridge_grid_team.fit(X_train_team, y_train)
best_ridge_team = ridge_grid_team.best_estimator_
ridge_predictions_team_tuned = best_ridge_team.predict(X_test_team)
ridge_mse_team_tuned = mean_squared_error(y_test, ridge_predictions_team_tuned)

print(f"Tuned Ridge Regression MSE (Individual Metrics): {ridge_mse_ind_tuned:.4f}")
print(f"Tuned Ridge Regression MSE (Team Metrics): {ridge_mse_team_tuned:.4f}")

The value of the MSEs aren't budging, and seem to have worsened.

Another thing we can do is reassess the current feature importance to determine which features have negative importance scores. 

Having negative importance scores introduce noise into the model rather than helping to improve the model, by simplifying, we can improve model performance.

In [None]:
# Feature importance for Ridge Regression (Individual Metrics)
result = permutation_importance(best_ridge_ind, X_test_ind, y_test, n_repeats=10, random_state=42)
importance_df = pd.DataFrame(result.importances_mean, index=individual_metrics, columns=["Importance"])
print("Feature Importance (Ridge Regression - Individual Metrics):")
print(importance_df.sort_values(by="Importance", ascending=False))

# Feature importance for Ridge Regression (Team Metrics)
result_team_ridge = permutation_importance(ridge_model_team, X_test_team, y_test, n_repeats=10, random_state=42)
importance_df_team_ridge = pd.DataFrame(result_team_ridge.importances_mean, index=team_metrics, columns=["Importance"])
print("Feature Importance (Ridge Regression - Team Metrics):")
print(importance_df_team_ridge.sort_values(by="Importance", ascending=False))

Since there are no negative values in the Individual Metrics, there is nothing to remove, and it is the best value of MSE we can get as the model is fitted near-perfectly to the data.

But, as we can see here in the Team Metrics, they have some features that are in the negative value, like Damage Share STD, Team Assists, and Vision Score.

Let's try running it through the model without them.

In [None]:
# Define team metrics without Damage Share STD, Team Assists, and Vision Score
team_metrics_updated = [
    'Objectives Taken'
]

# Split the data into training and testing sets for the updated team metrics
X_team_updated = team_aggregated[team_metrics_updated]
X_train_team_updated, X_test_team_updated, y_train, y_test = train_test_split(X_team_updated, y, test_size=0.2, random_state=42)

We train the model again for the updated team metrics.

In [None]:
# Train Ridge Regression model with the updated team metrics
ridge_model_team_updated = Ridge(alpha=1.0)
ridge_model_team_updated.fit(X_train_team_updated, y_train)
ridge_predictions_team_updated = ridge_model_team_updated.predict(X_test_team_updated)
ridge_mse_team_updated = mean_squared_error(y_test, ridge_predictions_team_updated)

print(f"Ridge Regression MSE (Updated Team Metrics): {ridge_mse_team_updated:.4f}")

It's not a big jump but we've managed to bring down the MSE before tuning and feature importance dropping from an initial value of `TEAM METRICS MSE: 0.0506` to `UPDATED TEAM METRICS MSE: 0.0499`.

**BEST FINAL MSE VALUES**

Before Tuning Individual-metric MSE: 0.0393 

After Tuning Team-metric MSE: 0.0499

### Bagging

For the `Bagging Regressor`, we'll tune the n_estimators (number of trees) and max_features (number of features to draw from).

We did a range of 50,100, and 200 estimators, and did 0.5 to 1 max features.

We used Grid Search to find the optimal parameters for the Bagging Regressor model by tuning it separately for individual and team metrics, focusing on the number of estimators and the maximum features to draw. After training the models with the best parameters, we evaluated their performance and printed the MSE for both sets of metrics.

In [None]:
# Define parameter grid for Bagging Regressor
bagging_params = {'n_estimators': [50, 100, 200], 'max_features': [0.5, 1.0]}

# Individual Metrics
bagging_grid_ind = GridSearchCV(BaggingRegressor(estimator=DecisionTreeRegressor(), random_state=42), 
                                bagging_params, cv=5, scoring='neg_mean_squared_error')
bagging_grid_ind.fit(X_train_ind, y_train)
best_bagging_ind = bagging_grid_ind.best_estimator_
bagging_predictions_ind_tuned = best_bagging_ind.predict(X_test_ind)
bagging_mse_ind_tuned = mean_squared_error(y_test, bagging_predictions_ind_tuned)

# Team Metrics
bagging_grid_team = GridSearchCV(BaggingRegressor(estimator=DecisionTreeRegressor(), random_state=42), 
                                 bagging_params, cv=5, scoring='neg_mean_squared_error')
bagging_grid_team.fit(X_train_team, y_train)
best_bagging_team = bagging_grid_team.best_estimator_
bagging_predictions_team_tuned = best_bagging_team.predict(X_test_team)
bagging_mse_team_tuned = mean_squared_error(y_test, bagging_predictions_team_tuned)

print(f"Tuned Bagging Regressor MSE (Individual Metrics): {bagging_mse_ind_tuned:.4f}")
print(f"Tuned Bagging Regressor MSE (Team Metrics): {bagging_mse_team_tuned:.4f}")


Just like Ridge Regression it seems Hyperparameter tuning has made the MSE higher, then again we can try a different approach and take a look at feature importance to reduce noise.

In [None]:
# Feature importance for Bagging Regressor (Individual Metrics)
result_ind_bagging = permutation_importance(bagging_model_ind, X_test_ind, y_test, n_repeats=10, random_state=42)
importance_df_ind_bagging = pd.DataFrame(result_ind_bagging.importances_mean, index=individual_metrics, columns=["Importance"])
print("Feature Importance (Bagging Regressor - Individual Metrics):")
print(importance_df_ind_bagging.sort_values(by="Importance", ascending=False))

# Feature importance for Bagging Regressor (Team Metrics)
result_team_bagging = permutation_importance(bagging_model_team, X_test_team, y_test, n_repeats=10, random_state=42)
importance_df_team_bagging = pd.DataFrame(result_team_bagging.importances_mean, index=team_metrics, columns=["Importance"])
print("Feature Importance (Bagging Regressor - Team Metrics):")
print(importance_df_team_bagging.sort_values(by="Importance", ascending=False))

Here we can see that there are negative values present in the Individual Metrics for the Bagging Regressor model, we can do the same as the previous one and cut them out and run the data through the model again.

In [None]:
# Define individual metrics without Damage Share Difference
individual_metrics_updated = [
    'KDA', 'KP Difference', 'Creep Score', 
    'Gold Earned'
]

# Split the data into training and testing sets for the updated individual metrics
X_individual_updated = team_aggregated[individual_metrics_updated]
X_train_ind_updated, X_test_ind_updated, y_train, y_test = train_test_split(X_individual_updated, y, test_size=0.2, random_state=42)


We train the model again like we did with the Ridge Model.

In [None]:
# Train Bagging Regressor model with the updated individual metrics
tree_model_ind_updated = DecisionTreeRegressor(random_state=42)
bagging_model_ind_updated = BaggingRegressor(estimator=tree_model_ind_updated, n_estimators=100, random_state=42)
bagging_model_ind_updated.fit(X_train_ind_updated, y_train)
bagging_predictions_ind_updated = bagging_model_ind_updated.predict(X_test_ind_updated)
bagging_mse_ind_updated = mean_squared_error(y_test, bagging_predictions_ind_updated)

print(f"Bagging Regressor MSE (Updated Individual Metrics): {bagging_mse_ind_updated:.4f}")

We've managed to bring down the MSE before tuning and feature importance dropping from an initial value of `INDIV METRICS MSE: 0.0323` to `UPDATED INDIV METRICS MSE: 0.0249`.

**BEST FINAL MSE VALUES**

After Tuning Individual-metric MSE: 0.0249

Before Tuning Team-metric MSE: 0.0371 

### Polynomial Regression

Finally, for `Polynomial Regression`, we'll tune the degree of the polynomial.

We did a poly_degree of 2, 3, and 4.

We used Grid Search to find the optimal degree for the Polynomial Regression model by tuning it separately for individual and team metrics, aiming to minimize the mean squared error. After training the models with the best parameters, we evaluated their performance and printed the MSE for both sets of metrics.

In [None]:
# Define parameter grid for Polynomial Regression
poly_params = {'poly__degree': [2, 3, 4]}

# Individual Metrics
pipeline_ind = Pipeline([('poly', PolynomialFeatures()), ('linear', LinearRegression())])
poly_grid_ind = GridSearchCV(pipeline_ind, poly_params, cv=5, scoring='neg_mean_squared_error')
poly_grid_ind.fit(X_train_ind, y_train)
best_poly_ind = poly_grid_ind.best_estimator_
poly_predictions_ind_tuned = best_poly_ind.predict(X_test_ind)
poly_mse_ind_tuned = mean_squared_error(y_test, poly_predictions_ind_tuned)

# Team Metrics
pipeline_team = Pipeline([('poly', PolynomialFeatures()), ('linear', LinearRegression())])
poly_grid_team = GridSearchCV(pipeline_team, poly_params, cv=5, scoring='neg_mean_squared_error')
poly_grid_team.fit(X_train_team, y_train)
best_poly_team = poly_grid_team.best_estimator_
poly_predictions_team_tuned = best_poly_team.predict(X_test_team)
poly_mse_team_tuned = mean_squared_error(y_test, poly_predictions_team_tuned)

print(f"Tuned Polynomial Regression MSE (Individual Metrics): {poly_mse_ind_tuned:.4f}")
print(f"Tuned Polynomial Regression MSE (Team Metrics): {poly_mse_team_tuned:.4f}")

We can now check feature importance instead, since there were no changes to the MSE values after tuning.

In [None]:
# Use coefficients to infer feature importance
poly_coefs_ind = pd.Series(poly_model_ind.coef_, index=poly_ind.get_feature_names_out(individual_metrics))
important_features_ind_poly = poly_coefs_ind.abs().sort_values(ascending=False)

print("Feature Importance (Polynomial Regression - Individual Metrics):")
print(important_features_ind_poly)


There are no negative values therefore there will be no need to remove any features when using Polynomial Regression.

In [None]:
# Use coefficients to infer feature importance
poly_coefs_team = pd.Series(poly_model_team.coef_, index=poly_team.get_feature_names_out(team_metrics))
important_features_team_poly = poly_coefs_team.abs().sort_values(ascending=False)

print("Feature Importance (Polynomial Regression - Team Metrics):")
print(important_features_team_poly)

**BEST FINAL MSE VALUES**

Before/After Tuning Individual-metric MSE: 0.1843 

Before/After Tuning Team-metric MSE: 0.1311

### Summary 

The steps we've taken include 

1. **Feature Importance Analysis:** We've checked the impact of each feature for Ridge Regression, Bagging Regressor and Polynomial Regression.

2. **Hyperparameter Tuning:** We've tuned the models to find the best-performing configurations.

3. **Error Analysis:** We've evaluated the models' performance using metrics such as MSE and RMSE.

Given these steps, if all the features have shown to have a non-negative impact on the model and the hyperparameter tuning did not result in significant improvements, it’s a good idea to retain the existing model structure. 

This way we can ensure that we are leveraging all the relevant features that contribute positively to the predictions.

## Model Selection Performance Summary

When evaluating and comparing models for the task, we selected three candidates: **Ridge Regression**, **Bagging Regressor**, and **Polynomial Regression**. These models were chosen because they address specific challenges in the dataset: multicollinearity, non-linear relationships, and feature aggregation through ensembles. Each model was evaluated based on its performance metrics and its ability to improve through tuning and feature selection. Below, we provide insights into the models' performances, the improvements achieved, and justifications for their selection.


### **1. Ridge Regression**

Ridge Regression was selected for its ability to handle datasets with multicollinear features, where traditional linear regression models may struggle. By penalizing large coefficients, Ridge Regression ensures more stable predictions and reduces the risk of overfitting. 

- **Performance**:  
  - Individual metrics yielded a low MSE of **0.0393**, with no improvement after tuning `alpha` values ([0.1, 1.0, 10.0, 100.0]).  
  - Team metrics improved slightly after irrelevant features (e.g., `Damage Share STD`, `Team Assists`, `Vision Score`) were dropped, reducing MSE from **0.0506** to **0.0499**.

- **Reasoning for Selection**:  
  Ridge Regression is an optimal choice for datasets like this one due to its regularization capabilities. While the improvement for individual metrics plateaued, the model excelled at balancing bias and variance, making it a reliable option for straightforward, interpretable predictions.


### **2. Bagging Regressor**

The Bagging Regressor was chosen as it combines the predictions of multiple base models (decision trees) to create a more robust ensemble. This method reduces variance and captures non-linear relationships better than linear models.

- **Performance**:  
  - Individual metrics saw a significant improvement after removing irrelevant features with negative importance, with MSE dropping from **0.0323** to **0.0249**.  
  - Team metrics remained optimal at **0.0371**, as hyperparameter tuning (`n_estimators` [50, 100, 200], `max_features` [0.5, 1.0]) did not yield better results.

- **Reasoning for Selection**:  
  The ensemble approach of Bagging Regressor makes it particularly effective for datasets with non-linear interactions and varying feature importance. Its ability to aggregate predictions from weak learners ensured it delivered the most accurate results across both individual and team metrics. Furthermore, the model's robustness to overfitting makes it a dependable choice for generalization.


### **3. Polynomial Regression**

Polynomial Regression was included to explore higher-order relationships between features and target variables. By transforming input features into polynomial terms, the model attempts to capture non-linear patterns that linear models may miss.

- **Performance**:  
  - Individual metrics stagnated with an MSE of **0.1843** before and after tuning polynomial degrees ([2, 3, 4]).  
  - Team metrics also showed no improvement, with an MSE of **0.1311** before and after tuning.

- **Reasoning for Selection**:  
  Polynomial Regression was initially considered because it can theoretically capture complex relationships. However, its poor performance indicates overfitting due to the high number of polynomial terms generated relative to the dataset size. The relationships in the dataset are better modeled by simpler, more robust techniques like Ridge Regression and Bagging Regressor.

Each model was selected to address specific dataset challenges. Ridge Regression provided a reliable baseline for handling multicollinearity, Bagging Regressor captured non-linear interactions effectively, and Polynomial Regression explored higher-order relationships. However, the results demonstrated that simpler, more robust techniques like Ridge Regression and Bagging Regressor were better suited for this task, while Polynomial Regression struggled with overfitting. These insights highlight the importance of aligning model complexity with the nature of the data to achieve optimal performance.

   ## **Recommendations for Improvement**


### 1. **Enhancing Feature Engineering**  
   To further improve model performance, additional feature engineering steps can be explored. For instance, scaling features to a uniform range can benefit models like Ridge Regression and Polynomial Regression, which are sensitive to feature magnitudes. Dimensionality reduction techniques such as Principal Component Analysis (PCA) could also help remove redundant or highly correlated features, reducing noise in the dataset. Additionally, creating interaction terms or transforming non-linear features into a more interpretable form might enable the models to capture relationships that are not immediately apparent.

   Feature importance analysis has already proven valuable, as seen in the Bagging Regressor and Ridge Regression models, where pruning irrelevant features improved performance. Extending this analysis to include advanced techniques like Recursive Feature Elimination (RFE) could further streamline feature selection, focusing only on those with the most predictive power.

### 2. **Experimenting with Advanced Models**  
   While Ridge Regression, Bagging Regressor, and Polynomial Regression provided a solid foundation for this analysis, exploring more advanced algorithms may yield better results. Ensemble methods like Gradient Boosting Machines (GBMs), Random Forests, and XGBoost have been widely recognized for their ability to handle complex datasets effectively. These models typically outperform simpler methods due to their capacity to combine weak learners into a stronger predictive system while controlling for overfitting.

   For instance, a Gradient Boosting model could potentially reduce errors on the team metrics, where Polynomial Regression struggled. These methods could also incorporate feature importance analysis intrinsically, further improving interpretability and performance. 

### 3. **Exploring Neural Networks**  
   If additional data becomes available, neural networks could be a powerful alternative to the current models. With sufficient training data, neural networks excel at capturing intricate, non-linear relationships, especially in problems where patterns are not immediately apparent. Architectures like deep feedforward networks or even recurrent networks (if temporal features are involved) could be particularly effective. However, these models require careful tuning and may necessitate GPU acceleration for efficient training.

   While neural networks are not always the first choice for tabular data, the potential to learn complex representations makes them worth considering, especially if the dataset expands in size or complexity.

### 4. **Reassessing Data and Expanding the Dataset**  
   The dataset’s characteristics, such as the distribution of features, should be thoroughly re-evaluated. Uneven distributions or biases in the data might be limiting model performance. Techniques like stratified sampling during train-test splits can ensure balanced representation across subsets. Additionally, data augmentation techniques (e.g., synthetic data generation) could address limitations caused by small sample sizes in certain feature groups.

   Expanding the dataset by including more data points or relevant features could also improve model generalization. For example, incorporating external data related to team dynamics, player history, or contextual factors may provide the models with additional insights to enhance prediction accuracy.

### 5. **Fine-Tuning Hyperparameters and Cross-Validation**  
   Hyperparameter tuning proved somewhat effective in this analysis, but a more extensive search could yield better results. Methods like RandomizedSearchCV or Bayesian Optimization might explore a broader hyperparameter space more efficiently than traditional GridSearchCV. Additionally, increasing the number of cross-validation folds or using nested cross-validation could provide more reliable estimates of model performance.

   These adjustments would not only improve the stability of the results but also ensure the models are less prone to overfitting, particularly when working with ensemble methods like Bagging Regressor or Gradient Boosting.

By implementing these recommendations, the models can be further optimized to achieve lower errors, improved generalization, and greater robustness. These steps would also pave the way for more accurate predictions and actionable insights in similar future projects.

## Insights and Conclusions

### **Model Performance**  
The choice of models for this analysis was driven by their ability to address specific challenges in the dataset, such as multicollinearity, non-linear relationships, and the need for ensemble learning to reduce variance. Each model was selected for its strengths in tackling these issues, providing a diverse approach to understanding win rate prediction.

**Ridge Regression** was a natural choice for handling datasets with multicollinearity. By penalizing large coefficients, it ensured stable predictions without overfitting, especially for individual metrics. Its MSE for individual metrics was already low at **0.0393**, reflecting its ability to capture linear relationships effectively. While hyperparameter tuning did not improve individual metrics further, removing irrelevant features like `Damage Share STD` and `Vision Score` slightly improved team metrics, lowering the MSE from **0.0506** to **0.0499**. This performance demonstrates that Ridge Regression is a reliable baseline for datasets with well-defined linear relationships.

**Bagging Regressor** stood out as the best-performing model overall, excelling in both accuracy and generalization. Its ensemble approach, which aggregates predictions from multiple decision trees, captured the non-linear relationships in the data effectively. After feature importance analysis and the removal of irrelevant features (e.g., `Damage Share Difference`), the individual metrics' MSE improved significantly from **0.0323** to **0.0249**, the lowest among all models. For team metrics, its untuned configuration yielded an optimal MSE of **0.0371**, showing that the model was robust even without extensive parameter adjustments. This highlights the value of ensemble methods in reducing variance and handling noise in complex datasets.

**Polynomial Regression**, while theoretically capable of modeling higher-order interactions, underperformed in this analysis. Its MSE values of **0.1843** for individual metrics and **0.1311** for team metrics remained unchanged despite tuning polynomial degrees. This suggests that the dataset’s relationships were not well-suited for higher-order feature transformations, and the increased complexity likely led to overfitting. The inclusion of Polynomial Regression was still valuable, as it validated that simpler models like Ridge Regression and Bagging Regressor were more appropriate for this problem. Overall, the performance of these models illustrates the importance of aligning model complexity with the underlying structure of the data.


### **Data and Problem**  
The dataset aimed to predict win rates based on individual and team performance metrics, providing a framework to analyze what contributes to team success. The win rate, calculated as the ratio of wins to total games, served as a clear and interpretable target variable. This made regression models an appropriate choice, as they align naturally with the goal of predicting continuous values.

The dataset captured several dimensions of gameplay, including individual metrics like `KDA`, `Gold Earned`, and `Creep Score`, as well as team metrics like `Objectives Taken` and `Vision Score`. These metrics represent both individual contributions and team dynamics, offering a holistic view of the factors influencing success. From the analysis, it became evident that metrics such as `KDA` and `Objectives Taken` were particularly predictive, reflecting the importance of both strong individual performance and coordinated team efforts. Conversely, features like `Damage Share STD` introduced noise into the models, highlighting the need for careful feature selection.

The benefits of this analysis extend beyond model performance. For coaches and analysts, understanding which metrics correlate most strongly with win rates can guide training strategies and in-game decision-making. Players can use these insights to prioritize individual improvements that have a measurable impact on team success. Additionally, these findings can inform game developers about the balance of gameplay elements, ensuring competitive fairness and engagement. The dataset and problem formulation thus provide a strong foundation for actionable insights that benefit multiple stakeholders.

### **Synthesis**  
This project achieved its primary objective of evaluating models to predict win rates while uncovering critical insights into the factors that drive individual and team performance. By systematically selecting, tuning, and analyzing models, we ensured a rigorous approach to identifying the most effective predictive techniques. The combination of Ridge Regression for linear relationships and Bagging Regressor for non-linear interactions offered a balanced framework that maximized accuracy and generalization.

The inclusion of Polynomial Regression, though underwhelming in performance, was an important part of the methodology, as it tested the hypothesis that higher-order interactions might improve predictions. Its poor results reinforced the importance of selecting models that align with the data’s complexity. Feature importance analysis and error evaluation further refined the models, demonstrating the value of iterative adjustments in machine learning workflows. Additionally, the project successfully identified key performance indicators, such as `KDA` and `Objectives Taken`, while filtering out noisy metrics that hindered predictive accuracy.

Overall, the project was a success in terms of meeting its goals and providing actionable insights. The models performed well within the constraints of the dataset, and the results were communicated in a clear and interpretable manner. However, there is room for future improvements. Expanding the dataset, incorporating advanced ensemble methods like Gradient Boosting or XGBoost, and exploring neural networks for more complex relationships could further enhance predictive accuracy. The comprehensive approach taken in this project lays a solid foundation for similar analyses and highlights the importance of balancing technical rigor with practical relevance.

## References

BaggingRegressor. (n.d.). Scikit-learn. https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.BaggingRegressor.html

DecisionTreeRegressor. (n.d.). Scikit-learn. https://scikit-learn.org/dev/modules/generated/sklearn.tree.DecisionTreeRegressor.html

GeeksforGeeks. (2024, July 11). Working with Missing Data in Pandas. GeeksforGeeks. https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/

GridSearchCV. (n.d.). Scikit-learn. https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.GridSearchCV.html

Grover, N. (n.d.). Lasso Regression and Hyperparameter tuning using sklearn. https://www.stochasticbard.com/blog/lasso_regression/

LinearRegression. (n.d.). Scikit-learn. https://scikit-learn.org/dev/modules/generated/sklearn.linear_model.LinearRegression.html

Lyashenko, V. (2024, April 12). Cross-Validation in Machine Learning: How to do it right. neptune.ai. https://neptune.ai/blog/cross-validation-in-machine-learning-how-to-do-it-right

mean_squared_error. (n.d.). Scikit-learn. https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.mean_squared_error.html

Pipeline. (n.d.). Scikit-learn. https://scikit-learn.org/1.5/modules/generated/sklearn.pipeline.Pipeline.html

PolynomialFeatures. (n.d.). Scikit-learn. https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

4.2. Permutation feature importance. (n.d.). Scikit-learn. https://scikit-learn.org/1.5/modules/permutation_importance.html

User guide and tutorial — seaborn 0.13.2 documentation. (n.d.). https://seaborn.pydata.org/tutorial.html

W3Schools.com. (n.d.). https://www.w3schools.com/python/matplotlib_pyplot.asp

W3Schools.com. (n.d.). https://www.w3schools.com/python/pandas/ref_df_duplicated.asp