This model generates WAR statistics for every player.

We can use a rudimentary approach:
1. Partition our data to train each model for each position + league.
2. Conduct dimensionality reduction using principal component analysis (PCA) to reduce features of each dataset into a singular WAR statistic.
3. Train each model to generate a WAR statistic for every player.

We can start with the keepers - they have a smaller dataset and much less features (much less per90 data types).

We have adapted the "keepers" dataset to only contain relevant per90 statistics (no counting stats), and removed keepers with less than 5 games to prevent extreme outliers.

Definitions for statistics can be found [here](https://fbref.com/en/comps/Big5/keepersadv/players/Big-5-European-Leagues-Stats)

In [47]:
# Basic Setup
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Change to your local folder before editing
os.chdir("C:/Users/tobyt/Desktop/Coding/Personal/wins-above-replacement-soccer/WAR-based-soccer-valuation")

In [48]:
# Importing dataframes
overall_keepers_dataset = pd.read_csv("./model/keepers_final_vF.csv")
weights_df = pd.read_csv("./model/weights/keepers-stats-weights.csv")

keepers_columns = weights_df["stat"].to_list()
keepers_weights = weights_df["weight"].to_numpy()
keepers_stat_direction = weights_df["direction"].to_numpy()

# Filtering players that only start 5+ games
overall_keepers_dataset_filtered = overall_keepers_dataset[(overall_keepers_dataset.Starts_Playing >= 5)]

# Removing NaN observations
overall_keepers_dataset_filtered = overall_keepers_dataset_filtered.dropna(subset=keepers_columns).reset_index()

overall_keepers_dataset_modified = overall_keepers_dataset_filtered[keepers_columns].copy()

In [49]:
# Goals allowed needs to be reversed - lower number is better
overall_keepers_dataset_modified = overall_keepers_dataset_modified * keepers_stat_direction

# Scaling our dataset
overall_keepers_dataset_scaled = StandardScaler().fit_transform(overall_keepers_dataset_modified)

# Weights of each stat - Remove if unnecessary
overall_keepers_dataset_weighted = overall_keepers_dataset_scaled * keepers_weights

In [50]:
# PCA decomposition
pca = PCA(n_components=1)
pca_features = pca.fit_transform(overall_keepers_dataset_weighted)

war = pd.DataFrame(pca_features)
war.columns = ["calculated_war"]
war["calculated_war"] = war["calculated_war"] * -1

overall_keepers_dataset_filtered.insert(overall_keepers_dataset_filtered.columns.get_loc("unique_ID") + 1, "calculated_war", war["calculated_war"])


In [51]:
# Percentile Calculations - rough approximation
sorted_war = np.sort(war["calculated_war"])
percentile_ranks = (sorted_war < war["calculated_war"].values[:,None]).mean(axis=1)
percentile_ranks = np.round(percentile_ranks, decimals=4) * 100
overall_keepers_dataset_filtered.insert(overall_keepers_dataset_filtered.columns.get_loc("unique_ID") + 2, "percentile", percentile_ranks)

In [52]:
if os.path.exists("./output/PCA/overall_keepers_war.csv"):
    os.remove("./output/PCA/overall_keepers_war.csv")

overall_keepers_dataset_filtered.to_csv("./output/PCA/overall_keepers_war.csv")


We can apply the same approach across all 5 leagues - players that stand out more in a position that is less-talented in a certain league will have a higher league-WAR than overall-WAR.

_This approach is heavy-handed - we have applied it for keepers because it's faster than deriving a method that is different than outfield players._

In [53]:
# League-wide calculations
overall_keepers_dataset_ENG = overall_keepers_dataset[(overall_keepers_dataset.Comp == "Premier League")]
overall_keepers_dataset_ESP = overall_keepers_dataset[(overall_keepers_dataset.Comp == "La Liga")]
overall_keepers_dataset_GER = overall_keepers_dataset[(overall_keepers_dataset.Comp == "Bundesliga")]
overall_keepers_dataset_ITA = overall_keepers_dataset[(overall_keepers_dataset.Comp == "Serie A")]
overall_keepers_dataset_FRA = overall_keepers_dataset[(overall_keepers_dataset.Comp == "Ligue 1")]

# Data Cleaning:
# Filtering players that only start 5+ games
overall_keepers_dataset_filtered_ENG = overall_keepers_dataset_ENG[(overall_keepers_dataset_ENG.Starts_Playing >= 5)]
overall_keepers_dataset_filtered_ESP = overall_keepers_dataset_ESP[(overall_keepers_dataset_ESP.Starts_Playing >= 5)]
overall_keepers_dataset_filtered_GER = overall_keepers_dataset_GER[(overall_keepers_dataset_GER.Starts_Playing >= 5)]
overall_keepers_dataset_filtered_ITA = overall_keepers_dataset_ITA[(overall_keepers_dataset_ITA.Starts_Playing >= 5)]
overall_keepers_dataset_filtered_FRA = overall_keepers_dataset_FRA[(overall_keepers_dataset_FRA.Starts_Playing >= 5)]

# Removing NaN observations
overall_keepers_dataset_filtered_ENG = overall_keepers_dataset_filtered_ENG.dropna(subset=keepers_columns).reset_index()
overall_keepers_dataset_filtered_ESP = overall_keepers_dataset_filtered_ESP.dropna(subset=keepers_columns).reset_index()
overall_keepers_dataset_filtered_GER = overall_keepers_dataset_filtered_GER.dropna(subset=keepers_columns).reset_index()
overall_keepers_dataset_filtered_ITA = overall_keepers_dataset_filtered_ITA.dropna(subset=keepers_columns).reset_index()
overall_keepers_dataset_filtered_FRA = overall_keepers_dataset_filtered_FRA.dropna(subset=keepers_columns).reset_index()

overall_keepers_dataset_modified_ENG = overall_keepers_dataset_filtered_ENG[keepers_columns].copy()
overall_keepers_dataset_modified_ESP = overall_keepers_dataset_filtered_ESP[keepers_columns].copy()
overall_keepers_dataset_modified_GER = overall_keepers_dataset_filtered_GER[keepers_columns].copy()
overall_keepers_dataset_modified_ITA = overall_keepers_dataset_filtered_ITA[keepers_columns].copy()
overall_keepers_dataset_modified_FRA = overall_keepers_dataset_filtered_FRA[keepers_columns].copy()

In [54]:
# Goals allowed needs to be reversed - lower number is better
overall_keepers_dataset_modified_ENG = overall_keepers_dataset_modified_ENG * keepers_stat_direction
overall_keepers_dataset_modified_ESP = overall_keepers_dataset_modified_ESP * keepers_stat_direction
overall_keepers_dataset_modified_GER = overall_keepers_dataset_modified_GER * keepers_stat_direction
overall_keepers_dataset_modified_ITA = overall_keepers_dataset_modified_ITA * keepers_stat_direction
overall_keepers_dataset_modified_FRA = overall_keepers_dataset_modified_FRA * keepers_stat_direction

# Scaling our dataset
overall_keepers_dataset_scaled_ENG = StandardScaler().fit_transform(overall_keepers_dataset_modified_ENG)
overall_keepers_dataset_scaled_ESP = StandardScaler().fit_transform(overall_keepers_dataset_modified_ESP)
overall_keepers_dataset_scaled_GER = StandardScaler().fit_transform(overall_keepers_dataset_modified_GER)
overall_keepers_dataset_scaled_ITA = StandardScaler().fit_transform(overall_keepers_dataset_modified_ITA)
overall_keepers_dataset_scaled_FRA = StandardScaler().fit_transform(overall_keepers_dataset_modified_FRA)

# Weights of each stat - Remove if unnecessary
overall_keepers_dataset_weighted_ENG = overall_keepers_dataset_scaled_ENG * keepers_weights
overall_keepers_dataset_weighted_ESP = overall_keepers_dataset_scaled_ESP * keepers_weights
overall_keepers_dataset_weighted_GER = overall_keepers_dataset_scaled_GER * keepers_weights
overall_keepers_dataset_weighted_ITA = overall_keepers_dataset_scaled_ITA * keepers_weights
overall_keepers_dataset_weighted_FRA = overall_keepers_dataset_scaled_FRA * keepers_weights

In [55]:
# PCA decomposition
pca = PCA(n_components=1)

pca_features_ENG = pca.fit_transform(overall_keepers_dataset_weighted_ENG)
pca_features_ESP = pca.fit_transform(overall_keepers_dataset_weighted_ESP)
pca_features_GER = pca.fit_transform(overall_keepers_dataset_weighted_GER)
pca_features_ITA = pca.fit_transform(overall_keepers_dataset_weighted_ITA)
pca_features_FRA = pca.fit_transform(overall_keepers_dataset_weighted_FRA)


war_ENG = pd.DataFrame(pca_features_ENG)
war_ESP = pd.DataFrame(pca_features_ESP)
war_GER = pd.DataFrame(pca_features_GER)
war_ITA = pd.DataFrame(pca_features_ITA)
war_FRA = pd.DataFrame(pca_features_FRA)

war_ENG.columns = ["calculated_war"]
war_ESP.columns = ["calculated_war"]
war_GER.columns = ["calculated_war"]
war_ITA.columns = ["calculated_war"]
war_FRA.columns = ["calculated_war"]

war_ENG["calculated_war"] = war_ENG["calculated_war"] * -1
war_ESP["calculated_war"] = war_ESP["calculated_war"] * -1
war_GER["calculated_war"] = war_GER["calculated_war"] * -1
war_ITA["calculated_war"] = war_ITA["calculated_war"] * -1
war_FRA["calculated_war"] = war_FRA["calculated_war"] * -1


overall_keepers_dataset_filtered_ENG.insert(overall_keepers_dataset_filtered_ENG.columns.get_loc("unique_ID") + 1, "calculated_war", war_ENG["calculated_war"])
overall_keepers_dataset_filtered_ESP.insert(overall_keepers_dataset_filtered_ESP.columns.get_loc("unique_ID") + 1, "calculated_war", war_ESP["calculated_war"])
overall_keepers_dataset_filtered_GER.insert(overall_keepers_dataset_filtered_GER.columns.get_loc("unique_ID") + 1, "calculated_war", war_GER["calculated_war"])
overall_keepers_dataset_filtered_ITA.insert(overall_keepers_dataset_filtered_ITA.columns.get_loc("unique_ID") + 1, "calculated_war", war_ITA["calculated_war"])
overall_keepers_dataset_filtered_FRA.insert(overall_keepers_dataset_filtered_FRA.columns.get_loc("unique_ID") + 1, "calculated_war", war_FRA["calculated_war"])


In [56]:
# Percentile Calculations - rough approximation
sorted_war_ENG = np.sort(war_ENG["calculated_war"])
sorted_war_ESP = np.sort(war_ESP["calculated_war"])
sorted_war_GER = np.sort(war_GER["calculated_war"])
sorted_war_ITA = np.sort(war_ITA["calculated_war"])
sorted_war_FRA = np.sort(war_FRA["calculated_war"])

percentile_ranks_ENG = (sorted_war_ENG < war_ENG["calculated_war"].values[:,None]).mean(axis=1)
percentile_ranks_ESP = (sorted_war_ESP < war_ESP["calculated_war"].values[:,None]).mean(axis=1)
percentile_ranks_GER = (sorted_war_GER < war_GER["calculated_war"].values[:,None]).mean(axis=1)
percentile_ranks_ITA = (sorted_war_ITA < war_ITA["calculated_war"].values[:,None]).mean(axis=1)
percentile_ranks_FRA = (sorted_war_FRA < war_FRA["calculated_war"].values[:,None]).mean(axis=1)

percentile_ranks_ENG = np.round(percentile_ranks_ENG, decimals=4) * 100
percentile_ranks_ESP = np.round(percentile_ranks_ESP, decimals=4) * 100
percentile_ranks_GER = np.round(percentile_ranks_GER, decimals=4) * 100
percentile_ranks_ITA = np.round(percentile_ranks_ITA, decimals=4) * 100
percentile_ranks_FRA = np.round(percentile_ranks_FRA, decimals=4) * 100

overall_keepers_dataset_filtered_ENG.insert(overall_keepers_dataset_filtered_ENG.columns.get_loc("unique_ID") + 2, "percentile", percentile_ranks_ENG)
overall_keepers_dataset_filtered_ESP.insert(overall_keepers_dataset_filtered_ESP.columns.get_loc("unique_ID") + 2, "percentile", percentile_ranks_ESP)
overall_keepers_dataset_filtered_GER.insert(overall_keepers_dataset_filtered_GER.columns.get_loc("unique_ID") + 2, "percentile", percentile_ranks_GER)
overall_keepers_dataset_filtered_ITA.insert(overall_keepers_dataset_filtered_ITA.columns.get_loc("unique_ID") + 2, "percentile", percentile_ranks_ITA)
overall_keepers_dataset_filtered_FRA.insert(overall_keepers_dataset_filtered_FRA.columns.get_loc("unique_ID") + 2, "percentile", percentile_ranks_FRA)

In [57]:
if os.path.exists("./output/PCA/overall_keepers_war_ENG.csv"):
    os.remove("./output/PCA/overall_keepers_war_ENG.csv")
if os.path.exists("./output/PCA/overall_keepers_war_ESP.csv"):
    os.remove("./output/PCA/overall_keepers_war_ESP.csv")
if os.path.exists("./output/PCA/overall_keepers_war_GER.csv"):
    os.remove("./output/PCA/overall_keepers_war_GER.csv")
if os.path.exists("./output/PCA/overall_keepers_war_ITA.csv"):
    os.remove("./output/PCA/overall_keepers_war_ITA.csv")
if os.path.exists("./output/PCA/overall_keepers_war_FRA.csv"):
    os.remove("./output/PCA/overall_keepers_war_FRA.csv")

overall_keepers_dataset_filtered.to_csv("./output/PCA/overall_keepers_war_ENG.csv")
overall_keepers_dataset_filtered.to_csv("./output/PCA/overall_keepers_war_ESP.csv")
overall_keepers_dataset_filtered.to_csv("./output/PCA/overall_keepers_war_GER.csv")
overall_keepers_dataset_filtered.to_csv("./output/PCA/overall_keepers_war_ITA.csv")
overall_keepers_dataset_filtered.to_csv("./output/PCA/overall_keepers_war_FRA.csv")

It is possible to adjust the fit for every model in future - tweak how the model values WAR based on inputs (e.g. putting higher weight on starters, removing non-starters for interference, etc.)