This model generates WAR statistics for every player.

We can use a rudimentary approach:
1. Partition our data to train each model for each position + league.
2. Conduct dimensionality reduction using principal component analysis (PCA) to reduce features of each dataset into a singular WAR statistic.
3. Train each model to generate a WAR statistic for every player.

We can start with the keepers - they have a smaller dataset and much less features (much less per90 data types).

We have adapted the "keepers" dataset to only contain relevant per90 statistics (no counting stats), and removed keepers with less than 5 games to prevent extreme outliers.

Definitions for statistics can be found [here](https://fbref.com/en/comps/Big5/keepersadv/players/Big-5-European-Leagues-Stats)

In [37]:
# Basic Setup
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

os.chdir("C:/Users/tobyt/Desktop/Coding/Personal/wins-above-replacement-soccer/WAR-based-soccer-valuation")

In [91]:
# TODO: WAR stat does not make sense - suspected that it's mixing good (save%) and bad (goals allowed/90) together
# UPDATE: reversed and weighted statistics fixed

overall_keepers_dataset = pd.read_csv("./model/keepers_final_vF.csv")
weights_df = pd.read_csv("./model/weights/keepers-stats-weights.csv")

keepers_columns = weights_df["stat"].to_list()
keepers_weights = weights_df["weight"].to_numpy()
keepers_stat_direction = weights_df["direction"].to_numpy()

# Filtering players that only start 5+ games
overall_keepers_dataset_filtered = overall_keepers_dataset[(overall_keepers_dataset.Starts_Playing >= 5)]

# Removing NaN observations
overall_keepers_dataset_filtered = overall_keepers_dataset_filtered.dropna(subset=keepers_columns).reset_index()
#overall_keepers_dataset_filtered.reset_index()

overall_keepers_dataset_modified = overall_keepers_dataset_filtered[keepers_columns].copy()

In [92]:
# Goals allowed needs to be reversed - lower number is better
overall_keepers_dataset_modified = overall_keepers_dataset_modified * keepers_stat_direction

# Scaling our dataset
overall_keepers_dataset_scaled = StandardScaler().fit_transform(overall_keepers_dataset_modified)

    # Weights of each stat - Remove if unnecessary

overall_keepers_dataset_weighted = overall_keepers_dataset_scaled * keepers_weights

In [93]:
# PCA decomposition
pca = PCA(n_components=1)
pca_features = pca.fit_transform(overall_keepers_dataset_weighted)

war = pd.DataFrame(pca_features)
war.columns = ["calculated_war"]
war["calculated_war"] = war["calculated_war"] * -1

overall_keepers_dataset_filtered.insert(overall_keepers_dataset_filtered.columns.get_loc("unique_ID") + 1, "calculated_war", war["calculated_war"])


In [94]:
if os.path.exists("./output/PCA/overall_keepers_war.csv"):
    os.remove("./output/PCA/overall_keepers_war.csv")

overall_keepers_dataset_filtered.to_csv("./output/PCA/overall_keepers_war.csv")


It is possible to adjust the fit for every model in future - tweak how the model values WAR based on inputs (e.g. putting higher weight on starters, removing non-starters for interference, etc.)