This model generates WAR statistics for every player.

We can use a rudimentary approach:
1. Partition our data to train each model for each position + league.
2. Conduct dimensionality reduction using principal component analysis (PCA) to reduce features of each dataset into a singular WAR statistic.
3. Train each model to generate a WAR statistic for every player.

We can start with the keepers - they have a smaller dataset and much less features (much less per90 data types).

We have adapted the "keepers" dataset to only contain relevant per90 statistics (no counting stats), and removed keepers with less than 5 games to prevent extreme outliers.

Definitions for statistics can be found [here](https://fbref.com/en/comps/Big5/keepersadv/players/Big-5-European-Leagues-Stats)

In [2]:
# Basic Setup
import os
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [28]:
# TODO: WAR stat does not make sense - suspected that it's mixing good (save%) and bad (goals allowed/90) together
# UPDATE: reversed and weighted statistics but still doesn't work - needs investigation
# UPDATE2: ATM best players are at bottom of list - NOT SURE WHY

# TODO: need to better arrange data pipeline - ALL data should result in same file as calculated war 
# Only two files needed - final + weights (stat + weight) - keep all columns with stat names and drop all else

overall_keepers_dataset = pd.read_csv("C:/Users/Toby Chiu/Desktop/Coding/Personal Projects/Personal/wins-above-rep-soccer/model/keepers_final_vPCA.csv")
overall_keepers_dataset_modified = overall_keepers_dataset.drop(columns=["unique_ID", "Comp", "Unnamed: 0"])

In [29]:
# Goals allowed needs to be reversed - lower number is better
overall_keepers_dataset_modified["GA90"] = overall_keepers_dataset_modified["GA90"]*-1

# Scaling our dataset
overall_keepers_dataset_scaled = StandardScaler().fit_transform(overall_keepers_dataset_modified)

    # Weights of each stat - Remove if unnecessary
weights_df = pd.read_csv("C:/Users/Toby Chiu/Desktop/Coding/Personal Projects/Personal/wins-above-rep-soccer/model/weights/keepers-stats-weights.csv")
keepers_weights = weights_df["weight"].to_numpy()
overall_keepers_dataset_weighted = overall_keepers_dataset_scaled * keepers_weights

In [30]:
# PCA decomposition
pca = PCA(n_components=1)
pca_features = pca.fit_transform(overall_keepers_dataset_scaled)

war = pd.DataFrame(pca_features)
war.columns = ["calculated_war"]
war["calculated_war"] = war["calculated_war"] * -1
overall_keepers_output = overall_keepers_dataset.join(war)

In [31]:
if os.path.exists("C:/Users/Toby Chiu/Desktop/Coding/Personal Projects/Personal/wins-above-rep-soccer/output/PCA/overall_keepers_war.csv"):
    os.remove("C:/Users/Toby Chiu/Desktop/Coding/Personal Projects/Personal/wins-above-rep-soccer/output/PCA/overall_keepers_war.csv")

overall_keepers_output.to_csv("C:/Users/Toby Chiu/Desktop/Coding/Personal Projects/Personal/wins-above-rep-soccer/output/PCA/overall_keepers_war.csv")


We need to adjust the fit for every model - tweak how the model values WAR based on inputs (e.g. putting higher weight on starters, removing non-starters for interference, etc.)