<a href="https://colab.research.google.com/github/rishicarter/MScProject_SOTON/blob/main/Preprocess_to_SPADL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AIM
---

1. Download the Wyscout dataset and preprocess the relevant data.
2. Value game states by training predictive machine learning models.
  * Compute descriptive features for each game state.
  * Obtain labels for each game state (i.e., Goal scored within next ten actions? Goal conceded within next ten actions?)
3. Value on-the-ball actions by using the trained predictive machine learning models.
4. Rate players by aggregating the values of their on-the-ball actions.


# Imports and requirements

In [None]:
!pip install tables==3.6.1
!pip install socceraction==0.2.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tables==3.6.1
  Downloading tables-3.6.1-cp37-cp37m-manylinux1_x86_64.whl (4.3 MB)
[K     |████████████████████████████████| 4.3 MB 5.3 MB/s 
Installing collected packages: tables
  Attempting uninstall: tables
    Found existing installation: tables 3.7.0
    Uninstalling tables-3.7.0:
      Successfully uninstalled tables-3.7.0
Successfully installed tables-3.6.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting socceraction==0.2.0
  Downloading socceraction-0.2.0.tar.gz (28 kB)
Collecting unidecode
  Downloading Unidecode-1.3.4-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 8.3 MB/s 
Building wheels for collected packages: socceraction
  Building wheel for socceraction (setup.py) ... [?25l[?25hdone
  Created wheel for socceraction: filename=socceraction-0.2.0-py3-none-any.whl size=306

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
# %mkdir -p '/content/gdrive/MyDrive/MSC_Project/FOT_VAEP/'
%cd '/content/gdrive/MyDrive/MSC_Project/FOT_VAEP/'
%pwd

Mounted at /content/gdrive
/content/gdrive/MyDrive/MSC_Project/FOT_VAEP


'/content/gdrive/MyDrive/MSC_Project/FOT_VAEP'

In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

import warnings
from io import BytesIO
from pathlib import Path
from urllib.parse import urlparse
from urllib.request import urlopen, urlretrieve
from zipfile import ZipFile, is_zipfile

import socceraction.vaep.features as features
import socceraction.vaep.labels as labels
from sklearn.metrics import brier_score_loss, roc_auc_score
from socceraction.spadl.wyscout import convert_to_spadl
from socceraction.vaep.formula import value
from tqdm import tqdm
from xgboost import XGBClassifier

In [None]:
warnings.filterwarnings('ignore', category=pd.io.pytables.PerformanceWarning)

In [None]:
# Functions
def read_json_file(filename):
  '''
  The read_json_file function reads and returns the content of a given JSON file. 
  The function handles the encoding of special characters 
  (e.g., accents in names of players and teams) that the pd.read_json function 
  cannot handle properly.
  '''
  with open(filename,'rb') as json_file:
    return BytesIO(json_file.read()).getvalue().decode('unicode_escape')

# Data Download and preprocess
---
1. Download the Wyscout dataset;
2. Construct an HDF5 file named wyscout.h5 that contains the relevant information from the dataset;
3. Convert the wyscout.h5 file into a spadl.h5 file that contains the same information in the SPADL representation.



## Download the Wyscout dataset

In [None]:
data_files = {
    'events': 'https://ndownloader.figshare.com/files/14464685',  # ZIP file containing one JSON file for each competition
    'matches': 'https://ndownloader.figshare.com/files/14464622',  # ZIP file containing one JSON file for each competition
    'players': 'https://ndownloader.figshare.com/files/15073721',  # JSON file
    'teams': 'https://ndownloader.figshare.com/files/15073697'  # JSON file
}

In [None]:
# for url in data_files.values():
#   url_s3 = urlopen(url).geturl()
#   path = Path(urlparse(url_s3).path)
#   file_name=path.name
#   file_local, _ = urlretrieve(url_s3,file_name)
#   if is_zipfile(file_local):
#     with ZipFile(file_local) as zip_file:
#       zip_file.extractall()

## Preprocess data

### Teams

In [None]:
json_teams = read_json_file('teams.json')
df_teams = pd.read_json(json_teams)
df_teams.head(5)

In [None]:
df_teams.to_hdf('wyscout.h5', key='teams', mode='w')

### Players

In [None]:
json_players=read_json_file('players.json')
df_players=pd.read_json(json_players)
df_players.head(5)

In [None]:
df_players.to_hdf('wyscout.h5', key='players', mode='a')

### Matches

In [None]:
matches=[]
for x in tqdm(os.listdir()):
  if x.startswith('matches_'):
    json_matches=read_json_file(x)
    df_matches=pd.read_json(json_matches)
    matches.append(df_matches)
df_matches=pd.concat(matches)

100%|██████████| 19/19 [00:01<00:00, 18.15it/s]


In [None]:
df_matches.columns

In [None]:
df_matches.to_hdf('wyscout.h5', key='matches', mode='a')

### Events

In [None]:
events=[]
for x in tqdm(os.listdir()):
  if x.startswith('events_'):
    json_events=read_json_file(x)
    df_events=pd.read_json(json_events)
    df_events_matches = df_events.groupby('matchId', as_index=False)
    for match_id, df_events_match in df_events_matches:
      df_events_match.to_hdf('wyscout.h5', key=f'events/match_{match_id}', mode='a')

100%|██████████| 19/19 [16:02<00:00, 50.64s/it]


### Convert Wyscout data to SPADL representation

In [None]:
convert_to_spadl('wyscout.h5', 'spadl.h5')

...Inserting actiontypes
...Inserting bodyparts
...Inserting results
...Converting games
...Converting players
...Converting teams
...Generating player_games


100%|██████████| 1941/1941 [04:14<00:00,  7.62game/s]


...Converting events to actions


100%|██████████| 1941/1941 [32:30<00:00,  1.01s/game]
