# Computing the player and the ball performmance

**Purpose of the notebook:** The purpose of this notebook is to load unified dataset for both players and ball for computing the performance.

**Input of the notebook:** The input data are unified match data.

**Output of the notebook:** The output of this notebook is `delta` table. (maybe feature store in the future)

**Some notes:**:
* The `spark.DataFrame` will always have notation `_df` at the end of the name of variable
* the `pandas.DataFrame` will always have notation `_pd` at the end of the name of variable

## Set the environment

In this part, the environment is set. The set up is:

* Loading the necessary python modules and helper functions
* Setting the path to data and metadata
* Initialize the spark session

Other config, such as `spark` application name, path, where the final `delta` table will be saved, etc. are defined in `config.yaml` file

#### Import modules

In [1]:
# Import the modules
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T
from delta import *
from utils import plot_pitch, ball_inside_box, read_config

#### Read config

In [2]:
config = read_config()

#### Initialize spark session

In [3]:
app_name = config['spark_application']['spark_app_batch_name']

builder = (
    SparkSession.builder.appName(app_name) 
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
)

spark = configure_spark_with_delta_pip(builder).getOrCreate()

22/07/18 16:01:34 WARN Utils: Your hostname, tomas-Yoga-Slim-7-Pro-14ACH5-O resolves to a loopback address: 127.0.1.1; using 192.168.0.53 instead (on interface wlp1s0)
22/07/18 16:01:34 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/tomas/.ivy2/cache
The jars for the packages stored in: /home/tomas/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-9ae56880-f1cd-4662-bda1-e2ca5f178ec0;1.0
	confs: [default]
	found io.delta#delta-core_2.12;1.2.1 in central
	found io.delta#delta-storage;1.2.1 in central
	found org.antlr#antlr4-runtime;4.8 in central
	found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
:: resolution report :: resolve 258ms :: artifacts dl 9ms
	:: modules in use:
	io.delta#delta-core_2.12;1.2.1 from central in [default]
	io.delta#delta-storage;1.2.1 from central in [default]
	org.antlr#antlr4-runtime;4.8 from central in [default]
	org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evict

#### Set the remaining ,,env'' variables

In [4]:
meta_data_path = "/home/tomas/Personal_projects/Aston_Villa/data/g1059778_Metadata.xml"
delta_player_path = config['batch']['delta_player_dir']
delta_ball_path = config['batch']['delta_ball_dir']
delta_feature_player_dir = config['batch']['delta_features_player_dir']
delta_feature_ball_dir = config['batch']['delta_features_ball_dir']

## Read the metadata

In [5]:
from bs4 import BeautifulSoup

with open('/home/tomas/Personal_projects/Aston_Villa/data/g1059778_Metadata.xml','r') as f:
    metadata = f.read()

match_metadata = BeautifulSoup(metadata,'xml')

metadata_match_data = match_metadata.find('match').get('dtDate').split(' ')[0]
field_x = float(match_metadata.find('match').get('fPitchXSizeMeters'))
field_y = float(match_metadata.find('match').get('fPitchYSizeMeters'))
metadata_field_dim = (field_x,field_y)

print(f"Match date: {metadata_match_data}")
print(f"Field dimension: {metadata_field_dim}")

Match date: 2019-10-05
Field dimension: (104.85, 67.97)


## Read the unified dataset for both ball and players

In [6]:
unified_data_players_df = (
    spark
    .read
    .format("delta")
    .load(delta_player_path)
)

unified_data_ball_df = (
    spark
    .read
    .format("delta")
    .load(delta_ball_path)
)

AnalysisException: `/home/tomas/Personal_projects/Aston_Villa/task_1/PlayerPerformanceData` is not a Delta table.

## Computing the performance

In this part the performance of the players and ball for particular match is calulated. In the previous cell we  are loading a delta table including ingested player and ball dataset. 
Because duplications are present in the players dataset, it is needed to **deduplicate** this dataset for both home and away team players. Then do an aggregation for each home and away player and then union both results into the one dataset, which is than again save as a `delta` table.

#### Define the window functions

`Window` functions have a crucial part in this task. With help of them, I can compute the rank of the players, lag of the columns ,etc.

In [None]:
# Define the window functions
windowTop = Window.partitionBy("away_home_team").orderBy(F.col("player_avg_speed").desc())

windowDelta_home = Window.partitionBy("homePlayer_playerId").orderBy("period","gameClock")
windowDelta_away = Window.partitionBy("awayPlayer_playerId").orderBy("period","gameClock")
windowTimeDelta_home = Window.partitionBy("homePlayer_playerId").orderBy("period")
windowTimeDelta_away = Window.partitionBy("awayPlayer_playerId").orderBy("period")

In [None]:
# Take the relevant columns
columns = unified_data_players_df.columns

home_columns = [col for col in columns if col.startswith('home')] + base_columns
away_columns = [col for col in columns if col.startswith('away')] + base_columns

#### Aggregation home players

As was stated previously, we need to first deduplicated our data.

In [None]:
home_players_df = (
    unified_data_players_df
    .select(home_columns)
    .dropDuplicates()
)

home_grouped_df = (
    home_df
    .groupBy('homePlayer_playerId','match_date')
    .agg(
        F.avg('homePlayer_speed').alias('player_avg_speed'),
        F.max('homePlayer_speed').alias('player_max_speed')
    )
    .withColumn('away_home_team',F.lit('home'))
    .withColumnRenamed('homePlayer_playerId','playerId')
)

home_x_dir_df = (
    home_df
    .withColumn('x_pos_lag',F.lag("home_player_3d_position_x",1).over(windowDelta_home))
    .withColumn("time_lag",F.lag("gameClock").over(windowTimeDelta_home))
    .fillna(0.0)
    .withColumn("delta_x",F.abs(F.col("home_player_3d_position_x") - F.col("x_pos_lag")))
    .withColumn("delta_time",F.col("gameClock") - F.col("time_lag"))
    .withColumn("speed_x", F.col("delta_x")/F.col("delta_time"))
    .groupBy("homePlayer_playerId")
    .agg(F.max("speed_x").alias("maximum_speed_x"))
    .withColumnRenamed("homePlayer_playerId","playerId")
    .withColumn("away_home_team",F.lit("home"))
)

#### Aggregation away players

In [None]:
away_df = (
    unified_data_players_df
    .select(away_columns)
    .dropDuplicates()
)

away_grouped_df = (
    away_df
    .groupBy('awayPlayer_playerId','match_date')
    .agg(
        F.avg('awayPlayer_speed').alias('player_avg_speed'),
        F.max('awayPlayer_speed').alias('player_max_speed')
    )
    .withColumn('away_home_team',F.lit('away'))
    .withColumnRenamed('awayPlayer_playerId','playerId')
)

away_x_dir_df = (
    away_df
    .withColumn('x_pos_lag',F.lag("away_player_3d_position_x",1).over(windowDelta_away))
    .withColumn("time_lag",F.lag("gameClock").over(windowTimeDelta_away))
    .fillna(0.0)
    .withColumn("delta_x",F.abs(F.col("away_player_3d_position_x") - F.col("x_pos_lag")))
    .withColumn("delta_time",F.col("gameClock") - F.col("time_lag"))
    .withColumn("speed_x", F.col("delta_x")/F.col("delta_time"))
    .groupBy("awayPlayer_playerId")
    .agg(F.max("speed_x").alias("maximum_speed_x"))
    .withColumnRenamed("awayPlayer_playerId","playerId")
    .withColumn("away_home_team",F.lit("away"))
)


In [None]:
players_performance_df = (
    home_grouped_df
    .union(away_grouped_df)
)

speed_x_dir_df = (
    home_x_dir_df
    .union(away_x_dir_df)
)


player_perf_final_df = (
    players_performance_df
    .join(speed_x_dir.select("playerId","maximum_speed_x"), on = ['playerId'], how = 'left')
)

In [None]:
if os.path.isdir(delta_feature_player_dir):

    deltaTable = DeltaTable.forPath(spark, delta_feature_player_dir)

    (
        deltaTable.alias('oldData')
        .merge(
            player_perf_final_df.alias('newData'),
            "oldData.playerId = newData.playerId"
        )
        .whenNotMatchedInsertAll()
        .execute()
    )
else:

    (
        player_perf_final_df
        .write
        .format('delta')
        .mode('overwrite')
        .partitionBy('match_date')
        .save(delta_feature_player_dir)
    )


### Ball performance

In [None]:
ball_perf_df = (
    temp_df
    .withColumn('ball_seconds',F.when(F.col('ball_inside_box') == True, 0.04).otherwise(0))
)

ball_perf_df_final = (
    ball_perf
    .groupBy("ball_inside_box")
    .agg(
        (F.sum('ball_seconds')/60).alias("minutes_inside_box"),
            F.count("ball_inside_box").alias("n_times_inside_box")
    )
    .filter(F.col("ball_inside_box") == True)
)


In [None]:
if os.path.isdir(delta_feature_ball_dir):

    deltaTable = DeltaTable.forPath(spark, delta_feature_ball_dir)

    (
        deltaTable.alias('oldData')
        .merge(
            ball_perf_df_final.alias('newData'),
            "oldData.playerId = newData.playerId"
        )
        .whenNotMatchedInsertAll()
        .execute()
    )
else:

    (
        ball_perf_df_final
        .write
        .format('delta')
        .mode('overwrite')
        .partitionBy('match_date')
        .save(delta_feature_ball_dir)
    )