# Loading the match data notebook

**Purpose of the notebook:** The purpose of this notebook is to load and unified the data for the further aggregations. This is the necessary part for the data loading.

**Input of the notebook:** The input data are raw match data.

**Output of the notebook:** The output of this notebook is `delta` table. 

**Some notes:**:
* The `spark.DataFrame` will always have notation `_df` at the end of the name of variable
* the `pandas.DataFrame` will always have notation `_pd` at the end of the name of variable

## Set the environment

In this part, the environment is set. The set up is:

* Loading the necessary python modules and helper functions
* Setting the path to data and metadata
* Initialize the spark session

Other config, such as `spark` application name, path, where the final `delta` table will be saved, etc. are defined in `config.yaml` file

#### Import modules

In [4]:
# Import the modules
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T
from delta import *
from utils import plot_pitch, ball_inside_box, read_config
from bs4 import BeautifulSoup

#### Read config

In [5]:
config = read_config()

#### Initialize spark session

In [6]:
app_name = config['spark_application']['spark_app_batch_name']

builder = (
    SparkSession.builder.appName(app_name) 
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
)

spark = configure_spark_with_delta_pip(builder).getOrCreate()

22/07/18 21:58:47 WARN Utils: Your hostname, tomas-Yoga-Slim-7-Pro-14ACH5-O resolves to a loopback address: 127.0.1.1; using 192.168.0.53 instead (on interface wlp1s0)
22/07/18 21:58:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/tomas/.ivy2/cache
The jars for the packages stored in: /home/tomas/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-9d40605a-2081-4731-835b-d62aa817fbbb;1.0
	confs: [default]
	found io.delta#delta-core_2.12;1.2.1 in central
	found io.delta#delta-storage;1.2.1 in central
	found org.antlr#antlr4-runtime;4.8 in central
	found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
:: resolution report :: resolve 300ms :: artifacts dl 15ms
	:: modules in use:
	io.delta#delta-core_2.12;1.2.1 from central in [default]
	io.delta#delta-storage;1.2.1 from central in [default]
	org.antlr#antlr4-runtime;4.8 from central in [default]
	org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evic

#### Set the remaining ,,env'' variables

In [7]:
raw_data_path = "g1059778_Data.jsonl"
meta_data_path = "/home/tomas/Personal_projects/Aston_Villa/data/g1059778_Metadata.xml"
delta_player_path = config['batch']['delta_player_dir']
delta_ball_path = config['batch']['delta_ball_dir']

## Read the raw match data

In [8]:
raw_match_data_df = (
    spark
    .read
    .json(raw_data_path)
)

                                                                                

## Read the metadata

In [9]:
with open(meta_data_path,'r') as f:
    metadata = f.read()

match_metadata = BeautifulSoup(metadata,'xml')

metadata_match_data = match_metadata.find('match').get('dtDate').split(' ')[0]
field_x = float(match_metadata.find('match').get('fPitchXSizeMeters'))
field_y = float(match_metadata.find('match').get('fPitchYSizeMeters'))
metadata_field_dim = (field_x,field_y)

print(f"Match date: {metadata_match_data}")
print(f"Field dimension: {metadata_field_dim}")

Match date: 2019-10-05
Field dimension: (104.85, 67.97)


## Unifying the data

In this part, the unified dataset is created. It is needed to somehow extract the values from the each `array` column (`homePlayers`,`awayPlayers`, `ball`). 

**Note: This approach will lead to duplicated dataset. However, for just storing the ingested raw data, it does not matter.**

In the cell below:
* `wallClock` is transformed to seconds and then the date of the match is extracted (the unique key). This will help us for identify ieach match uniquely and also tables can be partiotned by this column.
* Also we get the timestamp from `wallClock` column

In [10]:

match_date_df = (
    raw_match_data_df
    .withColumn('match_date', F.from_unixtime(F.col('wallClock')/1000, 'yyyy-MM-dd')) 
    .withColumn('match_timestamp',F.from_unixtime(F.col('wallClock')/1000, 'yyyy-MM-dd HH:mm:ss:S'))
)

In [8]:
(
    match_date_df
    .select('wallClock','match_date','match_timestamp')
).show(5,truncate=False)

+-------------+----------+---------------------+
|wallClock    |match_date|match_timestamp      |
+-------------+----------+---------------------+
|1570284007331|2019-10-05|2019-10-05 16:00:07:0|
|1570284007371|2019-10-05|2019-10-05 16:00:07:0|
|1570284007411|2019-10-05|2019-10-05 16:00:07:0|
|1570284007451|2019-10-05|2019-10-05 16:00:07:0|
|1570284007491|2019-10-05|2019-10-05 16:00:07:0|
+-------------+----------+---------------------+
only showing top 5 rows



In the cell bellow:
* both of `homePlayers` and `awayPlayers` columns are exploded, which we get each value of this array.
* Unfortunatelly, duplicates are created (e.g because of exploding, there is more rows for one player id and each row with this row associated)


In [11]:
base_columns = ['period','frameIdx','gameClock','wallClock','live','lastTouch','match_date']

unified_players_df = (
    match_date_df
    .withColumn('home_players_exploded',F.explode('homePlayers')) 
    .withColumn('away_players_exploded',F.explode('awayPlayers'))
    .select(
        F.col('home_players_exploded.playerId').alias('homePlayer_playerId'),
        F.col('home_players_exploded.speed').alias('homePlayer_speed'),
        F.col('home_players_exploded.xyz').alias('homePlayer_3d_position'),
        F.col('away_players_exploded.playerId').alias('awayPlayer_playerId'),
        F.col('away_players_exploded.speed').alias('awayPlayer_speed'),
        F.col('away_players_exploded.xyz').alias('awayPlayer_3d_position'),
        *base_columns
    )
    .withColumn("home_player_3d_position_x", F.col('homePlayer_3d_position').getItem(0))
    .withColumn("home_player_3d_position_y", F.col('homePlayer_3d_position').getItem(1))
    .withColumn("home_player_3d_position_z", F.col('homePlayer_3d_position').getItem(2))
    .withColumn("away_player_3d_position_x", F.col('awayPlayer_3d_position').getItem(0))
    .withColumn("away_player_3d_position_y", F.col('awayPlayer_3d_position').getItem(1))
    .withColumn("away_player_3d_position_z", F.col('awayPlayer_3d_position').getItem(2))
)

Also, let's make a unified data for the ball data.

In [12]:
unified_ball_df = (
    match_date_df
    .select(
        F.col("ball.xyz").alias("ballPosition"),
        F.col("ball.speed").alias("ballSpeed"),
        *base_columns
    )
    .withColumn("ballInsideBox",ball_inside_box(F.col('ballPosition'),F.lit("inside_box")))
)

Here, it is needed to first check, if the path to delta table exists. If there exists a partition in this folder, rows will be updated.

In [None]:
if os.path.isdir(delta_player_path):

        deltaTable = DeltaTable.forPath(spark, delta_player_path)

        (
            deltaTable.alias('oldData')
            .merge(
                unified_players_df.alias('newData'),
                "oldData.match_date = newData.match_date"
            )
            .whenNotMatchedInsertAll()
            .execute()
        )
else:

    (
        unified_players_df
        .write
        .format('delta')
        .mode('overwrite')
        .partitionBy('match_date')
        .save(delta_player_path)
    )


In [13]:
if os.path.isdir(delta_ball_path):

        deltaTable = DeltaTable.forPath(spark, delta_ball_path)

        (
            deltaTable.alias('oldData')
            .merge(
                unified_ball_df.alias('newData'),
                "oldData.match_date = newData.match_date"
            )
            .whenNotMatchedInsertAll()
            .execute()
        )
else:

    (
        unified_ball_df
        .write
        .format('delta')
        .mode('overwrite')
        .partitionBy('match_date')
        .save(delta_ball_path)
    )


22/07/18 21:59:18 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
22/07/18 21:59:18 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 84.44% for 9 writers
22/07/18 21:59:18 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 76.00% for 10 writers
22/07/18 21:59:18 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 69.09% for 11 writers
22/07/18 21:59:18 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 63.33% for 12 writers
22/07/18 21:59:18 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 58.46% for 13 writers
22/07/18 21:59:18 WARN MemoryManager: Total allocation exceeds 95.