# Lab 4: Spark Dataframes and SQL (Part II)

In this part, we will practice with data sets concerning soccer palyers and thier performance data within matches. We have two datasets: players and player attributes. The firt has a few columns and a few rows which can serve as a lookup table. The second contains several attributes which are useful for extracting different insights.

Let's start creating a session and load the data the same way we did with the London crime data set

In [5]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Analyzing soccer players") \
    .master('spark://spark-master:7077') \
    .config("spark.sql.repl.eagerEval.enabled", True) \
    .getOrCreate()

In [6]:
spark

In [7]:
players = spark.read\
               .format("csv")\
               .option("header", "true")\
               .load("data/player.csv")

In [10]:
players.printSchema()

root
 |-- id: string (nullable = true)
 |-- player_api_id: string (nullable = true)
 |-- player_name: string (nullable = true)
 |-- player_fifa_api_id: string (nullable = true)
 |-- birthday: string (nullable = true)
 |-- height: string (nullable = true)
 |-- weight: string (nullable = true)



In [12]:
players

id,player_api_id,player_name,player_fifa_api_id,birthday,height,weight
1,505942,Aaron Appindangoye,218353,1992-02-29 00:00:00,182.88,187
2,155782,Aaron Cresswell,189615,1989-12-15 00:00:00,170.18,146
3,162549,Aaron Doran,186170,1991-05-13 00:00:00,170.18,163
4,30572,Aaron Galindo,140161,1982-05-08 00:00:00,182.88,198
5,23780,Aaron Hughes,17725,1979-11-08 00:00:00,182.88,154
6,27316,Aaron Hunt,158138,1986-09-04 00:00:00,182.88,161
7,564793,Aaron Kuhl,221280,1996-01-30 00:00:00,172.72,146
8,30895,Aaron Lennon,152747,1987-04-16 00:00:00,165.1,139
9,528212,Aaron Lennox,206592,1993-02-19 00:00:00,190.5,181
10,101042,Aaron Meijers,188621,1987-10-28 00:00:00,175.26,170


In [13]:
player_attributes = spark.read\
                         .format("csv")\
                         .option("header", "true")\
                         .load("data/player_attributes.csv")

In [14]:
player_attributes.printSchema()

root
 |-- id: string (nullable = true)
 |-- player_fifa_api_id: string (nullable = true)
 |-- player_api_id: string (nullable = true)
 |-- date: string (nullable = true)
 |-- overall_rating: string (nullable = true)
 |-- potential: string (nullable = true)
 |-- preferred_foot: string (nullable = true)
 |-- attacking_work_rate: string (nullable = true)
 |-- defensive_work_rate: string (nullable = true)
 |-- crossing: string (nullable = true)
 |-- finishing: string (nullable = true)
 |-- heading_accuracy: string (nullable = true)
 |-- short_passing: string (nullable = true)
 |-- volleys: string (nullable = true)
 |-- dribbling: string (nullable = true)
 |-- curve: string (nullable = true)
 |-- free_kick_accuracy: string (nullable = true)
 |-- long_passing: string (nullable = true)
 |-- ball_control: string (nullable = true)
 |-- acceleration: string (nullable = true)
 |-- sprint_speed: string (nullable = true)
 |-- agility: string (nullable = true)
 |-- reactions: string (nullable = true

In [16]:
player_attributes.select('player_api_id','date').sample(fraction=0.1)

player_api_id,date
505942,2016-02-18 00:00:00
155782,2014-12-05 00:00:00
155782,2014-03-14 00:00:00
155782,2013-02-15 00:00:00
162549,2009-02-22 00:00:00
30572,2015-07-03 00:00:00
30572,2014-06-06 00:00:00
30572,2014-04-11 00:00:00
23780,2015-09-21 00:00:00
23780,2014-01-31 00:00:00


#### Player attributes

* Have values across multiple years
* Can be associated with a particular player using the **player_api_id** column
* Different attributes are valuable for different kinds of players i.e strikers, midfields, goalkeepers

In [18]:
players.count() , player_attributes.count()

(11060, 183978)

In [17]:
player_attributes.select('player_api_id')\
                 .distinct()\
                 .count()

                                                                                

11060

### Cleaning Data
Drop columns ```id``` and ```player_fifa_api_id``` from the players dataset

In [19]:
#Remove before sharing with students
players = players.drop('id', 'player_fifa_api_id')
players.columns

['player_api_id', 'player_name', 'birthday', 'height', 'weight']

According to our requirement there are certain traits which we are not at all going to use in this entire program<br>
So its better to remove those traits to make our dataset less bulky.From the player_attributes dataset, we need to drop the columns ```id, player_fifa_api_id, preferred_foot, attacking_work_rate, defensive_work_rate, crossing, jumping, sprint_speed, balance, aggression, short_passing, potential```.

In [20]:
#Remove before sharing with students
player_attributes = player_attributes.drop(
    'id', 
    'player_fifa_api_id', 
    'preferred_foot',
    'attacking_work_rate',
    'defensive_work_rate',
    'crossing',
    'jumping',
    'sprint_speed',
    'balance',
    'aggression',
    'short_passing',
    'potential'
)
player_attributes.columns

['player_api_id',
 'date',
 'overall_rating',
 'finishing',
 'heading_accuracy',
 'volleys',
 'dribbling',
 'curve',
 'free_kick_accuracy',
 'long_passing',
 'ball_control',
 'acceleration',
 'agility',
 'reactions',
 'shot_power',
 'stamina',
 'strength',
 'long_shots',
 'interceptions',
 'positioning',
 'vision',
 'penalties',
 'marking',
 'standing_tackle',
 'sliding_tackle',
 'gk_diving',
 'gk_handling',
 'gk_kicking',
 'gk_positioning',
 'gk_reflexes']

From both datasets, drop any rows that have empty cells

In [21]:
player_attributes = player_attributes.dropna()
players = players.dropna()

In [22]:
players.count() , player_attributes.count()

(11060, 181265)

#### Extract year information into a separate column

We have a ```date``` attribute in ```player_attributes``` data frame. It is given as a full date. But, we need to extract  only the year value. For example it is given as '2007-02-22 00:00:00'. So, we need to extract the first four digits from the left. We can see the date as a string and use the '-' symbole as a separator and split the string on that separator and take the first entry in the resulting String[]. 

We can use a user-defined function ```(UDF)``` for this purpose as shown below.

In [23]:
from pyspark.sql.functions import udf

In [24]:
year_extract_udf = udf(lambda date: date.split('-')[0])

player_attributes = player_attributes.withColumn(
    "year",
    year_extract_udf(player_attributes.date)
)

Alternatively, we can use the expression of the UDF within the ```withColumn``` function

In [25]:
from pyspark.sql.functions import split
player_attributes = player_attributes.withColumn("year2", split(player_attributes.date,('-'))[0])
player_attributes.select("player_api_id","date","year","year2")

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 186, in manager
  File "/usr/local/lib/python3.9/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 74, in worker
  File "/usr/local/lib/python3.9/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 642, in main
    if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "/usr/local/lib/python3.9/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 595, in read_int
    raise EOFError
EOFError


player_api_id,date,year,year2
505942,2016-02-18 00:00:00,2016,2016
505942,2015-11-19 00:00:00,2015,2015
505942,2015-09-21 00:00:00,2015,2015
505942,2015-03-20 00:00:00,2015,2015
505942,2007-02-22 00:00:00,2007,2007
155782,2016-04-21 00:00:00,2016,2016
155782,2016-04-07 00:00:00,2016,2016
155782,2016-01-07 00:00:00,2016,2016
155782,2015-12-24 00:00:00,2015,2015
155782,2015-12-17 00:00:00,2015,2015


We can now drop the ```date``` attribute as we do not need it anymore. Remember that it is a good practice to keep only the columns you need as in a produciton setting, the data is going to be large and dropping unwanted column will help reduce resource consumption and improve the throughput of the job.

In [26]:
player_attributes = player_attributes.drop('date')

In [27]:
player_attributes.columns

['player_api_id',
 'overall_rating',
 'finishing',
 'heading_accuracy',
 'volleys',
 'dribbling',
 'curve',
 'free_kick_accuracy',
 'long_passing',
 'ball_control',
 'acceleration',
 'agility',
 'reactions',
 'shot_power',
 'stamina',
 'strength',
 'long_shots',
 'interceptions',
 'positioning',
 'vision',
 'penalties',
 'marking',
 'standing_tackle',
 'sliding_tackle',
 'gk_diving',
 'gk_handling',
 'gk_kicking',
 'gk_positioning',
 'gk_reflexes',
 'year',
 'year2']

#### Filter to get all players who were active in the year 2016

In [28]:
# Remove before sharing with students
pa_2016 = player_attributes.filter(player_attributes.year == 2016)

In [29]:
pa_2016.count()

                                                                                

14098

How many distinct players are there for the 2016 data?

In [30]:
# Remove before sharing with students
pa_2016.select(pa_2016.player_api_id)\
       .distinct()\
       .count()

                                                                                

5586

#### Find the best striker in the year 2016

* Consider the scores for finishing, shot_power and acceleration to determine this
* There can be more than one entry for a player in the year (multiple seasons, some teams make entries per quarter)
* Find the average scores across the multiple records

In [31]:
pa_striker_2016 = pa_2016.groupBy('player_api_id')\
                       .agg({
                           'finishing':"avg",
                           "shot_power":"avg",
                           "acceleration":"avg"
                       })

In [32]:
pa_striker_2016.count()

5586

In [33]:
pa_striker_2016

                                                                                

player_api_id,avg(finishing),avg(acceleration),avg(shot_power)
309726,75.44444444444444,74.11111111111111,76.0
26112,53.0,51.0,76.0
38433,68.25,74.0,74.0
295060,25.0,62.0,40.0
161396,29.0,72.0,69.0
37774,61.0,64.0,68.0
41157,81.0,87.0,80.0
40740,58.0,73.5,75.0
31432,14.0,59.0,65.0
109653,62.0,65.0,83.5


Tidy the columns by renaming them. Rename each ```avg(colName)``` to ```colName```

In [34]:
# Remove before sharing with students
pa_striker_2016 = pa_striker_2016.withColumnRenamed("avg(finishing)","finishing")\
                                 .withColumnRenamed("avg(shot_power)","shot_power")\
                                 .withColumnRenamed("avg(acceleration)","acceleration")

#### Find an aggregate score to represent how good a particular player is

* Each attribute has a weighing factor
* Find a total score for each striker

In [35]:
weight_finishing = 1
weight_shot_power = 2
weight_acceleration = 1

total_weight = weight_finishing + weight_shot_power + weight_acceleration

In [36]:
strikers = pa_striker_2016.withColumn("striker_grade",
                                      (pa_striker_2016.finishing * weight_finishing + \
                                       pa_striker_2016.shot_power * weight_shot_power+ \
                                       pa_striker_2016.acceleration * weight_acceleration) / total_weight)

In [37]:
strikers = strikers.drop('finishing',
                         'acceleration',
                         'shot_power'
)

In [39]:
strikers = strikers.filter(strikers.striker_grade > 70)\
                   .sort(strikers.striker_grade.desc())
    
strikers

                                                                                

player_api_id,striker_grade
20276,89.25
37412,89.0
38817,88.75
32118,88.25
31921,87.0
30834,86.75
303824,85.10714285714286
129944,85.0
150565,84.75
158263,84.75


#### Find name and other details of the best strikers

* The information is present in the *players* dataframe
* Will involve a join operation between *players* and *strikers*

In [40]:
strikers.count(), players.count()

(1609, 11060)

#### Joining dataframes

In [41]:
striker_details = players.join(strikers, players.player_api_id == strikers.player_api_id)

In [42]:
striker_details.columns

['player_api_id',
 'player_name',
 'birthday',
 'height',
 'weight',
 'player_api_id',
 'striker_grade']

In [43]:
striker_details.count()

                                                                                

1609

In [44]:
striker_details = players.join(strikers, ['player_api_id'])

In [46]:
striker_details

player_api_id,player_name,birthday,height,weight,striker_grade
309726,Andrea Belotti,1993-12-20 00:00:00,180.34,159,75.38888888888889
38433,Borja Valero,1985-01-12 00:00:00,175.26,161,72.5625
41157,Giovani dos Santos,1989-05-11 00:00:00,175.26,163,82.0
40740,Jeremy Morel,1984-04-02 00:00:00,172.72,157,70.375
109653,John Goossens,1988-07-25 00:00:00,175.26,150,73.5
190851,Kenny McLean,1992-01-08 00:00:00,180.34,154,74.25
196957,Mihai Alexandru R...,1992-05-31 00:00:00,190.5,176,73.33333333333334
362212,Piotr Zielinski,1994-05-20 00:00:00,180.34,165,70.52777777777779
26005,Thomas Vermaelen,1985-11-14 00:00:00,182.88,176,74.0
3517,Alexandr Kerzhakov,1982-11-27 00:00:00,175.26,168,78.0


### Broadcast & Join

* Broadcast the smaller dataframe so it is available on all cluster machines
* The data should be small enough so it is held in memory
* All nodes in the cluster distribute the data as fast as they can so overall computation is faster

In [47]:
from pyspark.sql.functions import broadcast

In [48]:
striker_details = players.select(
                                "player_api_id",
                                "player_name"
                                 )\
                  .join(
                        broadcast(strikers), 
                        ['player_api_id'],   
                        'inner'
                  )

In [49]:
striker_details = striker_details.sort(striker_details.striker_grade.desc())

In [50]:
striker_details

                                                                                

player_api_id,player_name,striker_grade
20276,Hulk,89.25
37412,Sergio Aguero,89.0
38817,Carlos Tevez,88.75
32118,Lukas Podolski,88.25
31921,Gareth Bale,87.0
30834,Arjen Robben,86.75
303824,Memphis Depay,85.10714285714286
129944,Marco Reus,85.0
158263,Dorlan Pabon,84.75
150565,Pierre-Emerick Au...,84.75


### Accumulators

* Shared variables which are updated by processes running across multiple nodes

In [51]:
players.count(), player_attributes.count()

(11060, 181265)

In [52]:
players_heading_acc = player_attributes.select('player_api_id',
                                               'heading_accuracy')\
                                       .join(broadcast(players),
                                             player_attributes.player_api_id == players.player_api_id)

In [53]:
players_heading_acc.columns

['player_api_id',
 'heading_accuracy',
 'player_api_id',
 'player_name',
 'birthday',
 'height',
 'weight']

#### Get player counts by height

In [54]:
short_count = spark.sparkContext.accumulator(0)
medium_low_count = spark.sparkContext.accumulator(0)
medium_high_count = spark.sparkContext.accumulator(0)
tall_count = spark.sparkContext.accumulator(0)

In [55]:
def count_players_by_height(row):
    height = float(row.height)
    
    if (height <= 175 ):
        short_count.add(1)
    elif (height <= 183 and height > 175 ):
        medium_low_count.add(1)
    elif (height <= 195 and height > 183 ):
        medium_high_count.add(1)
    elif (height > 195) :
        tall_count.add(1)

In [56]:
players_heading_acc.foreach(lambda x: count_players_by_height(x))

In [57]:
all_players = [short_count.value,
               medium_low_count.value,
               medium_high_count.value,
               tall_count.value]

all_players

[18977, 97399, 61518, 3371]

#### Find the players who have the best heading accuracy

* Count players who have a heading accuracy above the threshold
* Bucket them by height

In [58]:
short_ha_count = spark.sparkContext.accumulator(0)
medium_low_ha_count = spark.sparkContext.accumulator(0)
medium_high_ha_count = spark.sparkContext.accumulator(0)
tall_ha_count = spark.sparkContext.accumulator(0)

In [59]:
def count_players_by_height_and_heading_accuracy(row, threshold_score):
    
    height = float(row.height)
    ha = float(row.heading_accuracy)
    
    if ha <= threshold_score:
        return
    
    if (height <= 175 ):
        short_ha_count.add(1)
    elif (height <= 183 and height > 175):
        medium_low_ha_count.add(1)
    elif (height <= 195 and height > 183):
        medium_high_ha_count.add(1)
    elif (height > 195) :
        tall_ha_count.add(1)        

In [60]:
players_heading_acc.foreach(lambda x: count_players_by_height_and_heading_accuracy(x, 60))

In [61]:
all_players_above_threshold = [short_ha_count.value,
                               medium_low_ha_count.value,
                               medium_high_ha_count.value,
                               tall_ha_count.value]

all_players_above_threshold

[3653, 41448, 40270, 1573]

#### Convert to percentages 

* % of players above the threshold heading accuracy for each height bucket

In [62]:
percentage_values = [short_ha_count.value / short_count.value *100,
                     medium_low_ha_count.value / medium_low_count.value *100,
                     medium_high_ha_count.value / medium_high_count.value *100,
                     tall_ha_count.value / tall_count.value *100
                    ]

percentage_values

[19.249617958581442, 42.55485169252251, 65.46051562144413, 46.66271136161376]

#### Custom accumulator

In [63]:
from pyspark.accumulators import AccumulatorParam

class VectorAccumulatorParam(AccumulatorParam):
    
    def zero(self, value):
        return [0.0] * len(value)

    def addInPlace(self, v1, v2):
        for i in range(len(v1)):
            v1[i] += v2[i]
        
        return v1

In [69]:
sc = spark.sparkContext

In [70]:
vector_accum = sc.accumulator([10.0, 20.0, 30.0], VectorAccumulatorParam())

vector_accum.value

[10.0, 20.0, 30.0]

In [71]:
vector_accum += [1, 2, 3]

vector_accum.value

[11.0, 22.0, 33.0]

#### Save data to file

In [72]:
pa_2016.columns

['player_api_id',
 'overall_rating',
 'finishing',
 'heading_accuracy',
 'volleys',
 'dribbling',
 'curve',
 'free_kick_accuracy',
 'long_passing',
 'ball_control',
 'acceleration',
 'agility',
 'reactions',
 'shot_power',
 'stamina',
 'strength',
 'long_shots',
 'interceptions',
 'positioning',
 'vision',
 'penalties',
 'marking',
 'standing_tackle',
 'sliding_tackle',
 'gk_diving',
 'gk_handling',
 'gk_kicking',
 'gk_positioning',
 'gk_reflexes',
 'year',
 'year2']

#### Save the dataframe to a file

In [101]:
pa_2016.select("player_api_id", "overall_rating")\
    .coalesce(1)\
    .write\
    .option("header", "true")\
    .csv("data/players_overall.csv")

In [102]:
pa_2016.select("player_api_id", "overall_rating")\
    .write\
    .json("players_overall.json")