# Basic DataFrames tutorial

In [1]:
from pyspark.sql import SparkSession

ss = SparkSession.builder.\
     master('spark://spark-master:7077').\
     appName('myapp').getOrCreate()

We can use the `json` reader to read in many json files at once.  Each json file becomes a single row in the resulting DataFrame:

In [57]:
df = ss.read.json('hdfs://namenode/Users/vagrant/structured-2018-01-14-neworleans/*.json')

In [58]:
# how many rows do we have (i.e. how many json files did we read?)
df.count()

280

In [59]:
# use the .columns member to list the columns out
df.columns

['duration_ms',
 'end_time_s',
 'events',
 'hp_hill_names',
 'hp_hill_rotations',
 'id',
 'map',
 'mode',
 'platform',
 'players',
 'rounds',
 'series_id',
 'start_time_s',
 'teams',
 'title']

In [60]:
# we can select columns just like in SQL
df.select(['map', 'mode', 'title']).show(5)

+---------------+---------+-----+
|            map|     mode|title|
+---------------+---------+-----+
|Ardennes Forest|Hardpoint|  ww2|
|   London Docks|Hardpoint|  ww2|
|   London Docks|Hardpoint|  ww2|
|Ardennes Forest|Hardpoint|  ww2|
|Ardennes Forest|Hardpoint|  ww2|
+---------------+---------+-----+
only showing top 5 rows



In [61]:
# selecting a single column is similar to pandas
df.mode
# or
df['mode']

Column<b'mode'>

We can perform standard aggregations (e.g. avg, min, max, etc).

However, we always need to perform a `groupBy()`, even if we aren't grouping:

In [62]:
df.groupBy().avg('duration_ms').show()

+-----------------+
| avg(duration_ms)|
+-----------------+
|608682.1428571428|
+-----------------+



There are actually quite a few alternative syntaxes to do the same thing.  Sometimes this gets a little confusing:

In [65]:
df.groupBy().agg({'duration_ms': 'avg'}).show()

+-----------------+
| avg(duration_ms)|
+-----------------+
|608682.1428571428|
+-----------------+



In [66]:
from pyspark.sql import functions as F

df.groupBy().agg(F.avg('duration_ms')).show()

+-----------------+
| avg(duration_ms)|
+-----------------+
|608682.1428571428|
+-----------------+



In [67]:
df.groupBy().agg(F.avg(df.duration_ms)).show()

+-----------------+
| avg(duration_ms)|
+-----------------+
|608682.1428571428|
+-----------------+



There are many useful functions in the `pyspark.sql.functions` module.  We will use some of them, like we did above.

Obviously, we can also perform aggregations over actual groups.  Here's an example:

In [68]:
df.groupBy('mode').avg('duration_ms').show()

+----------------+-----------------+
|            mode| avg(duration_ms)|
+----------------+-----------------+
|       Hardpoint|636487.1794871795|
|Capture The Flag|639394.3661971831|
|Search & Destroy|549619.5652173914|
+----------------+-----------------+



Json data is usually nested, which is a little "weird" when you are trying to analyze it using SQL-like tables.

For example, in the CWL json the `teams` field is actually a list of length 2 (one for each team):

In [70]:
df.select('teams').limit(5).show()

+--------------------+
|               teams|
+--------------------+
|[[false, ENIGMA6,...|
|[[false, LUMINOSI...|
|[[true, MINDFREAK...|
|[[true, RISE NATI...|
|[[false, FAZE CLA...|
+--------------------+



Sometimes it can be better to do a `take` than a `show` so that we can see the nested structure better:

In [71]:
df.select('teams').take(1)

[Row(teams=[Row(is_victor=False, name='ENIGMA6', round_scores=[24, 16, 21, 20, 31, 13, 18, 25, 4, 7, 21, 7, 0], score=207, side='home'), Row(is_victor=True, name='LUMINOSITY GAMING', round_scores=[17, 19, 27, 9, 5, 9, 20, 8, 32, 39, 8, 14, 13], score=220, side='away')])]

The .explode() function is a very useful way to "denormalize" the data.  TL;DR explodes a nested list into multiple rows (at the cost of introducing some redundancy):

In [73]:
teams_df = df.select('id', F.explode('teams'))
teams_df.show(5)

+--------------------+--------------------+
|                  id|                 col|
+--------------------+--------------------+
|64d15a2d-2f3c-5a2...|[false, ENIGMA6, ...|
|64d15a2d-2f3c-5a2...|[true, LUMINOSITY...|
|1111848b-1bfb-5d6...|[false, LUMINOSIT...|
|1111848b-1bfb-5d6...|[true, TEAM KALIB...|
|1b615383-6e9e-589...|[true, MINDFREAK,...|
+--------------------+--------------------+
only showing top 5 rows



Actually, it is better to rename our column to "team" because we exploded a list of 2 teams into a 2 separate rows each containing a team.  We use `alias` to rename:

In [74]:
teams_df = df.select('id', F.explode('teams').alias('team'))
teams_df.show(5)

+--------------------+--------------------+
|                  id|                team|
+--------------------+--------------------+
|64d15a2d-2f3c-5a2...|[false, ENIGMA6, ...|
|64d15a2d-2f3c-5a2...|[true, LUMINOSITY...|
|1111848b-1bfb-5d6...|[false, LUMINOSIT...|
|1111848b-1bfb-5d6...|[true, TEAM KALIB...|
|1b615383-6e9e-589...|[true, MINDFREAK,...|
+--------------------+--------------------+
only showing top 5 rows



In [76]:
teams_df.take(1)

[Row(id='64d15a2d-2f3c-5a28-844e-3d903d3cb9bc', team=Row(is_victor=False, name='ENIGMA6', round_scores=[24, 16, 21, 20, 31, 13, 18, 25, 4, 7, 21, 7, 0], score=207, side='home'))]

Notice that the "team" column is still nested.  This isn't really limiting, though.  We can use the col.field syntax to get at the subfields:

In [78]:
teams_df.select('id', 'team.name').show(5)

+--------------------+-----------------+
|                  id|             name|
+--------------------+-----------------+
|64d15a2d-2f3c-5a2...|          ENIGMA6|
|64d15a2d-2f3c-5a2...|LUMINOSITY GAMING|
|1111848b-1bfb-5d6...|LUMINOSITY GAMING|
|1111848b-1bfb-5d6...|     TEAM KALIBER|
|1b615383-6e9e-589...|        MINDFREAK|
+--------------------+-----------------+
only showing top 5 rows



If we want to rename the column then we have to use a noisier syntax:

In [83]:
teams_df.select('id', teams_df.team['name'].alias('team_name')).show(5)

+--------------------+-----------------+
|                  id|        team_name|
+--------------------+-----------------+
|64d15a2d-2f3c-5a2...|          ENIGMA6|
|64d15a2d-2f3c-5a2...|LUMINOSITY GAMING|
|1111848b-1bfb-5d6...|LUMINOSITY GAMING|
|1111848b-1bfb-5d6...|     TEAM KALIBER|
|1b615383-6e9e-589...|        MINDFREAK|
+--------------------+-----------------+
only showing top 5 rows



... or we could've used the `.withColumnRenamed()` method:

In [84]:
teams_df.select('id', 'team.name').\
         withColumnRenamed('name', 'team_name').\
         show(5)

+--------------------+-----------------+
|                  id|        team_name|
+--------------------+-----------------+
|64d15a2d-2f3c-5a2...|          ENIGMA6|
|64d15a2d-2f3c-5a2...|LUMINOSITY GAMING|
|1111848b-1bfb-5d6...|LUMINOSITY GAMING|
|1111848b-1bfb-5d6...|     TEAM KALIBER|
|1b615383-6e9e-589...|        MINDFREAK|
+--------------------+-----------------+
only showing top 5 rows



Let's explode the `players` nested field:

In [85]:
players_df = df.select('id', F.explode('players'))
players_df.show(5)

+--------------------+--------------------+
|                  id|                 col|
+--------------------+--------------------+
|64d15a2d-2f3c-5a2...|[3, 1, 0, 1, 0, 0...|
|64d15a2d-2f3c-5a2...|[2, 0, 0, 0, 0, 0...|
|64d15a2d-2f3c-5a2...|[6, 3, 0, 3, 1, 0...|
|64d15a2d-2f3c-5a2...|[4, 2, 0, 2, 0, 1...|
|64d15a2d-2f3c-5a2...|[6, 0, 0, 1, 0, 0...|
+--------------------+--------------------+
only showing top 5 rows



To reduce the number of joins we'll have to make, let's redo this last step but keep some more fields (at the cost of redundancy).  This is "denormalization":

In [88]:
players_df = df.select('id',
                       'platform',
                       'title',
                       'mode',
                       'map',
                       'start_time_s',
                       'end_time_s',
                       'duration_ms',
                       F.explode('players').alias('player'))
players_df.show(5)

+--------------------+--------+-----+---------+---------------+------------+----------+-----------+--------------------+
|                  id|platform|title|     mode|            map|start_time_s|end_time_s|duration_ms|              player|
+--------------------+--------+-----+---------+---------------+------------+----------+-----------+--------------------+
|64d15a2d-2f3c-5a2...|     ps4|  ww2|Hardpoint|Ardennes Forest|  1515814927|1515815692|     765000|[3, 1, 0, 1, 0, 0...|
|64d15a2d-2f3c-5a2...|     ps4|  ww2|Hardpoint|Ardennes Forest|  1515814927|1515815692|     765000|[2, 0, 0, 0, 0, 0...|
|64d15a2d-2f3c-5a2...|     ps4|  ww2|Hardpoint|Ardennes Forest|  1515814927|1515815692|     765000|[6, 3, 0, 3, 1, 0...|
|64d15a2d-2f3c-5a2...|     ps4|  ww2|Hardpoint|Ardennes Forest|  1515814927|1515815692|     765000|[4, 2, 0, 2, 0, 1...|
|64d15a2d-2f3c-5a2...|     ps4|  ww2|Hardpoint|Ardennes Forest|  1515814927|1515815692|     765000|[6, 0, 0, 1, 0, 0...|
+--------------------+--------+-

In [89]:
players_df.take(1)

[Row(id='64d15a2d-2f3c-5a28-844e-3d903d3cb9bc', platform='ps4', title='ww2', mode='Hardpoint', map='Ardennes Forest', start_time_s=1515814927, end_time_s=1515815692, duration_ms=765000, player=Row(2piece=3, 3piece=1, 4piece=0, 4streak=1, 5streak=0, 6streak=0, 7streak=0, 8+streak=0, accuracy=24.6, assists=14, avg_time_per_life_s=17.1, ctf_captures=None, ctf_defends=None, ctf_flag_carry_time_s=None, ctf_kill_carriers=None, ctf_pickups=None, ctf_returns=None, deaths=39, deaths_per_10min=30.6, fave_division='Airborne', fave_scorestreaks=['Fighter Pilot', 'Glide Bomb', 'Artillery Barrage'], fave_training='Hunker', fave_weapon='PPSh-41', headshots=2, hits=146, hp_captures=0, hp_defends=0, hp_hill_time_s=48, kd=0.9, kills=35, kills_per_10min=27.5, name='BLAZT', num_lives=40, scorestreaks_assists=0, scorestreaks_deployed=0, scorestreaks_earned=0, scorestreaks_kills=0, scorestreaks_used=0, shots=593, snd_1kill_round=None, snd_2kill_round=None, snd_3kill_round=None, snd_4kill_round=None, snd_def

You can join just like in SQL

In [96]:
joined_df = players_df.join(teams_df, 
                            [players_df.id == teams_df.id,
                             players_df.player['team'] == teams_df.team['name']])
joined_df.show(5)

+--------------------+--------+-----+---------+---------------+------------+----------+-----------+--------------------+--------------------+--------------------+
|                  id|platform|title|     mode|            map|start_time_s|end_time_s|duration_ms|              player|                  id|                team|
+--------------------+--------+-----+---------+---------------+------------+----------+-----------+--------------------+--------------------+--------------------+
|64d15a2d-2f3c-5a2...|     ps4|  ww2|Hardpoint|Ardennes Forest|  1515814927|1515815692|     765000|[3, 1, 0, 1, 0, 0...|64d15a2d-2f3c-5a2...|[false, ENIGMA6, ...|
|64d15a2d-2f3c-5a2...|     ps4|  ww2|Hardpoint|Ardennes Forest|  1515814927|1515815692|     765000|[2, 0, 0, 0, 0, 0...|64d15a2d-2f3c-5a2...|[false, ENIGMA6, ...|
|64d15a2d-2f3c-5a2...|     ps4|  ww2|Hardpoint|Ardennes Forest|  1515814927|1515815692|     765000|[6, 3, 0, 3, 1, 0...|64d15a2d-2f3c-5a2...|[false, ENIGMA6, ...|
|64d15a2d-2f3c-5a2...|

Filtering is also easy

In [97]:
joined_df.select('mode').distinct().collect()

[Row(mode='Hardpoint'),
 Row(mode='Capture The Flag'),
 Row(mode='Search & Destroy')]

In [98]:
ctf_df = joined_df.filter(joined_df.mode == 'Search & Destroy')

In [99]:
ctf_df.count()

736