Table: Activity <br>
+--------------+---------+ <br>
| Column Name | Type | <br>
+--------------+---------+ <br>
| player_id | int | <br>
| device_id | int | <br>
| event_date | date | <br>
| games_played | int | <br>
+--------------+---------+ <br>
- (player_id, event_date) is the primary key of this table. <br>
- This table shows the activity of players of some game.
- Each row is a record of a player who logged in and played a number of games
- (possibly 0) before logging out on some day using some device.

##### Write an SQL query that reports for each player and date, how many games played so far by the player. That is, the total number of games played by the player until that date. Check the example for clarity.
The query result format is in the following example: <br>
Activity table: <br>
+-----------+-----------+------------+--------------+ <br>
| player_id | device_id | event_date | games_played | <br>
+-----------+-----------+------------+--------------+ <br>
| 1 | 2 | 2016-03-01 | 5 | <br>
| 1 | 2 | 2016-05-02 | 6 | <br>
| 1 | 3 | 2017-06-25 | 1 | <br>
| 3 | 1 | 2016-03-02 | 0 | <br>
| 3 | 4 | 2018-07-03 | 5 | <br>
+-----------+-----------+------------+--------------+ <br>
Result table: <br>
+-----------+------------+---------------------+ <br>
| player_id | event_date | games_played_so_far | <br>
+-----------+------------+---------------------+ <br>
| 1 | 2016-03-01 | 5 | <br>
| 1 | 2016-05-02 | 11 | <br>
| 1 | 2017-06-25 | 12 | <br>
| 3 | 2016-03-02 | 0 | <br>
| 3 | 2018-07-03 | 5 | <br>
+-----------+------------+---------------------+ <br>
- For the player with id 1, 5 + 6 = 11 games played by 2016-05-02, and 5 + 6 + 1 = 12 games played by 2017-06-25.
- For the player with id 3, 0 + 5 = 5 games played by 2018-07-03.

Note that for each player we only care about the days when the player logged in.

Creating dataset:

In [0]:
from pyspark.sql.types import DateType
schema = "player_id int, device_id int, event_date string, games_played int"
data = [(1,	2, "2016-03-01", 5),
        (1,	2, "2016-05-02", 6),
        (2,	3, "2017-06-25", 1),
        (3,	1, "2016-03-02", 0),
        (3,	4, "2018-07-03", 5)]

df = spark.createDataFrame(data = data, schema = schema)

df = df.withColumn("event_date", df["event_date"].cast(DateType()))
df.display()

player_id,device_id,event_date,games_played
1,2,2016-03-01,5
1,2,2016-05-02,6
2,3,2017-06-25,1
3,1,2016-03-02,0
3,4,2018-07-03,5


PySpark Solution

In [0]:
from pyspark.sql.window import Window
from pyspark.sql import functions as F

windowSpec = Window.partitionBy("player_id").orderBy("event_date").rangeBetween(Window.unboundedPreceding, 0)

df_games_played = (df
                   .withColumn("games_played_so_far", F.sum("games_played").over(windowSpec))
                   .select("player_id", "event_date", "games_played_so_far")
                   )

df_games_played.display()

player_id,event_date,games_played_so_far
1,2016-03-01,5
1,2016-05-02,11
2,2017-06-25,1
3,2016-03-02,0
3,2018-07-03,5
