Table: Activity
| Column Name |    Type    |<br>|-------------|------------|<br>| player_id   | int        |<br>| device_id   | int        |<br>| event_date  | date       |<br>| games_played| int        |<br>
- (player_id, event_date) is the primary key of this table.
- This table shows the activity of players of some game.
- Each row is a record of a player who logged in and played a number of games
- (possibly 0) before logging out on some day using some device.

##### Write an SQL query that reports the first login date for each player.

The query result format is in the following example:

Activity table:
| player_id | device_id | event_date | games_played |<br>|-----------|-----------|------------|--------------|<br>| 1         | 2         | 2016-03-01 | 5            |<br>| 1         | 2         | 2016-05-02 | 6            |<br>| 2         | 3         | 2017-06-25 | 1            |<br>| 3         | 1         | 2016-03-02 | 0            |<br>| 3         | 4         | 2018-07-03 | 5            |<br>
Result table: 
| player_id | first_login |<br>|-----------|-------------|<br>| 1         | 2016-03-01  |<br>| 2         | 2017-06-25  |<br>| 3         | 2016-03-02  |<br>

##### PySpark Solution

In [0]:
# from pyspark.sql.types import StructType, StructField, IntegerType, DateType

# schema = StructType([
#   StructField("player_id", IntegerType(), True),
#   StructField("device_id", IntegerType(), True),
#   StructField("event_date", DateType(), True),
#   StructField("games_played", IntegerType(), True)
# ])

schema = "player_id int, device_id int, event_date string, games_played int"
data = [(1,	2, "2016-03-01", 5),
        (1,	2, "2016-05-02", 6),
        (2,	3, "2017-06-25", 1),
        (3,	1, "2016-03-02", 0),
        (3,	4, "2018-07-03", 5)]

df = spark.createDataFrame(data = data, schema = schema)

df = df.withColumn("event_date", df["event_date"].cast(DateType()))
df.display()

player_id,device_id,event_date,games_played
1,2,2016-03-01,5
1,2,2016-05-02,6
2,3,2017-06-25,1
3,1,2016-03-02,0
3,4,2018-07-03,5


In [0]:
from pyspark.sql.functions import rank
from pyspark.sql.window import Window

windowSpec = Window.partitionBy("player_id").orderBy("event_date")

df2 = (df
       .withColumn("rank", rank().over(windowSpec))
       .select("player_id")
       .filter("rank = 1")
       )

df2.display()

player_id
1
2
3
