Table: Activity


+--------------+---------+
| Column Name  | Type    |

+--------------+---------+
| player_id    | int     |
| device_id    | int     |
| event_date   | date    |
| games_played | int     |
+--------------+---------+

(player_id, event_date) is the primary key (combination of columns with unique values) of this table.
This table shows the activity of players of some games.
Each row is a record of a player who logged in and played a number of games (possibly 0) before logging out on someday using some device.
 

Write a solution to find the first login date for each player.

Return the result table in any order.

The result format is in the following example.

 

Example 1:

Input: 
Activity table:

+-----------+-----------+------------+--------------+
| player_id | device_id | event_date | games_played |

+-----------+-----------+------------+--------------+
| 1         | 2         | 2016-03-01 | 5            |
| 1         | 2         | 2016-05-02 | 6            |
| 2         | 3         | 2017-06-25 | 1            |
| 3         | 1         | 2016-03-02 | 0            |
| 3         | 4         | 2018-07-03 | 5            |
+-----------+-----------+------------+--------------+

Output: 
+-----------+-------------+
| player_id | first_login |

+-----------+-------------+
| 1         | 2016-03-01  |
| 2         | 2017-06-25  |
| 3         | 2016-03-02  |
+-----------+-------------+

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, min

# Initialize SparkSession
spark = SparkSession.builder.appName("FirstLogin").getOrCreate()

# Sample data
data = [
    (1, 2, "2016-03-01", 5),
    (1, 2, "2016-05-02", 6),
    (2, 3, "2017-06-25", 1),
    (3, 1, "2016-03-02", 0),
    (3, 4, "2018-07-03", 5)
]

# Create DataFrame
columns = ["player_id", "device_id", "event_date", "games_played"]
df = spark.createDataFrame(data, columns)

# Convert event_date to date type
df = df.withColumn("event_date", col("event_date").cast("date"))

# Find the first login (minimum event_date) for each player_id
result_df = df.groupBy("player_id").agg(min("event_date").alias("first_login"))

# Show result
result_df.display()


player_id,first_login
1,2016-03-01
2,2017-06-25
3,2016-03-02


In [0]:
df.createOrReplaceTempView('activity')

In [0]:
%sql
SELECT 
    player_id, 
    MIN(event_date) AS first_login
FROM 
    Activity
GROUP BY 
    player_id;

player_id,first_login
1,2016-03-01
2,2017-06-25
3,2016-03-02
