<a href="https://colab.research.google.com/github/root-git/stratascratch-sql-challenges/blob/main/2_First_Day_Retention_Rate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**First Day Retention Rate**  

Calculate the first-day retention rate of a group of video game players. The first-day retention occurs when a player logs in 1 day after their first-ever log-in.
Return the proportion of players who meet this definition divided by the total number of players.

**Original Question Link:**  
[StrataScratch ID2090 – First Day Retention Rate](https://platform.stratascratch.com/coding/2090-first-day-retention-rate?code_type=1)

---

## Table Schema

**Table Name**: `players_logins`

| Column Name | Data Type | Description                  |
|-------------|------------|------------------------------|
| player_id   | bigint     | Unique player identifier     |
| login_date  | date       | Date of login (YYYY-MM-DD)   |

---

## Thought Process

1. **Find each player's first login date** using `MIN(login_date)`.
2. **Check if the player logged in the next day** after their first login (`+1 day`).
3. Count the number of players who logged in exactly one day after.
4. **Divide** the retained players by the total number of unique players.

---

In [18]:
import pandas as pd

# Create mock data with edge cases
data = {
    "player_id": [
        1, 1,         # Logs in on day 0 and day 1 => retained
        2,            # Logs in only once => not retained
        3, 3, 3,      # Logs in on day 0, 2, 3 => missed day 1 => not retained
        4, 4,         # Logs in on day 0 and day 1 => retained
        5, 5,         # Logs in on day 0 and day 2 => not retained
        6, 6, 6       # Logs in on day 0, 1, 2 => retained
    ],
    "login_date": [
        "2024-01-01", "2024-01-02",
        "2024-01-01",
        "2024-01-01", "2024-01-03", "2024-01-04",
        "2024-01-01", "2024-01-02",
        "2024-01-01", "2024-01-03",
        "2024-01-01", "2024-01-02", "2024-01-03"
    ]
}

# Convert to DataFrame
df = pd.DataFrame(data)

#Convert login_date type to datetime
df['login_date'] = pd.to_datetime(df['login_date'])

In [19]:
import sqlite3
# Create an in-memory SQLite database (data will not persist after session ends)
conn = sqlite3.connect(":memory:")

# Load the DataFrame into a table named 'players_logins' within the SQLite database
df.to_sql("players_logins", conn, index=False, if_exists='replace')

query = """ Select * from players_logins""" #Preview table

pd.read_sql(query, conn)

Unnamed: 0,player_id,login_date
0,1,2024-01-01 00:00:00
1,1,2024-01-02 00:00:00
2,2,2024-01-01 00:00:00
3,3,2024-01-01 00:00:00
4,3,2024-01-03 00:00:00
5,3,2024-01-04 00:00:00
6,4,2024-01-01 00:00:00
7,4,2024-01-02 00:00:00
8,5,2024-01-01 00:00:00
9,5,2024-01-03 00:00:00


In [26]:
# Replace with your SQL query below
query = """ SELECT * FROM players_logins"""

result_df = pd.read_sql(query, conn)

In [27]:
query = """
WITH first_login_date AS
(
  SELECT
    player_id,
    MIN(login_date) AS first_login
  FROM players_logins
  GROUP BY player_id
),
retained_players AS
(
  SELECT
    f.player_id,
    f.first_login,
    DATE(f.first_login, '+1 day')
  FROM first_login_date f
  JOIN players_logins p
    ON f.player_id = p.player_id
  WHERE DATE(p.login_date) = DATE(f.first_login, '+1 day')
)
SELECT
  ROUND(CAST(COUNT(DISTINCT r.player_id)AS FLOAT)/COUNT(DISTINCT f.player_id),4)
  FROM first_login_date f
  LEFT JOIN retained_players r
    ON f.player_id = r.player_id
"""
solution = pd.read_sql(query, conn)

In [28]:
# Compare the two results
are_equal = result_df.equals(solution)

# Print result based on the comparison
if are_equal:
    print("Correct!")
else:
    print("Try again!")

Try again!


### Problem Explanation
### Step 1: Find each player's first login date
```sql
SELECT
    player_id,
    MIN(login_date) AS first_login
FROM players_logins
GROUP BY player_id
```
This finds the earliest login date for each player. We group by `player_id` and use `MIN(log_date) to determine when they first login date.

### Step 2: Check if the player logged in the next day
```sql
SELECT
    f.player_id
FROM first_login_date f
JOIN players_logins l
  ON f.player_id = l.player_id
 AND DATE(l.login_date) = DATE(f.first_login, '+1 day')
```
This identifies which players had at least one login exactly 1 day after their first login. We join the original logins table to the first login CTE on `player_id`, and filter for `login_date = first_login + 1 day`.

### Step 3: Calculate first-day retention rate
```sql
SELECT
  ROUND(CAST(COUNT(r.player_id) AS FLOAT) / COUNT(f.player_id), 4) AS first_day_retention_rate
FROM first_login_date f
LEFT JOIN retained_players r
  ON f.player_id = r.player_id
```
The numerator: `COUNT(r.player_id)` counts players who loggined in again the next day.
The denominator: `COUNT(f.player_id) counts all unique players.
The result is the proportion of players retained on Day 1, rounded to 4 decimal places.

The `LEFT JOIN` ensures that all players are counted, even if they were not retained (so we don't accidentally filter them out).

