## Exercise 04 - A/B Testing

## Imports

In [11]:
import pandas as pd
import sqlite3

## 1. Connect to database

In [12]:
conn = sqlite3.connect('../data/checking-logs.sqlite')

## 2. Compute before/after deltas for test group

In [13]:
test_results = pd.io.sql.read_sql(
    """
    WITH commits AS (
        SELECT
            t.uid,
            t.labname,
            t.first_commit_ts,
            datetime(d.deadlines, 'unixepoch') AS deadline
        FROM test t
        JOIN deadlines d ON t.labname = d.labs
        WHERE t.labname != 'project1'
    )
    SELECT
        time,
        AVG(diff_hours) AS avg_diff
    FROM (
        SELECT
            c.uid,
            CASE WHEN c.first_commit_ts >= t.first_view_ts THEN 'after' ELSE 'before' END AS time,
            (JULIANDAY(c.deadline) - JULIANDAY(c.first_commit_ts)) * 24 AS diff_hours
        FROM commits c
        JOIN test t ON c.uid = t.uid AND c.labname = t.labname
        WHERE t.first_view_ts IS NOT NULL
    )
    WHERE uid IN (
        SELECT uid
        FROM (
            SELECT
                c.uid,
                SUM(CASE WHEN c.first_commit_ts >= t.first_view_ts THEN 1 ELSE 0 END) AS after_cnt,
                SUM(CASE WHEN c.first_commit_ts < t.first_view_ts THEN 1 ELSE 0 END) AS before_cnt
            FROM commits c
            JOIN test t ON c.uid = t.uid AND c.labname = t.labname
            WHERE t.first_view_ts IS NOT NULL
            GROUP BY c.uid
            HAVING after_cnt > 0 AND before_cnt > 0
        )
    )
    GROUP BY time
    ORDER BY time DESC
    """,
    conn
)
test_results

Unnamed: 0,time,avg_diff
0,before,61.156438
1,after,105.229101


## 3. Compute before/after deltas for control group

In [14]:
control_results = pd.io.sql.read_sql(
    """
    WITH commits AS (
        SELECT
            c.uid,
            c.labname,
            c.first_commit_ts,
            c.first_view_ts,  -- Use the column we already filled in Ex02!
            datetime(d.deadlines, 'unixepoch') AS deadline
        FROM control c
        JOIN deadlines d ON c.labname = d.labs
        WHERE c.labname != 'project1'
    )
    SELECT
        time,
        AVG(diff_hours) AS avg_diff
    FROM (
        SELECT
            c.uid,
            CASE 
                WHEN c.first_commit_ts >= c.first_view_ts THEN 'after' 
                ELSE 'before' 
            END AS time,
            (JULIANDAY(c.deadline) - JULIANDAY(c.first_commit_ts)) * 24 AS diff_hours
        FROM commits c
    )
    WHERE uid IN (
        -- Filter: Only users who have BOTH 'before' and 'after' data
        SELECT uid
        FROM (
            SELECT
                c.uid,
                SUM(CASE WHEN c.first_commit_ts >= c.first_view_ts THEN 1 ELSE 0 END) AS after_cnt,
                SUM(CASE WHEN c.first_commit_ts < c.first_view_ts THEN 1 ELSE 0 END) AS before_cnt
            FROM commits c
            GROUP BY c.uid
            HAVING after_cnt > 0 AND before_cnt > 0
        )
    )
    GROUP BY time
    ORDER BY time DESC
    """,
    conn
)

print("--- Control Group Results ---")
print(control_results)

--- Control Group Results ---
     time    avg_diff
0  before   99.901295
1   after  118.144425


## 4. Close connection

In [15]:
conn.close()

## 5. Interpretation

**Did the hypothesis turn out to be true?**

**Yes, the hypothesis appears to be true.**

Based on the A/B test results, the Newsfeed page had a positive effect on student behavior.

### Analysis:
1.  **Test Group (Saw Newsfeed):**
    *   Average time before first visit: **61.16 hours** before deadline.
    *   Average time after first visit: **105.23 hours** before deadline.
    *   **Improvement:** Students started working **~44 hours earlier** after seeing the newsfeed.

2.  **Control Group (Did NOT see Newsfeed):**
    *   Average time "before" (virtual): **99.90 hours** before deadline.
    *   Average time "after" (virtual): **118.14 hours** before deadline.
    *   **Improvement:** Students started working **~18 hours earlier** in the second half of the course naturally.

### Conclusion:
While both groups showed improvement (likely due to gaining experience or anxiety about tougher deadlines), the **Test Group's improvement (+44h) was more than double that of the Control Group (+18h)**. This suggests that the peer pressure from the Newsfeed successfully motivated students to start their labs significantly earlier.