## SQL example

In [1]:
from dotenv import load_dotenv
import os

import pandas as pd
import pandahouse as ph

### 1. Find diligent students.

The courses on the platform consist of various lessons, each of which consists of several small tasks. Each such small task is called a "pea".

A diligent student is a user who has correctly solved 20 peas at least once in the current month.
Print information about the number of very diligent students for March 2020.

Table `peas`:
<br>
<br>`st_id` (int) - student id
<br>`timest` (timestamp) - timestamp of submitted solution
<br>`correct` (bool) - whether the solution was correct
<br>`subject` (text) - subject of the "pea"

In [6]:
load_dotenv()
host = os.getenv('DB_HOST')
user = os.getenv('DB_USER')
password = os.getenv('DB_PASSWORD')

In [7]:
# connect to DB

connection = dict(database='default',
                  host=host,
                  user=user,
                  password=password)

In [8]:
q = """
SELECT COUNT(st_id) as num_of_studs
FROM
    (SELECT DISTINCT st_id 
    FROM peas
    WHERE toStartOfMonth(timest) == '2020-03-01'
    GROUP BY 
        st_id
    HAVING  SUM(correct) >= 20)
"""
hardworking_studs = ph.read_clickhouse(query=q, connection=connection)
hardworking_studs

Unnamed: 0,num_of_studs
0,0


There are no such students :(

### 2. Product metrics.

The educational platform offers students to take courses on the trial model: a student can solve only 30 peas a day for free. A student needs to purchase full access to an unlimited number of tasks in a certain discipline. The team conducted an experiment where a new payment screen was tested.

There are three tables:
1. `peas`

<br>`st_id` (int) - student id
<br>`timest` (timestamp) - timestamp of submitted solution
<br>`correct` (bool) - whether the solution was correct
<br>`subject` (text) - subject of the "pea"

2. `studs`

<br>`st_id` (int) - student id
<br>`test_grp` (text) - test group (control or pilot)
<br>
<br>

3. `final_project_check`

<br>`st_id` (int) - student id
<br>`sale_time` (timestamp) - timestamp of purchase
<br>`money` (int) - price of the course
<br>`subject` (text) - subject of the purchased course

Print the following information about user groups in one request:
* ARPU (average revenue per user)
* ARPAU (average revenue per active user)
* CR to purchase 
* СR active user to purchase
* CR from math activity (subject = ’math’) to math course purchase

In [5]:
query_metrics = """
WITH unique_users AS (
    SELECT DISTINCT st_id, test_grp
    FROM studs
    GROUP BY st_id, test_grp
    ),
    all_data AS (
    SELECT a.st_id AS st_id, 
        test_grp, 
        all_money, 
        math_money,
        all_tasks,
        all_math_tasks
    FROM unique_users AS a
    LEFT JOIN (SELECT st_id, 
                    SUM(money) AS all_money, 
                    sumIf(money, subject=='Math') AS math_money 
                FROM final_project_check 
                GROUP BY st_id) AS b 
    ON a.st_id == b.st_id
    LEFT JOIN (SELECT st_id, 
                    COUNT(correct) AS all_tasks,  
                    countIf(correct, subject=='Math') AS all_math_tasks 
                FROM peas 
                GROUP BY st_id) AS c 
    ON a.st_id == c.st_id
    )
    
SELECT
test_grp,
SUM(all_money) / COUNT(st_id) as ARPU,
sumIf(all_money, all_tasks > 0) / countIf(st_id, all_tasks > 0) as ARPAU,
countIf(st_id, all_money > 0) / COUNT(st_id) as CR_purchase,
countIf(st_id, all_money > 0 and all_tasks > 0) / countIf(st_id, all_tasks > 0) as CR_purchase_active,
countIf(st_id, math_money > 0 and all_math_tasks > 0) / countIf(st_id, all_math_tasks > 0) as CR_math

FROM all_data
GROUP BY test_grp
""" 

metrics = ph.read_clickhouse(query=query_metrics, connection=connection)
metrics

Unnamed: 0,test_grp,ARPU,ARPAU,CR_purchase,CR_purchase_active,CR_math
0,control,4540.983607,8393.939394,0.04918,0.090909,0.056604
1,pilot,11508.474576,22832.167832,0.108475,0.20979,0.088889
