### Used Libraries<a class="anchor" id="chapter1"></a>

In [1]:
import pandas as pd
from sqlalchemy import create_engine
import psycopg2

### Access to the DB <a class="anchor" id="chapter2"></a>

In [2]:
db_config = {'user': 'practicum_student',         # username
             'pwd': 's65BlTKV3faNIGhmvJVzOqhs', # password
             'host': 'rc1b-wcoijxj3yxfsf3fs.mdb.yandexcloud.net',
             'port': 6432,              # connection port
             'db': 'data-analyst-eth-payouts-db'}          # the name of the database

connection_string = 'postgresql://{}:{}@{}:{}/{}'.format(db_config['user'],
                                                                     db_config['pwd'],
                                                                       db_config['host'],
                                                                       db_config['port'],
                                                                       db_config['db'])

engine = create_engine(connection_string, connect_args={'sslmode':'require'})


## The Database

### payout table:
**user_id:** user's id<br>

**eth_address:** the eth address that the user used to get the payout in Ethernium. (users can have multiple addresses)<br>

**date:** date of payout to user<br>

**payout:** amount that was paid to user<br>


### plan table:
**user_id:** user's id<br>

**"OS":** Operating system of user<br>

**"Plan":** user's plan on the site

#  Table Queries <a class="anchor" id="chapter3"></a>

function that takes a query and return dataframe for general use

In [3]:
def queryResult(q):
    return pd.io.sql.read_sql(q, con = engine)

### 1. How many users got paid?

In [4]:
query = '''
        SELECT
            COUNT(DISTINCT user_id) AS user_cnt
        FROM
            payout
        WHERE
            payout > 0;
        '''
print('The number of users who got paid:')
queryResult(query)

The number of users who got paid:


Unnamed: 0,user_cnt
0,433


----

### 2. Show the 5 users with the highest payouts

In [5]:
query = '''
        SELECT
            user_id,
            SUM(payout) AS total_payout
        FROM
            payout
        GROUP BY
            user_id
        ORDER BY
            total_payout DESC
        LIMIT 5;
        '''
print('The 5 users with the highest payouts are:')
queryResult(query)

The 5 users with the highest payouts are:


Unnamed: 0,user_id,total_payout
0,1537,10.74169
1,3051,9.73121
2,1512,8.80816
3,1127,8.78861
4,4848,8.42445


----

### 3. Show the 5 users with the lowest payouts

In [6]:
query = '''
        SELECT
            user_id,
            SUM(payout) AS total_payout
        FROM
            payout
        GROUP BY
            user_id
        ORDER BY
            total_payout
        LIMIT 5;
        '''
print('The 5 users with the lowest payouts are:')
queryResult(query)

The 5 users with the lowest payouts are:


Unnamed: 0,user_id,total_payout
0,2003,0.00775
1,3410,0.00818
2,2813,0.02914
3,2462,0.03381
4,4467,0.05056


----

### 4. How much ether was paid out in November 2020?

In [7]:
query = '''
        SELECT
            SUM(payout) AS eth_sum
        FROM
            payout
        WHERE
            CAST(date AS date) BETWEEN '2020-11-01' AND '2020-11-30'
        
        '''
print('The total amount of ether paid in November 2020:')
queryResult(query)

The total amount of ether paid in November 2020:


Unnamed: 0,eth_sum
0,166.0118


----

### 5. Which plan is the most popular?

In [8]:
query = '''
        SELECT
            "Plan",
            COUNT(DISTINCT user_id) AS user_count
        FROM
            plan
        GROUP BY
            "Plan"
        ORDER BY
           user_count DESC; 
        '''
print('The amount of users per plan:')
queryResult(query)
#print('The free plan is more popular but not by a large margin.')

The amount of users per plan:


Unnamed: 0,Plan,user_count
0,Free,220
1,Premium,213


The **Free plan is more popular** but not by a large margin.

**Q:** Why when I use print after running the fuctions that returns the query, the query does not print? <br>
And how can I extract only the name of the plan from index 0? for example, query[0,0]

----

### 6. Which plan is the most popular amongst Linux users?

In [9]:
query = '''
        SELECT
            "Plan",
            COUNT(DISTINCT user_id) AS user_count
        FROM
            plan
        WHERE
            "OS" = 'Linux'
        GROUP BY
            "Plan"
        ORDER BY
           user_count DESC;
        
        '''
print('The amount of users per plan using Linux OS:')
queryResult(query)
#print('The free plan is more popular but not by a large margin.')

The amount of users per plan using Linux OS:


Unnamed: 0,Plan,user_count
0,Premium,76
1,Free,68


The **Premium plan is more popular** among Linux users but not by a large margin.

----

### 7. What is the percentage of payout between the different plans?

In [10]:
query = '''
        SELECT
            DISTINCT "Plan",
            SUM(payout.payout) OVER(partition BY "Plan") AS payout_plan,
            SUM(payout.payout) OVER() AS total_payout,
            SUM(payout.payout) OVER(partition BY "Plan") / Sum(payout) OVER() * 100 AS percentage
        FROM
            payout
            LEFT JOIN plan ON plan.user_id = payout.user_id
        ORDER BY
            percentage DESC;
        '''
print('The percentage of payouts for each plan:')
queryResult(query)

The percentage of payouts for each plan:


Unnamed: 0,Plan,payout_plan,total_payout,percentage
0,Free,488.68999,905.45177,53.971951
1,Premium,416.76178,905.45177,46.028049


The **Free plan has more payouts** compared to Premium, but not by a large margin.

----

### 8. Users of which operating system earned more in payouts?

In [11]:
query = '''
        SELECT
            DISTINCT "OS",
            SUM(payout.payout) OVER(partition BY "OS") AS payout_os
            
        FROM
            payout
            LEFT JOIN plan ON plan.user_id = payout.user_id
        ORDER BY
            payout_os DESC;
        '''
print('The payouts earnings for each OS:')
queryResult(query)

The payouts earnings for each OS:


Unnamed: 0,OS,payout_os
0,Linux,322.00353
1,Windows,316.03701
2,MAC,267.41123


**Linux has more payouts** compared to the other OSs, but not by a large margin compared to Windows.

----

### 9. What is the average payout amount per user for each of the OS in July 2020?

In [12]:
query = '''
        SELECT
            DISTINCT "OS",
            AVG(payout.payout) OVER(partition BY "OS") AS payout_plan
            
        FROM
            payout
            LEFT JOIN plan ON plan.user_id = payout.user_id
        WHERE
            CAST(payout.date AS date) BETWEEN '2020-07-01' AND '2020-07-31'
        ORDER BY
            payout_plan DESC;
        '''
print('The payouts earnings per user for each OS:')
queryResult(query)

The payouts earnings per user for each OS:


Unnamed: 0,OS,payout_plan
0,Linux,1.931114
1,MAC,1.640537
2,Windows,1.5906


**Linux has more payouts** compared to the other OSs

----

### 10. What is the daily share of ether earned by users from Linux that are in the free plan in this data?

In [13]:
query = '''
    SELECT 
        DISTINCT x.user_id,
        x.date, 
        x.user_sum,
        y.total_date,
        x.user_sum/y.total_date AS daily_share 
    FROM
        (
            SELECT
                payout.user_id,
                date,
                SUM(payout) OVER (partition BY payout.user_id, date) AS user_sum
            FROM 
                payout 
                LEFT JOIN plan ON plan.user_id= payout.user_id
            WHERE
                "Plan"='Free' AND "OS"='Linux'
        ) AS x
        
        LEFT JOIN 
            (
            SELECT 
                date,
                SUM(payout) OVER(partition BY date) AS total_date
            FROM 
                payout 
                LEFT JOIN plan ON plan.user_id= payout.user_id
            ) AS y ON x.date=y.date
    ORDER BY 
        x.date

        '''
print('The daily percentage of payouts for Linux users in Free plan:')
queryResult(query)

The daily percentage of payouts for Linux users in Free plan:


Unnamed: 0,user_id,date,user_sum,total_date,daily_share
0,3415,2020-07-07,2.50566,6.94641,0.360713
1,3146,2020-07-08,1.39229,4.45935,0.312218
2,2154,2020-07-11,1.72359,4.72946,0.364437
3,4235,2020-07-13,3.13763,7.24881,0.432848
4,3542,2020-07-14,0.67543,5.77610,0.116935
...,...,...,...,...,...
94,2935,2021-01-02,0.30163,3.23708,0.093180
95,1330,2021-01-03,1.51269,5.85108,0.258532
96,2128,2021-01-04,0.68610,7.50206,0.091455
97,4131,2021-01-05,2.36296,9.85304,0.239820


In [34]:
query = '''
    SELECT
        payout.date,
        SUM(payout) OVER (partition BY payout.user_id, date) / 
        SUM(payout) OVER(partition BY date) * 100 AS daily_pay_share
        
    FROM 
        payout 
        LEFT JOIN plan ON plan.user_id = payout.user_id
    WHERE
        plan."Plan" = 'Free'
        AND
        plan."OS" = 'Linux'
    GROUP BY
        payout.date,
        payout.user_id,
        payout.payout



        '''
print('The daily percentage of payouts for Linux users in Free plan:')
queryResult(query)

The daily percentage of payouts for Linux users in Free plan:


Unnamed: 0,date,daily_pay_share
0,2020-07-07,100.000000
1,2020-07-08,100.000000
2,2020-07-11,100.000000
3,2020-07-13,100.000000
4,2020-07-14,100.000000
...,...,...
94,2021-01-02,100.000000
95,2021-01-03,100.000000
96,2021-01-04,100.000000
97,2021-01-05,36.494422
