# Window Functions with Postgres

## Create a fake clickstream table  

```sql
drop table if exists clickstream;
create table clickstream (
    eventId varchar(40),
    userId int,
    sessionId int,
    actionType varchar(8),
    datetimeCreated timestamp
);
INSERT INTO clickstream(eventId, userId, sessionId, actionType, datetimeCreated )
VALUES 
('6e598ae5-3fb1-476d-9787-175c34dcfeff',1 ,1000,'click','2020-11-25 12:40:00'),
('0c66cf8c-0c00-495b-9386-28bc103364da',1 ,1000,'login','2020-11-25 12:00:00'),
('58c021ad-fcc8-4284-a079-8df0d51601a5',1 ,1000,'click','2020-11-25 12:10:00'),
('85eef2be-1701-4f7c-a4f0-7fa7808eaad1',1 ,1001,'buy',  '2020-11-22 18:00:00'),
('08dd0940-177c-450a-8b3b-58d645b8993c',3 ,1010,'buy',  '2020-11-20 01:00:00'),
('db839363-960d-4319-860d-2c9b34558994',10,1120,'click','2020-11-01 13:10:03'),
('2c85e01d-1ed4-4ec6-a372-8ad85170a3c1',10,1121,'login','2020-11-03 18:00:00'),
('51eec51c-7d97-47fa-8cb3-057af05d69ac',8 ,6,   'click','2020-11-10 10:45:53'),
('5bbcbc71-da7a-4d75-98a9-2e9bfdb6f925',3 ,3002,'login','2020-11-14 10:00:00'),
('f3ee0c19-a8f9-4153-b34e-b631ba383fad',1 ,90,  'buy',  '2020-11-17 07:00:00'),
('f458653c-0dca-4a59-b423-dc2af92548b0',2 ,2000,'buy',  '2020-11-20 01:00:00'),
('fd03f14d-d580-4fad-a6f1-447b8f19b689',2 ,2000,'click','2020-11-20 00:00:00');
```

> Note:- We use the term user session to indicate a time when a user logs in until they log out. The session id will be the same between the corresponding user login and logout event.

## Partition by

The clickstream table contains login, click, buy, sell, return, logout events from our app. The query below orders the events per userId & sessionId, based on the event creation (datetimeCreated) time.

```sql
select eventId,
    userId,
    sessionId,
    actionType,
    datetimeCreated,
    ROW_NUMBER() OVER(
        PARTITION BY userId,
        sessionId
        ORDER BY datetimeCreated DESC
    ) as eventOrder
from clickstream;
```

## Lead and Lag

These can be used to perform calculations based on data from other rows. Lead and Lag are used to access data from rows after or before the current row respectively. The rows can be ordered using the order by clause.

Lead and lag can be used to calculate the time difference between events for a given user session (partition). In the example below, we use lead and lag to get the time that the next and previous events occur during a user session.

```sql
select eventId,
    userId,
    sessionId,
    actionType,
    datetimeCreated,
    LEAD(datetimeCreated, 1) OVER(
        PARTITION BY userId,
        sessionId
        ORDER BY datetimeCreated
    ) as nextEventTime,
    LAG(datetimeCreated, 1) OVER(
        PARTITION BY userId,
        sessionId
        ORDER BY datetimeCreated
    ) as prevEventTime
from clickstream;
```

## Rolling Window

We can use window functions without a PARTITION BY clause to simulate a rolling window over all the rows Let’s say we want to find the number of buy events within the last 5 events across all users, exclusive of the current event, then we do the following.

```sql
select eventId,
    userId,
    sessionId,
    actionType,
    datetimeCreated,
    SUM(
        CASE
            WHEN actionType = 'buy' THEN 1
            ELSE 0
        END
    ) OVER(
        ORDER BY datetimeCreated DESC ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING
    ) as num_purchases
from clickstream;
```

You can see from the image that the window starts from the 5 PRECEDING rows and stops before the current row, which is the 1 PRECEDING row. This num_purchases will be calculated for each row as seen in the result set above.

Let’s write a query to check if one of the current, previous, or next events was a buy event.

```sql
select eventId,
    userId,
    sessionId,
    actionType,
    datetimeCreated,
    MAX(
        CASE
            WHEN actionType = 'buy' THEN 1
            ELSE 0
        END
    ) OVER(
        ORDER BY datetimeCreated DESC ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
    ) as neighborBuy
from clickstream;
```

## Efficiency Considerations

Window functions can be expensive, use EXPLAIN to see the query plan. This will help when using window functions in low latency situations.

For example, if you want to only get the row with the latest event. It might be beneficial to use another technique, such as a group by shown below:

```sql
EXPLAIN
select *
from (
        select userId,
            sessionId,
            datetimeCreated,
            ROW_NUMBER() OVER(
                PARTITION BY userId,
                sessionId
                ORDER BY datetimeCreated DESC
            ) as eventOrder
        from clickstream
    ) as t
where t.eventOrder = 1;
```

```sql
EXPLAIN
select userId,
    sessionId,
    max(datetimeCreated) as datetimeCreated
from clickstream
group by userId,
    sessionId;
```