## How to sessionize

This notebook shows sessionization analysis in DuckDB, 
the basic concept (in SQL) is explained in the following 2 blogs 
- [Hands-On Tutorial: Sessionization in SQL, Hive, Python, and Pig](https://knowledge.dataiku.com/latest/courses/advanced-code/python/sessionization.html)
- [Sessionizing Log Data Using SQL](https://randyzwitch.com/sessionizing-log-data-sql/) 

We use the same dataset from the following blog so that one easily see performances among DuckDB, polars, and pandas.
- Author: https://gist.github.com/koaning
- Blog: https://calmcode.io/polars/introduction.html
- Notebook: https://gist.github.com/koaning/5a0f3f27164859c42da5f20148ef3856
- Dataset: https://www.kaggle.com/datasets/mylesoneill/warcraft-avatar-history?resource=download&select=wowah_data.csv



In [1]:
import os
import glob
from time import time
from pathlib import Path
import polars as pl
import pandas as pd
import duckdb
import sqlite3

In [2]:
datafile = "../data/kaggle/wowah_data.csv"

## DuckDB

In [13]:
conn = duckdb.connect() # create an in-memory database

### peek at header and a small sample of data

In [3]:
MAX_ROWS = -1  # all rows   #  1000 # for dev  

limit_clause = f" LIMIT {MAX_ROWS}" if MAX_ROWS > 0 else " "

sql_cmd = f"""
	SELECT *
	FROM read_csv('{datafile}', AUTO_DETECT=TRUE, ALL_VARCHAR=1)
	{limit_clause}
"""
ts_start = time()
df = conn.execute(sql_cmd).df()
ts_stop = time()
print(f"Elapsed time = {(ts_stop-ts_start):.2f}")

Elapsed time = 22.35


**Note:**

Elapsed time = 22.35

In comparison, polars (reading the same csv) took 0.928 sec

see [wowah-polars.ipynb](https://github.com/wgong/py4kids/blob/master/lesson-14.6-polars/polars-cookbook/cookbook/wowah-polars.ipynb)

In [4]:
df.shape
# (10826734, 7)

(10826734, 7)

In [5]:
df.columns

Index(['char', 'level', 'race', 'charclass', 'zone', 'guild', 'timestamp'], dtype='object')

In [6]:
df.head(5)

Unnamed: 0,char,level,race,charclass,zone,guild,timestamp
0,59425,1,Orc,Rogue,Orgrimmar,165,01/01/08 00:02:04
1,65494,9,Orc,Hunter,Durotar,-1,01/01/08 00:02:04
2,65325,14,Orc,Warrior,Ghostlands,-1,01/01/08 00:02:04
3,65490,18,Orc,Hunter,Ghostlands,-1,01/01/08 00:02:04
4,2288,60,Orc,Hunter,Hellfire Peninsula,-1,01/01/08 00:02:09


### Create a table

In [7]:
table_name = "web_logs"

sql_cmd = f"""
CREATE OR REPLACE TABLE {table_name} AS
	SELECT
		"char"::INTEGER AS char_id,
        level::INT as level,
		race AS race,
		"charclass" AS char_class,
		"zone" AS zone,
        guild::INT as guild,
		strptime("timestamp", '%m/%d/%y %H:%M:%S')::DATETIME as timestamp
	FROM df
	WHERE
        guild != -1 
		--TRY_CAST("Order_ID" AS INTEGER) NOTNULL
"""



ts_start = time()
conn.execute(sql_cmd)
ts_stop = time()
print(f"Elapsed time = {(ts_stop-ts_start):.2f}")

Elapsed time = 0.73


In [8]:
sql_cmd = f"""
    SELECT COUNT(*) FROM {table_name} 
"""
conn.execute(sql_cmd).df()

Unnamed: 0,count_star()
0,9132798


In [9]:
sql_cmd = f"""
    SELECT * FROM {table_name} limit 10
"""
conn.execute(sql_cmd).df()

Unnamed: 0,char_id,level,race,char_class,zone,guild,timestamp
0,59425,1,Orc,Rogue,Orgrimmar,165,2008-01-01 00:02:04
1,61239,68,Orc,Hunter,Blade's Edge Mountains,243,2008-01-01 00:02:14
2,59772,69,Orc,Warrior,Shadowmoon Valley,35,2008-01-01 00:02:14
3,22937,69,Orc,Rogue,Warsong Gulch,243,2008-01-01 00:02:14
4,23062,69,Orc,Shaman,Shattrath City,103,2008-01-01 00:02:14
5,48432,70,Orc,Warrior,Blade's Edge Mountains,79,2008-01-01 00:02:19
6,582,70,Orc,Warrior,Sethekk Halls,19,2008-01-01 00:02:19
7,33256,70,Orc,Warrior,Orgrimmar,53,2008-01-01 00:02:19
8,22307,70,Orc,Warrior,Orgrimmar,174,2008-01-01 00:02:19
9,22466,70,Orc,Warrior,Undercity,101,2008-01-01 00:02:19


### sessionize

In [11]:
SESSION_SPAN = 10*60 # mins
sql_cmd = f"""
WITH diff as (
    SELECT * 
    ,  EXTRACT(EPOCH FROM timestamp)  -- seconds since EPOCH
        - LAG(EXTRACT(EPOCH FROM timestamp))
       OVER (PARTITION BY char_id ORDER BY timestamp) AS ts_diff
    ,  char_id - LAG(char_id)
       OVER (PARTITION BY char_id ORDER BY timestamp) AS char_diff
    FROM {table_name} 
    order by char_id, timestamp
), session as (
    select * 
    , case when (ts_diff >= {SESSION_SPAN}) --or (char_diff > 0)
           then 1
           else 0    
      end as new_session
    from diff
    order by char_id, timestamp
)
select * 
, char_id || '_' || SUM(new_session)
  OVER (PARTITION BY char_id ORDER BY timestamp) AS session_id
from session
{limit_clause}
"""


ts_start = time()
df = conn.execute(sql_cmd).df()
ts_stop = time()
print(f"Elapsed time = {(ts_stop-ts_start):.2f}")

df.head(100)

Elapsed time = 24.66


Unnamed: 0,char_id,level,race,char_class,zone,guild,timestamp,ts_diff,char_diff,new_session,session_id
0,19,70,Orc,Rogue,Warsong Gulch,65,2008-01-20 00:46:39,,,0,19_0
1,19,70,Orc,Rogue,Warsong Gulch,65,2008-01-20 00:56:22,583.0,0.0,0,19_0
2,19,70,Orc,Rogue,Warsong Gulch,65,2008-01-20 01:06:49,627.0,0.0,1,19_1
3,19,70,Orc,Rogue,Nagrand,65,2008-01-20 04:26:18,11969.0,0.0,1,19_2
4,19,70,Orc,Rogue,Nagrand,65,2008-01-20 04:36:49,631.0,0.0,1,19_3
...,...,...,...,...,...,...,...,...,...,...,...
95,19,70,Orc,Rogue,Shattrath City,65,2008-01-25 03:49:17,93149.0,0.0,1,19_39
96,19,70,Orc,Rogue,Warsong Gulch,65,2008-01-25 03:59:49,632.0,0.0,1,19_40
97,19,70,Orc,Rogue,Warsong Gulch,65,2008-01-25 04:09:33,584.0,0.0,0,19_40
98,19,70,Orc,Rogue,Warsong Gulch,65,2008-01-25 04:19:15,582.0,0.0,0,19_40


**Note:**

Elapsed time = 24.66

In comparison, polars processing took 3.84 sec

see [wowah-polars.ipynb](https://github.com/wgong/py4kids/blob/master/lesson-14.6-polars/polars-cookbook/cookbook/wowah-polars.ipynb)

## SQLite

In [21]:
conn_ = sqlite3.connect(":memory:")

In [22]:
ts_start = time()
df = pd.read_csv(datafile)
ts_stop = time()
print(f"Elapsed time = {(ts_stop-ts_start):.2f}")   # Elapsed time = 6.49

Elapsed time = 6.53


In [23]:
ts_start = time()
df.shape
ts_stop = time()
print(f"Elapsed time = {(ts_stop-ts_start):.2f}")

Elapsed time = 0.00


In [24]:
ts_start = time()
print(df.columns)
ts_stop = time()
print(f"Elapsed time = {(ts_stop-ts_start):.2f}")

Index(['char', ' level', ' race', ' charclass', ' zone', ' guild',
       ' timestamp'],
      dtype='object')
Elapsed time = 0.00


In [26]:
col_map = {k:k.strip()  for k in df.columns}

In [20]:
print(" text \n, ".join(df.columns))

59425 text 
, 1 text 
, Orc text 
, Rogue text 
, Orgrimmar text 
, 165 text 
, 01/01/08 00:02:04


In [8]:
df.head(3)

Unnamed: 0,char,level,race,charclass,zone,guild,timestamp
0,59425,1,Orc,Rogue,Orgrimmar,165,01/01/08 00:02:04
1,65494,9,Orc,Hunter,Durotar,-1,01/01/08 00:02:04
2,65325,14,Orc,Warrior,Ghostlands,-1,01/01/08 00:02:04


In [28]:
df.rename(columns=col_map, inplace=True)

In [29]:
df.head(3)

Unnamed: 0,char,level,race,charclass,zone,guild,timestamp
0,59425,1,Orc,Rogue,Orgrimmar,165,01/01/08 00:02:04
1,65494,9,Orc,Hunter,Durotar,-1,01/01/08 00:02:04
2,65325,14,Orc,Warrior,Ghostlands,-1,01/01/08 00:02:04


In [30]:
sql_stmt = f"""
create table if not exists weblogs
(
char text 
,  level text 
,  race text 
,  charclass text 
,  zone text 
,  guild text 
,  timestamp text
)
"""

cur = conn_.cursor()
cur.executescript(sql_stmt)
conn_.commit()

In [31]:
ts_start = time()
df.to_sql("weblogs", conn_, if_exists='append', index=False)
ts_stop = time()
print(f"Elapsed time = {(ts_stop-ts_start):.2f}")   # Elapsed time = 24.30

Elapsed time = 24.30


In [41]:
sql_cmd = f"""
SELECT * 
    ,  unixepoch(timestamp)  -- seconds since EPOCH
        - LAG(unixepoch(timestamp))
       OVER (PARTITION BY char ORDER BY timestamp) AS ts_diff
    ,  char - LAG(char)
       OVER (PARTITION BY char ORDER BY timestamp) AS char_diff
    FROM weblogs 
    order by char, timestamp

"""
# SQLite does not support EXTRACT, use unixepoch instead 

# sql_cmd = f"""
# SELECT *
#     ,  timestamp - LAG(timestamp)
#        OVER (PARTITION BY char ORDER BY timestamp) AS ts_diff
#     FROM weblogs 

# """


ts_start = time()
df = pd.read_sql(sql_cmd, conn_)
ts_stop = time()
print(f"Elapsed time = {(ts_stop-ts_start):.2f}") #  Elapsed time = 137.89

Elapsed time = 137.89


In [None]:
df.head()

In [43]:
SESSION_SPAN = 10*60 # mins
table_name = "weblogs"
sql_cmd = f"""
WITH diff as (
    SELECT * 
    ,  unixepoch(timestamp)  -- seconds since EPOCH
        - LAG(unixepoch(timestamp))
       OVER (PARTITION BY char ORDER BY timestamp) AS ts_diff
    ,  char - LAG(char)
       OVER (PARTITION BY char ORDER BY timestamp) AS char_diff
    FROM {table_name} 
    order by char, timestamp
), session as (
    select * 
    , case when (ts_diff >= {SESSION_SPAN}) --or (char_diff > 0)
           then 1
           else 0    
      end as new_session
    from diff
    order by char, timestamp
)
select * 
, char || '_' || SUM(new_session)
  OVER (PARTITION BY char ORDER BY timestamp) AS session_id
from session

"""

print(f"sql_cmd = {sql_cmd}")
ts_start = time()
df = pd.read_sql(sql_cmd, conn_)
ts_stop = time()
print(f"Elapsed time = {(ts_stop-ts_start):.2f}")

sql_cmd = 
WITH diff as (
    SELECT * 
    ,  unixepoch(timestamp)  -- seconds since EPOCH
        - LAG(unixepoch(timestamp))
       OVER (PARTITION BY char ORDER BY timestamp) AS ts_diff
    ,  char - LAG(char)
       OVER (PARTITION BY char ORDER BY timestamp) AS char_diff
    FROM weblogs 
    order by char, timestamp
), session as (
    select * 
    , case when (ts_diff >= 600) --or (char_diff > 0)
           then 1
           else 0    
      end as new_session
    from diff
    order by char, timestamp
)
select * 
, char || '_' || SUM(new_session)
  OVER (PARTITION BY char ORDER BY timestamp) AS session_id
from session


Elapsed time = 280.11


Elapsed time = 280.11

In [44]:
df.head(100)

Unnamed: 0,char,level,race,charclass,zone,guild,timestamp,ts_diff,char_diff,new_session,session_id
0,10,29,Orc,Hunter,Undercity,-1,07/06/08 17:04:02,,,0,10_0
1,10,29,Orc,Hunter,Undercity,-1,07/06/08 17:13:46,,0.0,0,10_0
2,10,29,Orc,Hunter,Orgrimmar,-1,07/06/08 17:23:25,,0.0,0,10_0
3,10,29,Orc,Hunter,Orgrimmar,-1,07/06/08 17:33:47,,0.0,0,10_0
4,10,29,Orc,Hunter,Stonetalon Mountains,-1,07/06/08 17:44:39,,0.0,0,10_0
...,...,...,...,...,...,...,...,...,...,...,...
95,10,34,Orc,Hunter,Razorfen Downs,-1,07/19/08 16:29:46,,0.0,0,10_0
96,10,34,Orc,Hunter,Orgrimmar,-1,07/27/08 10:33:23,,0.0,0,10_0
97,10,34,Orc,Hunter,Hillsbrad Foothills,-1,07/27/08 10:42:49,,0.0,0,10_0
98,10,34,Orc,Hunter,Arathi Highlands,-1,07/27/08 10:52:33,,0.0,0,10_0


## References

- [DuckDB with Polars](https://duckdb.org/docs/guides/python/polars.html)