# Creating a Session ID

A python script to assign a “Session ID” to every record in the data . A Session is a window of activity from a user & it ends when there is at least 15 mins of inactivity.

Libraries:- <br>
pandas: for handling structured data, time: to obtain current time.

Reading the data into DataFrame using the read_csv function and displaying the first few rows

In [5]:
import pandas as pd
import time

In [2]:
df = pd.read_csv("clickStream.csv")
df.head()

Unnamed: 0,clicked_epoch,uuid,date,price,product_id,category
0,1496273000.0,110971,2017-06-01,599.5,122712,kurta & kurtis
1,1496273000.0,110971,2017-06-01,599.5,3453,kurta & kurtis
2,1496276000.0,49864,2017-06-01,1349.1,13610,jeans
3,1496277000.0,49864,2017-06-01,1124.1,48309,jeans
4,1496280000.0,21453,2017-06-01,999.0,133239,kurta & kurtis


In [8]:
df.describe()

Unnamed: 0,clicked_epoch,uuid,price,product_id
count,413913.0,413913.0,413913.0,413913.0
mean,1525372000.0,74061.14971,2142.091122,87214.981491
std,14113300.0,42856.196584,4826.469276,50157.359102
min,1496273000.0,1.0,-1.0,1.0
25%,1514964000.0,37209.0,517.65,43913.0
50%,1528033000.0,73523.0,879.6,87694.0
75%,1536505000.0,111149.0,1468.0,131285.0
max,1547684000.0,148649.0,250000.0,173030.0


In [3]:
df.isnull().sum() # Checking for null values in the data.

clicked_epoch    0
uuid             0
date             0
price            0
product_id       0
category         0
dtype: int64

Sorting the DataFrame according to the "clicked_epoch" just to ensure data is in sequential order.

In [4]:
df = df.sort_values(["clicked_epoch"])
df.head()

Unnamed: 0,clicked_epoch,uuid,date,price,product_id,category
0,1496273000.0,110971,2017-06-01,599.5,122712,kurta & kurtis
1,1496273000.0,110971,2017-06-01,599.5,3453,kurta & kurtis
2,1496276000.0,49864,2017-06-01,1349.1,13610,jeans
3,1496277000.0,49864,2017-06-01,1124.1,48309,jeans
4,1496280000.0,21453,2017-06-01,999.0,133239,kurta & kurtis


#### Calculating the Session ID

sess_id_list: is a list which will contain the "Session ID" for every record.

We use two loops to iterate over the records, the sess_id for each row is 0 initially indicating no id has been asssigned to this record. if the id is 0, we assign it an id i.e starting from sess_id = 1, and then iterate over the rest of records to look for a record with same uuid and is within the span of 15 mins, assign the same sess_id to the record.
After the finishing the inner loop we increase the sess_id and repeat the process again.

Average time for execution of the cell below is approximately 4 mins.

In [7]:
sess_id_list = [0] * len(df)
sess_id = 1
st = time.time()
for i in range(len(df)):
    if sess_id_list[i] == 0:
        temp_i = i
        sess_id_list[i] = sess_id
        for j in range(i+1, len(df)):
            if df.iloc[j,0] <= (df.iloc[temp_i,0] + 15*60):
                if df.iloc[temp_i,1] == df.iloc[j,1]:
                    sess_id_list[j] = sess_id
                    temp_i = j
            else:
                break
        sess_id+=1
print("Time taken in secs: ",time.time()-st)

Time taken in secs:  231.35193848609924


In [9]:
len(sess_id_list)

413913

In [10]:
df["session_id"] = sess_id_list
df.to_csv("clickStream_session_id.csv")