### Introduction

This notebook allows you to create the data (in the expected format) to elaborate [d3 sunburst visualizations](https://bl.ocks.org/kerryrodden/7090426) like this one:

![Sunburst Visualization](https://analista-digital.com/wp-content/uploads/2021/02/showcase_sunburst.png)

For this particular case, I only picked sequences including a purchase, because I was interested to see which flows lead to purchases more often.

### Import needed packages

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Create initial dataFrame from October's data

In [None]:
# Load October Data (keep only needed Columns)
df = pd.read_csv('/kaggle/input/ecommerce-events-history-in-cosmetics-shop/2019-Oct.csv',
                 usecols=["event_time", "event_type", "user_session"])

In [None]:
# Get a random sample of events
df.sample(10)

### Start Data Cleansing & Manipulation

In [None]:
# Make sure columns do not have NAN elements
df.isna().sum()

In [None]:
# drop rows with NAN user_session
df.dropna(inplace=True)

In [None]:
# Check validity of user_session. It seems to be a uuid, an therefore every session id should have the same number of characters
df['tmp_session_len'] = df['user_session'].apply(len)
df['tmp_session_len'].value_counts()

In [None]:
# Check validity of event_type. It should be factorial with 4 different events
df['event_type'].value_counts()

In [None]:
# replace remove_from_cart with remove (str too long for good visualization)
df['event_type'].replace(['remove_from_cart'], ['remove'], inplace = True) 

In [None]:
# Convert event_time in real datetime type
df['event_time'] = pd.to_datetime(df['event_time'],infer_datetime_format=True)

In [None]:
# Check event_time. Are all days represented?
df["event_time"].groupby(df["event_time"].dt.day).count().plot(kind="line")

In [None]:
# Check hour of the day... are all hours represented?
df["event_time"].groupby(df["event_time"].dt.hour).count().plot(kind="bar")

In [None]:
# Factorize session id to save some RAM
df['user_session'] = pd.factorize(df.user_session)[0]

### Create a new feature, sequences, which will be the base of the final output

If you are familiar with `SQL`, I am going to do a `collect_list` of all event_types, `partitioned` by the session id and `ordered by` the event_time. This sequences will include all events done in the user flow in chronological order. Imagine, a user's flow in a particular session is composed of (in chronological order, from older to newer):

* A product view
* Another product view
* A cart addition
* Another product view
* Another cart addition
* A cart removal
* A purchase

The final sequence for this user session will be this array:

`['view', 'view', 'cart', 'view', 'cart', 'remove', 'purchase']`

In [None]:
# Create a collect_list of events, partitioned by user_session & ordered by date_time ascending
grouped_df = df.sort_values(['event_time'],ascending=True).groupby('user_session')['event_type'].apply(list).to_frame(name='sequences')

Per specification, all sequences should have an "end" marker as the last element, unless it has been truncated because it is longer than the maximum sequence length (6, in the example). The purpose of the "end" marker is to distinguish a true end point (e.g. the user left the site) from an end point that has been forced by truncation.

In [None]:
# Add 'end' to the end of the array for each sequence
# This is done to finalize sequences with less than 6 diferent touch points
grouped_df['sequences'] = grouped_df.apply(lambda x: x['sequences'] + ['end'], axis=1)

As specified previously, the longest sequence length I want to analyze is 6 (I do not want to go deeper with the analysis). You can change it easily within the next code

In [None]:
# Constrain sequences to a maximum of 6 touchpoints
# Sequences with less than 6 touchpoints will finalize with 'end'
grouped_df['sequences'] = grouped_df.apply(lambda x: x['sequences'][:6], axis=1)

In [None]:
# Keep only sequences with at least one purchase
grouped_df = grouped_df[grouped_df.sequences.apply(lambda x: np.any(np.in1d(x, ['purchase'])))]

### Generate the final output file

In [None]:
# Transform array into a str (- separated). This is the format expected by d3
grouped_df['sequences'] = ['-'.join(map(str, l)) for l in grouped_df['sequences']]

In [None]:
# Create a df with the top N sequences by count of appearence
top_sequences = grouped_df['sequences'].value_counts().nlargest(100).to_frame(name='occurences')

In [None]:
# Export resulting dataframe into a csv file
top_sequences.to_csv('top_sequences.csv', header=False)

In [None]:
# Display final output
top_sequences