 # Predicting User behaviour based on e-commerce data
*__Author:__ Tino Merl*

__Table of Contents__

* [Introduction and planned action](#intro)
    * [CRISP-DM](#crisp)
* [Business Understanding](#bus_und)
* [Data Understanding](#dat_und)
    * [Describing the columns](#col_descr)
    * [Describing the files](#fil_descr)
* [Data Preparation](#dat_pre)
* [Modeling](#model)
* [Evaluation](#eval)

## Introduction and planned action<a class="anchor" id="intro"></a>
This is an assignment for the module applied programming in the summerterm of 2020 at the *FOM Hochschule für Oekonomie & Management* at the study center in cologne. Troughout this assignment i will work with a dataset which contains user data from an e-commerce system. The dataset can be found on kaggle named *Retailrocket recommender system dataset*.[[1](#kaggle_dataset)] The dataset contains four individual files. Since two of them (item_properties_part1 and item_properties_part2) exceed the maximum filesize allowed on github i am not able to upload them to this repository. The goal of this assignment is to predict user behaviour. This can be done in two ways. The Users can be clustered in a way to predict whether a user contains to a group that buys or not. It can also be done by using markov chains to calculate the probability of a user buying or not. This whole analysis will be done via the CRISP-DM Process. 

### CRISP-DM<a class="anchor" id="crisp"></a>
CRISP-DM stands for __C__ross __I__ndustry __S__tandard __P__rocess for __D__ata __M__ining. It is a standardized process which describes the steps a machine learning analysis and model building should undertake. The steps are the following six.

1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment

They may be listed in a sequential manner, but there is a lot of back and forth between the steps. Especially between Business Understanding and Data Understanding, Business Understanding and Evaluation such as Data Preparation and modeling. Figure 1 illustrates the circular nature of the process.

<div style="margin:auto;">
<img style="display:block; margin-left: auto; margin-right:auto;" src="img/crisp-dm_diagramm.png"/>
<div style="width: 50%; margin:0 auto; text-align:center;"><i><b>Figure 1:</b></i> CRISP-DM diagram by statistik-dresden.de[<a href="#crisp-dm_diagramm">2</a>]</div>
</div>

Since this is an assignment the last step, the deployment, will be left out. We will therefore end the process with step number five: evaluation.

## Business Understanding<a class="anchor" id="bus_und"></a>

The first step is the business understanding. In this step the concrete goals and requirements are set before the analysis begins. The concrete tasks will also be defined here.[[2](#crisp-dm_diagramm)] For this assignment the concrete tasks will be the following.

* *Explorative analysis of the data in data understanding*
* *Can the user behaviour be predicted with a simple clustering algorithm?*
* *Can the user behaviour be predicted with markov chains?*
* *Which of the models performs better?*

Tasks may be target of changes and additions.

## Data Understanding<a class="anchor" id="dat_und"></a>

As the next step is the data understanding we should usually try to understand the data by talking with stakeholders and data owners. This is then followed by an explorative analysis of the dataset, which also creates the foundation for the following chapter the data preparation. Since this dataset has a usability score of 8.8 kaggle and has also a lot of context describing the dataset i will cite the kaggle page.[[1](#kaggle_dataset)]

### Describing the files<a class="anchor" id="fil_descr"></a>

### Describing the columns<a class="anchor" id="col_descr"></a>

The following descriptions are taken from the kaggle dataset.[[1](#kaggle_dataset)]

__category_tree.csv__

* categoryid `{int}` -- unique identifier of the category.
* parentid `{int}` -- identifier of the parent category. It's empty, if parent doesn't exist.

__events.csv__

* timestamp `{int}` -- the time, when event is occurred, in milliseconds since 01-01-1970.
* visitorid `{int}` -- unique identifier of the visitor
* event `{string}` -- type of the event {“view”, “addtocart”, “transaction”}
* itemid `{int}` -- unique identifier of the item
* transactionid `{int}` -- unique identifier of the transaction (non empty only for transaction event type).

__item_properties_part1.csv__

* timestamp `{int}` -- snapshot creation time (Unix timestamp in milliseconds)
* itemid `{int}` -- unique Id of the item
* property `{str}` -- property of the Item. All of them had been hashed excluding "categoryid" and "available"
* value `{str}` -- property value of the item

__item_properties_part1.csv__

* timestamp `{int}` -- snapshot creation time (Unix timestamp in milliseconds)
* itemid `{int}` -- unique Id of the item
* property `{str}` -- property of the Item. All of them had been hashed excluding "categoryid" and "available"
* value `{str}` -- property value of the item


__Loading of the needed Packages__

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn

__Loading of the datasets__

In [2]:
categoryTreeDf = pd.read_csv("./data/category_tree.csv")
eventsDf = pd.read_csv("./data/events.csv")
properties1Df = pd.read_csv("./data/item_properties_part1.csv")
properties2Df = pd.read_csv("./data/item_properties_part2.csv")

Exploratives Betrachten der Dataframes als ganzes. Columns und Heads.


In [4]:
for elem in [
    categoryTreeDf,
    eventsDf,
    properties1Df,
    properties2Df,
    ]:

    print(elem.columns)
    print("\n")
    print(elem.head())
    print("\n\n")

Index(['categoryid', 'parentid'], dtype='object')
   categoryid  parentid
0        1016     213.0
1         809     169.0
2         570       9.0
3        1691     885.0
4         536    1691.0



Index(['timestamp', 'visitorid', 'event', 'itemid', 'transactionid'], dtype='object')
       timestamp  visitorid event  itemid  transactionid
0  1433221332117     257597  view  355908            NaN
1  1433224214164     992329  view  248676            NaN
2  1433221999827     111016  view  318965            NaN
3  1433221955914     483717  view  253185            NaN
4  1433221337106     951259  view  367447            NaN



Index(['timestamp', 'itemid', 'property', 'value'], dtype='object')
       timestamp  itemid    property                            value
0  1435460400000  460429  categoryid                             1338
1  1441508400000  206783         888          1116713 960601 n277.200
2  1439089200000  395014         400  n552.000 639502 n720.000 424566
3  1431226800000   59481

## Footnotes
[1]<a class="anchor" id="kaggle_dataset"></a> Retailrocket (2017) Retailrocket recommender system dataset, Version 4. Retrieved 2020-04-19 from https://www.kaggle.com/retailrocket/ecommerce-dataset

[2]<a class="anchor" id="crisp-dm_diagramm"></a> Wolf Riepel (2012). CRISP-DM: Ein Standard-Prozess-Modell für Data Mining. Retrieved 2020-05-10 from https://statistik-dresden.de/archives/1128