## Lesson 28 - Neo4j Analysis using E-Commerce Data






### Table of Contents

* [The Data](#the_data)
* [Import data](#import_data)
* [The graph](#the_graph)
* [Conclusion](#conclusion)





<a id="the_data"></a>
## The Data

Relational databases are logical way to manage data, but on the other hand, alternative approaches such as graph database can be more useful in many cases. It’s known that huge companies in various industries such as eBay, Airbnb, Cisco and many others use the graph database. At this point, Neo4j shows itself as a graph database platform for managing the data.
In this lesson, we'll create an graph from the e-commerce data, using Neo4j and also touch on analysis.

First of all, you can find the e-commerce data [here](https://www.kaggle.com/carrie1/ecommerce-data). At first glance, it’s clearly seen that the data consists of transactions. Therefore, the data includes a series of columns such as customer, purchased products, quantity and date of transaction.

It would be a step in the right direction to plan the schema before inserting data to Neo4j. The schema aimed to be builded in present study is as follows,

<img src="images/e-commerce-graph-nodes.png">

<a id="import_data"></a>
## Import data
We can start with the customers now. Creating a constraint before creating nodes both prevents duplication and performs better because it uses MERGE locks. You can create the constraint as follows,

```
CREATE CONSTRAINT ON (customer:Customer) ASSERT customer.customerID IS UNIQUE
```

### Customer nodes
Please notice that, having uniqueness for a property value is only useful in the graph if the property exists. Then you can create customer nodes as follows,

```
:auto 
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'file:///e-commerce-data.csv'
AS line
WITH 
  toInteger(line.CustomerID) AS CustomerID, 
  line WHERE NOT line.CustomerID IS null
MERGE(customer:Customer {customerID: CustomerID})
ON CREATE SET customer.country = line.Country
```

After creating customer nodes, it’ll be even easier to create product and transaction nodes. Likewise, firstly it would be correct to create constraint for product nodes.

```
CREATE CONSTRAINT ON (product:Product) ASSERT product.stockCode IS UNIQUE
```

There is an important point in here, when you create a constraint, Neo4j will create an index. Cypher will use that index for lookups just like other indexes. Therefore, there’s no need to create a separate index. In fact, if you try to create a constraint when there’s already an index, you’ll get an error.

### Product nodes
After taking into account all of these, you can create product nodes as follows,

```
:auto 
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'file:///e-commerce-data.csv'
AS line
MERGE(product:Product {stockCode: line.StockCode})
ON CREATE SET product.description = line.Description
```

As you can see above, ON CREATE statement is used when creating nodes. If the node needs to be created, merge a node and set the properties. Similarly, you can also use the statement ON MATCH if the node already exists.

### Transaction nodes
It'll be nice to create transaction nodes just before start dealing with relationships as follows,

```
CREATE CONSTRAINT ON (transaction:Transaction) ASSERT transaction.transactionID IS UNIQUE;
```

```
:auto 
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'file:///e-commerce-data.csv'
AS line
MERGE(transaction:Transaction {transactionID: line.InvoiceNo})
ON CREATE SET transaction.transactionDate = line.InvoiceDate
```

Looking at the Cypher statements above, you can see that semicolon is used to separate Cypher statements. In general, you don't need to end a Cypher statement with a semi-colon, but if you want to execute multiple Cypher statements, you must separate them.

The nodes in the graph are ready, but these nodes have no connection with each other. The connections capture the semantic relationships and context of the nodes in the graph. As it’s known, 3 types of nodes are available in the graph: Customer, transaction and product. As David mentioned at the beginning of this section, having relationships between customer-transaction and transaction-product will make this graph much more logical. The customer MADE a transaction and the transaction CONTAINS products. Here is the Cypher statement to building MADE relationships,

### Relationships of Customer and Transaction
```
:auto 
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'file:///e-commerce-data.csv'
AS line
WITH toInteger(line.CustomerID) AS CustomerID, line.InvoiceNo AS InvoiceNo
MATCH (customer:Customer {customerID: CustomerID})
MATCH (transaction:Transaction {transactionID: InvoiceNo})
MERGE (customer)-[:MADE]->(transaction)
```

Let's finalize the graph by creating CONTAINS relationships,

```
:auto 
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM 'file:///e-commerce-data.csv'
AS line
WITH line.InvoiceNo AS InvoiceNo, line.StockCode AS StockCode, toInteger(line.UnitPrice) AS UnitPrice, toInteger(line.Quantity) AS Quantity
MATCH (product:Product {stockCode: StockCode})
MATCH (transaction:Transaction {transactionID: InvoiceNo})
MERGE (transaction)-[s:CONTAINS]->(product)
SET s.price = UnitPrice*Quantity
```

<a id="the_graph"></a>
## The graph

You can now check the schema of the graph with the statement 
```
CALL db.schema.visualization()
```
The result is like that,

<img src="images/e-commerce-transaction-relationships.png">

Keep in mind that you can create the graph in a different way. For instance, the transaction could have been a relationship, not a node, and we could call it BOUGHT. As you can imagine, which one you choose depends on your business problem. At this point, you should set the rules and build the structure accordingly.

## RFM Analysis

RFM analysis is a behavior-based approach grouping customers into segments. It groups the customers on the basis of their previous purchase transactions. Here are the three dimensions of RFM,
- Recency: How recently did the customer purchase?
- Frequency: How often do they purchase?
- Monetary Value: How much do they spend?

Segmenting customers using RFM analysis is an important point for companies that sell in many industries. Because companies want to know the customers that are valuable to them and to ensure loyalty for all their customers.

Before we start, we need installed APOC
<img src="images/e-commerce-install-apoc.png">

After mentioning the dimensions of the RFM and the significance of the customer segmentation, we can get the recency, frequency and monetary value with following python code,

In [1]:
from py2neo import Graph
import pandas as pd

host = 'localhost'
port = 7687
user = 'neo4j'
password = '42840667'

graph = Graph(
    host=host,
    port=port,
    user=user,
    password=password
)
tx = graph.begin()

query = """
  MATCH (c:Customer)-[r1:MADE]->(t:Transaction)-[r2:CONTAINS]->(product:Product)
  WITH SUM(r2.price) AS monetary,
     COUNT(DISTINCT t) AS frequency,
       c.customerID AS customer,
       MIN(
        duration.inDays(
        date(datetime({epochmillis: apoc.date.parse(t.transactionDate, 'ms', 'MM/dd/yyyy')})), 
        date()
      ).days
    ) AS recency
  RETURN customer, recency, frequency, monetary
"""

# create the dataframe
results = tx.run(query).data()
df = pd.DataFrame(results)

In [2]:
df.head()

Unnamed: 0,customer,recency,frequency,monetary
0,17850,3674,35,4521
1,13047,3403,18,2304
2,12583,3374,18,4370
3,13748,3467,5,710
4,15100,3702,6,580


In [3]:
# edit the recency value
df['recency'] = df['recency'] - df['recency'].min()

In [4]:
df.tail()

Unnamed: 0,customer,recency,frequency,monetary
4367,13436,1,1,148
4368,15520,1,1,236
4369,13298,1,1,288
4370,14569,1,1,179
4371,12713,0,1,624


Then, it would be a correct step to define the segments by creating percentiles for dimensions. Please keep in mind that the segmentation here can be taken to a much more advanced level and is often not that simple. Real world problems can be more complex.

In [5]:
# three quantiles to rfm values
df['r_val'] = pd.qcut(df['recency'], q=3, labels=range(3, 0, -1))
df['f_val'] = pd.qcut(df['frequency'], q=3, labels=range(1, 4))
df['m_val'] = pd.qcut(df['monetary'], q=3, labels=range(1, 4))

In [6]:
# create the segment value
df['rfm_val'] = (
    df['r_val'].astype(str) + 
    df['f_val'].astype(str) + 
    df['m_val'].astype(str)
)

df.head()

Unnamed: 0,customer,recency,frequency,monetary,r_val,f_val,m_val,rfm_val
0,17850,302,35,4521,1,3,3,133
1,13047,31,18,2304,2,3,3,233
2,12583,2,18,4370,3,3,3,333
3,13748,95,5,710,1,3,2,132
4,15100,330,6,580,1,3,2,132


In [7]:
# example names for segments
mapping = {
    'Best customers': '333',
    'No purchases recently': '133',
    'Low loyalty': '111',
    'New customers': '311'
}

In [8]:
rowData = df.loc[311,:]
rowData.head()

customer     15953
recency         15
frequency       10
monetary      1066
r_val            3
Name: 311, dtype: object

In [9]:
df[df.rfm_val == v].describe().T

NameError: name 'v' is not defined

In [None]:
# print the results
for k, v in mapping.items():
    print(k + ',')
    print(df[df.rfm_val == v].drop('customer', axis=1).describe().T)
    print()

As you can see in the output, there are descriptive statistics to segments. For instance, looking at the statistics of the best customers, it’s seen that they have recently purchased, frequently purchased, and the monetary value was quite high. Therefore, it’s important that customers in this segment are kept by the company.

There will be different approaches to other segments. This is natural considering the purpose of segmentation. It’s a big impact to develop different approaches for segments and to improve customer loyalty. For instance, you can see the descriptions and required actions to the four segments as follows,

In [None]:
df

<a id="conclusion"></a>
## Conclusion

As mentioned at the beginning of the lesson, different approaches may be needed to find solutions to problems in business life. At this point, it’ll be necessary to identify the problem quite well and build the solution step by step. Although it is not a viable solution to all problems, trying different approaches to tackle an issue would be both beneficial for your company and your career development.

It would be useful for you to get a head start on customer segmentation with this practice.