The objective of this notebook is to explore and exploit ECommerce data to recommend products to customers. The following analysis somewhat follows the popular method of recommendation - given a sample customer, find similar customers, then recommend the products they've bought which the sample customer has not. From what I've seen from other algorithms, the similarity score between customers is measured using a bitwise AND operation. Although these algorithms are fast, they disregard the number of each product the customer has bought. Using bitwise, customers who have only bought bottles and nails are measured as equals. Using euclidean distance instead of bitwise will measure them differently if they had bought different quantities of each product.

In [None]:
import numpy as np
import pandas as pd
import math
import scipy.spatial
import os
import seaborn as sns
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt;

In [None]:
pd.DataFrame({
    'Customers': ['A', 'B', 'C'],
    'Bottles': [1, 1, 1],
    'Nails': [1, 1, 1]
})

All customers are seen as equal using bitwise.

In [None]:
pd.DataFrame({
    'Customers': ['A', 'B', 'C'],
    'Bottles': [4, 1, 5],
    'Nails': [1, 2, 1]
})

Using quantity as a distance measure we can see that customer C is actually more similar to customer A than B, for example.

Let's explore this further...

In [None]:
def showVector(vector):
    plt.figure(figsize = (100,100))
    plt.imshow(vector, aspect = 100)

def sigmoid(x):
  return 1 / (1 + math.exp(-x))
sigmoid = np.vectorize(sigmoid)

In [None]:
data = pd.read_csv(
    '/kaggle/input/ecommerce-data/data.csv',
    encoding = "ISO-8859-1",
    dtype = {'CustomerID': str, 'InvoiceID': str}
)

data

Looking at the table above, each row is a transaction with a few columns that are of interest:
- **StockCode** (the product)
- **Description** (of the product)
- **Quantity** (of products bought)
- **CustomerID** (of the customer whom bought the product)

Thse columns provide the values that describe the relationship betwen the customer and their purchaces.

In [None]:
customers = data.CustomerID.unique()
products = data.StockCode.unique()
numberOfCustomers = len(customers)
numberOfProducts = len(products)
pd.DataFrame([{
    'Customers': numberOfCustomers,
    'Products': numberOfProducts
}], index=['quantity'])

In [None]:
productTransactionCounts = data.StockCode.value_counts()
productPurchaseFrequency = pd.DataFrame(data = {
    'Product': productTransactionCounts.index.values,
    '#Transactions': productTransactionCounts.values
})
productPurchaseFrequency.head()

In [None]:
productQuantityTotals = data.groupby('StockCode')['Quantity'].sum()
productQuantityTotals = pd.DataFrame(data = {
    'Product': productQuantityTotals.index.values,
    'QuantitySum': productQuantityTotals.values
})
productQuantityTotals.sort_values(by='QuantitySum', ascending=False).head()

In [None]:
customerQuantityTotals = data.groupby('CustomerID')['Quantity'].sum()
customerQuantityTotals = pd.DataFrame(data = {
    'Customer': customerQuantityTotals.index.values,
    'QuantitySum': customerQuantityTotals.values
})
customerQuantityTotals.sort_values(by='QuantitySum', ascending=False)

The negative quantities are the purchases that have been returned.

*Customer 14646 loves buying stuff! You'll be able to clearly splot them in the matrix image below*

In [None]:
numberOfCancellations = len(data[data.Quantity < 0])
numberOfKeptTransactions = len(data) - numberOfCancellations
percentageOfCancellations = round(numberOfCancellations / len(data) * 100, 2)

fig, ax = plt.subplots()
ax.pie(
    [numberOfKeptTransactions, numberOfCancellations],
    labels=(
        f'Retained Transactions: #{numberOfKeptTransactions}',
        f'Cancelled Transactions {percentageOfCancellations}%: #{numberOfCancellations}'
    ),
    startangle=90
)
plt.show()

In [None]:
# Remove all cancellations
data = data[data.Quantity > 0];

For my product recommendation I don't really care about cancellion orders. If they bought the product in the first place then that describes their purchase intent.

In [None]:
customerProductMatrix = data.pivot_table(index='CustomerID', columns='StockCode', values='Quantity', fill_value=0, aggfunc='sum')
customerProductMatrix

I've used pivot-table to produce a sparse matrix with products as columns and customers as rows and the cells being the quantity of the product that the customer bought.

In [None]:
numberOfCustomerProductRelations = customerProductMatrix.size
numberOfCustomerProductRelationsFillfilled = np.count_nonzero(customerProductMatrix)
sparsity = 1 - (numberOfCustomerProductRelationsFillfilled / numberOfCustomerProductRelations)
density = 1 - sparsity
pd.DataFrame([{
    '#CustomerProductRelations': numberOfCustomerProductRelations,
    '#CustomerProductRelationsFullfilled': numberOfCustomerProductRelationsFillfilled,
    'Sparsity': sparsity,
    'Density': density
}], index=['quantity'])

The table is highly sparse. To give you an idea I've rendered an image of the table below (each pixel is a cell in the table/matrix) to easier see the density. Given the rows are customers and the columns are products - you can immediately see the highly active buyers and highly sold products.

In [None]:
matrix = sigmoid(customerProductMatrix)
plt.figure(figsize = (100,50))
plt.imshow(matrix);

In [None]:
sampleCustomerID = '18177'
sampleCustomer = np.asarray(customerProductMatrix.loc[sampleCustomerID])
sampleCustomer = np.reshape(sampleCustomer, [1, sampleCustomer.shape[0]])
showVector(sampleCustomer)

I'll pick an arbitrary customer and render their 'customer vector', which is simply a row in the matrix above. I've vertically streched the image of the 1d vector so that it's more visible. The cooler colours are lower numbers/quantities and hotter colors higher numbers/quantities.

In [None]:
sampleCustomerIndex = customerProductMatrix.index.get_loc(sampleCustomerID)
distances = scipy.spatial.distance.cdist(customerProductMatrix, sampleCustomer, metric='euclidean')
distances[sampleCustomerIndex][0] = distances.mean()
bestMatchingCustomer = customerProductMatrix[distances == distances.min()]
bestMatchIndex = distances.argmin()

showVector(np.reshape(distances, [1, distances.shape[0]]))

Using the sample customer vector I can compare it to all the other customer vectors in the matrix and calculate the euclidean distance. The vector above displays the distances across the customer axis of the matrix (a vector of all customers and how similar they are to the sample customer). It shows small variance which tells me that this customer is somewhat similar to most other customers (thus small distance values). I've plotted the vector below to better display the distance values.

In [None]:
fig = plt.figure(figsize=(100,20))
plt.plot(distances)
plt.axvline(bestMatchIndex, color='k', linestyle='dashed', linewidth=2);

There are only a few customers that are very dissimilar. Most are quite similar, which suggests that this sample customer is a typical customer.

In [None]:
showVector(sampleCustomer)

In [None]:
showVector(bestMatchingCustomer)

The two vectors above display the sample customer (top) and the most similar customer to the sample customer (bottom). We can see they are pretty similar in that they've bought the same products, yet our sample customer has bought more things than the best-matching-customer (BMC). The important similarity here is the quantity of the products they've both bought.

In [None]:
diff = np.subtract(bestMatchingCustomer, sampleCustomer)
diff[diff < 0] = 0
showVector(diff)

By subtracting the sample customer from the BMC we can see the products that the BMI bought that our sample customer hasn't. In theory we can use this vector as our recommendation. Recommend products could be sorted by the purchase quantity of the BMC. This could be useful, but there maybe times when the BMC has bought a product in large quantities are not similar to the products the customer bought. A solution to this is to get a consensus from multiple BMCs.

In [None]:
customerProductMatrix['Distances'] = distances
customerProductMatrix = customerProductMatrix.sort_values('Distances')
customerProductMatrix.head()

Instead of just pulling out the BMC, we'll grab a number of BMCs that have the lowest distance (highest similarity) from the sample customer. We can calculate the distances and put them into a new column so that we can sort the rows by distance.

In [None]:
del customerProductMatrix['Distances']
bestMatchingCustomers = customerProductMatrix.head()
bestMatchingCustomerBinaries = np.where(bestMatchingCustomers > 0, 1, 0)
consensus = np.expand_dims(np.sum(bestMatchingCustomerBinaries, axis=0), 0)
consensus[sampleCustomer > 0] = 0
showVector(consensus)

Before getting the 'consensus vector' we have to make sure that the quantities of the products are normalised to 1 for each BMC. This is because if we want to sort the products by 'rank' we can't account for the purchase quantities of the individual BMCs. The quanties will be sum result of adding the BMC vectors together. For example, if one BMC bough 5 bottles and another bought 2 we want to sum bottles together as 2 (like... one 'vote' for each product from each BMC). We can see from the vector above that there's a few products that are highly 'voted' for - they will rank high in the recommendations.

In [None]:
customerProductMatrix.loc['Rank'] = np.squeeze(consensus, 0)
customerProductMatrix = customerProductMatrix.T.sort_values('Rank', ascending=False)
recommendedProducts = customerProductMatrix[customerProductMatrix['Rank'] > 0]
recommendedProducts

I can add the 'Rank' column to the matrix and sort in descending order and remove all products that have 0 rank.

*The more BMCs I add to the consensus vector the more product recommendations there'll be.*

In [None]:
recommendedProducts = [ recommendedProducts['Rank'] ]
recommendedProducts = pd.DataFrame(recommendedProducts).T
recommendedProductsTop = recommendedProducts.head(20)
recommendedProductsTop

Above are the top 20 ranking recommended products.

In [None]:
data.loc[data['StockCode'].isin(recommendedProductsTop.index.tolist())]['Description'].unique()

In [None]:
data.loc[data['CustomerID'] == sampleCustomerID]['Description']

Above are the lists of products that the sample customer has already bought (top) and the recommended products (bottom). Compared to the range of products on offer, the recommended products are pretty similar.

**Thank you for reading my first EDA! All feedback is welcome :)**