# Introduction

In this notebook, we will explore a customer segmentation and their result. As there is plenty of Kernel with a clear explaination of the cleanup, I'll just do it quickly without a lot of information. More explanation will be given on the preparation of the dataset and the result parts. For reading purpose, some codes will be hidden to avoid having a dataset extremely long. Feel free to let me know you thinking about it and I hope you gonna enjoy it !

# Summary

The Notebook is divided in 5 parts as follow:

<ul>
<li>Pre-Processing</li>
<ul>
<li>Classic Cleanup</li>
<li>Customer Country</li>
</ul>
<li>Data Preparation on Articles</li>
<ul>
<li>Preparation of Matrices</li>
<li>Clustering</li>
<li>Analysis</li>
</ul>
<li>New Datasets per Invoice</li>
<li>New Dataset per Customer</li>
<ul>
<li>Preparation for Clustering</li>
<li>Clustering</li>
<li>Analysis</li>
</ul>
<li>Building Model</li>
</ul>

Now we can start !


In [118]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import scipy
import scipy.sparse
import matplotlib.pyplot as plt
import nltk
import random
from wordcloud import WordCloud

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.metrics import silhouette_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

%matplotlib inline

# Pre-Processing 

### Classic Cleanup

In this part, I'll handle Cancelled Orders, remove all descriptions which means that the object is lost and so on. All invoces from the customer will be removed as well as Shipping costs (StockCode = POST or DOT). A new feature for the Price of each line will be created and some outliers (in term of Unit Price or Qty) will be remove if and only if there is a cancellation existing. There is for example 40k article bought and cancelled just after with the same article/customer. Without cancellation, I'll consider that it's not a mistake.

In [10]:
df = pd.read_csv("../input/data.csv", encoding="ISO-8859-1")
df.info()

In [11]:
# Handle cancellation as new feature
df["Cancelled"] = df["InvoiceNo"].str.startswith("C")
df["Cancelled"] = df["Cancelled"].fillna(False)

# Hnadle incorrect Description
df = df[df["Description"].str.startswith("?") == False]
df = df[df["Description"].str.isupper() == True]
df = df[df["Description"].str.contains("LOST") == False]
df = df[df["CustomerID"].notnull()]
df["CustomerID"] = df["CustomerID"].astype(int)

# Convert Invoice Number to integer as we already consider Cancellation as new feature
df['InvoiceNo'].replace(to_replace="\D+", value=r"", regex=True, inplace=True)
df['InvoiceNo'] = df['InvoiceNo'].astype('int')

# remove shiping invoices
df = df[(df["StockCode"] != "DOT") & (df["StockCode"] != "POST")]
df.drop("StockCode", inplace=True, axis=1)

# remove outliers by qty
qte_false = [74215, 3114, 80995]  # fond during exploration but not done here (found with a boxplot on qty or price)
for qte in qte_false:
    df = df[(df["Cancelled"] == False) & (df["Quantity"] !=qte)]

# Now we can only keep the order without cancellation
df = df[df["Cancelled"] == False]
df.drop("Cancelled", axis=1, inplace=True)

# We can create the feature Price
df["Price"] = df["UnitPrice"] * df["Quantity"]

# convert date to proper datetime
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])

In [12]:
df.info()

Now the dataset starts to be clean, we can now convert our Countries to a value. To do so, we will rank them based on their impact of the revenue.

### Handling Customer Country

As mentionned above, here we can think about encoding Country to their rank based on their impact on the Revenue. Let's look at values for number of invoices per country and the revenue per country.

In [15]:
revenue_per_countries = df.groupby(["Country"])["Price"].sum().sort_values()
revenue_per_countries.plot(kind='barh', figsize=(15,12))
plt.title("Revenue per Country")
plt.show()

In [16]:
No_invoice_per_country = df.groupby(["Country"])["InvoiceNo"].count().sort_values()
No_invoice_per_country.plot(kind='barh', figsize=(15,12))
plt.title("Number of Invoices per Country")
plt.show()

We can see that for example Netherland bring more money with less Invoices. That means we may order country by their average purchase price instead of only the price or the quantity. Now the strategy changed.

In [17]:
best_buyer = df.groupby(["Country", "InvoiceNo"])["Price"].sum().reset_index().groupby(["Country"])["Price"].mean().sort_values()
best_buyer.plot(kind='barh', figsize=(15,12))
plt.title("Average Basket Price per Country")
plt.show()

That's it, we now have a cleaner scale as the country won't be biaised by the huge number of order comming from UK. It's also a good indicator on where the Shop should expand his market. If I were him, I'd think about Netherlands ! Now let's encode the country.

In [18]:
encoder_countries = best_buyer.rank().to_dict()
decoder_countries = {i: j for i, j in encoder_countries.items()}

df["Country"]  = df["Country"].apply(lambda x:encoder_countries[x])

# Preparation of articles

Now we have only 1 feature remaining to pre-process before to go on, it's the articles descriptions. To do so, we will first generate a Term-Frequency Matrix and a Term-Frequency Invert Document-Frequency Matrix with all uniques articles. Then we will prepare this matrix to be able to run a clustering on it. The clustering will be tried with multiple size of clusters. The best parameter will be used an analysis of the content will be done in order to check that it's relevant.

In [19]:
X = df["Description"].unique()

In [20]:
X = df["Description"].unique()

stemmer = nltk.stem.porter.PorterStemmer()
stopword = nltk.corpus.stopwords.words('english')

def stem_and_filter(doc):
    tokens = [stemmer.stem(w) for w in analyzer(doc)]
    return [token for token in tokens if token.isalpha()]

analyzer = CountVectorizer().build_analyzer()
CV = CountVectorizer(lowercase=True, stop_words="english", analyzer=stem_and_filter)
TF_matrix = CV.fit_transform(X)
print("TF_matrix :", TF_matrix.shape, "of", TF_matrix.dtype)

analyzer = TfidfVectorizer().build_analyzer()
CV = TfidfVectorizer(lowercase=True, stop_words="english", analyzer=stem_and_filter, min_df=0.00, max_df=0.3)  # we remove words if it appears in more than 30 % of the corpus (not found stopwords like Box, Christmas and so on)
TF_IDF_matrix = CV.fit_transform(X)
print("TF_IDF_matrix :", TF_IDF_matrix.shape, "of", TF_IDF_matrix.dtype)

So there is no word which appears in more than a third of our dataset. We can now think about which matrix should we use ? Both matrices are very sparse so before to do the clustering, we should ensure that we will be able to compute proper distances. The thing to do in such case, it's to go throught the TruncatedSVD. I used 100 output features but we may increase it. 100 is the default value to perform what is called Latent Semantic Analysis. Both matrices will be full of float64 numbers. After Normalization to have a norm of 1 for each row, we can do a clustering.

In [21]:
svd = TruncatedSVD(n_components = 100)
normalizer = Normalizer(copy=False)

TF_embedded = svd.fit_transform(TF_matrix)
TF_embedded = normalizer.fit_transform(TF_embedded)
print("TF_embedded :", TF_embedded.shape, "of", TF_embedded.dtype)

TF_IDF_embedded = svd.fit_transform(TF_IDF_matrix)
TF_IDF_embedded = normalizer.fit_transform(TF_IDF_embedded)
print("TF_IDF_embedded :", TF_IDF_embedded.shape, "of", TF_IDF_embedded.dtype)

For the clustering, we can check the Silhouette Score. If we do that with multiple number of cluster we will see that there is a flat after 60 clusters. To dertermine the best choice after that, we can also take a look at the number of articles in both clusters. If all articles are in 1 cluters, that lean the clustering is very poor. We can also take a look at the distribution of article per clusters. 

In [None]:
score_tf = []
score_tfidf = []
mean_tf = []
std_tf = []
mean_tfidf = []
std_tfidf = []

x = list(range(5, 105, 5))

for n_clusters in x:
    kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=10)
    kmeans.fit(TF_embedded)
    clusters = kmeans.predict(TF_embedded)
    silhouette_avg = silhouette_score(TF_embedded, clusters)
#     print("N clusters =", n_clusters, "Silhouette Score :", silhouette_avg)
    rep = np.histogram(clusters, bins = n_clusters-1)[0]
    score_tf.append(silhouette_avg)
    mean_tf.append(rep.mean())
    std_tf.append(rep.std())

for n_clusters in x:
    kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=10)
    kmeans.fit(TF_IDF_embedded)
    clusters = kmeans.predict(TF_IDF_embedded)
    silhouette_avg = silhouette_score(TF_IDF_embedded, clusters)
#     print("N clusters =", n_clusters, "Silhouette Score :", silhouette_avg)
    rep = np.histogram(clusters, bins = n_clusters-1)[0]
    score_tfidf.append(silhouette_avg)
    mean_tfidf.append(rep.mean())
    std_tfidf.append(rep.std())

In [26]:
plt.figure(figsize=(20,16))

plt.subplot(2, 1, 1)
plt.plot(x, score_tf, label="TF matrix")
plt.plot(x, score_tfidf, label="TF-IDF matrix")
plt.title("Evolution of the Silhouette Score")
plt.legend()

plt.subplot(2, 1, 2)
plt.plot(x, mean_tfidf, label="TF-IDF mean")
plt.plot(x, mean_tf, label="TF mean")
plt.plot(x, std_tfidf, label="TF-IDF St.Dev")
plt.plot(x, std_tf, label="TF St.Dev")
plt.ylim(0, 200)
plt.title("Evolution of the mean and Std of both clusters")
plt.legend()

plt.show()

We can see that 100 cluster is the value where the silhouette score is the highest and and Std Dev minimum. In term of choice, we should select the matrix TF-IDF instead of TF as the score is highest after 60 clusters. Let's do a final clustering with 100 cluster. and we can explore the content of some clusters by using cloudword.

In [114]:
n_clusters = 100

kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30, random_state=42)
kmeans.fit(TF_IDF_embedded)
clusters = kmeans.predict(TF_IDF_embedded)

In [119]:
plt.figure(figsize=(20,8))
wc = WordCloud()

for num, cluster in enumerate(random.sample(range(100), 12)) :
    plt.subplot(3, 4, num+1)
    wc.generate(" ".join(X[np.where(clusters==cluster)]))
    plt.imshow(wc, interpolation='bilinear')
    plt.title("Cluster {}".format(cluster))
    plt.axis("off")
plt.figure()
plt.show()

We can see that the content is different for both clusters and the balance is quite good from the calculation. We can consider it as a good clustering. We can now map the article to the cluster.

In [29]:
dict_article_to_cluster = {article : cluster for article, cluster in zip(X, clusters)}

We can also take a look at the position of clusters in space with TSNE. A good alternative and faster is UMAP (https://github.com/lmcinnes/umap)

In [None]:
tsne = TSNE(n_components=2)
proj = tsne.fit_transform(TF_IDF_embedded)

plt.figure(figsize=(10,10))
plt.scatter(proj[:,0], proj[:,1], c=clusters)
plt.title("Visualisation of the clustering with TSNE", fontsize="25")
plt.show()

# Modelization

### Intermediate dataset grouped by invoices

Now we have everything to start the preparation of the final dataset. First we will One-Hot-Encode the price spent on every cluster par invoices. In parralleel, we will also perform a groupby on the dataset initial before to merge them.

In [30]:
cluster = df['Description'].apply(lambda x : dict_article_to_cluster[x])
df2 = pd.get_dummies(cluster, prefix="Article_cluster").mul(df["Price"], 0)
df2 = pd.concat([df['InvoiceNo'], df2], axis=1)
df2_grouped = df2.groupby('InvoiceNo').sum()

In [31]:
custom_aggregation = {}
custom_aggregation["Price"] = "sum"
custom_aggregation["InvoiceDate"] = lambda x:x.iloc[0]
custom_aggregation["CustomerID"] = lambda x:x.iloc[0]
custom_aggregation["Country"] = lambda x:x.iloc[0]
custom_aggregation["Quantity"] = "sum"

df_grouped = df.groupby("InvoiceNo").agg(custom_aggregation)

In [32]:
# let create recency for every Invoice
now = df_grouped["InvoiceDate"].max()  # as the dataset is not done in the present
df_grouped["Recency"] = now - df_grouped["InvoiceDate"]
df_grouped["Recency"] = pd.to_timedelta(df_grouped["Recency"]).astype("timedelta64[D]") # conversion to day from now

# add features required for the next groupby
df_grouped["nb_visit"] = 1
df_grouped["total_spent"] = 1

### Final Dataset per customer

In [33]:
df2_grouped_final = pd.concat([df_grouped['CustomerID'], df2_grouped], axis=1).set_index("CustomerID").groupby("CustomerID").sum()
df2_grouped_final = df2_grouped_final.div(df2_grouped_final.sum(axis=1), axis=0)
df2_grouped_final = df2_grouped_final.fillna(0)

In [34]:
custom_aggregation = {}
custom_aggregation["Price"] = ["mean", "sum"]
custom_aggregation["nb_visit"] = "sum"
custom_aggregation["Country"] = lambda x:x.iloc[0]
custom_aggregation["Quantity"] = "sum"
custom_aggregation["Recency"] = ["min", "max"]

df_grouped_final = df_grouped.groupby("CustomerID").agg(custom_aggregation)

In [35]:
df_grouped_final["Freq"] = (df_grouped_final["Recency"]["max"]  - df_grouped_final["Recency"]["min"] ) / df_grouped_final["nb_visit"]["sum"]
df_grouped_final.columns = ["avg_price", "sum_price", "nb_visit", "country", "quantity", "min_recency", "max_recency", "freq"]

In [36]:
df_grouped_final.head()

### Clustering

Now we have our 2 datasets. we won't merge them as there is not the same scaling. The scaling has to be done upfront. One dataset has raw values and the other one is scaled by row to have the percentage on every cluster. Now we will do a clustering and check the number of clusters required to explain all kind of customer (nevertheless, in order to be able to explain every group, the number of cluster will be limited to 10).

In [37]:
X1 = df_grouped_final.as_matrix()
X2 = df2_grouped_final.as_matrix()

scaler = StandardScaler()
X1 = scaler.fit_transform(X1)
X_final_std_scale = np.concatenate((X1, X2), axis=1)

scaler = MinMaxScaler()
X1 = scaler.fit_transform(X1)
X_final_minmax_scale = np.concatenate((X1, X2), axis=1)

In [38]:
x = list(range(2, 10))
y_std = []
y_minmax = []
for n_clusters in x:
    print("n_clusters =", n_clusters)
    
    kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=10)
    kmeans.fit(X_final_std_scale)
    clusters = kmeans.predict(X_final_std_scale)
    silhouette_avg = silhouette_score(X_final_std_scale, clusters)
    y_std.append(silhouette_avg)
    print("The average silhouette_score is :", silhouette_avg, "with Std Scaling")
    
    kmeans.fit(X_final_minmax_scale)
    clusters = kmeans.predict(X_final_minmax_scale)
    silhouette_avg = silhouette_score(X_final_minmax_scale, clusters)
    y_minmax.append(silhouette_avg)
    print("The average silhouette_score is :", silhouette_avg, "with MinMax Scaling")

In [39]:
plt.figure(figsize=(12,6))

plt.plot(x, y_std, label="Using Standard Scaling")
plt.plot(x, y_minmax, label="Using MinMax Scaling")

plt.legend()
plt.xlabel("Nombre de Clusters")
plt.ylabel("Score de Silhouette")
plt.title("Impact du Scaling sur le Clustering")
plt.xticks(x)

plt.show()

At cluster = 2, the score is clearly high beacuse the cluster group nearly all customers, we should focus only on higher sizes. On it we can see that the Standard Scaling is clearly better. Regarding the score, the best number of cluster seems to be 6. We can take a look at the histogram just to ensure that there is a balance even if we may have few customer really differents.

In [122]:
kmeans = KMeans(init='k-means++', n_clusters = 6, n_init=30, random_state=42)  # random state just to be able to provide cluster number durint analysis
kmeans.fit(X_final_std_scale)
clusters = kmeans.predict(X_final_std_scale)

In [104]:
plt.figure(figsize = (20,8))
n, bins, patches = plt.hist(clusters, bins=6) # arguments are passed to np.histogram
plt.xlabel("Cluster")
plt.title("Number of Customer per cluster")
plt.xticks([rect.get_x()+ rect.get_width() / 2 for rect in patches], ["Cluster {}".format(x) for x in range(6)])

for rect in patches:
    y_value = rect.get_height()
    x_value = rect.get_x() + rect.get_width() / 2

    space = 5
    va = 'bottom'
    label = str(int(y_value))
    
    plt.annotate(
        label,                      # Use `label` as label
        (x_value, y_value),         # Place label at end of the bar
        xytext=(0, space),          # Vertically shift label by `space`
        textcoords="offset points", # Interpret `xytext` as offset in points
        ha='center',                # Horizontally center label
        va=va)                      # Vertically align label differently for
                                    # positive and negative values.

plt.show()

We have 1 cluster really empty (#4) with 6 customers, 2 very low (#2 and #3) and 3 quite balance (#0, #1 and #5). Let's now explore them. The small cluster may be very good or bad customers...

### Analysis

We will now used raw datas or the dataset of percentage per cluster to explore all types of customers. Let's start with a visualisation with TSNE as we did previously.

In [None]:
tsne = TSNE(n_components=2)
proj = tsne.fit_transform(X_final_std_scale)

plt.figure(figsize=(10,10))
plt.scatter(proj[:,0], proj[:,1], c=clusters)
plt.title("Visualisation of the clustering with TSNE", fontsize="25")
plt.show()

We can also compare both cluster based on the frequency of purchase and the average price.

In [105]:
df_grouped_final["cluster"] = clusters
df_grouped_final.head()

In [106]:
# custom_aggregation = {}
# custom_aggregation["avg_price"] = "mean"
# custom_aggregation["nb_visit"] = "mean"
# custom_aggregation["Country"] = lambda x:x.iloc[0]
# custom_aggregation["min_recency"] = "mean"
# custom_aggregation["max_recency"] = ["min", "max"]

# df_grouped_final = df_grouped.groupby("CustomerID").agg(custom_aggregation)

df_analysis = df_grouped_final.groupby("cluster").mean()
df_analysis.head()

In [107]:
price = df_analysis["avg_price"].values
freq = df_analysis["freq"].values
visit = df_analysis["nb_visit"].values

plt.figure(figsize = (20,8))
plt.scatter(price, freq, s=visit*20)
    
for label, x, y in zip(["Customer #{}".format(x) for x in range(6)], price, freq):
    plt.annotate(
        label,
        xy=(x, y), xytext=(-20, 20),
        textcoords='offset points', ha='right', va='bottom',
        bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
        arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0'))
    
plt.title("Panier Moyen et fréquence d'achat par Cluster")
plt.xlabel("Average Price per Invoice")
plt.ylabel("Time between 2 invoices")

plt.show()

In [108]:
x_max = df_analysis["min_recency"].values
x_min = df_analysis["max_recency"].values
freq = df_analysis["freq"].values

plt.figure(figsize = (20,8))
for i in range(6):
    plt.plot([-x_min[i] , -x_max[i]], [i, i], linewidth=freq[i]/5)

plt.xlim(-365, 0)
plt.title("Balance of invoices per Customer Clusters")
plt.xlabel("Recency")
plt.ylabel("Cluster")
plt.show()

The previous plots highlight the differences between clusters. 

The cluster #4 is having a very high mean price (3600 € per basket) and very often (every 5 days). If we take a look at histogramme of number of customer, we can see that this small group of people should be considered as V.I.P. and should be handled like "outliers". In term of history, we can also see that it's old customers and still active so probably happy from our store.  

In term of frequency, we can see that there is 2 clusters with also a good frequency but a small basket (#0 and #1). If we take a look at the "history", we can see that it's customer who already purchase few time and we were not able to convert them to frequent customer. Unfortunately, the 2 cluster group more that 2350 customers. They probably join the store for a specific discount. 

To finish, if we take a look at the cluster #5 grouping 1850 customers, we can see that they order not very often but in a correct quantity (360€ every 60 days). We should maintain the fidelity of this group as it represent also an important part of the Revenue.

Now let see what every cluster is purchasing

In [109]:
purchase_mean = df2_grouped_final.set_index(clusters).groupby(clusters).mean()
purchase_mean.head()

In [110]:
plt.figure(figsize=(15,12))
ax = plt.subplot(111, projection='polar')
theta = 2 * np.pi * np.linspace(0, 1, 100)
matrix = purchase_mean.as_matrix()

for i in range(6):
    r = matrix[i, :]
    ax.plot(theta, r, label="Customer Group {}".format(i))

ax.set_xticklabels([
    "Cluster 0", "Cluster 12", "Cluster 25", "Cluster 38", 
    "Cluster 50", "Cluster 62", "Cluster 75", "Cluster 88"
])

ax.grid(True)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
ax.set_title("Cluster of article bought per type of customer", va='bottom')
plt.show()

We can see taht there is some basic clusters where all customer order items and there is some other nearly empty (for example cluster 76). If we have to orient our company, we could reduce the order of this type of parts and focus on main buckets.

# Building Classification Model

We saw that all classes are quite different. For a balance in production, we can guess that a Tree model should work well. At first, we will try it and based on the result, we may change to another model. There is also 2 additionnal good point for this model, we can build the tree an see visually the outcome. But another advantage, is to be able to use raw data without scaling. The evaluation will be done on Cross Validation as we don't have a lot of  customer to also keep a validation dataset (al least for nearly empty clusters). As the smallest cluster has 6 customer, we will use 5 split for the validation and we will track accuracy and the standard deviation between all score. Which means that it's not really repetible.

In [123]:
clusters.shape

In [129]:
classification_dataset = pd.concat([df_grouped_final, df2_grouped_final], axis = 1)
classification_dataset.head()

In [132]:
X = classification_dataset.drop("cluster", axis=1).as_matrix()
y = classification_dataset["cluster"].as_matrix()

In [155]:
min_samples_split = 4 
cv = 5
n_estimators = 40

In [162]:
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [156]:
for max_depth in range(3, 10):
    clf = DecisionTreeClassifier(random_state=0, max_depth=max_depth, min_samples_split=min_samples_split )
    scores = cross_val_score(estimator=clf, X=X, y=y, cv=5, n_jobs=8)
    scores = np.array(scores)
    print("Depth {} : Acc {:.3f} - Dev {:.3f}".format(max_depth, scores.mean(), scores.std()))

In [157]:
for max_depth in range(3, 10):
    clf = ExtraTreeClassifier(random_state=0, max_depth=max_depth, min_samples_split=min_samples_split )
    scores = cross_val_score(estimator=clf, X=X, y=y, cv=cv, n_jobs=8)
    scores = np.array(scores)
    print("Depth {} : Acc {:.3f} - Dev {:.3f}".format(max_depth, scores.mean(), scores.std()))

In [158]:
for max_depth in range(3, 10):
    clf = RandomForestClassifier(n_estimators=n_estimators, random_state=0, max_depth=max_depth, min_samples_split=min_samples_split )
    scores = cross_val_score(estimator=clf, X=X, y=y, cv=cv, n_jobs=8)
    scores = np.array(scores)
    print("Depth {} : Acc {:.3f} - Dev {:.3f}".format(max_depth, scores.mean(), scores.std()))

In [159]:
for max_depth in range(3, 10):
    clf = ExtraTreesClassifier(n_estimators=n_estimators, random_state=0, max_depth=max_depth, min_samples_split=min_samples_split )
    scores = cross_val_score(estimator=clf, X=X, y=y, cv=cv, n_jobs=8)
    scores = np.array(scores)
    print("Depth {} : Acc {:.3f} - Dev {:.3f}".format(max_depth, scores.mean(), scores.std()))

The DecisionTreeClassifier is the best in accuracy but the Std Dev is higher than the RandomForestClassifier which have a smaller accuracy. Models usign ExtraTrees are not working as good as Random one so we will exclude them.
At first sight, I'd keep the RandomForestClassifier as we have the benefit to avoid overfitting due to the ensemble of classifier. Let see the confusion matrice for both models :

In [170]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

clf1 = DecisionTreeClassifier(random_state=0, max_depth=9, min_samples_split=4 )
clf2 = RandomForestClassifier(random_state=0, max_depth=9, min_samples_split=4, n_estimators=n_estimators)

clf1.fit(X_train, y_train)
clf2.fit(X_train, y_train)

y_pred1 = clf1.predict(X_test)
y_pred2 = clf2.predict(X_test)

In [171]:
np.unique(y_pred1)

In [177]:
from sklearn.metrics import confusion_matrix
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

#     print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
#     plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

classes = ["Cluster {}".format(x) for x in range(6)]
np.set_printoptions(precision=2)

plt.figure(figsize=(20,12))
plt.subplot(1, 2, 1)
cnf_matrix = confusion_matrix(y_test, y_pred1)
plot_confusion_matrix(cnf_matrix, classes=classes, title='Confusion matrix, with DecisionTreeClassifier')

plt.subplot(1, 2, 2)
cnf_matrix = confusion_matrix(y_test, y_pred2)
plot_confusion_matrix(cnf_matrix, classes=classes, title='Confusion matrix, with RandomForestClassifier')

plt.show()

So we trained the model with fixed values and a splitted dataset 2/3 for train and 1/3 to test. We can see that there is clearly more mistakes with ensemble learning and not with Random Forest. Moreover, in addition of having more accuracy, it also classify properly the cluster 4 with our VIP customer which is a good point. The main error is between cluster 0 and 1 vs Cluster 5. The main difference is that is some cases we lost the customer. This can be analysed in a good point too. As we classify some customer lost as customer not ofter active, that may mean that they are borderline and we may be able to convert them to permanent customer with some actions. As a result, This model can be used in production

# Conclusion

In this notebook, I tried to explain as much as possible to work done to segment customer based on their behavior. We succeed to find some specific patterns and create a model able to classify them from some of their order. We can also use this work to drive some actions like items to stop purchaing, trending items, detect customer being lost, rewoard our VIP customer and so on. 

I hope you liked it  and feel free to let me know if you have comments or improvements to do on this dataset !