This notebook aims at analyzing the content of an E-commerce database that lists purchases made by  ∼ 4000 customers over a period of one year (from 2010/12/01 to 2021/12/01). Based on this analysis, I develop a model that allows to anticipate the purchases that will be made by a new customer, during the following year and this, from its first purchase.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import time
import datetime as dt
from sklearn import preprocessing, model_selection, metrics, feature_selection
from sklearn.model_selection import GridSearchCV, learning_curve
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn import neighbors, linear_model, svm, tree, ensemble
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import silhouette_samples,silhouette_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, plot_roc_curve


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Data Insight

Getting the basic data insights
Given some basic informations on the content of the dataframe: the type of the various variables, the number of null values and their percentage with respect to the total number of entries:


In [None]:
data = pd.read_csv('/kaggle/input/ecommerce-data/data.csv',encoding="ISO-8859-1")
separator = '\n*******************************\n'
print(data.info())
print(separator)
print(data.describe())
print(separator)
print(data.head(5))
print(separator)
print(data.shape)

Getting the total number of the Null Values and calculating the perecentage of data that needs to remvoed in order to get a more accurate dataset


In [None]:
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])
print(data.info())
print(separator)
print(data.isnull().sum().sort_values(ascending = False))
print(separator)
print((data.isnull().sum().sort_values(ascending = False))/len(data)*100)


In [None]:
print(data.shape)

In [None]:
data.dropna(axis = 0, inplace = True)
print(data.shape)   #removing the null values

Looking at the number of null values in the dataframe, it is interesting to note that  ∼ 25% of the entries are not assigned to a particular customer. With the data available, it is impossible to impute values for the user and these entries are thus useless for the current exercise. So I delete them from the dataframe:



In [None]:
x_values = data['Quantity']

sns.kdeplot(x_values, color='b', shade=True, Label='Quantity') 
  
# Setting the X and Y Label 
plt.xlabel('Quantity') 
plt.ylabel('Probablity Density')

Kernel Density Estimate (KDE) plot shows the distribution for the unit price 

In [None]:
x_values = data['UnitPrice']

sns.kdeplot(x_values, color='r', shade=True, Label='UnitPrice') 
  
# Setting the X and Y Label 
plt.xlabel('UnitPrice') 
plt.ylabel('Probablity Density')

In [None]:
data = data[data['Quantity']>=0]
data = data[data['UnitPrice']>=0]

print(data.shape)

In [None]:
'''Basic info and describe analysis over the dataset'''
data.info()
print(separator)
print(data.describe())  #data description 
data.head()

> This dataframe contains 8 variables that correspond to:

**InvoiceNo:** Invoice number. Nominal, a 6-digit integral number uniquely assigned to each 
transaction. If this code starts with letter 'c', it indicates a cancellation.

**StockCode:** Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.

**Description:** Product (item) name. Nominal.

**Quantity:** The quantities of each product (item) per transaction. Numeric.

**InvoiceDate:** Invice Date and time. Numeric, the day and time when each transaction was generated.

**UnitPrice:** Unit price. Numeric, Product price per unit in sterling.

**CustomerID:** Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.

**Country:** Country name. Nominal, the name of the country where each customer resides.

Now adding a total quantity for price


In [None]:
data['TotalQuantity'] = data['Quantity']*data['UnitPrice']
data.head(5)

Now we will look at the details of the countries through which most of the orders were placed


In [None]:
print(data[['InvoiceNo','Country']].groupby('Country').count().sort_values("InvoiceNo",ascending = False))

Creating a Pie Chart to Visualize the the data better

In [None]:
  
# Creating dataset 
country = ['Netherlands','United Kingdom', 'Belgium','Germany'   , 'France' , 'EIRE'  , 'Spain'    , 'Portugal'  , 'Others','Switzerland']
invoice = [2363,354345,2031,9042,8342,7238,2485 ,1462,8774,1842]
# Creating plot 
fig = plt.figure(figsize =(10, 7)) 
plt.pie(invoice, labels = country) 
  
# show plot 
plt.show() 

We see that the dataset is largely dominated by orders made from the UK.

In [None]:
'''Top 5 countries sales count wise in the cleaned up data.'''
data['Country'].value_counts().head(10).plot(kind='bar')

# Gross Amount of countries

Comparing each companies gross amount  and using a bar plot to compare the data 


In [None]:
'''Top 5 countries Total Gross Amount sales wise.'''
data_temp = data.groupby(['Country'])['TotalQuantity'].sum().reset_index().sort_values('TotalQuantity',ascending=False).head(7)
print(data_temp)
print(data_temp.plot(x='Country', y='TotalQuantity',kind='bar'))

Let's find the largest amount order.

In [None]:
data[data['TotalQuantity']==data['TotalQuantity'].max()]

**Conclusion:** Since the given website is a United Kindom originated website. All the variables such as No. of Customers and the Gross total sales is dominated by the United Kingdom. The remaning portion is occupied by the neighbouring Europian Countries

**Which description was used the most**

In [None]:
items = data['Description'].value_counts().head()
print(items)

In [None]:
item_counts = data['Description'].value_counts().sort_values(ascending=False).head(20)
sns.barplot(item_counts.index, item_counts.values,palette = "Blues_r")
plt.ylabel("Counts")
plt.title("Which items were bought more often?");
plt.xticks(rotation=90);

**This analysis gave us a general idea of which product was most demanded by the customers and the bar plot is used for the same visulaization**

In [None]:
print(data[['InvoiceNo','Country','CustomerID','TotalQuantity']].sort_values('TotalQuantity',ascending = False).head(15))

# **RFM Analysis**

**RFM (Recency, Frequency, Monetary)** analysis is a customer segmentation technique that uses past purchase behaviour to divide customers into groups.
RFM helps divide customers into various categories or clusters to identify customers who are more likely to respond to promotions and also for future personalization services.

**RECENCY (R)**: Days since last purchase

**FREQUENCY (F)**: Total number of purchases

**MONETARY VALUE (M)**: Total money this customer spent.
We will create those 3 customer attributes for each customer.

# **Recency**

To calculate recency, we need to choose a date point from which we evaluate how many days ago was the customer's last purchase.

In [None]:
data['Date'] = data['InvoiceDate'].apply(lambda x: x.date())
data.head()

In [None]:
#recency dataframe
recency_df = data.groupby(by='CustomerID', as_index=False)['Date'].max()
recency_df.columns = ['CustomerID','LastPurchaseDate']
recency_df.head(5)


In [None]:
now = dt.date(2021,12,1)
print(now)


In [None]:
recency_df['Recency'] = recency_df['LastPurchaseDate'].apply(lambda x: (now - x).days)
recency_df.drop('LastPurchaseDate',axis = 1,inplace=True)
recency_df.head(5)

# Frequency
Frequency helps us to know how many times a customer purchased from us. To do that we need to check how many invoices are registered by the same customer.

In [None]:
temp = data.copy()
temp.drop_duplicates(['InvoiceNo','CustomerID'],keep='first',inplace=True)
frequency_df = temp.groupby(by=['CustomerID'], as_index=False)['InvoiceNo'].count()
frequency_df.columns = ['CustomerID','Frequency']
frequency_df.head()

# Monetary
Monetary attribute answers the question: How much money did the customer spent over time?

To do that, first, we will create a new column total cost to have the total price per invoice.

In [None]:
monetary_df = data.groupby(by = 'CustomerID',as_index=False).agg({'TotalQuantity':'sum'})
monetary_df.columns = ['CustomerID','TotalQuanity']
monetary_df.head(5)

# Create RFM Table


In [None]:
rfm_df = recency_df.merge(frequency_df,on='CustomerID').merge(monetary_df,on='CustomerID')
rfm_df.set_index('CustomerID',inplace=True)
rfm_df.head(5)

In [None]:
features = rfm_df.columns
rfm_df.shape

# RFM Table Visualisation

Now we will look at the correlation between the the Recency, Frequency and Monetary part of the RFM table which will be an integral part of customer segmentation

In [None]:
print(rfm_df.corr())
sns.heatmap(rfm_df.corr(),cmap="YlGnBu",annot=True)

In [None]:
sns.pairplot(rfm_df, diag_kind="hist")

In [None]:
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
rfm_df = pd.DataFrame(pt.fit_transform(rfm_df))
rfm_df.columns = features
rfm_df.head()

# Conclusion
To gain even further insight into customer behavior, we can dig deeper in the relationship between RFM variables.

**RFM model** can be used in conjunction with certain predictive models like **K-means clustering, Logistic Regression and Recommendation Engines** to produce better informative results on customer behavior.

We will go for **K-means since it has been widely used for Market Segmentation** and it offers the advantage of being simple to implement.

# PCA

Applying PCA to reduce the the dimensions and the correlation between Frequency and Monetary features.

In [None]:
sc = StandardScaler()
rfm_scaled = sc.fit_transform(rfm_df)
rfm_scaled[:5]

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
pca_tranformed_data = pca.fit_transform(rfm_scaled)

In [None]:
pca.components_

In [None]:
pca.explained_variance_

In [None]:
var_exp = pca.explained_variance_ratio_
var_exp

In [None]:
np.cumsum(var_exp)

In [None]:
pca.mean_

In [None]:
pca.n_features_

In [None]:
plt.figure(figsize=(6,4))
plt.bar(range(3), var_exp, alpha=0.5, align='center', label='Individual explained variance')
plt.step(range(3), np.cumsum(var_exp), where='mid', label='Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

In [None]:
X = rfm_df.copy()
pca = PCA(n_components = 2)
df_pca = pca.fit_transform(X)

df_pca = pd.DataFrame(df_pca)
df_pca.head(5)

# K-Means Clustering 


In [None]:
X = df_pca.copy()

In [None]:
from sklearn.cluster import KMeans 

cluster_range = range(1, 15)
cluster_errors = []
cluster_sil_scores = []

for num in cluster_range: 
    clusters = KMeans(num, n_init = 100,init='k-means++',random_state=0)
    clusters.fit(X)
    labels = clusters.labels_                     # capture the cluster lables
    centroids = clusters.cluster_centers_         # capture the centroids
    cluster_errors.append( clusters.inertia_ )    # capture the intertia
clusters_df = pd.DataFrame({ "num_clusters":cluster_range, "cluster_errors": cluster_errors} )
clusters_df[0:10]

In [None]:
plt.figure(figsize=(15,6))
plt.plot(clusters_df["num_clusters"],clusters_df["cluster_errors"],marker = 'o')
plt.xlabel('count of clusters')
plt.ylabel('error')

In [None]:
for num in range(2,16):
    clusters = KMeans(n_clusters=num,random_state=0)
    labels = clusters.fit_predict(df_pca)
    
    sil_avg = silhouette_score(df_pca, labels)
    print('For',num,'The Silhouette Score is =',sil_avg)

# Inferences:

We observe from the elbow plot a sharp bend after the number of clusters increase by 2.
Silhoutte Score is also the highest for 2 clusters.

But, there is also a significant reduce in cluster error as number of clusters increase from 2 to 4 and after 4, the reduction is not much.

So, we will choose n_clusters = 4 to properly segment our customers.

In [None]:
kmeans = KMeans(n_clusters = 4)
kmeans = kmeans.fit(df_pca)
labels = kmeans.predict(df_pca)
centroids = kmeans.cluster_centers_

print(labels)
print()
print('Cluster Centers')
print(centroids)

In [None]:
df_pca['Clusters'] = labels
df_pca.head()

In [None]:
df_pca['Clusters'].value_counts()

In [None]:
df_pca.boxplot(by = 'Clusters',figsize=(15,6))
plt.show()

In [None]:

sns.pairplot(df_pca,diag_kind='hist',hue='Clusters')

Now we will use the models to find the number of

In [None]:
df_pca.head(5)

In [None]:
X = df_pca[[0,1]]
Y = df_pca['Clusters']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42, stratify=Y)
lr = LogisticRegression(max_iter=1000,random_state=0)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print('Test accuracy = ', accuracy_score(y_test, y_pred))

In [None]:
knn = KNeighborsClassifier(n_neighbors = 4) 
  
knn.fit(X_train, y_train) 
pred = knn.predict(X_test) 
  
# Predictions and Evaluations 
# Let's evaluate our KNN model !  
from sklearn.metrics import classification_report, confusion_matrix 
print(confusion_matrix(y_test, pred)) 
  
print(classification_report(y_test, pred)) 

In [None]:
from sklearn.tree import DecisionTreeRegressor
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42, stratify=Y)
regressor = DecisionTreeRegressor(random_state = 213)
regressor.fit(X_train,y_train)

pred = regressor.predict(X_test) 
print(confusion_matrix(y_test, pred)) 
  
print(classification_report(y_test, pred)) 


# Conclusion
We saw that using classification models like Logisitc Regression, KNeighborsClassifier ,DecisionTree we predicted the clusters for customers using RFM dataset as independent variables and Cluster as the target variable. The clusters predicted by the classification models perfectly aligns with K-Means clustering. So, we can conclude that our clusters are correct.

# Summary

The work described in this notebook is based on a database providing details on purchases made on an E-commerce platform over a period of one year. Each entry in the dataset describes the purchase of a product, by a particular customer and at a given date. In total, approximately  **∼ 4000** clients appear in the database. Given the available information, I decided to develop a classifier that allows to anticipate the type of purchase that a customer will make, as well as the number of visits that he will make during a year, and this from its first visit to the E-commerce site.

The next part of the analysis consisted of some basic **data visualization**. This was done in order to get insights regarding the country which was using the E-commerce website the most. I used basic plots in order to show the results of my analysis. I also tried to analyse other important factors such as the Gross Purcahse by a country as well as which following description was used the most. 

The final part of the analysis was the customer segmentation part. The main way to go around with this procces is to use the **RFM (Recency, Frequency, Monetory) table** to sort the customer in the groups. After creating the RFM table I used **K-Means clustering (Elbow curve and Silhoutte scores)** in order to create 4 clusters in which the customers should be Segmented. After each of the customers were segmented into their respective groups. I used models such as **Logisitc Regression, KNeighborsClassifier ,DecisionTree** in order the cross the accuracy of the clustering which resulted in an accuracy score **0.98**. Hence, I conclude the customer segmentation was done which effective methods and high accuracy. 