# <center>Online Shoppers Intension Data set</center>

<a id="top"></a>
​
<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:Blue; border:0' role="tab" aria-controls="home"><center>Quick navigation</center></h3>

* [1. Introduction](#1)
* [2. Data Reading and Analysis](#2)
* [3. Data Processing and Cleansing](#3) 
* [4. Model Training](#5) <br>

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0' role="tab" aria-controls="home"><center>Introduction</center><a id=1></a></h3>

> This dataframe contains 8 variables that correspond to:

**InvoiceNo:** Invoice number. Nominal, a 6-digit integral number uniquely assigned to each 
transaction. If this code starts with letter 'c', it indicates a cancellation.

**StockCode:** Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.

**Description:** Product (item) name. Nominal.

**Quantity:** The quantities of each product (item) per transaction. Numeric.

**InvoiceDate:** Invice Date and time. Numeric, the day and time when each transaction was generated.

**UnitPrice:** Unit price. Numeric, Product price per unit in sterling.

**CustomerID:** Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.

**Country:** Country name. Nominal, the name of the country where each customer resides.

In [None]:
#importing all important libraries 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import confusion_matrix,accuracy_score,roc_curve,classification_report
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
import time
import datetime as dt

from sklearn.metrics import silhouette_samples,silhouette_score
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from collections import Counter
import pandas_profiling as pp
# data preprocessing
from sklearn.preprocessing import StandardScaler


<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0' role="tab" aria-controls="home"><center>Data Reading and Analysis</center><a id=2></a></h3>

In [None]:
data = pd.read_csv('/kaggle/input/ecommerce-data/data.csv',encoding="ISO-8859-1")
data

In [None]:
print(data.shape)

In [None]:
# checking null values
data.isnull().sum()

In [None]:
data.dropna(axis = 0, inplace = True)
print(data.shape)   #removing the null values

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0' role="tab" aria-controls="home"><center>Data Processing and Cleansing</center><a id=3></a></h3>

Now we can observe that difference before droping null values and after droping null values

In [None]:
sns.heatmap(data.isnull())

Fortunately, there are no missing values. If there were missing values we will have to fill them with the median, mean or mode. I tend to use the median but in this scenario there is no need to fill any missing values. This will definitely make our job easier!

In [None]:
pp.ProfileReport(data)

In [None]:
data = data[data['Quantity']>=0]
data = data[data['UnitPrice']>=0]

print(data.shape)

In [None]:
data.describe()

In [None]:
data['TotalQuantity'] = data['Quantity']*data['UnitPrice']
data.head()

Now we will look at the details of the countries through which most of the orders were placed


In [None]:
print(data[['InvoiceNo','Country']].groupby('Country').count().sort_values("InvoiceNo",ascending = False))

In [None]:
'''Top 5 countries sales count wise in the cleaned up data.'''
data['Country'].value_counts().head(10).plot(kind='bar')

In [None]:
data[data['TotalQuantity']==data['TotalQuantity'].max()]

In [None]:
items = data['Description'].value_counts().head()
print(items)

In [None]:
print(data[['InvoiceNo','Country','CustomerID','TotalQuantity']].sort_values('TotalQuantity',ascending = False).head(15))

In [None]:
print(data[['InvoiceNo','Country','CustomerID','TotalQuantity']].sort_values('TotalQuantity',ascending = False).head(15))

In [None]:
print(data[['InvoiceNo','Country','CustomerID','TotalQuantity']].sort_values('TotalQuantity',ascending = False).head(15))

# **RFM Analysis**

**RFM (Recency, Frequency, Monetary)** analysis is a customer segmentation technique that uses past purchase behaviour to divide customers into groups.
RFM helps divide customers into various categories or clusters to identify customers who are more likely to respond to promotions and also for future personalization services.

**RECENCY (R)**: Days since last purchase

**FREQUENCY (F)**: Total number of purchases

**MONETARY VALUE (M)**: Total money this customer spent.
We will create those 3 customer attributes for each customer.

# **Recency**

To calculate recency, we need to choose a date point from which we evaluate how many days ago was the customer's last purchase.

In [None]:
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])
data['Date'] = data['InvoiceDate'].apply(lambda x: x.date())
data.head()

In [None]:
data['Month']=data['InvoiceDate'].apply(lambda x:x.month)
data['Year']=data['InvoiceDate'].apply(lambda x:x.year)
data=data.sort_values(by=['Year','Month'])

mmap={1:'Jan11',2:'Feb11',3:'Mar11',4:'Apr11', 5:'May11', 6:'Jun11', 7:'Jul11',8:'Aug11',9:'Sep11',10:'Oct11',11:'Nov11',12:'Dec11'}
data['Month_name']=data['Month'].map(mmap)


In [None]:
def my(x):
    Month=x[0]
    Year=x[1]
    
    if Year==2010:
        Month='Dec10'
        return Month
    else:
        return Month

In [None]:
data['Month_name']=data[['Month_name','Year']].apply(my, axis=1)

In [None]:
data.head()

## Total Transaction's monthly

In [None]:
monthly=data.groupby(['Year','Month','Month_name']).sum()
monthly.head()

In [None]:
#recency dataframe
recency_df = data.groupby(by='CustomerID', as_index=False)['Date'].max()
recency_df.columns = ['CustomerID','LastPurchaseDate']
recency_df.head(5)


In [None]:
current = dt.date(2021,12,1)
print(current)


In [None]:
recency_df['Recency'] = recency_df['LastPurchaseDate'].apply(lambda x: (current - x).days)
recency_df.drop('LastPurchaseDate',axis = 1,inplace=True)
recency_df.head(5)

# Frequency
Frequency helps us to know how many times a customer purchased from us. To do that we need to check how many invoices are registered by the same customer.

In [None]:
temp = data.copy()
temp.drop_duplicates(['InvoiceNo','CustomerID'],keep='first',inplace=True)
frequency_df = temp.groupby(by=['CustomerID'], as_index=False)['InvoiceNo'].count()
frequency_df.columns = ['CustomerID','Frequency']
frequency_df.head()

# Monetary
Monetary attribute answers the question: How much money did the customer spent over time?

To do that, first, we will create a new column total cost to have the total price per invoice.

In [None]:
monetary_df = data.groupby(by = 'CustomerID',as_index=False).agg({'TotalQuantity':'sum'})
monetary_df.columns = ['CustomerID','TotalQuanity']
monetary_df.head(5)

# Create RFM Table


In [None]:
rfm_df = recency_df.merge(frequency_df,on='CustomerID').merge(monetary_df,on='CustomerID')
rfm_df.set_index('CustomerID',inplace=True)
rfm_df.head(5)

In [None]:
features = rfm_df.columns
rfm_df.shape

# RFM Table Visualisation

Now we will look at the correlation between the the Recency, Frequency and Monetary part of the RFM table which will be an integral part of customer segmentation

In [None]:
print(rfm_df.corr())
sns.heatmap(rfm_df.corr(),cmap="YlGnBu",annot=True)

In [None]:
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
rfm_df = pd.DataFrame(pt.fit_transform(rfm_df))
rfm_df.columns = features
rfm_df.head()

# Conclusion
To gain even further insight into customer behavior, we can dig deeper in the relationship between RFM variables.

**RFM model** can be used in conjunction with certain predictive models like **K-means clustering, Logistic Regression and Recommendation Engines** to produce better informative results on customer behavior.

We will go for **K-means since it has been widely used for Market Segmentation** and it offers the advantage of being simple to implement.

# PCA

Applying PCA to reduce the the dimensions and the correlation between Frequency and Monetary features.

In [None]:
sc = StandardScaler()
rfm_scaled = sc.fit_transform(rfm_df)
rfm_scaled

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
pca_tranformed_data = pca.fit_transform(rfm_scaled)

In [None]:
pca.components_

In [None]:
pca.explained_variance_

In [None]:
var_exp = pca.explained_variance_ratio_
var_exp


<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:orange; border:0' role="tab" aria-controls="home"><center>Model Training</center><a id=5></a></h3>

In [None]:
X = rfm_df.copy()
pca = PCA(n_components = 2)
df_pca = pca.fit_transform(X)

df_pca = pd.DataFrame(df_pca)
df_pca.head(5)

# K-Means Clustering 


In [None]:
X = df_pca.copy()

In [None]:
from sklearn.cluster import KMeans 

cluster_range = range(1, 15)
cluster_errors = []
cluster_sil_scores = []

for num in cluster_range: 
    clusters = KMeans(num, n_init = 100,init='k-means++',random_state=0)
    clusters.fit(X)
    # capture the cluster lables
    labels = clusters.labels_  
    # capture the centroids
    centroids = clusters.cluster_centers_ 
    # capture the intertia
    cluster_errors.append( clusters.inertia_ )    
clusters_df = pd.DataFrame({ "num_clusters":cluster_range, "cluster_errors": cluster_errors} )
clusters_df[0:10]

In [None]:
plt.figure(figsize=(15,6))
plt.plot(clusters_df["num_clusters"],clusters_df["cluster_errors"],marker = 'o')
plt.xlabel('count of clusters')
plt.ylabel('error')

In [None]:
for num in range(2,16):
    clusters = KMeans(n_clusters=num,random_state=0)
    labels = clusters.fit_predict(df_pca)
    
    sil_avg = silhouette_score(df_pca, labels)
    print('For',num,'The Silhouette Score is =',sil_avg)

# Inferences:

We observe from the elbow plot a sharp bend after the number of clusters increase by 2.
Silhoutte Score is also the highest for 2 clusters.

But, there is also a significant reduce in cluster error as number of clusters increase from 2 to 4 and after 4, the reduction is not much.

So, we will choose n_clusters = 4 to properly segment our customers.

In [None]:
kmeans = KMeans(n_clusters = 6)
kmeans = kmeans.fit(df_pca)
labels = kmeans.predict(df_pca)
centroids = kmeans.cluster_centers_

print(labels)
print()
print('Cluster Centers')
print(centroids)

In [None]:
df_pca['Clusters'] = labels
df_pca.head()

In [None]:
df_pca['Clusters'].value_counts()

Now we will use the models to find the number of

In [None]:
df_pca.head(5)

In [None]:
X = df_pca[[0,1]]
Y = df_pca['Clusters']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42, stratify=Y)

# instantiate the model
dc=DecisionTreeClassifier()
knn=KNeighborsClassifier(1)
# train a Gaussian Naive Bayes classifier on the training set
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()


m6 = 'DecisionTreeClassifier'
dt = DecisionTreeClassifier(criterion = 'entropy',random_state=0,max_depth = 6)
dt.fit(X_train, y_train)
dt_predicted = dt.predict(X_test)
dt_conf_matrix = confusion_matrix(y_test, dt_predicted)
dt_acc_score = accuracy_score(y_test, dt_predicted)
print("confussion matrix")
print(dt_conf_matrix)
print("\n")
print("Accuracy of DecisionTreeClassifier:",dt_acc_score*100,'\n')
print(classification_report(y_test,dt_predicted))

In [None]:
# Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, dt_predicted)

print('Confusion matrix\n\n', cm)

print('\nTrue Positives(TP) = ', cm[0,0])

print('\nTrue Negatives(TN) = ', cm[1,1])

print('\nFalse Positives(FP) = ', cm[0,1])

print('\nFalse Negatives(FN) = ', cm[1,0])

## KNN

In [None]:
m5 = 'K-NeighborsClassifier'
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
knn_predicted = knn.predict(X_test)
knn_conf_matrix = confusion_matrix(y_test, knn_predicted)
knn_acc_score = accuracy_score(y_test, knn_predicted)
print("confussion matrix")
print(knn_conf_matrix)
print("\n")
print("Accuracy of K-NeighborsClassifier:",knn_acc_score*100,'\n')
print(classification_report(y_test,knn_predicted))

In [None]:
# Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, knn_predicted)

print('Confusion matrix\n\n', cm)

print('\nTrue Positives(TP) = ', cm[0,0])

print('\nTrue Negatives(TN) = ', cm[1,1])

print('\nFalse Positives(FP) = ', cm[0,1])

print('\nFalse Negatives(FN) = ', cm[1,0])






## Gaussian Naive Bayes Classifier

In [None]:
m2 = 'Naive Bayes'
gnb.fit(X_train,y_train)
nbpred = gnb.predict(X_test)
nb_conf_matrix = confusion_matrix(y_test, nbpred)
nb_acc_score = accuracy_score(y_test, nbpred)
print("confussion matrix")
print(nb_conf_matrix)
print("\n")
print("Accuracy of Naive Bayes model:",nb_acc_score*100,'\n')
print(classification_report(y_test,nbpred))

In [None]:
# Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test,nbpred)

print('Confusion matrix\n\n', cm)

print('\nTrue Positives(TP) = ', cm[0,0])

print('\nTrue Negatives(TN) = ', cm[1,1])

print('\nFalse Positives(FP) = ', cm[0,1])

print('\nFalse Negatives(FN) = ', cm[1,0])


# Conclusion
We saw that using classification models like Logisitc Regression, KNeighborsClassifier ,DecisionTree we predicted the clusters for customers using RFM dataset as independent variables and Cluster as the target variable. The clusters predicted by the classification models perfectly aligns with K-Means clustering. So, we can conclude that our clusters are correct.

# Summary

The work described in this notebook is based on a database providing details on purchases made on an E-commerce platform over a period of one year. Each entry in the dataset describes the purchase of a product, by a particular customer and at a given date. In total, approximately  **∼ 4000** clients appear in the database. Given the available information, I decided to develop a classifier that allows to anticipate the type of purchase that a customer will make, as well as the number of visits that he will make during a year, and this from its first visit to the E-commerce site.

The next part of the analysis consisted of some basic **data visualization**. This was done in order to get insights regarding the country which was using the E-commerce website the most. I used basic plots in order to show the results of my analysis. I also tried to analyse other important factors such as the Gross Purcahse by a country as well as which following description was used the most. 

The final part of the analysis was the customer segmentation part. The main way to go around with this procces is to use the **RFM (Recency, Frequency, Monetory) table** to sort the customer in the groups. After creating the RFM table I used **K-Means clustering (Elbow curve and Silhoutte scores)** in order to create 4 clusters in which the customers should be Segmented. After each of the customers were segmented into their respective groups. I used models such as **Logisitc Regression, KNeighborsClassifier ,DecisionTree** in order the cross the accuracy of the clustering which resulted in an accuracy score **0.98**. Hence, I conclude the customer segmentation was done which effective methods and high accuracy. 