# Online Shoppers Intent

The online shoppers intent dataset has been sourced from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset#). It contains features that describe shoppers' actions on an online shop and whether or not any revenue was generated from a given customer.

In this notebook we are going to explore this data, use clustering techniques to discover the various groups of prospective customers, and then apply some regression techniques to predict whether or not a customer is going to make. purchase during their visit to the website.

## Library and Data Import

In [76]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler

In [2]:
df = pd.read_csv("./online_shoppers_intention.csv")

In [3]:
df.columns

Index(['Administrative', 'Administrative_Duration', 'Informational',
       'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration',
       'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'Month',
       'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType',
       'Weekend', 'Revenue'],
      dtype='object')

In [4]:
df.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


In [5]:
for col in df.columns:
    print(f"{col} | {df[col].nunique()}")

Administrative | 27
Administrative_Duration | 3335
Informational | 17
Informational_Duration | 1258
ProductRelated | 311
ProductRelated_Duration | 9551
BounceRates | 1872
ExitRates | 4777
PageValues | 2704
SpecialDay | 6
Month | 10
OperatingSystems | 8
Browser | 13
Region | 9
TrafficType | 20
VisitorType | 3
Weekend | 2
Revenue | 2


In [6]:
df.dtypes

Administrative               int64
Administrative_Duration    float64
Informational                int64
Informational_Duration     float64
ProductRelated               int64
ProductRelated_Duration    float64
BounceRates                float64
ExitRates                  float64
PageValues                 float64
SpecialDay                 float64
Month                       object
OperatingSystems             int64
Browser                      int64
Region                       int64
TrafficType                  int64
VisitorType                 object
Weekend                       bool
Revenue                       bool
dtype: object

In [7]:
dtypes = {
    "Month": "category",
    "OperatingSystems": "category",
    "Browser": "category",
    "Region": "category",
    "TrafficType": "category",
    "VisitorType": "category",
    "Weekend": "category",
    "Revenue": "category"
}

In [8]:
for col in df.columns:
    if col in dtypes.keys():
        df[col] = df[col].astype(dtypes[col])

## Exploratory Data Analysis

Let's have a look at the shape of the data.

In [9]:
df.shape

(12330, 18)

Next up, how clean is this data set?

In [10]:
df.isnull().sum()

Administrative             0
Administrative_Duration    0
Informational              0
Informational_Duration     0
ProductRelated             0
ProductRelated_Duration    0
BounceRates                0
ExitRates                  0
PageValues                 0
SpecialDay                 0
Month                      0
OperatingSystems           0
Browser                    0
Region                     0
TrafficType                0
VisitorType                0
Weekend                    0
Revenue                    0
dtype: int64

Now let's se how balanced this data is.

In [11]:
df["Revenue"].value_counts(normalize=True)

False    0.845255
True     0.154745
Name: Revenue, dtype: float64

Only 15% of the data set is of a positive class, therefore this would be classified as a moderately imbalanced data set. Data imbalances can be an issue for machine learning algorithms as they tend to have poor predictive performance for the minority class due to not enough signal, causing the machine learning model to tend towards the majority class, in this case negative (or no revenue). Later in this notebook we can use some upsampling techniques to balance this data out.

In the cell below we are creating pairplots for each of the numerical features to gauge some of the relationships that exist within the data set.

In [12]:
# sns.pairplot(df)
# plt.show()

We can see straight away that most of the numerical features are highly positively skewed.

# Preprocessing

## Scaling

In [13]:
scaler = StandardScaler()

In [14]:
cols = df.columns

In [15]:
numeric_cols = [col for col in df.columns if df[col].dtype.name in ["int64", "float64"]]
cat_cols = [col for col in df.columns if col not in numeric_cols]

In [16]:
df_scaled = pd.DataFrame(scaler.fit_transform(df[numeric_cols]), columns=numeric_cols)

In [17]:
df = pd.concat([df_scaled, df[cat_cols]], axis=1)

## One Hot Encoding

In [18]:
df = pd.get_dummies(df)

In [19]:
df = df.drop(columns=["Revenue_False", "Weekend_False"]).rename(columns={"Revenue_True": "Revenue", "Weekend_True": "Weekend"})

In [20]:
df.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,...,TrafficType_16,TrafficType_17,TrafficType_18,TrafficType_19,TrafficType_20,VisitorType_New_Visitor,VisitorType_Other,VisitorType_Returning_Visitor,Weekend,Revenue
0,-0.696993,-0.457191,-0.396478,-0.244931,-0.691003,-0.624348,3.667189,3.229316,-0.317178,-0.308821,...,0,0,0,0,0,0,0,1,0,0
1,-0.696993,-0.457191,-0.396478,-0.244931,-0.668518,-0.590903,-0.457683,1.171473,-0.317178,-0.308821,...,0,0,0,0,0,0,0,1,0,0
2,-0.696993,-0.457191,-0.396478,-0.244931,-0.691003,-0.624348,3.667189,3.229316,-0.317178,-0.308821,...,0,0,0,0,0,0,0,1,0,0
3,-0.696993,-0.457191,-0.396478,-0.244931,-0.668518,-0.622954,0.573535,1.99461,-0.317178,-0.308821,...,0,0,0,0,0,0,0,1,0,0
4,-0.696993,-0.457191,-0.396478,-0.244931,-0.488636,-0.29643,-0.045196,0.142551,-0.317178,-0.308821,...,0,0,0,0,0,0,0,1,1,0


In [21]:
X = df.drop(columns=["Revenue"])
y = df["Revenue"].astype("int64")

## Train Test Split

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Clustering

### KMeans

In [23]:
# squared_distances = list()
# num_clu_range = range(2,50)
# for n_clusters in num_clu_range:
#     km = KMeans(n_clusters=n_clusters)
#     km = km.fit(X_train)
#     squared_distances.append(km.inertia_)

In [24]:
# plt.figure(figsize=(14,8))
# plt.plot(num_clu_range, squared_distances, 'ro-')
# plt.xlabel('Num Clusters')
# plt.ylabel('Sum of Squared Distances')
# plt.title('Elbow Plot For Optimal Num Clusters')
# plt.show()

The elbow of this plot is at around 10 clusters, therefore we'll use this to generate our new `cluster` feature for the data set.

In [25]:
km = KMeans(n_clusters=10)
clusters = km.fit_predict(X_train)
X_train["cluster_id"] = clusters

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [26]:
X_train["cluster_id"].value_counts()

1    4143
5    1408
3    1364
4     735
2     672
0     536
9     412
8     318
6     176
7     100
Name: cluster_id, dtype: int64

### Agglomerative

In [80]:
squared_distances = list()
num_clu_range = range(2,10,2)
for n_clusters in num_clu_range:
    print(f"Beginning clustering with n_clusters={n_clusters}")
    ag = AgglomerativeClustering(n_clusters=n_clusters, compute_distances=True)
    ag = ag.fit(X_train)
    squared_distances.append(ag.distances_)
    print(f"Finished clustering with n_clusters={n_clusters}")

Beginning clustering with n_clusters=2
Finished clustering with n_clusters=2
Beginning clustering with n_clusters=4
Finished clustering with n_clusters=4
Beginning clustering with n_clusters=6
Finished clustering with n_clusters=6
Beginning clustering with n_clusters=8
Finished clustering with n_clusters=8


In [None]:
plt.figure(figsize=(14,8))
plt.plot(num_clu_range, squared_distances, 'ro-')
plt.xlabel('Num Clusters')
plt.ylabel('Sum of Squared Distances')
plt.title('Elbow Plot For Optimal Num Clusters')
plt.show()

## Upsampling

In [27]:
from imblearn.over_sampling import SMOTENC
oversample = SMOTENC(categorical_features=[0,1], random_state=42)
X_train, y_train = oversample.fit_resample(X_train, y_train)

# Modelling

In [35]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

In [58]:
params = {"max_depth": range(1, 100, 10), "min_samples_leaf": range(1, 100, 10)}

In [59]:
clf = GridSearchCV(DecisionTreeClassifier(), params, verbose=2)

In [60]:
clf.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits
[CV] END ....................max_depth=1, min_samples_leaf=1; total time=   0.0s
[CV] END ....................max_depth=1, min_samples_leaf=1; total time=   0.0s
[CV] END ....................max_depth=1, min_samples_leaf=1; total time=   0.0s
[CV] END ....................max_depth=1, min_samples_leaf=1; total time=   0.0s
[CV] END ....................max_depth=1, min_samples_leaf=1; total time=   0.0s
[CV] END ...................max_depth=1, min_samples_leaf=11; total time=   0.0s
[CV] END ...................max_depth=1, min_samples_leaf=11; total time=   0.0s
[CV] END ...................max_depth=1, min_samples_leaf=11; total time=   0.0s
[CV] END ...................max_depth=1, min_samples_leaf=11; total time=   0.0s
[CV] END ...................max_depth=1, min_samples_leaf=11; total time=   0.0s
[CV] END ...................max_depth=1, min_samples_leaf=21; total time=   0.0s
[CV] END ...................max_depth=1, min_s

GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': range(1, 100, 10),
                         'min_samples_leaf': range(1, 100, 10)},
             verbose=2)

In [64]:
dtree = clf.best_estimator_

In [71]:
dtree = dtree.fit(X_train, y_train)

In [74]:
# clusters = km.predict(X_test)
# X_test["cluster_id"] = clusters

In [75]:
dtree.score(X_test, y_test)

0.8564476885644768