# 4.0 Pre-processing

## 4.1 Problem Statement

Online Retail, a company specializing in e-commerce, recently invested a substantial portion of its revenue in an advertising campaign to boost brand and product awareness. Despite these efforts, the campaign achieved an acquisition response rate of only 3%, falling short of the anticipated 6%. Management suspects that the campaign's underperformance stemmed from its broad and costly approach, which failed to consider the diverse purchasing behaviors of customers.

To improve outcomes, the company intends to focus future marketing efforts on customers most likely to drive revenue growth. With the next campaign scheduled in six months, management seeks to achieve the following objectives:

Customer Value Analysis: Assess the commercial value of each customer just before the campaign launch.
Customer Segmentation: Develop a segmentation strategy based on purchasing behaviors to identify key customer groups.
Marketing Enablement Tool: Equip the Marketing team with a tool to implement and sustain a targeted marketing strategy.
The Data Science team has been tasked with leading this project. They will collaborate with the Marketing team responsible for promotions, the Technology team, and a Management Committee representative. Although the company's database contains some data gaps due to past system migrations, it will serve as the foundation for this initiative. The success of this project will be evaluated based on the achievement of the targeted response rate of 6%, a key performance metric set by management.

This document continues the data exploration and analysis work. It includes:

- Creation of dummy or indicator features for categorical variables

- Splitting the data into testing and training datasets

- Standardizing the magnitude of numeric features using a scaler


## 4.2 Import libraries

In [1]:
# Code task 1#
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import datetime as dt
from datetime import datetime
import requests
import calendar
%matplotlib inline

In [2]:
import sklearn
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.cluster import KMeans
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.ticker as ticker


## 4.3 Read the Dataset

In [3]:
# Code task 5#
# Load the dataset
df = pd.read_csv('purchases.csv', encoding='utf-8', encoding_errors='ignore')

In [4]:
# Code task 6#
# Check on the dataset using the info method
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397924 entries, 0 to 397923
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   InvoiceNo        397924 non-null  int64  
 1   StockCode        397924 non-null  object 
 2   Description      397924 non-null  object 
 3   Quantity         397924 non-null  int64  
 4   InvoiceDate      397924 non-null  object 
 5   UnitPrice        397924 non-null  float64
 6   CustomerID       397924 non-null  float64
 7   Country          397924 non-null  object 
 8   Month            397924 non-null  int64  
 9   Year             397924 non-null  int64  
 10  Day              397924 non-null  int64  
 11  Revenue          397924 non-null  float64
 12  Continent        397924 non-null  object 
 13  TransactionType  397924 non-null  object 
dtypes: float64(3), int64(5), object(6)
memory usage: 42.5+ MB


In [5]:
# Code task 7#
# Check on some Statistics
df.describe()

Unnamed: 0,InvoiceNo,Quantity,UnitPrice,CustomerID,Month,Year,Day,Revenue
count,397924.0,397924.0,397924.0,397924.0,397924.0,397924.0,397924.0,397924.0
mean,560617.126645,13.021823,3.116174,15294.315171,7.612537,2010.934259,15.042181,22.394749
std,13106.167695,180.42021,22.096788,1713.169877,3.416527,0.247829,8.653771,309.055588
min,536365.0,1.0,0.0,12346.0,1.0,2010.0,1.0,0.0
25%,549234.0,2.0,1.25,13969.0,5.0,2011.0,7.0,4.68
50%,561893.0,6.0,1.95,15159.0,8.0,2011.0,15.0,11.8
75%,572090.0,12.0,3.75,16795.0,11.0,2011.0,22.0,19.8
max,581587.0,80995.0,8142.75,18287.0,12.0,2011.0,31.0,168469.6


We have some outliers. We'll Filter them out to have a proper machine learning model.

## 4.4 Additional Cleaning

### 4.4.1 Convert columns to the Appropriate format

In [6]:
# Code task 8#
# Convert InvoiceDate in datetime format:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], errors='coerce')
df['Month']=df['InvoiceDate'].dt.month
df['Day'] = df['InvoiceDate'].dt.day


### 4.4.2 Managing Outliers

In [7]:
# Code task 9#
# Apply IQR method to all numeric columns
for col in df.select_dtypes(include=np.number).columns:
    Q1 = df[col].quantile(0.25) #Calculate the first quartile (Q1)
    Q3 = df[col].quantile(0.75) # Calculate the third quartile (Q3)
    IQR = Q3 - Q1 # Calculate the Interquartile Range (IQR)
    df[col] = np.where(df[col] < (Q1 - 1.5 * IQR), Q1 - 1.5 * IQR, df[col]) # Replace any values in the current column that are lower than Q1 - 1.5 * IQR with Q1 - 1.5 * IQR
    df[col] = np.where(df[col] > (Q3 + 1.5 * IQR), Q3 + 1.5 * IQR, df[col]) # Replace any values in the current column that are higher than Q3 + 1.5 * IQR with Q3 + 1.5 * IQR


In [8]:
#Code task 10#
# Check the Dataset after modifications
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397924 entries, 0 to 397923
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   InvoiceNo        397924 non-null  float64       
 1   StockCode        397924 non-null  object        
 2   Description      397924 non-null  object        
 3   Quantity         397924 non-null  float64       
 4   InvoiceDate      397924 non-null  datetime64[ns]
 5   UnitPrice        397924 non-null  float64       
 6   CustomerID       397924 non-null  float64       
 7   Country          397924 non-null  object        
 8   Month            397924 non-null  float64       
 9   Year             397924 non-null  float64       
 10  Day              397924 non-null  float64       
 11  Revenue          397924 non-null  float64       
 12  Continent        397924 non-null  object        
 13  TransactionType  397924 non-null  object        
dtypes: datetime64[ns](1)

In [9]:
# Code task 11#
# Review key Statistics
df.describe()

Unnamed: 0,InvoiceNo,Quantity,InvoiceDate,UnitPrice,CustomerID,Month,Year,Day,Revenue
count,397924.0,397924.0,397924,397924.0,397924.0,397924.0,397924.0,397924.0,397924.0
mean,560617.126645,8.317485,2011-07-10 23:43:36.912475648,2.597896,15294.315171,7.612537,2011.0,15.042181,14.382781
min,536365.0,1.0,2010-12-01 08:26:00,0.0,12346.0,1.0,2011.0,1.0,0.0
25%,549234.0,2.0,2011-04-07 11:12:00,1.25,13969.0,5.0,2011.0,7.0,4.68
50%,561893.0,6.0,2011-07-31 14:39:00,1.95,15159.0,8.0,2011.0,15.0,11.8
75%,572090.0,12.0,2011-10-20 14:33:00,3.75,16795.0,11.0,2011.0,22.0,19.8
max,581587.0,27.0,2011-12-09 12:50:00,7.5,18287.0,12.0,2011.0,31.0,42.48
std,13106.167695,8.09761,,2.103131,1713.169877,3.416527,0.0,8.653771,11.984713


## 4.5 Create predictive features

Our goal is to predict the Customer Lifetime Value (CLV) for the next six months. To achieve this, we will divide the dataset into two parts:
- The last six months will serve as the target or dependent variables.
- The data from the preceding months will be used to represent customer activities before the six months.

### 4.5.1 Define the cutoff date for the training data

In [10]:
# Code task 12#
latest_date = max(df['InvoiceDate'])
cutoff_date = pd.to_datetime(latest_date)-pd.DateOffset(months=6)

### 4.5.2 Split data into feature and target

In [11]:
# Code task 13#
# Filter data for features (only transactions before the cutoff date)
df_features = df[pd.to_datetime(df['InvoiceDate']) <= cutoff_date]

In [12]:
# Code task 14#
# Filter data for target variable (transactions in the next 6 months)
df_target = df[
    (pd.to_datetime(df['InvoiceDate']) > cutoff_date) &
    (pd.to_datetime(df['InvoiceDate']) <= pd.to_datetime(latest_date))]

## 4.6 Feature Engineering

### 4.6.1 Feature Engineering for df_feature

#### 4.6.1-1 Insert RFM values into df_feature

In [13]:
# Code task 15#
# Aggregate Features and calculation RFM values
features = df_features.groupby('CustomerID').agg(
    Revenue=('Revenue', 'sum'),
    TotalTransactions=('InvoiceNo', 'nunique'),
    AvgOrderValue=('Revenue', 'mean'),
    Frequency=('InvoiceNo', 'count'),
    Recency=('InvoiceDate', lambda x: (cutoff_date - x.max()).days),
    Tenure=('InvoiceDate', lambda x: (x.max() - x.min()).days + 1)
).reset_index()
features.head()

Unnamed: 0,CustomerID,Revenue,TotalTransactions,AvgOrderValue,Frequency,Recency,Tenure
0,12346.0,42.48,1,42.48,1,142,1
1,12347.0,1574.67,3,18.746071,84,63,121
2,12348.0,986.44,3,35.23,28,65,110
3,12350.0,334.4,1,19.670588,17,126,1
4,12352.0,848.95,5,22.340789,38,78,35


#### 4.6.1-2 Create cluster for Recency

In [14]:
# Code task 16#
# Label for the recency
r_labels = range(4, 0, -1) # Attribute an higher rate to the customers who have been active more recently

# Code task 17#
# Use the fonction qcut to divide the customers in 4 equal groupes based on the quantiles for recency
r_quartiles = pd.qcut(features['Recency'], q=4, labels = r_labels)

# Code task 18#
# Assign the value to a column call RecencyCluster
features = features.assign(RecencyCluster = r_quartiles.values)
features.head()

Unnamed: 0,CustomerID,Revenue,TotalTransactions,AvgOrderValue,Frequency,Recency,Tenure,RecencyCluster
0,12346.0,42.48,1,42.48,1,142,1,1
1,12347.0,1574.67,3,18.746071,84,63,121,2
2,12348.0,986.44,3,35.23,28,65,110,2
3,12350.0,334.4,1,19.670588,17,126,1,1
4,12352.0,848.95,5,22.340789,38,78,35,2


#### 4.6.1-3 Create Clusters for Frequency and Revenue

In [15]:
# Code task 19#
# Labels for frequency and monetary values
f_labels = range(1,5)
m_labels = range(1,5)

# Code task 20#
# Divide the customers into 4 equal groups based on the quantiles by using the qcut function
f_quartiles = pd.qcut(features['Frequency'], q=4, labels = f_labels)
m_quartiles = pd.qcut(features['Revenue'], q=4, labels = m_labels)

# Code task 21#
# Assign the values to a column FrequencyCluster for frequency and RevenueCluster for monetary
features = features.assign(FrequencyCluster = f_quartiles.values)
features = features.assign(RevenueCluster = m_quartiles.values)
features

Unnamed: 0,CustomerID,Revenue,TotalTransactions,AvgOrderValue,Frequency,Recency,Tenure,RecencyCluster,FrequencyCluster,RevenueCluster
0,12346.0,42.48,1,42.480000,1,142,1,1,1,1
1,12347.0,1574.67,3,18.746071,84,63,121,2,4,4
2,12348.0,986.44,3,35.230000,28,65,110,2,2,4
3,12350.0,334.40,1,19.670588,17,126,1,1,2,2
4,12352.0,848.95,5,22.340789,38,78,35,2,3,3
...,...,...,...,...,...,...,...,...,...,...
2788,18272.0,972.62,2,17.684000,55,41,22,3,3,4
2789,18273.0,42.48,1,42.480000,1,74,1,2,1,1
2790,18280.0,180.60,1,18.060000,10,94,1,2,1,1
2791,18283.0,535.05,5,2.306250,232,17,137,4,4,3


### 4.6.2 Feature Engineering for df_feature

#### 4.6.2-1 Define the LTV in six months for each customer

In [16]:
# Code task 22#
#Target Variable - Future Revenue
future_revenue = df_target.groupby('CustomerID')['Revenue'].sum().reset_index()

# code task 23# 
# Rename the Target Variable Appropriately
future_revenue.rename(columns={'Revenue': 'Future6MonthRevenue'}, inplace=True)
future_revenue.head()

Unnamed: 0,CustomerID,Future6MonthRevenue
0,12347.0,2194.67
1,12348.0,124.96
2,12349.0,1442.71
3,12352.0,869.19
4,12356.0,58.35


#### 4.6.2-2 Create Clusters for Future6MonthRevenue

In [17]:
# Code # 23
# Labels for frequency and monetary values
LTV_labels = range(1,4)


# Code 24#
# Divide the customers into 3 equal groups based on the quantiles by using the qcut function
LTV_quartiles = pd.qcut(future_revenue['Future6MonthRevenue'], q=3, labels = LTV_labels)

# Code 25#
# Assign the values to a column FrequencyCluster for frequency and RevenueCluster for monetary
future_revenue = future_revenue.assign(LTVCluster = LTV_quartiles.values)
future_revenue.head()

Unnamed: 0,CustomerID,Future6MonthRevenue,LTVCluster
0,12347.0,2194.67,3
1,12348.0,124.96,1
2,12349.0,1442.71,3
3,12352.0,869.19,3
4,12356.0,58.35,1


#### 4.6.3 Merge the two DataFrames

In [18]:
# Code 26#
# Merge target variable into features
customer_data = features.merge(future_revenue, on='CustomerID', how='left')
customer_data['Future6MonthRevenue'].fillna(0, inplace=True)
customer_data.head()

Unnamed: 0,CustomerID,Revenue,TotalTransactions,AvgOrderValue,Frequency,Recency,Tenure,RecencyCluster,FrequencyCluster,RevenueCluster,Future6MonthRevenue,LTVCluster
0,12346.0,42.48,1,42.48,1,142,1,1,1,1,0.0,
1,12347.0,1574.67,3,18.746071,84,63,121,2,4,4,2194.67,3.0
2,12348.0,986.44,3,35.23,28,65,110,2,2,4,124.96,1.0
3,12350.0,334.4,1,19.670588,17,126,1,1,2,2,0.0,
4,12352.0,848.95,5,22.340789,38,78,35,2,3,3,869.19,3.0


In [19]:
# Code 27#
# Check on the new DataFrame
customer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2793 entries, 0 to 2792
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   CustomerID           2793 non-null   float64 
 1   Revenue              2793 non-null   float64 
 2   TotalTransactions    2793 non-null   int64   
 3   AvgOrderValue        2793 non-null   float64 
 4   Frequency            2793 non-null   int64   
 5   Recency              2793 non-null   int64   
 6   Tenure               2793 non-null   int64   
 7   RecencyCluster       2793 non-null   category
 8   FrequencyCluster     2793 non-null   category
 9   RevenueCluster       2793 non-null   category
 10  Future6MonthRevenue  2793 non-null   float64 
 11  LTVCluster           1943 non-null   category
dtypes: category(4), float64(4), int64(4)
memory usage: 186.1 KB


## 4.7 Categorical Variables

### 4.7.1 Convert categorical columns to boolean values

In [20]:
# Code task 28#
# One-Hot Encode
customer_data_dummie = pd.get_dummies(customer_data)

In [21]:
# Code task 29#
# Visualize the five first rows after conversion
customer_data_dummie.head()

Unnamed: 0,CustomerID,Revenue,TotalTransactions,AvgOrderValue,Frequency,Recency,Tenure,Future6MonthRevenue,RecencyCluster_4,RecencyCluster_3,...,FrequencyCluster_2,FrequencyCluster_3,FrequencyCluster_4,RevenueCluster_1,RevenueCluster_2,RevenueCluster_3,RevenueCluster_4,LTVCluster_1,LTVCluster_2,LTVCluster_3
0,12346.0,42.48,1,42.48,1,142,1,0.0,False,False,...,False,False,False,True,False,False,False,False,False,False
1,12347.0,1574.67,3,18.746071,84,63,121,2194.67,False,False,...,False,False,True,False,False,False,True,False,False,True
2,12348.0,986.44,3,35.23,28,65,110,124.96,False,False,...,True,False,False,False,False,False,True,True,False,False
3,12350.0,334.4,1,19.670588,17,126,1,0.0,False,False,...,True,False,False,False,True,False,False,False,False,False
4,12352.0,848.95,5,22.340789,38,78,35,869.19,False,False,...,False,True,False,False,False,True,False,False,False,True


In [22]:
# Code task 30# 
# Display the column of the DataFrame
customer_data_dummie.columns

Index(['CustomerID', 'Revenue', 'TotalTransactions', 'AvgOrderValue',
       'Frequency', 'Recency', 'Tenure', 'Future6MonthRevenue',
       'RecencyCluster_4', 'RecencyCluster_3', 'RecencyCluster_2',
       'RecencyCluster_1', 'FrequencyCluster_1', 'FrequencyCluster_2',
       'FrequencyCluster_3', 'FrequencyCluster_4', 'RevenueCluster_1',
       'RevenueCluster_2', 'RevenueCluster_3', 'RevenueCluster_4',
       'LTVCluster_1', 'LTVCluster_2', 'LTVCluster_3'],
      dtype='object')

In [23]:
# Code task 31#
# Convert the True and False of the categorical variables into numeric values
# Columns to convert to numeric
columns_to_numeric =['RecencyCluster_4', 'RecencyCluster_3', 'RecencyCluster_2',
       'RecencyCluster_1', 'FrequencyCluster_1', 'FrequencyCluster_2',
       'FrequencyCluster_3', 'FrequencyCluster_4', 'RevenueCluster_1',
       'RevenueCluster_2', 'RevenueCluster_3', 'RevenueCluster_4',
       'LTVCluster_1', 'LTVCluster_2', 'LTVCluster_3']

# Code task 32#
# Convert the columns
customer_data_dummie[columns_to_numeric] = customer_data_dummie[columns_to_numeric].astype(int)
customer_data_dummie

Unnamed: 0,CustomerID,Revenue,TotalTransactions,AvgOrderValue,Frequency,Recency,Tenure,Future6MonthRevenue,RecencyCluster_4,RecencyCluster_3,...,FrequencyCluster_2,FrequencyCluster_3,FrequencyCluster_4,RevenueCluster_1,RevenueCluster_2,RevenueCluster_3,RevenueCluster_4,LTVCluster_1,LTVCluster_2,LTVCluster_3
0,12346.0,42.48,1,42.480000,1,142,1,0.00,0,0,...,0,0,0,1,0,0,0,0,0,0
1,12347.0,1574.67,3,18.746071,84,63,121,2194.67,0,0,...,0,0,1,0,0,0,1,0,0,1
2,12348.0,986.44,3,35.230000,28,65,110,124.96,0,0,...,1,0,0,0,0,0,1,1,0,0
3,12350.0,334.40,1,19.670588,17,126,1,0.00,0,0,...,1,0,0,0,1,0,0,0,0,0
4,12352.0,848.95,5,22.340789,38,78,35,869.19,0,0,...,0,1,0,0,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2788,18272.0,972.62,2,17.684000,55,41,22,2073.02,0,1,...,0,1,0,0,0,0,1,0,0,1
2789,18273.0,42.48,1,42.480000,1,74,1,84.96,0,0,...,0,0,0,1,0,0,0,1,0,0
2790,18280.0,180.60,1,18.060000,10,94,1,0.00,0,0,...,0,0,0,1,0,0,0,0,0,0
2791,18283.0,535.05,5,2.306250,232,17,137,1559.83,1,0,...,0,0,1,0,0,1,0,0,0,1


## 4.8 Split the Data into Training and Test

In [24]:
# Code task 33#
# Remove the column 'CustomerID' which will not include in the model
customer_data_model = customer_data_dummie.drop(['CustomerID'], axis=1)

In [25]:
# Code task 34#
# Dependant variable or target: Future6MonthRevenue
y = customer_data_model['Future6MonthRevenue']

# Code task 35#
# Independant variables or features
X = customer_data_model.drop(['Future6MonthRevenue'], axis=1)

# Code task 36#
# Split the data in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 4.9 Scale The Data

In [26]:
# Code task 1#
# Initialize the StandardScaler()
scaler = StandardScaler()

# Code task 2
# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

# Code task 3
# Transform the testing data 
X_test_scaled = scaler.transform(X_test)

In [27]:
print("Scaled Training Features:\n", X_train_scaled) 
print("Scaled Testing Features:\n", X_test_scaled)

Scaled Training Features:
 [[-0.35509981 -0.20303941  2.33828644 ...  2.19492787 -0.5307525
  -0.66709775]
 [ 0.1348661   0.26822297  1.01045178 ... -0.45559584 -0.5307525
   1.49903069]
 [ 0.34899545  0.26822297 -0.09990729 ... -0.45559584 -0.5307525
   1.49903069]
 ...
 [ 0.20057022  0.26822297 -0.11283636 ... -0.45559584 -0.5307525
   1.49903069]
 [-0.17136974  0.26822297 -1.35585207 ... -0.45559584  1.88411734
  -0.66709775]
 [-0.1663625   0.26822297  0.03588433 ... -0.45559584 -0.5307525
  -0.66709775]]
Scaled Testing Features:
 [[ 0.78960701  0.97511655 -0.41181583 ... -0.45559584 -0.5307525
   1.49903069]
 [ 1.30151454  0.50385417  0.22125728 ... -0.45559584 -0.5307525
   1.49903069]
 [-0.3018693  -0.4386706   0.09295604 ... -0.45559584 -0.5307525
  -0.66709775]
 ...
 [ 0.78395163  0.50385417 -0.8180482  ... -0.45559584 -0.5307525
   1.49903069]
 [ 0.09851914 -0.20303941  1.08976653 ... -0.45559584 -0.5307525
   1.49903069]
 [-0.47460632 -0.20303941  0.00920355 ... -0.45559584 -

### The data is ready for the next steps

### END