In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

import sys
!{sys.executable} -m pip install xgboost
import xgboost as xgb



## Rationale for dataset selection

Datasets "Campaign", "EngagementDetail" and "DigitalUsage" are directly relevant to the objective of designing a cost-benefit analysis model for assessing the impact of personalised marketing strategies.

The "Campaign" dataset provides core business levels such as campaign type, target audience, cost of acquisition, impressions and ROI, which are needed to evaluate the efficiency and financial impact of each marketing effort. It enables us to assess whether more personalised campaigns lead to better returns on investment, or if high personalisation inflates costs disproprotionately.

The "EngagementDetail" dataset is a form of measuring campaign effectiveness at the individual level, including records of whether and how customers interacted with campaigns. By linking campaign data to individual engagement outcomes, we can identify which types of campaigns perform better.

The "DigitalUsage" dataset gives further insight into customers. It reveals user preferences for mobile and web banking, as well as their activity patterns. 

Taken together, the three datasets form a comprehensive analytical foundation to evaluate how tailoring marketing to customer behaviour influences both cost and performance, which would provide us with actionable insights.

In [28]:
# Load the datasets
campaigns = pd.read_csv('../../data/processed/campaigns.csv')
engagements = pd.read_csv('../../data/processed/engagement_details.csv')
usage = pd.read_csv('../../data/processed/digital_usage.csv')

## Merging datasets

To enable a holistic analysis, the datasets are merged using customer_id and campaign_id as key identifiers. This integrated datasets allows us to link campaign characteristics with individual-level engagement outcomes and digital behaviour patterns.

In [29]:
# Merge datasets
df = pd.merge(campaigns, engagements, on='campaign_id', how='inner')
df = pd.merge(df, usage, on='customer_id', how='inner')

## Dataset summary

To gain an initial understanding of the merged datasets, df.info() is used. This allows for inspection of column names and datatypes, and non-null counts to identify missing values.

In [30]:
# Display summary information about the DataFrame
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15992 entries, 0 to 15991
Data columns (total 25 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   campaign_id        15992 non-null  int64  
 1   campaign_type      15992 non-null  object 
 2   target_audience    15992 non-null  object 
 3   campaign_duration  15992 non-null  int64  
 4   conversion_rate    15992 non-null  float64
 5   acquisition_cost   15992 non-null  float64
 6   roi                15992 non-null  float64
 7   campaign_language  15992 non-null  object 
 8   impressions        15992 non-null  int64  
 9   clicks             13409 non-null  float64
 10  engagement_id      15992 non-null  int64  
 11  customer_id        15992 non-null  int64  
 12  channel_used       15992 non-null  object 
 13  has_engaged        15992 non-null  int64  
 14  day                15992 non-null  int64  
 15  month              15992 non-null  object 
 16  duration           290

In [31]:
# Checking for missing values
print(df.isnull().sum())

campaign_id              0
campaign_type            0
target_audience          0
campaign_duration        0
conversion_rate          0
acquisition_cost         0
roi                      0
campaign_language        0
impressions              0
clicks                2583
engagement_id            0
customer_id              0
channel_used             0
has_engaged              0
day                      0
month                    0
duration             13085
has_mobile_app           0
has_web_account          0
mobile_logins_wk      4378
web_logins_wk         1966
avg_mobile_time       4378
avg_web_time          1966
last_mobile_use       4378
last_web_use          1966
dtype: int64


## Dropping column: duration

The 'duration' column from the "EngagementDetail" dataset was dropped due to a high proportion of null values, which limited its analytical usefulness. The null values are a result of email marketing campagins where duration could not be tracked. Retaining such sparse features can introduct noise without adding meaningful insight.

In [32]:
# Drop 'duration' column as it is mostly null
df = df.drop(columns=['duration'])

## Encoding categorical features

To prepare the data for analysis, LabelEncoder is used to encode the categorical variables into numerical values. Encoding these categorical features allows us to incorporate them into models while preserving their informational content.

In [33]:
# Encoding categorical features using LabelEncoder
le_campaign_type = LabelEncoder()
df['campaign_type'] = le_campaign_type.fit_transform(df['campaign_type'])

le_target_audience = LabelEncoder()
df['target_audience'] = le_target_audience.fit_transform(df['target_audience'])

le_campaign_language = LabelEncoder()
df['campaign_language'] = le_campaign_language.fit_transform(df['campaign_language'])

le_channel_used = LabelEncoder()
df['channel_used'] = le_channel_used.fit_transform(df['channel_used'])

df['month'] = pd.to_datetime(df['month'], format='%B').dt.month

## Feature selection

For the analysis, I selected the following relevant features as input variables:
- campaign_type: Type of marketing campaign
- campaign_duration: Duration of the campaign
- avg_mobile_time and avg_web_time: Average time spent by users on mobile and web platforms respectively.
- mobile_logins_wk and web_logins_wk: Number of weekly logins to mobile and web banking platforms.
- has_mobile_app and has_web_account: Indicators of whether the customer uses the mobile app and web banking.
- channel_used: The channels through which the campaign was promoted.
- campaign_language: The language in which the campaign was communicated.

The target variable is:
- has_engaged: A binary indicator of whether the customer engaged with the campaign (1: engaged, 0: not engaged).

These features capture both user behavior and campaign characteristics, enabling the analysis of what factors influence customer engagement with personalized marketing campaigns.

In [34]:
# Select revelant features and target variable
X = df[['campaign_type', 'campaign_duration', 'avg_mobile_time', 'avg_web_time', 
               'mobile_logins_wk', 'web_logins_wk', 'has_mobile_app', 
               'has_web_account', 'channel_used', 'campaign_language']]
y = df['has_engaged']

In [35]:
# Splitting the dataset into train and test tests
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [36]:
# Initialise and train a Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

y_pred = rf_model.predict(X_test)

In [37]:
# Model evaluation
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.90      0.85      3755
           1       0.32      0.17      0.22      1043

    accuracy                           0.74      4798
   macro avg       0.56      0.54      0.53      4798
weighted avg       0.69      0.74      0.71      4798



In [38]:
# Use XGBoost
xgb_model = xgb.XGBClassifier(eval_metric='logloss', random_state=42)
xgb_model.fit(X_train, y_train)

y_pred_xgb = xgb_model.predict(X_test)

In [39]:
# Evaluate the XGBoost model
print(classification_report(y_test, y_pred_xgb))

              precision    recall  f1-score   support

           0       0.80      0.92      0.85      3755
           1       0.35      0.15      0.21      1043

    accuracy                           0.75      4798
   macro avg       0.57      0.54      0.53      4798
weighted avg       0.70      0.75      0.72      4798



In [40]:
# Predict the cost of a campaign
X = df[['campaign_type', 'campaign_duration', 'campaign_language']]
y = df['acquisition_cost']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

cost_predictor = xgb.XGBRegressor()

cost_predictor.fit(X_train, y_train)

y_pred = cost_predictor.predict(X_test)


In [41]:
# Predict the ROI of a campaign
# Predict the cost of a campaign
X = df[['campaign_type', 'campaign_duration', 'campaign_language']]
y = df['roi']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

roi_predictor = xgb.XGBRegressor()

roi_predictor.fit(X_train, y_train)

y_pred = roi_predictor.predict(X_test)

In [42]:
def cost_benefit_analysis(new_data, xgb_model, cost_predictor, roi_predictor,
                          le_campaign_type, le_campaign_language, le_channel_used):
    new_data = pd.DataFrame(new_data)

    new_data['campaign_type'] = le_campaign_type.transform(new_data['campaign_type'])
    new_data['campaign_language'] = le_campaign_language.transform(new_data['campaign_language'])
    new_data['channel_used'] = le_channel_used.transform(new_data['channel_used'])

    new_campaign_data = new_data[['campaign_type', 'campaign_duration', 'campaign_language']]

    # Predict the probability of engagement
    engagement_probability = xgb_model.predict_proba(new_data)[:, 1]

    # Predict the cost of the campaign for the customer
    predicted_cost = cost_predictor.predict(new_campaign_data) * engagement_probability

    # Predict the ROI of the campaign for the customer
    predicted_benefit = roi_predictor.predict(new_campaign_data)[0] * predicted_cost * engagement_probability

    # Calculate the net benefit
    net_benefit = predicted_benefit - predicted_cost

    # Calculate ROI 
    roi = (net_benefit / predicted_cost) * 100 if predicted_cost > 0 else 0 

    # Print the results
    print("Cost-Benefit Analysis Results:")
    print(f"Customer Acquisition Cost: ${predicted_cost[0]:.2f}")
    print(f"Expected Customer Benefit: ${predicted_benefit[0]:.2f}")
    print(f"Net Benefit: ${net_benefit[0]:.2f}")
    print(f"ROI: {roi[0]:.2f}%")

In [43]:
# Test
new_data = {
    'campaign_type': ['Email Marketing'],
    'campaign_duration': [30],
    'avg_mobile_time': [5.50],
    'avg_web_time': [5.50],
    'mobile_logins_wk':  [3.0],
    'web_logins_wk':  [2.0],
    'has_mobile_app': [1],
    'has_web_account': [1],
    'channel_used': ['Email'],
    'campaign_language': ['English']
}

results = cost_benefit_analysis(new_data, xgb_model, cost_predictor, roi_predictor,
                                le_campaign_type, le_campaign_language, le_channel_used)

Cost-Benefit Analysis Results:
Customer Acquisition Cost: $5053.10
Expected Customer Benefit: $3788.83
Net Benefit: $-1264.27
ROI: -25.02%
