In [28]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, cross_val_predict, KFold
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


In [4]:
campaigns = pd.read_csv('../../data/processed/campaigns.csv')
customers = pd.read_csv('../../data/processed/customer.csv')
engagements = pd.read_csv('../../data/processed/engagement_details.csv')

## Campaign personalisation scores

The personalisation scores for each campaign reflect how tailored and relevant a campaign is for a given target audience. A higher personalisation score means that the campaign is better suited for specific audience segments, offering more customised content and messaging.

Score calculation:
- **Campaign type weights**: Different campaign types are assigned different weights based on their level of personalisation. For example, email marketing receives the highest weight as it often involves more tailored content for individual recipients. Display advertising receives the lowest weight as it is typically more generic.
- **Language score**: Campaigns targeting customers who speak languages such as Mandarin are assigned a higher language score. More common languages such as English receive a lower score.
- **Target audience score**: Different age groups are considered based on the presumed relevance of the campaign to that demographic. The age group of 35-44 receives the highest score as this group is typically highly engaged with personalised marketing content. They are often in a position where they are familiar with technology and digital campaigns, they are generally in a financially stable phase of life and are at the peak of their spending power and they are more likely to build loyalty to brands.
- **Campaign duration score**: Longer campaigns are often more engaging and personalised, as they allow for sustained interactions.

The final personalisation score is a weighted sum of the above factors.

In [5]:
def get_personalisation_score(row):
    campaign_type_weights = {
        'Search Engine Optimisation': 0.2,
        'Email Marketing': 0.8,
        'Affiliate Marketing': 0.5,
        'Display Advertising': 0.3
    }

    language_score = 0.5 if row['campaign_language'] in ['Mandarin', 'French', 'Spanish', 'German'] else 0.3

    if row['target_audience'] == '35-44':
        target_audience_score = 0.7
    elif row['target_audience'] == '55+':
        target_audience_score = 0.6
    else:
        target_audience_score = 0.5

    if row['campaign_duration'] > 60:
        duration_score = 0.8
    elif row['campaign_duration'] > 30:
        duration_score = 0.6
    else:
        duration_score = 0.4

    campaign_type_score = campaign_type_weights.get(row['campaign_type'], 0.4)

    personalisation_score = (
        0.25 * campaign_type_score +
        0.2 * language_score + 
        0.25 * target_audience_score +
        0.3 * duration_score
    )

    return personalisation_score


campaigns['personalisation_score'] = campaigns.apply(get_personalisation_score, axis=1)

## Campaign engagement scores

The engagement scores for each campaign indicate their effectiveness – how well the campaign has succeeded in attracting interactions from its audience. Higher engagement scores suggest campaigns have resonated well with customers and elicited more significant actions, such as clicks and views.

**Engagement rate** measures the frequency of interactions, whilst the **effective engagement rate** considers the depth of engagement by accounting for the duration of interactions. Both provide complementary insights into how well a campaign resonates with its audience.


In [6]:
merged = pd.merge(engagements, campaigns, on='campaign_id', how='left')

campaign_engagements = merged.groupby('campaign_id').agg(
    total_engagements=('has_engaged', 'sum'),
    total_duration=('duration', 'sum'),
    impressions=('impressions', 'first')
).reset_index()

campaign_engagements['engagement_rate'] = campaign_engagements['total_engagements'] / campaign_engagements['impressions']
campaign_engagements['effective_engagement_rate'] = campaign_engagements['total_duration'] / campaign_engagements['impressions']

campaign_engagements['campaign_effectiveness_score'] = (
    0.5 * campaign_engagements['engagement_rate'] +
    0.5 * campaign_engagements['effective_engagement_rate']
)

campaigns_final = pd.merge(campaigns, campaign_engagements[['campaign_id', 'campaign_effectiveness_score']], on='campaign_id', how='left')

## Predicting personalisation and engagement scores

A supervised learning model is used to predict the **personalisation score** and **engagement score** of marketing campaigns based on campaign attributes. 

Random Forest is used because it is a robust, ensemble learning method that can handle complex, non-linear relationships between the input features and target variables. It works well with both categorical and numerical data and helps capture interactions between features without requiring explicit specification. This makes it suitable for predicting both continuous scores with high accuracy and reliability.


In [13]:
# Handle missing values
imputer = SimpleImputer(strategy='mean')
campaigns_imputed = campaigns.copy()
campaigns_imputed['campaign_duration'] = imputer.fit_transform(campaigns_imputed[['campaign_duration']])

# Encode categorical variables using Label Encoder
label_encoder = LabelEncoder()

categoricals = ['campaign_type', 'campaign_language', 'target_audience']

for col in categoricals:
    campaigns_imputed[col] = label_encoder.fit_transform(campaigns_imputed[col])

# Merge with effectiveness score data
merged_data = pd.merge(campaigns_imputed, campaign_engagements[['campaign_id', 'campaign_effectiveness_score']], on='campaign_id', how='left')

# Split the data into features and target variables
X = merged_data[['campaign_type', 'campaign_language', 'target_audience', 'campaign_duration']]
y_personalisation = merged_data['personalisation_score']
y_engagement = merged_data['campaign_effectiveness_score']

# Split the data into training and testing sets
X_train, X_test, y_train_personalisation, y_test_personalisation = train_test_split(X, y_personalisation, test_size=0.2, random_state=42)
X_train, X_test, y_train_engagement, y_test_engagement = train_test_split(X, y_engagement, test_size=0.2, random_state=42)

In [None]:
# Train the model for personalisation score
personalisation_model = RandomForestRegressor(n_estimators=100, random_state=42)
personalisation_model.fit(X_train, y_train_personalisation)

# Predict on the test set
y_pred_personalisation = personalisation_model.predict(X_test)

# Train the model for engagement score
engagement_model = RandomForestRegressor(n_estimators=100, random_state=42)
engagement_model.fit(X_train, y_train_engagement)

# Predict on the test set
y_pred_engagement = engagement_model.predict(X_test)

## Predicting acquisition costs

This model predicts the **acquisition cost** of campaigns using a log-transformed linear regression approach. The log transformation is applied to stabilize variance in the acquisition cost, making the model more robust. This approach provides insights into how different campaign features influence the acquisition cost while ensuring robust model performance with cross-validation.

In [30]:
# Log-transform the acquisition_cost column due to high variance
merged_data['log_acquisition_cost'] = np.log(merged_data.acquisition_cost)

# Encode an order for target_audience
merged_data['target_audience'] = pd.Categorical(merged_data.target_audience, ordered=True)


X = merged_data[['campaign_type', 'campaign_duration', 'campaign_language']]
y = merged_data['log_acquisition_cost']
categorical_features = ['campaign_type', 'campaign_language']
numerical_features = ['campaign_duration']

# Define preprocessor
preprocessor = ColumnTransformer(transformers=[
    ('cat', OneHotEncoder(drop='first'), categorical_features),
    ('num', 'passthrough', numerical_features)
])

# Define pipeline
linreg_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Define pipeline
linreg_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# 5-fold cross-validation due to small dataset size
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Fit model
linreg_model.fit(X, y)

# Predict (log scale → original scale)
log_preds = cross_val_predict(linreg_model, X, y, cv=kf)
preds = np.expm1(log_preds)
true_vals = np.expm1(y)

# Coefficient interpretation
feature_names = linreg_model.named_steps['preprocessor'].get_feature_names_out()
log_coefficients = linreg_model.named_steps['regressor'].coef_
percent_impact = (np.exp(log_coefficients) - 1) * 100

coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Log Coefficient': log_coefficients,
    'Approx % Change in Cost': percent_impact
}).sort_values(by='Approx % Change in Cost', ascending=False)

## Cost-Benefit ratio

The Cost-Benefit ratio combines personalisation, engagemtn and acquisition cost to meausre the cost-effectiveness of a campaign. A high ratio indicates that campaign is delivering high engagement and personalisation for a reasonable cost, making it more effective and efficient.

In [8]:
campaigns_final['cost_benefit_ratio'] = (
    campaigns_final['personalisation_score'] * campaigns_final['campaign_effectiveness_score'] / campaigns_final['acquisition_cost']
)

In [9]:
campaigns_final.head()

Unnamed: 0,campaign_id,campaign_type,target_audience,campaign_duration,conversion_rate,acquisition_cost,roi,campaign_language,impressions,clicks,personalisation_score,campaign_effectiveness_score,cost_benefit_ratio
0,25,Search Engine Optimization,35-44,15,0.1161,16230.45,4.85,Mandarin,14665,1703.0,0.495,0.082748,2.523669e-06
1,11,Email Marketing,35-44,75,0.0471,13406.29,3.5,Mandarin,33011,1555.0,0.715,0.000379,2.019523e-08
2,41,Affiliate Marketing,35-44,30,0.1332,9051.65,1.99,Mandarin,7142,951.0,0.52,0.399118,2.292856e-05
3,35,Display Advertising,35-44,60,0.1021,3788.69,2.76,Mandarin,12935,1321.0,0.53,0.268728,3.759241e-05
4,29,Search Engine Optimization,55+,15,0.062,2967.56,3.29,German,22232,1378.0,0.47,0.023997,3.800618e-06


## Customer personalisation potential

The customer personalisation potential score evaluates how likely a customer is to respond positively to personalised marketing strategies.

Score calculation:
- **Age score**: A mature audience is more able to engage with personalised content as they have higher spending power.
- **Income score**: Customers with higher incomes may be responsive to premium or personalised offerings.
- **Job and education**: Professionally educated or employed customers may prefer certain types of content or campaigns.
- **Dependents**: Customers with dependents may be more receptive to personalised offers that cater to family or financial needs.
- **Customer lifetime value**: High lifetime value customers are prime candidates for more personalised, long-term engagement.

The final personalisation score is a weighted sum of the above factors.

In [10]:
def get_personalisation_potential(row):
    if row['age'] < 30:
        age_score = 0.7
    elif 30 <= row['age'] <= 50:
        age_score = 0.8
    else:
        age_score = 0.6

    if row['income'] > 2000:
        income_score = 0.9
    elif 1000 <= row['income'] <= 2000:
        income_score = 0.7
    else:
        income_score = 0.5

    education_score = 0.8 if row['education'] == 'tertiary' else 0.6

    job_score = 0.8 if row['job'] in ['management', 'technician', 'blue-collar'] else 0.5

    dependents_score = 0.7 if row['dependents'] > 0 else 0.5

    lifetime_value_score = 0.9 if row['customer_lifetime_value'] > 200 else 0.6

    potential_score = (
        0.2 * age_score +
        0.3 * income_score +
        0.2 * education_score +
        0.1 * job_score +
        0.1 * dependents_score +
        0.1 * lifetime_value_score
    )

    return potential_score

customers['personalisation_potential_score'] = customers.apply(get_personalisation_potential, axis=1)

## Balancing of personalisation with cost-effectiveness

The computed scores in this notebook – campaign personalisation scores, campaign engagement scores, campaign cost-benefit ratios and customer personalisation – provide a framework for optimising marketing strategies by aligning camapigns with customers' profiles whilst maintaining cost-effectiveness. These scores can be leveraged to effectively match campaigns with customers.

An important assumption here is that higher personalisation scores increase costs, but also increase effectiveness, hence it is important to balance personalisation across different aspects of campaigns to maximise the Cost-Benefit ratio.

A few possible strategies are:

- **Balancing high-personalisation with low-cost campaigns**:
  - Email marketing and affiliate marketing can be highly personalised at a lower cost, compared to other channels like display advertising, which may require more resources for customisation. Focus on these more cost-effective channels whilst still maintaining a high personalisation score.

- **Target high-potential segments with moderate personalisations**:
  - Instead of personalising all aspects of a campaign, target customers with high personalisation potential with moderately personalised campaigns. For example, using dynamic email content based on purchase history can be highly effective but not as resource-intensive as fully individualised product recommendations. 

- **Use tiered personalisation levels**:
  - For larger customer groups, use tiered personalisations. For example, offer a higher level of personalisation to top-tier customers, i.e. those with high lifetime value or engagement history), and a more generic approach to lower-tier customers. This can help limit the costs of personalisation whilst maintaining targeted content for the most valuable customers.

- **Optimise campaign duration for cost efficiency**:
  - Longer campaigns tend to cost more, but they also allow for more sustained engagement. Instead of makign the entire campaign highly personlaised, focus personalisation on key touchpoints, such as the initial email or a special offer. The remaining communication can be more standardised but still relevant, balancing cost and engagment.

- **Prioritise personalisation where it matters most**:
  - Focus on the most impactful personalisation elements. For example, product recommendations based on past purchases or tailoring messages based on age and income may yield a higher ROI than customising every communication channel.

- **Leverage automation for scalable personalisation**:
  - Invest in automated personalisation tools that allow for high levels of personalisation without a significant increase in cost. For example, AI-powered content recommendations or dynamic email campaigns can scale personalisation efficiently, ensuring that personalisation is automated and does not require extensive manual effort.

## Monitoring and optimising scoring systems

To ensure that personalisation and cost-effectiveness are aligned and optimised, the scores must continuously be monitored and refined, and campaign strategies must be periodically re-evaluated.

A few possible approaches are:

- **Score adjustments**:
  - Analyse if campaigns with **higher personalisation scores** are yielding the expected results in terms of engagement and conversion. If engagement drops or costs increase without a proportional increase in effectiveness, revisit the personalisation score computation.

- **Strategy re-evaluation**:
  - Continuously conduct **A/B tests** to evaluate different levels of personalisation. This will help determine the optimal level of personalisation for each customer segment.
  - Create **feedback loops** where the insights from campaign performance inform future personalisation strategies. For example, if campaigns with moderate personalisation are performing better than highly personalised ones for certain customer segments, shift resources to maximise those mid-level personalisation strategies.
  - Regularly review the Cost-Benefit ratio across campaigns to ensure that increasing personalisation does not disproportionately inflate costs. If the ratio begins to decline as personalisation efforts increase, it may indicate that campaigns needs to be scaled back or optimised to maintain efficiency.
  - Compare the personalisation scores, engagement scores and Cost-Benefit ratios against **industry benchmarks** or historical performance. Continuously strive for incremental improvements based on these benchmarks, identifying underperforming campaigns and adjusting strategies to maximise the overall return on investment.

By monitoring and optimising* the personalization and cost-effectiveness scores over time, businesses can ensure that their campaigns evolve to meet customer needs while maintaining a sustainable cost structure. Regular testing, adjustments, and feedback loops will ensure that the personalisation strategy remains effective and cost-efficient, ultimately driving higher engagement and maximising returns.
