<a href="https://www.kaggle.com/code/mvsaikumar/predict-ctr-of-an-email-campaign?scriptVersionId=108417652" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Predict CTR of an Email Campaign

Can you predict the Click Through Rate (CTR) of an email campaign?

## Problem Statement

Most organizations today rely on email campaigns for effective communication with users. Email communication is one of the popular ways to pitch products to users and build trustworthy relationships with them.

Email campaigns contain different types of CTA (Call To Action). The ultimate goal of email campaigns is to maximize the Click Through Rate (CTR).

CTR is a measure of success for email campaigns. The higher the click rate, the better your email marketing campaign is. CTR is calculated by the no. of users who clicked on at least one of the CTA divided by the total no. of users the email was delivered to.

CTR =   No. of users who clicked on at least one of the CTA / No. of emails delivered

CTR depends on multiple factors like design, content, personalization, etc. 

How do you design the email content effectively?

What should your subject line look like?

What should be the length of the email?

Do you need images in your email template?

As a part of the Data Science team, in this hackathon, you will build a smart system to predict the CTR for email campaigns and therefore identify the critical factors that will help the marketing team to maximize the CTR.

## Objective

Your task at hand is to build a machine learning-based approach to predict the CTR of an email campaign.

## About the Dataset

You are provided with the information of past email campaigns containing the email attributes like subject and body length, no. of CTA, date and time of an email, type of the audience, whether its a personalized email or not, etc and the target variable indicating the CTR of the email campaign.

## Data Dictionary

You are provided with 3 files - train.csv, test.csv and sample_submission.csv


## Train and Test Set

Train and Test set contains different sets of email campaigns containing information about the email campaign. Train set includes the target variable click_rate and you need to predict the click_rate of an email campaign in the test set.


**Variable**  -   **Description**

**campaign_id** - Unique identifier of a campaign

**sender** - Sender of an e-mail

**subject_len** - No. of characters in a subject

**body_len** - No. of characters in an email body

**mean_paragraph_len** - Average no. of characters in paragraph of an email

**day_of_week** - Day on which email is sent

**is_weekend** - Boolean flag indicating if an email is sent on weekend or not

**times_of_day** - Times of day when email is sent: Morning, Noon, Evening

**category** - Category of the product an email is related to

**product** - Type of the product an email is related to

**no_of_CTA** - No. of Call To Actions in an email

**mean_CTA_len** - Average no. of characters in a CTA 

**is_image** - No. of images in an email

**is_personalised** - Boolean flag indicating if an email is personalized to the user or not

**is_quote** - No. of quotes in an email

**is_timer** - Boolean flag indicating if an email contains a timer or not

**is_emoticons** - No. of emoticons in an email

**is_discount** - Boolean flag indicating if an email contains a discount or not

**is_price** - Boolean flag indicating if an email contains price or not

**is_urgency** - Boolean flag indicating if an email contains urgency or not

**target_audience** - Cluster label of the target audience

**click_rate (Target Variable)** - Click rate of an email campaign


## Submission File Format

sample_submission.csv contains 2 variables - campaign id and click_rate


**Variable**  - **Description**

**campaign_id** - Unique Identifier of a campaign id

**click_rate (Target Variable)** - Click rate of an email campaign


## Evaluation metric

The evaluation metric for this hackathon would be r2_score.


## Public and Private Split

Test data is further divided into Public (40%) and Private (60%) data. Your initial responses will be checked and scored on the Public data. The final rankings would be based on your private score which will be published once the competition is over.


###  1. Import the required libraries

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

### 2. Data Inspection

In [None]:
# Read the datasets

train = pd.read_csv('../input/jobathon-august-2022/train_F3fUq2S.csv')
test = pd.read_csv('../input/jobathon-august-2022/test_Bk2wfZ3.csv')

In [None]:
# check the shapes of the dataset

print('No.of rows and columns in train dataset',train.shape, '\n')
print('No.of rows and columns in test dataset',test.shape)

__We have 1888 rows and 22 columns in Train set whereas Test set has 762 rows and 21 columns.__ 

In [None]:
# Read the first 5 rows of train and test datasets

train.head()

In [None]:
test.head()

In [None]:
# Info about the train and test datsets

train.info()

__In the train dataset, we have 1 categorical feature and 20 numerical features__

In [None]:
test.info()

__In the test dataset, we have 1 categorical feature and 19 numerical features__

### 3. Data Cleaning

* __Check for Missing values__

Before we go on to build the model, we must look for missing values within the dataset as treating the missing values is a necessary step before we fit a machine learning model on the dataset.

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

__In the train and test datasets, we don't have any null values.__

### 4. Exploratary Data Analysis (EDA)

In [None]:
# Features in train and test datasets

train.columns

In [None]:
test.columns

* __Target Variable__

In this section we will take a look at the 'click_rate' (CTR) of an email campaign which is the target variable. It is crucial to understand it in detail as this is what we are trying to predict accurately.

In [None]:
train['click_rate'].describe()

__The target variable (Click Through Rate) has a max of 89% Click rate.__

* __Univariate Analysis__

In [None]:
# Binary Features

plt.figure(figsize=(22,6))

# Day of week
plt.subplot(1,3,1)
sns.countplot('day_of_week',data=train)

# Times of day
plt.subplot(1,3,2)
sns.countplot('times_of_day',data=train)

# Weekend or not
plt.subplot(1,3,3)
sns.countplot('is_weekend',data=train)

__Assume that 0-6 indicates Sunday to Saturday, as most of the emails were sent on wednesday, tuesday and thrusaday.__

__Most of the emails were sent during evenings as people were free during most of that time.__

__Assume that 0 --> Not Weekend, 1 --> Weekend, as most of the emails were sent on weekdays and less no.of emails were sent on weekends.__

In [None]:
plt.figure(figsize=(22,6))

# No.of Images in an email
plt.subplot(1,3,1)
sns.countplot('is_image',data=train)

# No.of quotes in an email
plt.subplot(1,3,2)
sns.countplot('is_quote',data=train)

# No. of emoticons in an email
plt.subplot(1,3,3)
sns.countplot('is_emoticons',data=train)

__Assume that 0 to 6 indicates no.of images in an email. Since email containing 0-2 images are more and with 3-6 images are less__

__Email containing 0-1 quotes are more and with 2-6 are less__

__Email containing 0 emotions/emojis are more and with 1-6 are less__

In [None]:
plt.figure(figsize=(22,6))

# Personalized emails or not
plt.subplot(1,3,1)
sns.countplot('is_personalised',data=train)

# Discount email or not
plt.subplot(1,3,2)
sns.countplot('is_discount',data=train)

# Urgent email or not
plt.subplot(1,3,3)
sns.countplot('is_urgency',data=train)

__Most the emails were not personalized which can be special discounts, offers emails and less no.of emails were personalized which can be related towards their work.__

__Most of them were not discount emails and less no.of emails were discounted emails since most of the discount emails get during sales period.__

__Most of the emails were not important/urgency emails and less no.of emails were urgency emails since it can be related towards their work.__

* __Bivariate Analysis__

In [None]:
# Click rate vs Image 

plt.figure(figsize=(10,8))
sns.barplot(x='is_image',y='click_rate',data=train,palette='bright')

__0-2 images in an email are more but click rate for 6 and 3 images in an email seems higher and hence CTR can be maxmized by providing more images in an email.__

In [None]:
# click rate vs subject length

plt.figure(figsize=(20,8))
sns.relplot(x="subject_len", y="click_rate",ci=None,kind="line", data=train)

__If the no.of characters in a subject of an email is 50 then CTR can be maximized.__

In [None]:
# click rate vs length of an email

plt.figure(figsize=(20,8))
sns.relplot(x="body_len", y="click_rate",ci=None,kind="line", data=train)

 __If the No. of characters in an email body is in the range of 100-200 then CTR can be maxmized.__

In [None]:
# click rate vs Mean paragraph length of an email

plt.figure(figsize=(20,8))
sns.relplot(x="mean_paragraph_len", y="click_rate",ci=None,kind="line", data=train)

__To maxmize the CTR the Average no. of characters in paragraph of an email should be in the range of 130-150.__

* __Correlation Heat Map__

Understanding the correlation between various features in the dataset

In [None]:
correlation = train.corr()

In [None]:
# constructing a heatmap to understand the correlation

plt.figure(figsize=(10,10))
sns.heatmap(correlation, cbar=True, square=True, fmt='.1f', annot=True, annot_kws={'size':8}, cmap='Blues')

### 5. Data Pre-Processing

* __Label Encoding to the Categorical features__

Here only 'times of day' is the only categorical feature

In [None]:
print(train['times_of_day'].value_counts(),'\n')
print(test['times_of_day'].value_counts(),'\n')

In [None]:
# Import Label encoder from sklearn

from sklearn.preprocessing import LabelEncoder

In [None]:
# Define the model
le = LabelEncoder()

var_mod = train.select_dtypes(include='object').columns
for i in var_mod:
    train[i] = le.fit_transform(train[i])
    
for i in var_mod:
    test[i] = le.fit_transform(test[i])

In [None]:
train.head()

In [None]:
test.head()

__The labels in the 'times of day' feature has changed to numerical data in the train and test data.__

__Here 1--> Morning, 2--> Noon, 0--> Evening 

### 6. Model Building

In [None]:
# Import train test split from sklearn

from sklearn.model_selection import train_test_split

In [None]:
# Splitting the data into Features and Traget

X = train.drop(['click_rate'],axis=1)
Y = train['click_rate']

In [None]:
print(X, '\n')
print(Y)

In [None]:
# Splitting the data into Training data and Test data(20%)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 22)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

### 7. Development with ML Models

In [None]:
# Import the ML models libraries

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn import metrics

In [None]:
algos = [LinearRegression(), Lasso(), Ridge(), KNeighborsRegressor(), DecisionTreeRegressor(), XGBRegressor()]

names = ['Linear Regression', 'Lasso Regression', 'Ridge Regression', 'K Neighbors Regressor', 'Decision Tree Regressor', 'XGBoost Regressor']

r2_score_list = []

In [None]:
for name in algos:
    model = name                           # Load the model
    model.fit(X_train, Y_train)            # Fit the model with training data
    test_data_pred = model.predict(X_test)        # prediction on test data(i.e Y_pred)
    r2 = metrics.r2_score(Y_test, test_data_pred)   # R2 error
    r2_score_list.append(r2)

In [None]:
evaluation = pd.DataFrame({'Model': names, 'r2': r2_score_list})

In [None]:
evaluation

### 8. Conclusion and Submission

As we can clearly see XGBoost Regressor performs slighlty better than KNeighbours Regressor, Linear, Ridge and Lasso regression and Decision Tree Regressor do not improve the score so we can select XGBoost Regressor for making our final predictions.

__Make a Submission to CSV file__

In [None]:
submission = pd.read_csv('../input/jobathon-august-2022/sample_submission_LJ2N3ZQ.csv')
model = XGBRegressor()
model.fit(X, Y)
final_predictions = model.predict(test)
submission['click_rate'] = final_predictions

In [None]:
print(final_predictions)

In [None]:
#only positive predictions for the target variable

#submission['click_rate'] = submission['click_rate'].apply(lambda x: 0 if x<0 else x)
submission.to_csv('my_submission.csv', index=False)