# Rocketium ML Internship Assignment

**Assignment: Predictive Modeling for Digital Marketing Campaign Performance**

**Importing Libraries**

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import missingno as msno
import plotly.graph_objects as go

**Reading dataset using Pandas Library**

In [4]:
df=pd.read_csv("Rocketium AI_ML Internship Assignment - 2.csv")

**Viewing the dataset and its other information**

In [None]:
print(df.head())
print(df.shape)
print(df.info())

# Step-1 : Exploratory Data Analysis (EDA)

**Seggregating Performance Metrics and Creative Attributes features from the dataset**

In [6]:
'''
The features are seggregated into 2 categories because of the nature of the features. The features are either performance metrics or creative attributes.
The performance metrics are the metrics that are used to measure the performance of digital marketing campaigns.
The creative attributes are the attributes that are used to creativity of the digital marketing camaigns.

Moreover, seggregating the features into 2 categories will help in the the Exploratory Data Analysis.
'''

performance_metrics = ["date","spend","impressions","likecount","commentcount","repostcount","total engagements","conversion"]

creative_attributes = ["action", "type","posturl","postcontent","profileurl","videourl","sharedposturl","created_at","size","url","number of faces",
    "face emotion","face position","face area percentage %","objects","number of objects","primary object","primary object position","primary object area percentage %","secondary object","secondary object position",
    "secondary object area percentage %","text","text length","dominant colour","cta","logos","logo 1 name","logo 1 position","logo 1 area percentage %","logo 2 name","logo 2 position","logo 2 area percentage %","number of persons",
    "person area %","person area","style","tone","voice","sentiment","text area %","empty space %","topic","language","# faces","# persons","# objects","# text length"
]

**Performing Exploratory Data Analysis on PERFORMANCE METRICS**

**Spend v/s Time**

In [92]:
'''
This code will generate a bar chart where the x-axis represents dates and y-axis represents the spending.
Each bar will represent the spending for a specific date.
'''

fig = go.Figure(data=go.Bar(x=df['date'], y=df['spend']))
fig.update_layout(title='Spend Over Time', xaxis_title='Date', yaxis_title='Spend')
fig.show()


# Findings: 
# 1 Spetember 2023 has the highest spending.
# 24 August 2023 has second highest spendingg.

**Impression v/s Time**

In [93]:
'''
This code will generate a bar chart where the x-axis represents dates and y-axis represents the number of impressions.
Each bar will represent the impressions of digital marketing campaign for a specific date.
'''

fig = go.Figure(data=go.Bar(x=df['date'], y=df['impressions']))
fig.update_layout(title='Impressions Over Time', xaxis_title='Date', yaxis_title='Impressions')
fig.show()

# Findings:
# 24 August 2023 has highest number of impressions.
# 1 September 2023 has second highest number of impressions.

**Like count v/s Time**

In [94]:
''' 
This code will generate a bar chart where the x-axis represents dates and y-axis represents the number of likes.
Each bar will represent the number of likes received for a specific date.
'''

fig = go.Figure(data=go.Bar(x=df['date'], y=df['likecount']))
fig.update_layout(title='Like Count Over Time', xaxis_title='Date', yaxis_title='Like Count')
fig.show()

# Findings:
# 24 August 2023 has highest number of likes.
# 6 September 2023 has second highest number of likes.


**Comment Count v/s Time**

In [95]:
'''
This code will generate a bar chart where the x-axis represents dates and y-axis represents the number of comments.
Each bar will represent the number of comments received for a specific date.
'''

fig = go.Figure(data=go.Bar(x=df['date'], y=df['commentcount']))
fig.update_layout(title='Comment Count Over Time', xaxis_title='Date', yaxis_title='Comment Count')
fig.show()

# Findings:
# 21 August 2023 has highest number of comments.
# 15 August 2023 has second highest number of comments.

**Repost Count v/s Time**

In [96]:
'''
This code will generate a bar chart where the x-axis represents dates and y-axis represents the number of reposts.
Each bar will represent the number of reposts received for a specific date.
'''

fig = go.Figure(data=go.Bar(x=df['date'], y=df['repostcount']))
fig.update_layout(title='Repost Count Over Time', xaxis_title='Date', yaxis_title='Repost Count')
fig.show()

# Findings:
# 23 August 2023 has highest number of reposts.
# 6 September 2023 has second highest number of reposts.

**Total Engagement v/s Time**

In [97]:
'''
This code will generate a bar chart where the x-axis represents dates and y-axis represents the number of total number of engagements (likes, comments, and reposts combined).
Each bar will represent the total number of engagements for a specific date.
'''

fig = go.Figure(data=go.Bar(x=df['date'], y=df['total engagements']))
fig.update_layout(title='Total Engagements Over Time', xaxis_title='Date', yaxis_title='Total Engagements')
fig.show()

# Findings:
# 24 August 2023 has highest number of total engagements.
# 6 September 2023 has second highest number of total engagements.


**Conversion v/s Time**

In [98]:
'''
This code will generate a bar chart where the x-axis represents dates and y-axis represents the conversion metric (e.g. sign-ups, purchases).
Each bar will represent the conversion metric for a specific date.
'''

fig = go.Figure(data=go.Bar(x=df['date'], y=df['conversion']))
fig.update_layout(title='Conversion Over Time', xaxis_title='Date', yaxis_title='Conversion')
fig.show()

# Findings:
# 6 August 2023 has highest number of conversions.
# 24 September 2023 has second highest number of conversions.


**Performing Exploratory Data Analysis (EDA) on CREATIVE ATTRIBUTES**

**Text Area percentage v/s Conversion**

In [99]:
'''
This code will generate a scatter plot where the x-axis represents the text area percentage and y-axis represents the conversion metric (e.g. sign-ups, purchases).
'''

fig = px.scatter(df, x='text area %', y='conversion', trendline='ols', title='Text Area Percentage vs. Conversions')
fig.show()

df['text area %'].corr(df['conversion'])

# Findings:
# The correlation between text area percentage and conversion is -0.04799720967509999. 
# This means that there is negative correlation between text area percentage and conversion so when text area percentage increases, conversion decreases or vice versa.

-0.04799720967509999

**Video URL presence v/s Conversion**

In [100]:
'''
This histogram visualizes the presence or absence of a post URL in the creative content and how it relates to conversions. 
It provides insights into whether including a post URL affects conversion rates.
'''

fig = px.histogram(df, x='videourl', color='conversion', barmode='group', title='Video URL Presence vs. Conversions')
fig.show()

# Findings:
# The presence of video URL has  effect on the conversion rate.


**Dominant Color Distribution**

In [101]:
'''
This pie chart visualizes the distribution of the dominant color in the creative content.
'''

fig = px.pie(df, names='dominant colour', title='Dominant Color Distribution')
fig.show()

# Findings:
# The dominant color is blue.
# The seocnd dominant color is red.


**Logo Presence v/s Conversion**

In [102]:
'''
This histogram visualizes the presence or absence of a logo in the creative content and how it relates to conversions.(egs sign-ups, purchases)
'''

fig = px.histogram(df, x='logos', color='conversion', barmode='group', title='Logo Presence vs. Conversions', height=1000 ,width=800) 
fig.show()

# Findings:
# Sdneider Electric has highest number of conversions due to the presence of logo.

**Performing Exploratory Data Analysis (EDA) on PERFORMANCE METRICS v/s CREATIVE ATTRIBUTES**

**Performance Metric: Total Engagements vs. Creative Attribute: Dominant Color**

In [103]:
'''
This Box-Plot visualizes the relationship between the dominant color in the creative content and the total number of engagements (likes, comments, and reposts combined).
'''

fig = px.box(df, x='dominant colour', y='total engagements', title='Dominant Color vs. Total Engagements')
fig.show()

# Findings:
# The dominant color is dimgray as it has highest number of total engagements (likes, comments, and reposts combined) which is 2715.

**Performance Metric: Like Count vs. Creative Attribute: # faces**

In [104]:
'''
This scatter plot visualizes the relationship between the number of faces in the creative content and the like count in the performance metrics.
'''
fig = px.scatter(df, x='# faces', y='likecount', color='conversion', title='Number of Faces vs. Like Count')
fig.show()

# Findings:
# When there is 1 face both conversion and like count are moderate and gives good result which is required in Digital Marketing Campaign.
# When there are 2 faces like count is high but conversion is moderate.

**Performance Metric: Comment Count vs Creative Attribute: Text Area Percentage**

In [105]:
'''
This code will generate a scatter plot where the x-axis represents the text area percentage and y-axis represents the Comment Count.

'''

fig = px.scatter(df, x='commentcount', y='text area %', trendline='ols', title='Comment Count vs. Text Area Percentage')
fig.show()

df['commentcount'].corr(df['text area %'])

# Findings:
# There is a negative correlation between comment count and text area percentage which is -0.03754204500041261.
# This means that when comment count increases, text area percentage decreases or vice versa.

-0.03754204500041261

# Step-2: Data Preprocessing

**Importing the Required Libraries**

In [106]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import SelectKBest, mutual_info_regression
import pandas as pd

**Generating separate numerical features and categorical features list**

In [107]:
'''
Two separate list for numerical features and categorical features are created.
This is done because both numerical as well as categorical features require different preprocessing methods.
# For ex: Numerical features require imputation and scaling whereas categorical features require imputation and one-hot encoding method.
'''

# Since conversion will be my output variable, I am drooping it from the numerical features list so as to avoid data leakage and protect it from any preprocessing method.  
numerical_features=df.select_dtypes(include=['int64','float64']).columns.drop('conversion')
categorical_features=df.select_dtypes(include=['object']).columns

**Using Pipeline and Column Tranformer to preprocess data**

**Numerical Features**

In [108]:
'''
A Pipeline is created to handle numerical features.
This pipeline will ensure that the numerical features are first imputed and then scaled.

Imputation is done using Simple Imputer using median strategy which means that the missing numerical values will be replaced by the median of the numerical feature.
Scaling is done using Standard Scaler which means that the numerical features will be scaled using Standard Scaler.
Scaling is done in order to ensure that the numerical features are on the same scale ie they lie in same range.
'''

num_transformer=Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='median')),
    ('scaler',StandardScaler())
])

**Categorical Features**

In [109]:
'''
A Pipeline is created to handle categorical features.
This pipeline will ensure that the categorical features are first imputed and then encoded.

Imputation is done using Simple Imputer using constant strategy which means that the missing categorical values will be replaced by the missing value.
Encoding is done using One Hot Encoder which means that the categorical features will be scaled using One Hot Encoder.
Encoding is done in order to ensure that the categorical features are encoded into numerical features so that the machine learning model can understand them.
'''

cat_transformer=Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='constant',fill_value='missing')),
    ('onehot',OneHotEncoder(handle_unknown='ignore'))
])

In [110]:
'''
Column Transformer is used to apply the pipelines on the numerical as well as categorical features.
Column Transformer ensures smooth application of the pipelines on the numerical as well as categorical features.
'''

preprocessor=ColumnTransformer(
    transformers=[
        ('num',num_transformer,numerical_features),
        ('cat',cat_transformer,categorical_features)
    ]
)

# Step-3: Model Building

In [111]:
'''
A Pipeline is created to apply the machine learning model.
In the pipeline , first the preprocessor is applied.
After preprocessor, SelectKBest is applied which selects the best features using Mutual Information Regression.
After SelectKBest, RandomForestRegressor is applied which is the machine learning model.

Here in regressor , we can apply different machine learning models like Linear Regression, Decision Tree Regressor, Random Forest Regressor, etc and then
we can compare the performance of the different machine learning models and then select the best machine learning model among them

I have first tried Linear Regression and then Decision Tree Regressor and then Random Forest Regressor and then compared the performance of the different machine learning models 
and then selected the Random Forest Regressor machine learning model among them as it gives best performance.
'''

model=Pipeline(steps=[
    ('preprocessor',preprocessor),
     ("selector",SelectKBest(mutual_info_regression)),
    ('regressor',RandomForestRegressor())
])

**Visualising model**

In [112]:
model

**Creating Input(X) and Output(y) Features**

In [113]:
'''
X is the input features and y is the output feature.
X contains all the features except conversion and y contains conversion.
y is the output feature which we have to predict.
'''

# Reason for choosing conversion as output feature
'''
I have choose Conversion as output features because first of all it is a PERFORMANCE METRIC which is used to measure the performance of digital marketing campaigns.

Conversion in digital marketing indicates the number of people who have completed a desired action such as signing up for a newsletter or making a purchase or completing 
a registration or making a transaction for a product or service and many more, which is the ultimate goal of any digital marketing campaign.
'''

X,y=df.drop('conversion',axis=1),df['conversion']

In [114]:
'''
Input features(X) and output feature(y) are split into training and testing data.
Training data is used to train the machine learning model.
Testing data is used to test the machine learning model.
'''

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [115]:
'''
The machine learning model is trained using the training data.

'''
model.fit(X_train,y_train)

# Step-4: Model  Evaluation

In [116]:
'''
The machine learning model is tested using both training data as well as testing data.
Testing the model will help in understanding the performance of the model
'''

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculating Mean Squared Error (MSE) which tells how close a predicted value is to the original value.
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)

# Calculating R-squared (R^2) score which tells how well the model performs.
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

print(f"Training Mean Squared Error (MSE): {train_mse}")
print(f"Testing Mean Squared Error (MSE): {test_mse}")
print(f"Training R-squared (R^2) Score: {train_r2}")
print(f"Testing R-squared (R^2) Score: {test_r2}")

Training Mean Squared Error (MSE): 237.7629684460261
Testing Mean Squared Error (MSE): 875.7251037914692
Training R-squared (R^2) Score: 0.9353196023599001
Testing R-squared (R^2) Score: 0.7843409615934862
