In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

**Data**

The data used in this project is from an anonymous organisation’s social media ad campaign. The data file can be downloaded from here. The file conversion_data.csv contains 1143 observations in 11 variables. Below are the descriptions of the variables.

1.) ad_id: an unique ID for each ad.

2.) xyzcampaignid: an ID associated with each ad campaign of XYZ company.

3.) fbcampaignid: an ID associated with how Facebook tracks each campaign.

4.) age: age of the person to whom the ad is shown.

5.) gender: gender of the person to whim the add is shown

6.) interest: a code specifying the category to which the person’s interest belongs (interests are as mentioned in the person’s Facebook public profile).

7.) Impressions: the number of times the ad was shown.

8.) Clicks: number of clicks on for that ad.

9.) Spent: Amount paid by company xyz to Facebook, to show that ad.

10.) Total conversion: Total number of people who enquired about the product after seeing the ad.

11.) Approved conversion: Total number of people who bought the product after seeing the ad.

**Objective**

Explore features to determine optimal customer segments and next steps for facebook ad targeting 

# Loading Data and data cleaning

In [None]:
df = pd.read_csv('/kaggle/input/clicks-conversion-tracking/KAG_conversion_data.csv')

Rename some of the features and values to make an easier workflow

In [None]:
df.rename(columns={'xyz_campaign_id':'xyzCampId', 'fb_campaign_id':'fbCampId','Total_Conversion':'conv','Approved_Conversion':'appConv'}, inplace=True)
df['xyzCampId'].unique()
df['xyzCampId'].replace({916:'campA', 936:'campB', 1178:'campC'}, inplace=True)

# Exploratory Analysis

In [None]:
df.head()

Check for missing data

In [None]:
df.isnull().sum()

In [None]:
df.describe()

We can see there is a large distribution in impressions with STD being 312762.2 indicating some ads recieve much more exposure than others

In [None]:
df1 = pd.get_dummies(df, columns = ['xyzCampId', 'age', 'gender'])

**Correlation Matrix**

Have removed ad id, FB campaign id, and interests as the numbers are arbitary for correlation

In [None]:
plt.figure(figsize=(16,5))
x=sns.heatmap(df1[df1.columns.difference(['ad_id','fbCampId','interest'], sort=False)].corr(),annot=True ,fmt=".2f", cmap="coolwarm")

Looking at the correlation matrix we can see both total and approved conversions are positively correlated with the 30-34 age group these customers are more likely to both inquire and buy the product that other age groups. 

Meanwhile, Clicks are positively correlated with the age groups: 40-44 & 45-49. This indicates older customers are more likely to click on the ad however when we look at conversions for this age group it is negatively correlated suggesting they inquire about the product and complete pruchases at a lower rate than younger age groups. 

Additionally, we see Females are much more likely to click on the advertisement and inquire about the product however after inquiry males seem to purchase the product more. 

Finally, looking at the campaigns we see noth campaign A & B were targeted more towards younger customers as there is a positive correlation with ages 30-34 while campaign C was targeted at older customers as there is a negative correlation with ages 30-34. 

On a side note the campaign correlations with non demographic variables are due to large difference in impressions of the campaigns compared to the actual number of individual ads as will be shown below.  

# Campaign deep dive

In [None]:
sns.countplot(x ='xyzCampId', data = df).set_title('Count of Individual Ads')
plt.show() 

campSum = df.groupby(by=['xyzCampId']).sum()
plt.bar(campSum.index, campSum["Impressions"])
plt.ylabel("Impressions")
plt.title("Campaign vs Impressions")
plt.show()

**Need for additional features to properly measure campaign efficacy**

Due to the unequal distribution of both individual ads and the impressions of each campaign there is a need for additional KPI features such as click through rate (CTR), cost per click (CPC), conversion rate and customer aquisition cost (CAC). 


An assumption being made for CAC is each ad link directs to the exact same landing page and as such customers have the same purchasing experience. 

Additionally we could use return on ad spend to get a more hollistic analysis however due to having no information on the value of both the customers along with the cost of the product involved in the ad we will assume the company is advertsing widgets at the exact same price for each campaign along with the different demographic customer groups (female, age 30-34) all being worth the same lifetime value which although not ideal will allow for an easier analysis.

In [None]:
plt.bar(campSum.index, campSum["Spent"])
plt.ylabel("Spent")
plt.title("Campaign vs Spent")
plt.show()

campSum = df.groupby(by=['xyzCampId']).sum()
plt.bar(campSum.index, campSum["appConv"])
plt.ylabel("Approved_Conversion")
plt.title("Campaign vs Approved Conversion")
plt.show()

No surprises here, Campaign C has the highest ad spend and highest conversions which is consistent with its high impressions in the previous chart

In [None]:
campCTR = campSum['Clicks']/campSum['Impressions']*100
# Creating our bar plot
plt.bar(campCTR.index, campCTR)
plt.ylabel("CTR")
plt.title("Campaign vs. CTR")
for x,y in zip(campCTR.index, campCTR):

    label = "{:.4f}%".format(y)

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,2), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()

Here we see both Campaign A & B have higher CTR's indicating they either have a more effective message or better targeting than campaign C. We will investigate this further. 

In [None]:
campConv = campSum['appConv']/campSum['Clicks']*100
# Creating our bar plot
plt.bar(campConv.index, campConv)
plt.ylabel("Conversion Rate")
plt.title("Campaign vs. Conversion Rate")
for x,y in zip(campConv.index, campConv):

    label = "{:.2f}%".format(y)

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,2), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()

Campaign A and B have much larger conversion rates that campaign C however looking at total conversion rate for the company we see only 2.83% due to the large proportion of total approved conversions Campaign C has compared to the others

In [None]:
overallConvRate = round(df['appConv'].sum()/df['Clicks'].sum()*100, 2)
print(overallConvRate)

In [None]:
campCAC = campSum['Spent']/campSum['appConv']
# Creating our bar plot
plt.bar(campCAC.index, campCAC)
plt.ylabel("CAC")
plt.title("Campaign vs. CAC")
for x,y in zip(campCAC.index, campCAC):

    label = "${:.2f}".format(y)

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,2), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()

Here we see campaign C having a CAC of more than 4 times the next highest CAC of campaign B. This along with the CTR above helps build a story that Campaign C's messaging and targeting are off as they have both lower CTR meaning the message does not resonate as much with customers and they have a higher CAC likely meaning the targeted customers who do click the ad are not actually interested in purchasing the product. Another explanation could be the cost per click for Campaign C higher than the other 2 campaigns which we will now examine.

In [None]:
campCPC = campSum['Spent']/campSum['Clicks']
# Creating our bar plot
plt.bar(campCPC.index, campCPC)
plt.ylabel("CPC")
plt.title("Campaign vs. CPC")
for x,y in zip(campCPC.index, campCPC):

    label = "${:.2f}".format(y)

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,2), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()

CPC of Campaign C is the highest however this difference does not make up for the much larger CAC Campaign C has.

***Gender***

In [None]:
genSum = df.groupby(by=['gender']).sum()
total_impressions = df['Impressions'].sum()
#dropping both ad_id, fbCampId as they are not relevant to this portion
genSum.drop(['ad_id', 'fbCampId'], axis=1, inplace=True)

In [None]:
gen_impression_prop = {'Female': genSum['Impressions']['F']/total_impressions, 'Male': genSum['Impressions']['M']/total_impressions} 
plt.bar(gen_impression_prop.keys(), gen_impression_prop.values())
plt.ylabel("Impressions")
plt.title("Gender vs. Impressions Proportion")
for x,y in zip(gen_impression_prop.keys(), gen_impression_prop.values()):

    label = "{:.3f}".format(y)

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,2), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()

Here we see that of the people who viewed the ad across campaigns, 54% were women and 46% were men

In [None]:
plt.bar(genSum.index, genSum['Clicks'])
plt.ylabel("Clicks")
plt.title("Gender adjusted vs. Clicks")
for x,y in zip(genSum.index, genSum['Clicks']):

    label = "{:.0f}".format(y)

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,2), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()


In [None]:
genCTR = genSum['Clicks']/genSum['Impressions']*100
plt.bar(genCTR.index, genCTR)
plt.ylabel("CTR")
plt.title("Gender vs. CTR")
for x,y in zip(genCTR.index, genCTR):

    label = "{:.4f} %".format(y)

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,2), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()

Females have a higher CTR than Males

In [None]:
genConvRate = genSum['appConv']/genSum['Clicks']*100
plt.bar(genConvRate.index, genConvRate)
plt.ylabel("Conversion Rate")
plt.title("Gender vs. Conversion Rate")
for x,y in zip(genConvRate.index, genConvRate):

    label = "{:.2f}%".format(y)

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,2), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()

Although females have a higher CTR Males are almost twice as likely to actually complete a purchase once on the site

In [None]:
genCAC = genSum['Spent']/genSum['appConv']
plt.bar(genCAC.index, genCAC)
plt.ylabel("CAC")
plt.title("Gender vs. CAC")
for x,y in zip(genCAC.index, genCAC):

    label = "${:.2f}".format(y)

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,2), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()


As such, due to both the CTR and conversion rate males CAC is only 2/3 that of Females

**Gender By Campaign**

In [None]:
sns.set(style="whitegrid")
tips = sns.load_dataset("tips")
g = sns.barplot(x=df["xyzCampId"], y=df["Impressions"], hue=df["gender"], data=tips)
g.set_yscale('log')

Across all campaigns females are more exposed to the advertisements

In [None]:
sns.set(style="whitegrid")
tips = sns.load_dataset("tips")
sns.barplot(x=df["xyzCampId"], y=df["Clicks"]/df['Impressions']*100, hue=df["gender"], data=tips)


Additionally they have a higher CTR & CAC on all campaigns and with a lower conversion as shown below

In [None]:
genCampSum = df.groupby(by = ['xyzCampId', 'gender']).sum()
genCampConv = genCampSum['appConv']/genCampSum['Clicks']
genCampSumCov = genCampSum.merge(genCampConv.rename('Conversion Rate'), left_index=True, right_index=True)

genCampSumCov['Conversion Rate'].unstack().plot(kind='bar').set_title('Conversion rate by Gender')


In [None]:
genCampSum = df.groupby(by = ['xyzCampId', 'gender']).sum()
genCampCAC = genCampSum['Spent']/genCampSum['appConv']
genCampSumCAC = genCampSum.merge(genCampCAC.rename('CAC'), left_index=True, right_index=True)

genCampSumCAC['CAC'].unstack().plot(kind='bar').set_title('CAC by Gender')


**Age**

In [None]:
ageSum = df.groupby(by=['age']).sum()
#dropping both ad_id, fbCampId as they are not relevant to this portion
ageSum.drop(['ad_id', 'fbCampId'], axis=1, inplace=True)

In [None]:
plt.bar(ageSum.index, ageSum['Impressions'])
plt.ylabel("Impressions")
plt.title("Age vs. Impressions")
for x,y in zip(ageSum.index, ageSum['Impressions']):

    label = "{:.0f}".format(y)

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,2), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()

Age 30-34 has the highest impressions with 45-49 close behind

In [None]:
plt.bar(ageSum.index, ageSum['Clicks'])
plt.ylabel("Clicks")
plt.title("Age vs. Clicks")
for x,y in zip(ageSum.index, ageSum['Clicks']):

    label = "{:.0f}".format(y)

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,2), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()

The clicks however show a different story with age 45-49 having by far the most clicks

In [None]:
ageCTR = ageSum['Clicks']/ageSum['Impressions']*100
plt.bar(ageCTR.index, ageCTR)
plt.ylabel("CTR")
plt.title("Age vs. CTR")
for x,y in zip(ageCTR.index, ageCTR):

    label = "{:.4f} %".format(y)

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,2), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()

As such age 45-49 has the highest CTR with age 30-34 being the lowest

In [None]:
ageConvRate = ageSum['appConv']/ageSum['Clicks']*100
plt.bar(ageConvRate.index, ageConvRate)
plt.ylabel("Conversion Rate")
plt.title("Age vs. Conversion Rate")
for x,y in zip(ageConvRate.index, ageConvRate):

    label = "{:.2f}%".format(y)

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,2), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()

However looking at conversion rate ages 30-34 are almost double the conversion rate of any other group

In [None]:
ageCPC = ageSum['Spent']/ageSum['Clicks']
# Creating our bar plot
plt.bar(ageCPC.index, ageCPC)
plt.ylabel("CPC")
plt.title("Campaign vs. CPC")
for x,y in zip(ageCPC.index, ageCPC):

    label = "${:.2f}".format(y)

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,2), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()

CPC decreases as age increases

In [None]:
ageCAC = ageSum['Spent']/ageSum['appConv']
plt.bar(ageCAC.index, ageCAC)
plt.ylabel("CAC")
plt.title("Age vs. CAC")
for x,y in zip(ageCAC.index, ageCAC):

    label = "${:.2f}".format(y)

    plt.annotate(label, # this is the text
                 (x,y), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,2), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()

As we see CTR increase with age and conversion rate decrease with age it makes sense 45-49 have the highest customer aquisition cost while 30-34 are the lowest

**Campaign by Age**

In [None]:
sns.set(style="whitegrid")
tips = sns.load_dataset("tips")
g = sns.barplot(x=df["xyzCampId"], y=df["Impressions"], hue=df["age"], data=tips)


Camp A is the msot evenly distributed across age groups while Camp B and C are more targeted towards 45-49 

In [None]:
sns.set(style="whitegrid")
tips = sns.load_dataset("tips")
sns.barplot(x=df["xyzCampId"], y=df["Clicks"]/df['Impressions']*100, hue=df["age"], data=tips)

Looking at the CTR we see camp A has the highest in 3 of the four age group less 40-44

In [None]:
ageCampSum = df.groupby(by = ['xyzCampId', 'age']).sum()
ageCampConv = ageCampSum['appConv']/ageCampSum['Clicks']
ageCampSumCov = ageCampSum.merge(ageCampConv.rename('Conversion Rate'), left_index=True, right_index=True)

ageCampSumCov['Conversion Rate'].unstack().plot(kind='bar').set_title('Conversion rate by Age')


Campaign A has the highest conversion rate for all age groups execpt 30-34 while campaign C is the lowest

In [None]:
ageCampSum = df.groupby(by = ['xyzCampId', 'age']).sum()
ageCampCAC = ageCampSum['Spent']/ageCampSum['appConv']
ageCampSumCAC = ageCampSum.merge(ageCampCAC.rename('CAC'), left_index=True, right_index=True)

ageCampSumCAC['CAC'].unstack().plot(kind='bar').set_title('CAC by Age')

We see a steady trend of CAC increasing with age in all but Campaign A where 35-44 are lower than the 30-34 age group, this could be due to a low sample size which we will look at now 

In [None]:
ageCampSum = df.groupby(by = ['xyzCampId', 'age']).sum()
print(ageCampSum['Impressions'])
print(ageCampSum['Clicks'])

Sample size for this particular analysis of Campaign A may have to few clicks to give an accurate picture

**Interest**

In [None]:
interestSum = df.groupby('interest').sum()
interestSum.reset_index(inplace=True)
# count plot on single categorical variable 
fig_dims = (20,6)
fig, ax = plt.subplots(figsize=fig_dims)
sns.barplot(x = 'interest', y ='Spent', data = interestSum)
# Show the plot 
plt.show() 

In [None]:
interestSum = df.groupby('interest').sum()
interestCTR = interestSum['Clicks']/interestSum['Impressions']*100
interestSumCTR = interestSum.merge(interestCTR.rename('CTR'), left_index=True, right_index=True)
CTRmean = [np.mean(interestSumCTR['CTR'])]*len(interestSumCTR.index)

# count plot on single categorical variable 
fig_dims = (20,6)
fig, ax = plt.subplots(figsize=fig_dims)
mean_line = ax.plot(interestSumCTR.index,CTRmean, label='Mean', linestyle='--')
sns.barplot(x =interestSumCTR.index, y ='CTR', data = interestSumCTR) 
# Show the plot 
plt.show() 

In [None]:
interestSum = df.groupby('interest').sum()
interestCAC = interestSum['Spent']/interestSum['appConv']
interestSumCAC = interestSum.merge(interestCAC.rename('CAC'), left_index=True, right_index=True)
CACmean = [np.mean(interestSumCAC['CAC'])]*len(interestSumCAC.index)

# count plot on single categorical variable 
fig_dims = (20,6)
fig, ax = plt.subplots(figsize=fig_dims)
mean_line = ax.plot(interestSumCAC.index,CACmean, label='Mean', linestyle='--')
sns.barplot(x =interestSumCAC.index, y ='CAC', data = interestSumCAC) 
# Show the plot 
plt.show() 


Looking at the charts we can see there are some individual interests which perform better than average. Let's examine further by pulling the best 10 performing interests based on CAC

In [None]:
interestKeyValues = interestSumCAC.sort_values(by = 'CAC', ascending=True).drop(columns=['ad_id', 'fbCampId']).head(n = 10)
interestKeyValues.reset_index(inplace=True)
print(interestKeyValues)

Lets investigate these top interest catagories more closely through segmenting by gender

In [None]:
interestGenSum = df.groupby(by = ['interest', 'gender']).sum()
interestGenCAC = interestGenSum['Spent']/interestGenSum['appConv']
interestGenSumCAC = interestGenSum.merge(interestGenCAC.rename('CAC'), left_index=True, right_index=True)
interestGenSumCAC.reset_index(inplace=True)

interestGenKeyValues = interestGenSumCAC.sort_values(by = 'CAC', ascending=True).head(n = 10)
interestGenKeyValues = interestGenKeyValues.drop(columns=['ad_id', 'fbCampId', 'conv'])
print(interestGenKeyValues)

Looking at the data above we can see Males with interest 101 have a drastically lower CAC than any other interest. This could be by chance however due to the small click sample size. Another interesting insight is interest 31 is effective for both males and females. lets examine interest CAC for females further

In [None]:
interestGenSum = df.groupby(by = ['interest', 'gender']).sum()
interestGenCAC = interestGenSum['Spent']/interestGenSum['appConv']
interestGenSumCAC = interestGenSum.merge(interestGenCAC.rename('CAC'), left_index=True, right_index=True)
interestGenSumCAC.reset_index(inplace=True)


interestFSumCAC = interestGenSumCAC[interestGenSumCAC.gender != 'M']

interestFKeyValues = interestFSumCAC.drop(columns=['ad_id', 'fbCampId', 'conv'], inplace=True)
interestFKeyValues = interestFSumCAC.sort_values(by = 'CAC', ascending=True).head(n = 10)
print(interestFKeyValues)

We find interest 31 is a bit of an outlier as it is $10 less than any other CAC. 

**Age and interest** 

In [None]:
interestAgeSum = df.groupby(by = ['interest', 'age']).sum()
interestAgeCAC = interestAgeSum['Spent']/interestAgeSum['appConv']
interestAgeSumCAC = interestAgeSum.merge(interestAgeCAC.rename('CAC'), left_index=True, right_index=True)

interestAgeKeyValues = interestAgeSumCAC.sort_values(by = 'CAC', ascending=True)
interestAgeKeyValues = interestAgeKeyValues[~(interestAgeKeyValues['Clicks'] <= 10)]  
interestAgeKeyValues = interestAgeKeyValues.drop(columns=['ad_id', 'fbCampId', 'conv'])
interestAgeKeyValues.reset_index(inplace=True)
print(interestAgeKeyValues.head(n =20))

Utilized 20 rows compared to 10 in gender due to the larger amount of rows involved in a age-interest group. Removing all rows with Clicks < 10 due to the higher likelyhood of being affected by chance. Both interest 102 & 31 have much lower CAC and would be worth investing in to determine if the CAC is sustainable with higher exposure.

# Random Forest Regression Analysis to determine factors influencing approved conversions

In [None]:
df = pd.read_csv('/kaggle/input/clicks-conversion-tracking/KAG_conversion_data.csv')
df.rename(columns={'xyz_campaign_id':'xyzCampId', 'fb_campaign_id':'fbCampId','Total_Conversion':'conv','Approved_Conversion':'appConv'}, inplace=True)
df['xyzCampId'].unique()
df['xyzCampId'].replace({916:'campA', 936:'campB', 1178:'campC'}, inplace=True)
df2 = df
df2.drop(columns=['ad_id'], inplace=True)
len(df2)

Drop Ad Id as it is irrelevant

In [None]:
plt.figure(figsize=(16,5))
x=sns.heatmap(df[df.columns.difference(['ad_id','fbCampId','interest'], sort=False)].corr(),annot=True ,fmt=".2f", cmap="coolwarm")

Impressions, Clicks and Spend are all extermely correlated must pic one out of the three for analysis

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor 

df2 = pd.get_dummies(df2, columns=['age', 'gender', 'interest'])
df2.drop(columns=['gender_M','age_45-49', 'xyzCampId','Spent', 'Clicks','conv','fbCampId','appConv', 'interest_107'], inplace=True)

# the independent variables set 
X = df2

# VIF dataframe 
vif_data = pd.DataFrame() 
vif_data["feature"] = X.columns 

# calculating VIF for each feature 
vif_data["VIF"] = [variance_inflation_factor(X.values, i) 
						for i in range(len(X.columns))] 

print(vif_data)


All VIF are good after dropping various features

In [None]:
y = df['conv']

In [None]:
from sklearn.feature_selection import SelectKBest, f_regression
#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=f_regression, k=40)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(featureScores.nlargest(33,'Score')) 

Xselect = featureScores.nlargest(5,'Score')['Specs'].to_list()
X = X[Xselect]

In [None]:
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

The factors which are both statistically significant and have the highest feature scores

In [None]:
y = np.array(y)
y.reshape(len(y), 1)
# Saving feature names for later use
feature_list = list(X.columns)
# Convert to numpy array
X = np.array(X)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

print('Training Features Shape:', X_train.shape)
print('Training Labels Shape:', y_train.shape)
print('Testing Features Shape:', X_test.shape)
print('Testing Labels Shape:', y_test.shape)

In [None]:
from sklearn.preprocessing import StandardScaler
sc_x= StandardScaler()
X = sc_x.fit_transform(X)

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators = 1000)
rf.fit(X_train, y_train)

In [None]:
y_pred=rf.predict(X_test)
y_pred=np.round(y_pred)

In [None]:
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
mae=mean_absolute_error(y_test, y_pred)
mse=mean_squared_error(y_test, y_pred)
rmse=np.sqrt(mse)
r2_score=r2_score(y_test, y_pred)

Mean Absolute Error

In [None]:
mae

In [None]:
r2_score

r-score indicating 72.75% of the data fits the model

In [None]:
# Get numerical feature importances
importances = list(rf.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

# Conclusion

**Campaigns**
1. Campaign C had the highest number of individual ads, impressions, ad spend and total approved conversions  
2. Campaign be had the highest CTR slight beating Campaign A while being over 25% larger than Campaign C
3. Campaign A had by far the highest conversion rate beating Campaign B by 2X and Campaign C by 10X
4. Campaign A had the lowest CAC at almost 10X lower than Campaign C and 3X lower than campaign B
5. Campaign A had the lowest CPC

**Gender**
1. Females had 54% of all impressions while males had 46%
2. females had a higher CTR than males by ~ 20% while males almost doubled the Conversion rate of females
3. Female aquisiton cost was ~ $30 higher that males
4. Both impressions and CTR across campaigns look similiar to the non campaign segmented equivalants
5. Conversion rates and CAC are similiar for both Campaign A and C however for Campaign B Male conversion rates are 5X females compared to non campaign segmented equivalants 

**Age**
1. Ages 30-34 and 45-49 have the highest impressions
2. The CTR increases relatively steadily as we move up age groups with age 45-49 having a 56 % higher CTR than age 30-34
3. The Conversion rate on the other hand decreases steadily as we move up age groups with ages 30-34 having a 247 % higher Conversion rate than ages 45-49
4. Overall as age increases CAC from $30 at ages 30-34 to 100 at ages 45-49, what is interesting about this is CPC is lower as age increasing indicating the higher CAC is simply due to poor conversion rates for higher age groups
5. CTR and CAC are similiar across campaigns to non segmented equivalants however Conversion rates are much higher across all age groups for campaigns A and across the 3 youngest age groups for campaign B
6. looking at importance of the age 30-34 range it is 2.5X as important as the next single factor feature 

**Interests**
1. interests 31 & 36 have extremely low CAC compared to others
2. interests 101, 104 & 112 have the largest impact on total conversions while spend on these interest catagories is much lower than others
3. interests 31 & 102 combined with age group of 30-34 have a CAC of ~4.75 almost half of the next low interest-age segment
4. interest 101 has the lowest CAC when targeting males at ~5.10 almost 1/3 the next highest CAC for interest-gender segment

# Business insights and Next Steps

**Business Insights**

1.Optimizing target demographic
    * should target the age group of 30 - 34 as they had a higher conversion rate than any other segment along with a lower CAC
    * Males should be the main gender target as they had a higher conversion rate than any other segment along with a lower CAC
    * Out of the four largest interest spends two are inefficient due to being higher than average CAC (interest 10 & 27)
    * (this is dependant on LFV across customer segments being similiar)
    
2. Optimize campaigns
    * currently largest spend is going to worst performing campaign
    * Best campaign by CAC across age groups and gender is Campaign A
    
**Next Steps**
1. Gradually increase spend in Campaign A
    * Due to the low sample size compared to other campaigns the findings could be a result of chance by increasing spend we can determine if that is the case
2. Increase ad spend on high performing interest catagories/decrease ad spend on low performing interest catagories 
3. shift marketing spend to target a younger audiance 
    * (dependant on LFV across customer segments being similiar)
    
    