# Working with Dates, Evaluating Correlations and Developing a Basic Multiple Linear Regression Model on FaceBook Advertising Data
## By Tyler Chambers
### Created for APRD6432: Digital Advertising

## Project Summary

In this code, we will be taking a set of FaceBook advertising data and gleaning a few insights from it. First we need to add in a new column to our dataframe to evaluate 'Cost per Impression.' Next we reformat text into dates so we can look at a day of the week's affect on our new 'Cost per Impression' metric. After that, we check out a few variables relationship to 'Amount Spent' by evaluating their correlations. Finally we build a very basic Multiple Linear Regression model to evaluate 'Reach' and 'Frequency' impact on 'Unique Clicks'.

## Setting up our Environment and Creating our 'Cost per Impression' Column

In [24]:
#importing packages for later use
import pandas
from datetime import datetime
import calendar
import statsmodels.api as sm 
#Reading in the csv file and saving it as tpf
Filename = ('Travel Pony Facebook.csv')
tpf = pandas.read_csv(Filename)

#Initializing the new column cost per impression
tpf['Cost per Impression'] = 0

#Building a for loop to fill in the data for cost per impression
for ad in tpf:
    tpf['Cost per Impression'] = tpf['Amount Spent (USD)']/tpf['Impressions']

## Reformatting our 'Text Dates' into Real 'Date Variables' and Looking at the Day of the Week's impact on 'Cost per Impression'   

In [25]:
#First I am making a column to hold the Numeric Day of the Week values
#I'm setting it to one so I can multiply into it later
tpf["Numeric DOW"] = 1

#Here I am iterating through the rows of start date and saving their values in a series dt
for date in tpf.iterrows():
    dt = tpf["Start Date"]
    
#Here I am initializing a list 
datelist = [0]

#Now I am applying the above datetime funtion to the series of dates dt and saving them in the datelist
for date in dt:
    truedt = datetime.strptime(date, "%m/%d/%y")
    datelist.append( truedt.weekday())
    
#I ran this to remove the zero at the beginning of the datelist
datelist.remove(0) 

#Finally I am multiplying our dataframe column Numeric DOW by the datelist to have the integer values of the days of the week
tpf["Numeric DOW"] = tpf["Numeric DOW"] * datelist

#Made a second data table where all the data is grouped by day of the week, allowing for analysis
Daycomparison = tpf.groupby(["Numeric DOW"]).mean()

#Next we are going to grab the maximum and minimum values for 'Cost per Impression' grouped by the day of the week
print('Most expensive day to advertise based on cost per impression')
print('------------------------------------------------------------')
print(calendar.day_name[Daycomparison['Cost per Impression'].idxmax()])
print(Daycomparison['Cost per Impression'].max())
print('------------------------------------------------------------')
print('Least expensive day to advertise based on cost per impression')
print('-------------------------------------------------------------')
print(calendar.day_name[Daycomparison['Cost per Impression'].idxmin()])
print(Daycomparison['Cost per Impression'].min())

Most expensive day to advertise based on cost per impression
------------------------------------------------------------
Friday
0.004096890719487211
------------------------------------------------------------
Least expensive day to advertise based on cost per impression
-------------------------------------------------------------
Saturday
0.0026286969333697923


## Running Correlations to see variables relationships with 'Amount Spent'

In [26]:
#Next I'm going to start on the correlation analysis

print('Amount Spent in relation to Reach')
print(tpf["Amount Spent (USD)"].corr(tpf["Reach"]))
print('---------------------------------')
print('Amount Spent in relation to Frequency')
print(tpf["Amount Spent (USD)"].corr(tpf["Frequency"]))
print('---------------------------------')
print('Amount Spent in relation to Unique Clicks')
print(tpf["Amount Spent (USD)"].corr(tpf["Unique Clicks"]))
print('----------------------------------')
print('Amount Spent in relation to Page Likes')
print(tpf["Amount Spent (USD)"].corr(tpf["Page Likes"]))

Amount Spent in relation to Reach
0.7031238065113846
---------------------------------
Amount Spent in relation to Frequency
0.13020086992866337
---------------------------------
Amount Spent in relation to Unique Clicks
0.8829931774784137
----------------------------------
Amount Spent in relation to Page Likes
0.7576119292180449


These correlations show that our amount that we are spending on an advertisement is strongly correlated to 'Reach', 'Unique Clicks' and 'Page Likes'. It is very weakly correlated to 'Frequency', however, achieving a value of approximately only 0.13. Our strongest correlation exists between 'Amount Spent' and 'Unique Clicks' which attained a value of approximately 0.88.

## Building our Basic Multiple Linear Regression Model

In [27]:
#Now we will set up the variables to be used in the regression
"""I did not add a constant to the regression equation, as I did not feel it was applicable to this situation. If there 
is zero reach and zero frequency, I'd also expect there to be zero unique clicks."""

#Here we are assigning our X and Y variables
X = tpf[["Reach", "Frequency"]]
Y = tpf["Unique Clicks"]

#Now we will set up the actual model
MLR = sm.OLS(Y, X).fit()
predictions = MLR.predict(X)

print(MLR.summary())

                            OLS Regression Results                            
Dep. Variable:          Unique Clicks   R-squared:                       0.557
Model:                            OLS   Adj. R-squared:                  0.556
Method:                 Least Squares   F-statistic:                     2325.
Date:                Fri, 12 Oct 2018   Prob (F-statistic):               0.00
Time:                        21:59:48   Log-Likelihood:                -15973.
No. Observations:                3705   AIC:                         3.195e+04
Df Residuals:                    3703   BIC:                         3.196e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Reach          0.0019   3.12e-05     62.490      0.0

Looking at this regression output, we can see that both predictors are heavily significant. Looking at the coefficients, we can see that 'Frequency' seems to be the much more powerful predictor. It should be noted that the high standard error on reach implies that multicollinearity exists in the model, pointing to the fact that a simple linear regression with just 'Frequency' might be better to model this relationship. That being said, multicollinearity is a common occurrence in advertising data.