# **Snapchat Political Ad Impressions**
> The purpose of this project is to re-analyze what factors might relate to a greater number of political advertisement impressions on Snapchat using Python.

***Import Libraries***

In [10]:
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import datetime as dt
import statsmodels.api as sm


***Data Cleaning***
> In this step, I downloaded the dataset from my original GitHub repository, extracted the columns that I wanted to analyze, and deleted any datapoints that had missing values.



In [11]:
#import dataset from GitHub
url = 'https://raw.githubusercontent.com/vchen19/snapchat-2018-ad-impressions/main/PoliticalAds.csv'
df = pd.read_csv(url)
#extract relevant columns
df = pd.DataFrame(df, columns = ['Currency Code', 'Spend', 'Impressions', 'StartDate', 'EndDate', 'CountryCode'])
#delete rows with NaN values, reset the index to be from 0 incrementing by 1 again
df = df.dropna()
df = df.reset_index(drop=True)


***Finding Advertising Spend in USD***


> In this step, I found all the currencies that were represented in the dataset. Then, I found their exchange rates (provided by [Morningstar](https://www.google.com/intl/en/googlefinance/disclaimer/)). Finally, I used these exchange rates to convert all the advertising spend to USD.



In [12]:
#determine the unique countries in the dataset
unique_country = np.unique(df["Currency Code"])
print(unique_country)

#convert all advertising spend to USD based on the currency code
for i in range(len(df)):
  if df.loc[i,"Currency Code"] == 'AUD':
    df.loc[i,"Spend"] = float(df.loc[i,"Spend"]) *0.76
  elif df.loc[i,"Currency Code"] == 'CAD':
    df.loc[i,"Spend"] = float(df.loc[i,"Spend"]) * 0.8
  elif df.loc[i,"Currency Code"] == 'EUR':
    df.loc[i,"Spend"] = float(df.loc[i,"Spend"]) * 1.19
  elif df.loc[i,"Currency Code"] == 'GBP':
    df.loc[i,"Spend"] = float(df.loc[i,"Spend"]) * 1.38

['AUD' 'CAD' 'EUR' 'GBP' 'USD']


***Finding Length of Ad Run***


> In this step, I found the length of ad run for all ads by converting the dates to datetime objects and subtracting the end date from the start date. I added this new data column, "Time", to the original dataframe



In [13]:
#break up the start date and end date of the advertisement into year, month, day, hour, minute, second
timediff = []
for i in range(len(df)):
  startyear = int(df.loc[i, "StartDate"][0:4])
  endyear = int(df.loc[i,"EndDate"][0:4])
  startmonth = int(df.loc[i,"StartDate"][5:7])
  endmonth = int(df.loc[i,"EndDate"][5:7])
  startday = int(df.loc[i,"StartDate"][8:10])
  endday = int(df.loc[i, "EndDate"][8:10])
  starthour = int(df.loc[i,"StartDate"][11:13])
  endhour = int(df.loc[i,"EndDate"][11:13])
  startmin = int(df.loc[i,"StartDate"][14:16])
  endmin = int(df.loc[i,"EndDate"][14:16])
  startsec = int(df.loc[i,"StartDate"][17:19])
  endsec = int(df.loc[i,"EndDate"][17:19])
#use date data to convert date to a datetime object
  startdate = dt.datetime(startyear, startmonth, startday, starthour, startmin, startsec)
  enddate = dt.datetime(endyear, endmonth, endday, endhour, endmin, endsec)
#find the time in between the end date and start date to find how long the ad ran
  time_diff = enddate - startdate
  time_diff_sec = time_diff.total_seconds()
  time_diff_sec = float(time_diff_sec)
  timediff.append(time_diff_sec)

#add the list of length of ad run to the dataframe
df['Time'] = timediff

***Plotting Using Plotly Express***


> In this step, I plotted impressions against spend with the colors corresponding to the country and the size of the point corresponding to length of ad run. This allowed me to visually analyze whether there were any interesting trends between all of these variables.



In [14]:
#plot impressions against spend with the colors corresponding to the country and the size of the point corresponding to length of ad run
fig = px.scatter(df, x="Spend", y="Impressions", color = "CountryCode", size="Time")

#change axis to remove empty space
fig.update_yaxes(range=[0, 12000000])
fig.show()

***Adding a New Variable to Multiple Regression***


> From analyzing the plot, I noticed that there were several orange bubbles, corresponding to ads that ran in Canada, that appeared to sit above an overall trendline. This indicated to me that whether or not an ad runs in Canada might have a positive correlation with the number of impressions that a political ad gets. So, I assigned each ad a value of 1 if it ran in Canada, and 0 if not. I added this new data column to the dataframe.



In [15]:
#assign value of 1 if country is Canada, 0 if not
canada = []
for i in range(len(df)):
  if df.loc[i, "CountryCode"] == "canada":
    canada.append(1)
  else:
    canada.append(0)

#add variable to the dataframe
df['Canada?'] = canada

***Multiple Regression***


> In this step, I ran a multiple regression model using the statsmodels library. Spend and time running predict impressions positively and strongly, as given by their positive coefficients and low p-values. While my guess that an ad that runs in Canada is likely to have more impressions seems to be reflected by its positive coefficient, its p-value is rather high at 0.483. This indicates to me that this correlation is not strong enough to say that there is definitively a relationship between an ad running in Canada and the number of impressions.



In [16]:
#run multiple linear regression with ad spend, amount of time run, and whether or not it was in Canada as the variables
Y = df['Impressions']
X = df[['Spend', 'Time', 'Canada?']]
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
 
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:            Impressions   R-squared:                       0.687
Model:                            OLS   Adj. R-squared:                  0.686
Method:                 Least Squares   F-statistic:                     430.5
Date:                Tue, 13 Apr 2021   Prob (F-statistic):          6.95e-148
Time:                        21:12:50   Log-Likelihood:                -8733.3
No. Observations:                 592   AIC:                         1.747e+04
Df Residuals:                     588   BIC:                         1.749e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1.158e+05   3.33e+04     -3.472      0.0