# Snapchat Political Ads
* **See the main project notebook for instructions to be sure you satisfy the rubric!**
* See Project 03 for information on the dataset.
* A few example prediction questions to pursue are listed below. However, don't limit yourself to them!
    * Predict the reach (number of views) of an ad.
    * Predict how much was spent on an ad.
    * Predict the target group of an ad. (For example, predict the target gender.)
    * Predict the (type of) organization/advertiser behind an ad.

Be careful to justify what information you would know at the "time of prediction" and train your model using only those features.

# Summary of Findings


### Introduction
TODO

### Baseline Model
TODO

### Final Model
TODO

### Fairness Evaluation
TODO

# Code

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

In [2]:
fp_2018 = os.path.join('data', 'PoliticalAds_2018.csv')
fp_2019 = os.path.join('data', 'PoliticalAds_2019.csv')
ads_2018 = pd.read_csv(fp_2018)
ads_2019 = pd.read_csv(fp_2019)
ads = pd.concat([ads_2018, ads_2019], ignore_index=True)

###### Convert Start and End Dates to Datetimes with time zone default UTC

In [3]:
ads['StartDate'] = pd.to_datetime(ads['StartDate'])
ads['EndDate'] = pd.to_datetime(ads['EndDate'])
ads.columns

Index(['ADID', 'CreativeUrl', 'Currency Code', 'Spend', 'Impressions',
       'StartDate', 'EndDate', 'OrganizationName', 'BillingAddress',
       'CandidateBallotInformation', 'PayingAdvertiserName', 'Gender',
       'AgeBracket', 'CountryCode', 'Regions (Included)', 'Regions (Excluded)',
       'Electoral Districts (Included)', 'Electoral Districts (Excluded)',
       'Radius Targeting (Included)', 'Radius Targeting (Excluded)',
       'Metros (Included)', 'Metros (Excluded)', 'Postal Codes (Included)',
       'Postal Codes (Excluded)', 'Location Categories (Included)',
       'Location Categories (Excluded)', 'Interests', 'OsType', 'Segments',
       'Language', 'AdvancedDemographics', 'Targeting Connection Type',
       'Targeting Carrier (ISP)', 'CreativeProperties'],
      dtype='object')

###### Cleaning the Data
I used the readme file to fill missing values with appropriate values. For example if gender was empty, 
the read me stated that the targets all genders so I filled all the missing with all genders. I used the readme to fill in missing values because many of those missing values before cleaning actually have meaningful values so I filled those particular missing values so it would be what values are truly missing when assessing missingness.

In [4]:
ads['Gender'] = ads['Gender'].fillna('All genders')
ads['AgeBracket'] = ads['AgeBracket'].fillna('All ages')
ads['CountryCode'] = ads['CountryCode'].str.title()
ads['CandidateBallotInformation'] = ads['CandidateBallotInformation'].fillna('No information provided')
ads['Regions (Included)'] = ads['Regions (Included)'].fillna('None included')
ads['Regions (Excluded)'] = ads['Regions (Excluded)'].fillna('All regions')
ads['Electoral Districts (Included)'] = ads['Electoral Districts (Included)'].fillna('No electoral districts targeting')
ads['Electoral Districts (Excluded)'] = ads['Electoral Districts (Excluded)'].fillna('All electoral districts')
ads['Radius Targeting (Included)'] = ads['Radius Targeting (Included)'].fillna('No radius targeting')
ads['Radius Targeting (Excluded)'] = ads['Radius Targeting (Excluded)'].fillna('Full radius')
ads['Metros (Included)'] = ads['Metros (Included)'].fillna('No metros targeting')
ads['Metros (Excluded)'] = ads['Metros (Excluded)'].fillna('All metros')
ads['Postal Codes (Included)'] = ads['Postal Codes (Included)'].fillna('No postal codes targeting')
ads['Postal Codes (Excluded)'] = ads['Postal Codes (Excluded)'].fillna('All postal codes')
ads['Location Categories (Included)'] = ads['Location Categories (Included)'].fillna('No location categories targeting')
ads.rename({'Location Categories (Exlcuded)': 'Location Categories (Excluded)'})
ads['Location Categories (Excluded)'] = ads['Location Categories (Excluded)'].fillna('All location categories')
ads['Interests'] = ads['Interests'].fillna('No interest targeting')
ads['OsType'] = ads['OsType'].fillna('All operating systems')
ads['Language'] = ads['Language'].fillna('No language targeting')
ads['AdvancedDemographics'] = ads['AdvancedDemographics'].fillna('No 3rd party data')
ads['Targeting Connection Type'] = ads['Targeting Connection Type'].fillna('No internet connection targeting')
ads['Targeting Carrier (ISP)'] = ads['Targeting Carrier (ISP)'].fillna('All carrier types')

In [5]:
ads.head()

Unnamed: 0,ADID,CreativeUrl,Currency Code,Spend,Impressions,StartDate,EndDate,OrganizationName,BillingAddress,CandidateBallotInformation,...,Location Categories (Included),Location Categories (Excluded),Interests,OsType,Segments,Language,AdvancedDemographics,Targeting Connection Type,Targeting Carrier (ISP),CreativeProperties
0,5c81aa516a8b62d12172328a536cb66e16c3695431f753...,https://www.snap.com/political-ads/asset/27d66...,EUR,6000,2080852,2018-11-18 10:21:59+00:00,2018-12-02 10:21:59+00:00,Media Agent,"Østre alle 2 ,Værløse ,3500,DK",No information provided,...,No location categories targeting,All location categories,No interest targeting,All operating systems,Provided by Advertiser,No language targeting,No 3rd party data,No internet connection targeting,All carrier types,web_view_url:https://danskfolkeparti.dk
1,ba2b4508ba87ed3c749d5ad5bc296648e6f4d55aacd33b...,https://www.snap.com/political-ads/asset/4f611...,USD,306,164497,2018-09-28 20:33:08+00:00,NaT,Chong and Koster,"1640 Rhode Island Ave. NW, Suite 600,Washingto...",No information provided,...,No location categories targeting,All location categories,No interest targeting,All operating systems,,en,No 3rd party data,No internet connection targeting,All carrier types,web_view_url:https://act.everytown.org/sign/di...
2,2438786c60ae41cf56614885b415a72857bbfb5c06f760...,https://www.snap.com/political-ads/asset/2c264...,GBP,445,232906,2018-11-27 21:44:19+00:00,2019-01-13 21:43:53+00:00,Amnesty International UK,"17-25 New Inn Yard,London,EC2A 3EA,GB",No information provided,...,No location categories targeting,All location categories,No interest targeting,All operating systems,Provided by Advertiser,No language targeting,No 3rd party data,No internet connection targeting,All carrier types,web_view_url:https://www.amnesty.org.uk/write-...
3,c80ca50681d552551ceaf625981c0202589ca710d51925...,https://www.snap.com/political-ads/asset/a36b7...,USD,60,12883,2018-09-28 23:10:14+00:00,2018-10-10 02:00:00+00:00,Chong and Koster,"1640 Rhode Island Ave. NW, Suite 600,Washingto...",No information provided,...,No location categories targeting,All location categories,No interest targeting,All operating systems,Provided by Advertiser,No language targeting,Marital Status (Single),No internet connection targeting,All carrier types,web_view_url:https://www.voterparticipation.or...
4,6648427922b496ea49597c4b74b650805e29544600e6ca...,https://www.snap.com/political-ads/asset/9dfa5...,USD,3403,964607,2018-10-02 17:28:33+00:00,2018-10-09 03:59:44+00:00,ACRONYM,US,No information provided,...,No location categories targeting,All location categories,"Arts & Culture Mavens,Bookworms & Avid Readers...",All operating systems,Provided by Advertiser,No language targeting,No 3rd party data,No internet connection targeting,All carrier types,web_view_url:https://join.knockthe.vote/swag?s...


### Baseline Model

In [6]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import FunctionTransformer, LabelEncoder, OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split

In [22]:
ads_copy = ads.copy()
ads_copy['Month'] = ads_copy['StartDate'].dt.month
ads_copy = ads_copy.drop(['ADID', 'StartDate', 'EndDate', 'OrganizationName', 'CreativeUrl', 'BillingAddress', 'PayingAdvertiserName', 'CreativeProperties', 'CountryCode'], axis=1)
types = ads_copy.drop(['Spend'], axis=1).dtypes
catcols = types.loc[types == np.object].index.to_list()
numcols = types.loc[types != np.object].index.to_list()

In [8]:
with_age = ads_copy.loc[ads_copy['AgeBracket'] != 'All ages'].index
ads_copy.loc[with_age, 'AgeBracket'] = 'Some targeting'
with_gender = ads_copy.loc[ads_copy['Gender'] != 'All genders'].index
ads_copy.loc[with_gender, 'Gender'] = 'Some targeting'
with_info = ads_copy.loc[ads_copy['CandidateBallotInformation'] != 'No information provided'].index
ads_copy.loc[with_info, 'CandidateBallotInformation'] = 'Some targeting'
with_region_incl = ads_copy.loc[ads_copy['Regions (Included)'] != 'None included'].index
ads_copy.loc[with_region_incl, 'Regions (Included)'] = 'Some targeting'
with_region_excl = ads_copy.loc[ads_copy['Regions (Excluded)'] != 'All regions'].index
ads_copy.loc[with_region_excl, 'Regions (Excluded)'] = 'Some targeting'
with_elec_incl = ads_copy.loc[ads_copy['Electoral Districts (Included)'] != 'No electoral districts targeting'].index
ads_copy.loc[with_elec_incl, 'Electoral Districts (Included)'] = 'Some targeting'
with_elec_excl = ads_copy.loc[ads_copy['Electoral Districts (Excluded)'] != 'All electoral districts'].index
ads_copy.loc[with_elec_excl, 'Electoral Districts (Excluded)'] = 'Some targeting'
with_rad_incl = ads_copy.loc[ads_copy['Radius Targeting (Included)'] != 'No radius targeting'].index
ads_copy.loc[with_rad_incl, 'Radius Targeting (Included)'] = 'Some targeting'
with_rad_excl = ads_copy.loc[ads_copy['Radius Targeting (Excluded)'] != 'Full radius'].index
ads_copy.loc[with_rad_excl, 'Radius Targeting (Excluded)'] = 'Some targeting'
with_met_incl = ads_copy.loc[ads_copy['Metros (Included)'] != 'No metros targeting'].index
ads_copy.loc[with_met_incl, 'Metros (Included)'] = 'Some targeting'
with_met_excl = ads_copy.loc[ads_copy['Metros (Excluded)'] != 'All metros'].index
ads_copy.loc[with_met_excl, 'Metros (Excluded)'] = 'Some targeting'
with_post_incl = ads_copy.loc[ads_copy['Postal Codes (Included)'] != 'No postal codes targeting'].index
ads_copy.loc[with_post_incl, 'Postal Codes (Included)'] = 'Some targeting'
with_post_excl = ads_copy.loc[ads_copy['Postal Codes (Excluded)'] != 'All postal codes'].index
ads_copy.loc[with_post_excl, 'Postal Codes (Excluded)'] = 'Some targeting'
with_loc_incl = ads_copy.loc[ads_copy['Location Categories (Included)'] != 'No location categories targeting'].index
ads_copy.loc[with_loc_incl, 'Location Categories (Included)'] = 'Some targeting'
with_loc_excl = ads_copy.loc[ads_copy['Location Categories (Excluded)'] != 'All location categories'].index
ads_copy.loc[with_loc_excl, 'Location Categories (Excluded)'] = 'Some targeting'
with_interest = ads_copy.loc[ads_copy['Interests'] != 'No interest targeting'].index
ads_copy.loc[with_interest, 'Interests'] = 'Some targeting'
with_os = ads_copy.loc[ads_copy['OsType'] != 'All operating systems'].index
ads_copy.loc[with_os, 'OsType'] = 'Some targeting'
with_lang = ads_copy.loc[ads_copy['Language'] != 'No language targeting'].index
ads_copy.loc[with_lang, 'Language'] = 'Some targeting'
with_dem = ads_copy.loc[ads_copy['AdvancedDemographics'] != 'No 3rd party data'].index
ads_copy.loc[with_dem, 'AdvancedDemographics'] = 'Some targeting'
with_target = ads_copy.loc[ads_copy['Targeting Connection Type'] != 'No internet connection targeting'].index
ads_copy.loc[with_target, 'Targeting Connection Type'] = 'Some targeting'
with_carr = ads_copy.loc[ads_copy['Targeting Carrier (ISP)'] != 'All carrier types'].index
ads_copy.loc[with_carr, 'Targeting Carrier (ISP)'] = 'Some targeting'
ads_copy.head()

Unnamed: 0,Currency Code,Spend,Impressions,CandidateBallotInformation,Gender,AgeBracket,Regions (Included),Regions (Excluded),Electoral Districts (Included),Electoral Districts (Excluded),...,Location Categories (Included),Location Categories (Excluded),Interests,OsType,Segments,Language,AdvancedDemographics,Targeting Connection Type,Targeting Carrier (ISP),Month
0,EUR,6000,2080852,No information provided,All genders,All ages,None included,All regions,No electoral districts targeting,All electoral districts,...,No location categories targeting,All location categories,No interest targeting,All operating systems,Provided by Advertiser,No language targeting,No 3rd party data,No internet connection targeting,All carrier types,11
1,USD,306,164497,No information provided,All genders,All ages,None included,All regions,No electoral districts targeting,All electoral districts,...,No location categories targeting,All location categories,No interest targeting,All operating systems,,Some targeting,No 3rd party data,No internet connection targeting,All carrier types,9
2,GBP,445,232906,No information provided,All genders,Some targeting,None included,All regions,No electoral districts targeting,All electoral districts,...,No location categories targeting,All location categories,No interest targeting,All operating systems,Provided by Advertiser,No language targeting,No 3rd party data,No internet connection targeting,All carrier types,11
3,USD,60,12883,No information provided,Some targeting,Some targeting,Some targeting,All regions,No electoral districts targeting,All electoral districts,...,No location categories targeting,All location categories,No interest targeting,All operating systems,Provided by Advertiser,No language targeting,Some targeting,No internet connection targeting,All carrier types,9
4,USD,3403,964607,No information provided,All genders,Some targeting,Some targeting,All regions,No electoral districts targeting,All electoral districts,...,No location categories targeting,All location categories,Some targeting,All operating systems,Provided by Advertiser,No language targeting,No 3rd party data,No internet connection targeting,All carrier types,10


In [9]:
cat_transformer = Pipeline(steps=[('imp', SimpleImputer(strategy='constant', fill_value='NULL')),
                                 ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False)),
                                 ('pca', PCA(svd_solver='full', n_components=0.99))])
num_transformer_ohe = Pipeline(steps=[('imp', SimpleImputer(strategy='most_frequent')),
                                     ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False)),
                                     ('pca', PCA(svd_solver='full', n_components=0.99))])
scaled_transformer = Pipeline(steps=[('imp', SimpleImputer(strategy='median')),
                                     ('num', StandardScaler())])

In [14]:
preproc = ColumnTransformer(transformers=[('cat', cat_transformer, catcols),
                                         ('scale', scaled_transformer, numcols),
                                         ('num', num_transformer_ohe, ['Month'])])

In [15]:
pl = Pipeline(steps=[('preproc', preproc), ('reg', RandomForestRegressor())])

In [16]:
X_train, X_test, y_train, y_test = train_test_split(ads_copy.drop(['Spend'], axis=1), ads_copy.Spend, random_state=1)

In [17]:
pl.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('preproc',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('cat',
                                                  Pipeline(memory=None,
                                                           steps=[('imp',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value='NULL',
                                                                                 missing_values=nan,
                                                                                 strategy='constant',
                                                                  

In [18]:
preds = pl.predict(ads_copy.drop('Spend', axis=1))

In [19]:
r_score = pl.score(X_test, y_test)
r_score

0.7427707211713366

### Final Model

In [None]:
# TODO

### Fairness Evaluation

In [None]:
# TODO