## Data Collection
This notebook shows how to access the "PlacementSuggestionService" API.
We used several keyword lists in `terms.py` and submitted each term to the API. We re-submitted blocked terms without spaces and word-by-word to get the status of each baseword.

In [1]:
import os
import json
import time
import glob
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

from tqdm import tqdm

from terms import category2terms

In [2]:
# create request session
s = requests.Session()
retries = Retry(total=5, 
                backoff_factor=2, 
                status_forcelist=[ 500, 501, 502, 503, 504 ])
s.mount('http://', HTTPAdapter(max_retries=retries))

In [3]:
# output: where is data saved?
DATA_OUT = '../data/input/placements_api'

## Accessing the API
The following `headers`, `params` and `data` are from "ads.google.com".

We used DevTools (on Chrome) and listened for network requests while filling out searches for video-based ad placements. The network request was copied as a `cURL` and converted to a Python request using https://curl.trillworks.com.

💡Note: the params here are no longer valid, and some values have been replaced with "REDACTED".💡

In [4]:
headers = {
    'authority': 'ads.google.com',
    'x-same-domain': '1',
    'dnt': 'REDACTED',
    'x-framework-xsrf-token': 'REDACTED',
    'user-agent': 'REDACTED',
    'build-version': 'v1611596777',
    'content-type': 'application/x-www-form-urlencoded',
    'accept': '*/*',
    'origin': 'https://ads.google.com',
    'x-client-data': 'REDACTED',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'REDACTED',
    'accept-language': 'en-US,en;q=0.9',
    'cookie': 'REDACTED',
}

params = (
    ('authuser', '0'),
    ('acx-v-bv', 'awn_video_auto_REDACTED'),
    ('acx-v-clt', 'REDACTED'),
    ('rpcTrackingId', 'PlacementSuggestionService.Fetch:3'),
    ('f.sid', 'REDACTED'),
)

data = {
    'hl': 'en_US',
    '__lu': 'REDACTED',
    '__u': 'REDACTED',
    '__c': 'REDACTED',
    'f.sid': 'REDACTED',
    'ps': 'aw',
    '__ar': '{"1":"dogs","2":{"1":0,"2":20},"3":[1,4,6,5,2,3],"4":true,"5":false,"8":"355895228","11":["US"],"14":{"1":20}}',
    'activityContext': 'VideoCampaignConstruction.PlacementPickerPanel.ExpansionPanel.PlacementPickerComponent.Search',
    'requestPriority': 'HIGH_LATENCY_SENSITIVE',
    'activityType': 'INTERACTIVE',
    'activityId': 'REDACTED',
    'uniqueFingerprint': 'REDACTED',
    'previousPlace': '/aw/campaigns/new/video',
    'activityName': 'VideoCampaignConstruction.PlacementPickerPanel.ExpansionPanel.PlacementPickerComponent.Search',
    'destinationPlace': '/aw/campaigns/new/video'
}

## Query the API

In [5]:
for keyword_list, terms in category2terms.items():
    print(f"{keyword_list}: {len(terms)} terms")

social_justice: 62 terms
hate: 87 terms
policy: 150 terms
noise: 20 terms
adhoc: 68 terms


In [6]:
def query_placements_api(query: str, 
                         fn_out: str, 
                         headers: dict, 
                         params: tuple, 
                         data: dict) -> None:
    """
    Saves JSON from the "PlacementSuggestionService" API for a given `query`.
    """
    # format the argument in the request
    data['__ar'] = '{"1":"'+ query +'","2":{"1":0,"2":20},"3":[1,4,6,5,2,3],"4":true,"5":false,"8":"527682421","11":["US"],"13":[1],"14":{"1":20}}'
    
    # make the request
    response = s.post('https://ads.google.com/aw_video/_/rpc/PlacementSuggestionService/Fetch', 
                      headers=headers, params=params, data=data)
    
    # save the JSON request
    if response.status_code == 200: 
        with open(fn_out, 'w') as f:
            f.write(json.dumps(response.json()))
    time.sleep(3)

In [7]:
# make a request for each keyword, and save the json response.
for keyword_list, terms in category2terms.items():
    data_dir_ = os.path.join(f'{DATA_OUT}/{keyword_list}')
    os.makedirs(data_dir_, exist_ok=True)
    for term in tqdm(terms):
        fn_out = f'{data_dir_}/{term.lower()}.json'
        if os.path.exists(fn_out):
            continue
        query_placements_api(term, fn_out, headers, params, data)

100%|██████████| 62/62 [00:00<00:00, 83135.18it/s]
100%|██████████| 87/87 [00:00<00:00, 100413.99it/s]
100%|██████████| 150/150 [00:00<00:00, 101622.61it/s]
100%|██████████| 20/20 [00:00<00:00, 47934.90it/s]
100%|██████████| 68/68 [00:03<00:00, 21.65it/s]


## Re-run "blocked" responses without spaces
Blocked responses are just two characters `{}`, so the size is 2.

In [8]:
blocked = []
for fn in glob.glob(DATA_OUT + '/*/*'):
    size = os.stat(fn).st_size
    if size == 2 and "blocked_basewords/" not in fn:
        blocked.append(fn)
len(blocked)

240

In [9]:
# example of a blocked term
blocked[0]

'../data/input/placements_api/social_justice/muslim american.json'

In [10]:
data_out_2 = f'{DATA_OUT}/blocked'
os.makedirs(data_out_2, exist_ok=True)

In [11]:
# Query each blocked term without spaces.
for fn in tqdm(blocked):
    term = fn.split('/')[3].replace('.json', '')
    new_term = term.replace(' ', '')
    if term == new_term:
        continue
    fn_out = f'{data_out_2}/{term}.json'
    if os.path.exists(fn_out):
        continue
    query_placements_api(new_term, fn_out, headers, params, data)

100%|██████████| 240/240 [00:00<00:00, 889251.73it/s]


## Blocked basewords
Split each blocked response into base words.

In [12]:
base_words = []
for fn in blocked:
    term = fn.split('/')[-1].replace('.json', '')
    base_words.extend(term.split(' '))

In [13]:
base_words = list(set(base_words))
len(base_words)

264

In [14]:
data_out_3 = f'{DATA_OUT}/blocked_basewords'
os.makedirs(data_out_3, exist_ok=True)

In [15]:
for term in tqdm(base_words):
    fn_out = f'{data_out_3}/{term}.json'
    if os.path.exists(fn_out):
        continue
    query_placements_api(term, fn_out, headers, params, data)

100%|██████████| 264/264 [00:00<00:00, 226672.72it/s]
