# Retrieve Data

This script retrieves monthly pageview counts on dinosaur wikipedia pages:
- it gets a list of article titles from 'raw_data/dinosaur_article_titles.csv'
- it runs those through the Wikipedia API to get the pageview counts
- it saves the resulting data to 'raw_data/dino_monthly_\<access_type\>_201501-202209.json'

Code for calling the Wikipedia API is based on [this example notebook](https://drive.google.com/file/d/1gtFZAjRoOShsqZKuNhiiSn9Ko4ky-CSC/view).

## Preprocessing

The file 'raw_data/dinosaur_article_titles.csv' was downloaded from [this file](https://docs.google.com/spreadsheets/d/1zfBNKsuWOFVFTOGK8qnTr2DmHkYK4mAACBKk1sHLt_k/edit?usp=sharing). No changes were made besides renaming the file.

## Import Packages

In [1]:
import json
import time
import urllib.parse
import requests
import pandas as pd

## Set Constants

*Constants are variables that will not be changed later in the script.*

In [None]:
TITLE_PATH = '../raw_data/dinosaur_article_titles.csv'

The location of the CSV file of article titles to be used - see [Preprocessing](#Preprocessing)

In [None]:
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests

In [None]:
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

This is a parameterized string that specifies what kind of pageviews request we are going to make.  
In this case it will be a 'per-article' based request. The string is a format string so that we can replace each parameter with an appropriate value before making the request.

In [None]:
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request.

In [None]:
REQUEST_HEADERS = {
    'User-Agent': '<klein324@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2022',
}

When making a request to the Wikimedia API they ask that you include a "unique ID" that will allow them to contact you if something happens - such as - your code exceeding request limits - or some other error happens.  
**NOTE: this should be replaced with your own email and usage information**

In [None]:
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE = {
    "project":     "en.wikipedia.org",
    "access":      "",             # this value will be set/changed before each request
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",
    "end":         "2022093000"
}

This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a field/key for each of the required parameters. In the example, below, we only vary the article name and access type, so the majority of the fields can stay constant for each request. Of course, these values *could* be changed if necessary.

## Functions

### Function to request pageviews for 1 article

This function takes inputs of all the information needed to access the API and outputs the json response unmodified *except* it removes the 'access' field.

Inputs:
- article_title: the title of a Wikipedia article
- access_type: the method(s) the page was accessed by to be included in viewcount
    - Available values : all-access, desktop, mobile-app, mobile-web
- endpoint_url: the REST API 'pageviews' URL
- endpoint_params: a parameterized string that specifies what kind of pageviews request to make - see [documentation](https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_aggregate_project_access_agent_granularity_start_end)
- request_template: a template used to map parameter values into endpoint_params
- headers: a unique ID for the request

Output:
- a json response of a list of dictionaries, with each dictionary being one month's data

In [None]:
def request_pageviews_per_article(article_title = None, 
                                  access_type = "desktop",
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT, 
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS, 
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers = REQUEST_HEADERS):
    
    # Make sure we have an article title
    if not article_title: return None
    
    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    article_title_encoded = urllib.parse.quote(article_title.replace(' ','_'))
    request_template['article'] = article_title_encoded
    
    # update the access type
    request_template['access'] = access_type
    
    # create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
        
        # remove access type from dictionary
        for item in json_response["items"]:
            del item["access"]
    except Exception as e:
        print(e)
        json_response = None
    return json_response

### Function to compile all pageview data for one access type

This function takes a list of article titles and a newly-defined access type, and outputs a dictionary of data from the API (using the previous function).

Inputs:
- article_list: a list of titles of Wikipedia articles
- access_type: the method(s) the page was accessed by to be included in viewcount
    - Available values : desktop, mobile (sum of mobile-app and mobile-web), cumulative (all-access)

Output:
- a dictionary with the keys being the article names and values being the list of dictionaries from the API

In [None]:
def compile_pageview_data(article_list, access = "desktop"):
    article_dict = {}
    
    # loop through article titles
    for title in article_list:
        if access == "desktop":
            # get API data with 'desktop' access type
            article_dict[title] = request_pageviews_per_article(article_title = title, access_type = "desktop")
            
        elif access == "mobile":
            # get API data from both 'mobile-app' and 'mobile-web' access types
            app = request_pageviews_per_article(article_title = title, access_type = "mobile-app")
            web = request_pageviews_per_article(article_title = title, access_type = "mobile-web")
            # loop through both lists simultaniously, and add the viewcount from 'mobile-web' to 'mobile-app'
            for d in zip(app["items"], web["items"]):
                d[0]['views'] = d[0]['views'] + d[1]['views']
            article_dict[title] = app
            
        elif access == "all-access":
            # get API data with 'desktop' access type
            article_dict[title] = request_pageviews_per_article(article_title = title, access_type = "all-access")
            
    return(article_dict)

## Create Datafiles

In [None]:
title_list = pd.read_csv(TITLE_PATH)['name'].tolist()
title_list = [item.replace('“','\"').replace('”','\"') for item in title_list]

read the file of article names and convert to a list  
also replace incorrect quote characters (“,”) with **\\\"** to be correctly read by the code

In [None]:
mobile = compile_pageview_data(title_list, access = "mobile")
desktop = compile_pageview_data(title_list, access = "desktop")
cumulative  = compile_pageview_data(title_list, access = "cumulative")

get data for all titles in list for mobile, desktop, and all (cumulative) access types

In [None]:
mobile_json = json.dumps(mobile, indent=4)
desktop_json = json.dumps(desktop, indent=4)
cumulative_json = json.dumps(cumulative, indent=4)

serialize the json

In [None]:
with open("../raw_data/dino_monthly_mobile_201501-202209.json", "w") as outfile:
    outfile.write(mobile_json)
with open("../raw_data/dino_monthly_desktop_201501-202209.json", "w") as outfile:
    outfile.write(desktop_json)
with open("../raw_data/dino_monthly_cumulative_201501-202209.json", "w") as outfile:
    outfile.write(cumulative_json)

write the data into the 'raw_data' folder