## Data Curation on English Wikipedia View Metrics

The goal of this notebook is to provide reproducable steps into the analysis of view metrics data provided by wikimedia foundation(please see the Licensing details on the repository ReadMe for more details about terms of use).

In this analysis we will collect and analyze the monthly traffic metric data in the time window of January 1 2008 through September 30 2017 on English Wikipedia from two different source API servers:

- Legacy service a.k.a the legacy Pagecounts API ( [documentation] (https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts), [endpoint] (https://wikimedia.org/api/rest_v1/#!/Pagecounts_data_(legacy)/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end)) provides access to desktop and mobile traffic data from January 2008 through July 2016. 

- Current service a.k.a The Pageviews API ([documentation] (https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews), [endpoint] (https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_aggregate_project_access_agent_granularity_start_end)) provides access to desktop, mobile web, and mobile app traffic data from July 2015 through September 2017.

The following sections will include steps from data-acquisition, processing and finally analysis.


### Step I: Data Acquisition

In this section we are collecting data from both APIs and saving the results in 5 separate JSON formatted files.

In [None]:
#coding utf-8

import requests
import json

We have introduced dictionaries each including constant values that have been provided in the API documentations of current(coded as 'curr') and legacy(coded as 'legacy') API providers.

In [None]:
# representing our API sources
projects = { 'pageCounts': 'legacy', 'pageViews':'curr'}
#endpoints of current and legacy API servers
endpoint = {
    'curr' :'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}',
    'legacy' : 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}'
}

Each API endpoint can be accessed in multiple ways. In the legacy data, the only distinguished access types were mobile and desktop sites, however after July 2015, in order to distiguish the automated (web-crawler) access from the real-user access mechanisms, the new API provider introduced the 'agent' parameter. Thus by specifying the agent as 'user' we are able to further filter access by users.

In [None]:
#the available access mechanisms 
access = {
    'curr' : {'desktop', 'mobile-app', 'mobile-web'},
    'legacy' : {'desktop-site', 'mobile-site'}
}

date_range = {
    'legacy' : { 'start' : '2008010100', 'end' : '2016080100'},
    'curr' : { 'start' : '2015070100','end' : '2017120100'}
}

#page_count_endpoint = ''
headers={'User-Agent' : 'https://github.com/rezvanielham', 'From' : 'rezvanil@uw.edu'}

The following method constructs a dictionary of parameters based on accesstype, start and end dates and API source.
The result of this method will be used to construct the API call URLs 

In [None]:
def get_params(access, start, end, project):

    params = dict()

    if(project =='curr'):
        params['access'] = access
        params['agent'] = 'user'
    else:
        params['access-site'] = access
    params['project'] = 'en.wikipedia.org'
    params['granularity'] = "monthly"
    params['start'] = start
    params['end'] = end
    return params

This is a helper method to format the start and end dates from YYYMMDD format to YYYYMM format by eliding the day value.
The result of this method is used later in naming the output files

In [None]:
def get_ym_date(ymd_start, ymd_end):
    '''

    :param ymd_start: the start date with YYYYMMDD format
    :param ymd_end: the end date with YYYYMMDD format
    :return: the param dict of start and end dates with YYYYMM format removing the rest of the string
    '''
    params = dict()
    params['ym-start'] = ymd_start[0:6]
    params['ym-end'] = ymd_end[0:6]
    return params

This method formats the aquired 4-digit YYYYMM based on the project types (because we need different date-ranges given each API source)

In [None]:
def get_page_view_formatted_dates(project):
    return get_ym_date(date_range[project]['start'], date_range[project]['end'])

Finally at this step, we have everything to beging calling into the API endpoints and right the corresponding data to files in this format: "apiname_accesstype_firstmonth-lastmonth.json"


In [None]:
for project in projects:
   prj = projects[project]
   for acs in access[prj]:
        api_call = requests.get(endpoint[projects[project]].format(**get_params(acs, date_range[prj]['start'], date_range[prj]['end'], prj)))
        response = api_call.json()
        #print to files with names with this format:
        out_file_name = project + '_{}_{}_{}.json'.format(acs, get_page_view_formatted_dates(prj)['ym-start'],
                                                          get_page_view_formatted_dates(prj)['ym-end'])
        json.dump(response, open(out_file_name, "w"), indent=4

)